gprof, because pants are for squares, I want to go fast

Sometimes, when you want to go REALLY fast, just using a compiled language like C won’t be enough. This situation, as strange as it sounds is unfortunately common. Fortunately st. Ignucious provides. Enter gprof, a tool for understanding the “execution profile” of a program written in C, Pascal, or Fortran77. For those of you without a fortran77 program handy, IBM has a nice (read silly) piece of code in their gprof tutorial, listed below.


#include <stdio.h>

int a(void) {
     int i=0,g=0;
     while(i++<100000)
     {
          g+=i;
     }
     return g;
}
int b(void) {
     int i=0,g=0;
     while(i++<400000)
     {
          g+=i;
     }
     return g;
}

int main(int argc, char** argv)
{
     int iterations;

     if(argc != 2)
     {
          printf("Usage %s <No of Iterations>\n", argv[0]);
          exit(-1);
     }
     else
          iterations = atoi(argv[1]);

     printf("No of iterations = %d\n", iterations);

     while(iterations--)
     {
          a();
          b();
     }
}

Looking at it, the behavior is fairly obvious. Given a number of iterations n the program calls function a and b n times. Functions a and b both perform the same task, but a different number of times.

Intuitivley the program should spend exactly four times as long in b as it does in a, but we don’t know how long that is, and validating this hypothesis is a good demonstration of gprof’s value. Gprof, like gcc depends on code injected by the compiler. To do this you compile with the ‘-pg’ flag, e.g. ‘gcc -pg -o program_under_test test.c’. Once you’ve done this, invoke the program with the relevant command line arguments. The rest of this article assumes that the program is invoked with argv[1]==1000. Invoking a program compiled with ‘-pg’ will create identical behavior to an invocation of an instance compiled without, however, it will output a file, gmon.out, that contains statistics about the program’s runtime. Once this file has been generated you can run ‘gprof <program name>’, and a wall of text will appear describing the runtime characterstics of the program.

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 80.53      8.69     8.69     1000     8.69     8.69  b
 20.51     10.90     2.21     1000     2.21     2.21  a

 %         the percentage of the total running time of the
time       program used by this function.

cumulative a running sum of the number of seconds accounted
 seconds   for by this function and those listed above it.

 self      the number of seconds accounted for by this
seconds    function alone.  This is the major sort for this
           listing.

calls      the number of times this function was invoked, if
           this function is profiled, else blank.
 
 self      the average number of milliseconds spent in this
ms/call    function per call, if this function is profiled,
           else blank.

 total     the average number of milliseconds spent in this
ms/call    function and its descendents per call, if this 
           function is profiled, else blank.

name       the name of the function.  This is the minor sort
           for this listing. The index shows the location of
           the function in the gprof listing. If the index is
           in parenthesis it shows where it would appear in
           the gprof listing if it were to be printed.
Copyright (C) 2012-2014 Free Software Foundation, Inc.

Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
                     Call graph (explanation follows)


granularity: each sample hit covers 2 byte(s) for 0.09% of 10.90 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.00   10.90                 main [1]
                8.69    0.00    1000/1000        b [2]
                2.21    0.00    1000/1000        a [3]
-----------------------------------------------
                8.69    0.00    1000/1000        main [1]
[2]     79.7    8.69    0.00    1000         b [2]
-----------------------------------------------
                2.21    0.00    1000/1000        main [1]
[3]     20.3    2.21    0.00    1000         a [3]
-----------------------------------------------

 This table describes the call tree of the program, and was sorted by
 the total amount of time spent in each function and its children.

 Each entry in this table consists of several lines.  The line with the
 index number at the left hand margin lists the current function.
 The lines above it list the functions that called this function,
 and the lines below it list the functions this one called.
 This line lists:
     index      A unique number given to each element of the table.
                Index numbers are sorted numerically.
                The index number is printed next to every function name so
                it is easier to look up where the function is in the table.

     % time     This is the percentage of the `total' time that was spent
                in this function and its children.  Note that due to
                different viewpoints, functions excluded by options, etc,
                these numbers will NOT add up to 100%.

     self       This is the total amount of time spent in this function.

     children   This is the total amount of time propagated into this
                function by its children.

     called     This is the number of times the function was called.
                If the function called itself recursively, the number
                only includes non-recursive calls, and is followed by
                a `+' and the number of recursive calls.

     name       The name of the current function.  The index number is
                printed after it.  If the function is a member of a
                cycle, the cycle number is printed between the
                function's name and the index number.


 For the function's parents, the fields have the following meanings:

     self       This is the amount of time that was propagated directly
                from the function into this parent.

     children   This is the amount of time that was propagated from
                the function's children into this parent.

     called     This is the number of times this parent called the
                function `/' the total number of times the function
                was called.  Recursive calls to the function are not
                included in the number after the `/'.

     name       This is the name of the parent.  The parent's index
                number is printed after it.  If the parent is a
                member of a cycle, the cycle number is printed between
                the name and the index number.

 If the parents of the function cannot be determined, the word
 `<spontaneous>' is printed in the `name' field, and all the other
 fields are blank.

 For the function's children, the fields have the following meanings:

     self       This is the amount of time that was propagated directly
                from the child into the function.

     children   This is the amount of time that was propagated from the
                child's children to the function.

     called     This is the number of times the function called
                this child `/' the total number of times the child
                was called.  Recursive calls by the child are not
                listed in the number after the `/'.

     name       This is the name of the child.  The child's index
                number is printed after it.  If the child is a
                member of a cycle, the cycle number is printed
                between the name and the index number.

 If there are any cycles (circles) in the call graph, there is an
 entry for the cycle-as-a-whole.  This entry shows who called the
 cycle (as parents) and the members of the cycle (as children.)
 The `+' recursive calls entry shows the number of function calls that
 were internal to the cycle, and the calls entry for each member shows,
 for that member, how many times it was called from other members of
 the cycle.
Copyright (C) 2012-2014 Free Software Foundation, Inc.

Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
Index by function name

   [3] a                       [2] b

gprof is kind enough to provide explanations of its outputs, so I won’t go into significant depth about how to interpret them. However, there are a couple of things it doesn’t tell you. Gprof profiles the program by sampling at even intervals. Because of this the values displayed, while roughly four to one are not exactly accurate. In fact were you to decrease the number of iterations to 100 and run gprof a couple times you would see noticeably different results. With an appropriate number of samples though, you can see that b runs for about 4 times as long as a overall. The flat profile also shows the number of calls made and the ms/call for each function. These statistics can be very helpful in identifying the sections of a project that are incurring significant runtime cost. For example, if you have a function that gets called several places, which accounts for a significant portion of execution time, it is easy to identify and optimize.

The call graph provides a textual representation of all the function calls made by the program, as well as subgraphs for each individual function which include their caller. Each listing provides information about how much time is spent in the function itself, as well as how much of that time is spent in functions it calls. I find the call graph to be helpful in developing a more nuanced understanding of a program’s behavior. You can see where functions are called often, and identify other potential routes for optimization, for example identifying places where reducing the use of a function is valuable.

Now, since we’re good (read lazy) programmers, who weild glorious multipass optimizing compilers the first/only step we might take is turning some optimizations on. This can be done by recompiling the program with ‘-O’; in this case I used ‘-O3’ but you can use lower or no numbers for less optimization. If we run the newly created program we see some interesting things, namely the following flat profile.

Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

This seems a little odd, the program’s execution time has more or less dropped to nothing. To see why, we must delve into the machine code the compiler is generating. To do this, we once again look to GNU. objdump is a tool for disassembling binary files. It shows section delimited listings of the program translated to the relevant assembly language. It can be invoked by calling ‘objdump -d <program name>’. If you look at the section containing main() in the optimized and unoptimized program there are some interesting things to be seen. The unoptimized version is about what we expect. It does some things and calls (the callq opcode calls the associated function) the functions we expect.

0000000000400706 <main>:
  400706:       55                      push   %rbp
  400707:       48 89 e5                mov    %rsp,%rbp
  40070a:       48 83 ec 20             sub    $0x20,%rsp
  40070e:       e8 fd fd ff ff          callq  400510 <mcount@plt>
  400713:       89 7d ec                mov    %edi,-0x14(%rbp)
  400716:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
  40071a:       83 7d ec 02             cmpl   $0x2,-0x14(%rbp)
  40071e:       74 23                   je     400743 <main+0x3d>
  400720:       48 8b 45 e0             mov    -0x20(%rbp),%rax
  400724:       48 8b 00                mov    (%rax),%rax
  400727:       48 89 c6                mov    %rax,%rsi
  40072a:       bf 5c 08 40 00          mov    $0x40085c,%edi
  40072f:       b8 00 00 00 00          mov    $0x0,%eax
  400734:       e8 a7 fd ff ff          callq  4004e0 <printf@plt>
  400739:       bf ff ff ff ff          mov    $0xffffffff,%edi
  40073e:       e8 fd fd ff ff          callq  400540 <exit@plt>
  400743:       48 8b 45 e0             mov    -0x20(%rbp),%rax
  400747:       48 83 c0 08             add    $0x8,%rax
  40074b:       48 8b 00                mov    (%rax),%rax
  40074e:       48 89 c7                mov    %rax,%rdi
  400751:       b8 00 00 00 00          mov    $0x0,%eax
  400756:       e8 d5 fd ff ff          callq  400530 <atoi@plt>
  40075b:       89 45 fc                mov    %eax,-0x4(%rbp)
  40075e:       8b 45 fc                mov    -0x4(%rbp),%eax
  400761:       89 c6                   mov    %eax,%esi
  400763:       bf 79 08 40 00          mov    $0x400879,%edi
  400768:       b8 00 00 00 00          mov    $0x0,%eax
  40076d:       e8 6e fd ff ff          callq  4004e0 <printf@plt>
  400772:       eb 0a                   jmp    40077e <main+0x78>
  400774:       e8 1d ff ff ff          callq  400696 <a>
  400779:       e8 50 ff ff ff          callq  4006ce <b>
  //Calls to a and b, the source of most of the runtime overhead
  40077e:       8b 45 fc                mov    -0x4(%rbp),%eax
  400781:       8d 50 ff                lea    -0x1(%rax),%edx
  400784:       89 55 fc                mov    %edx,-0x4(%rbp)
  400787:       85 c0                   test   %eax,%eax
  400789:       75 e9                   jne    400774 <main+0x6e>
  40078b:       b8 00 00 00 00          mov    $0x0,%eax
  400790:       c9                      leaveq 
  400791:       c3                      retq   
  400792:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400799:       00 00 00 
  40079c:       0f 1f 40 00             nopl   0x0(%rax)

Everything in the unoptimized version is about what we’d expect, but looking at the optimized version there are some interesting things to be seen.

0000000000400550 <main>:
  400550:       55                      push   %rbp
  400551:       48 89 e5                mov    %rsp,%rbp
  400554:       e8 b7 ff ff ff          callq  400510 <mcount@plt>
  400559:       83 ff 02                cmp    $0x2,%edi
  40055c:       75 1d                   jne    40057b <main+0x2b>
  40055e:       48 8b 7e 08             mov    0x8(%rsi),%rdi
  400562:       31 c0                   xor    %eax,%eax
  400564:       e8 c7 ff ff ff          callq  400530 <atoi@plt>
  400569:       bf b1 08 40 00          mov    $0x4008b1,%edi
  40056e:       89 c6                   mov    %eax,%esi
  400570:       31 c0                   xor    %eax,%eax
  400572:       e8 69 ff ff ff          callq  4004e0 <printf@plt>
  400577:       31 c0                   xor    %eax,%eax
  400579:       5d                      pop    %rbp
  40057a:       c3                      retq   
  40057b:       48 8b 36                mov    (%rsi),%rsi
  40057e:       bf 94 08 40 00          mov    $0x400894,%edi
  400583:       31 c0                   xor    %eax,%eax
  400585:       e8 56 ff ff ff          callq  4004e0 <printf@plt>
  40058a:       83 cf ff                or     $0xffffffff,%edi
  40058d:       e8 ae ff ff ff          callq  400540 <exit@plt>
  400592:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400599:       00 00 00 
  40059c:       0f 1f 40 00             nopl   0x0(%rax)

The optimizer has produced more succinct machine code, and interestingly it has removed the calls to a and b, accounting for the massive reduction in execution time. This is because the compiler has some ability to assess the side effects of a function, and since there are no relevant side effects, and the return value is unused, it can optimize away the function calls. Often compiler optimizations won’t be a silver bullet, but they can prove very helpful. When you they aren’t enough though, gprof can be incredibly helpful in identifying the sections of code worth hand optimizing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s