Better living through abstraction

While not much short of hand crafted assembly will provide better performance than well written C there are many downsides to being able to do literally whatever you want to memory. The concept of an owning scope, and the ability of a scope to take ownership of some data is a good way to manage where variables are freed. While concepts of ownership are applied in existing languages such as C++, Rust , a recently released systems programming language developed by the Mozilla foundation uses it to create 100% memory safe code. Unlike languages that employ garbage collection Rust’s compiler uses ownership semantics to ensure that Rust code is memory safe. This technique creates memory safe code without the overhead of a complex language run-time. This does come at the cost of fighting the borrow checker, the component of the rust compiler that ensures ownership rules are followed. To understand why this can be helpful the piece of code below can be helpful

fn stack_return() -> &int {
    let i = 1234;
    return &i;
}

This code will not compile because the variable i is stack allocated, going out of scope at the end of the function. Since the function returns the borrowed pointer &i which has an owner whose lifetime is potentially longer than stack_return() this code will fail to compile because this creates a potential for an invalid memory access. By associating each piece of memory with an owner who can then lend it to other scopes rust is able to provide powerful guarantees about object lifetime and access. These impose certain restrictions on the language. For example the following code will not compile

fn main() {
    let mut x = vec!["Hello", "world"];

    let y = &x[0];

    x.push("foo");
}

The code above will fail to compile. While the vector is marked as mutable the line x.push(“foo”); is invalid. This is because push() requires a mutable reference to our vector x. This is because y is an immutable reference. As long as y is alive, there can be no mutable reference to x as it could potentially invalidate the data in y. Though typical rust code guarantees safety, it is possible to write potentially unsafe code using the unsafe{} directive, a necessary aspect of interfacing Rust with C, and other unsafe languages.

While semantics like this may seem over-complicated they completely eliminate an entire class of error without the cost of garbage collection. They also allow rust to make promises about data races that make writing concurrent code a much more manageable process than in classic systems languages. Rust also has many of the higher level features found in languages like C++ and Ocaml such as type traits, iterators, and RAII. while it’s performance is still not quite as good as that of C, Rust is rapidly becoming a viable, and reasonable, choice for several classes of problems. Additionally Rust’s compiler is based on LLVM providing excellent cross compilation and optimization capabilities.

Advertisements

Calling the System

There are many things that a programmer will need to do that require functionality reserved to the operating system. This typically has to do with which functionality the operating system developer thinks is safe and secure to expose to a user. For example in linux the execve() system call is used to spawn processes as this is a task that requires loading a file into memory not owned by the calling program which is typically prohibited. It also allows the operating system to handle the scheduling and signaling of the second program.

Because system calls require a context switch to kernel code they are typically implemented using an interrupt. x86 has several unused interrupts which can be bound to interrupt service routines  and invoked in userland to jump into kernel code. In the case of linux, interrupt 0x80 is used.While system calls are all callable as C functions it is an interesting study to invoke one in assembly. The listing below is written in Intel syntax, and can be assembled using NASM, the netwide assembler. This choice is mostly motivated by my undying hatred of percent signs, and this listing can be fairly easily converted to the AT&T syntax used by the GNU assembler. We will be trying to print some text so we want to use sys_write. The C signature is

ssize_t write(int <i>fd</i>, const void *<i>buf</i>, size_t <i>count</i>)

Since all system calls are invocations of the same interrupt we will also need to pass an argument denoting which syscall we want (0x04 in the case of sys write). A full listing of system calls and associated register values can be found here.

section .text
    global _start
_start:
    mov eax, 0x04 ;int 0x80 causes the operating system to check register eax for the syscall's code 
    mov ebx, 1    ;fd=1 for stdout
    mov ecx, msg  ;set buf to the string allocated bellow
    mov edx, len  ;count=len(message)
    int 0x80      ;trigger interrupt
;we also have to invoke the exit system call (sys_exit is call 1)
    mov eax, 1    ;code 1 for sys_exit
    mov ebx, 0    ;equivalent to exit(0)
    int 0x80

section .data
    msg db "My Love's the Bogans!",0xa ;0xa is newline
    len equ $ - msg ;msg is a pointer to the head of the string $ is an alias in NASM for the previous address, in this case the end of the string

Finally, this can be compiled with

nasm -felf64 bogans.asm
ld bogans.o -o bogans

Running the resulting binary should print the text then exit. While this is an atypical amount of work to print some text it is a good example of how system calls work at a low level.

Speak softy, and carry a big fork

Fork is one of the many powerful tools you have at your disposal as a C developer. Take for example the code below. Here was an honest attempt at implementing an example using fork, but things went very wrong.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>

// main takes two parameters: argc is the number of command-line
// arguments; argv is an array of strings containing the command
// line arguments
int main (int argc, char *argv[])
{
    int status;
    pid_t pid;
    int i, num_children;

    // the first command-line argument is the name of the executable.
    // if there is a second, it is the number of children to create.
    if (argc == 2) {
        num_children = atoi (argv[1]);
    } else {
        num_children = 1;
    }

    for (i=0; i<num_children; i++) {

        // create a child process
        printf ("Creating child %d.\n", i);
        pid = fork ();

        /* check for an error */
        if (pid == -1) {
            fprintf (stderr, "fork failed: %s\n", strerror(errno));
            perror (argv[0]);
            exit (1);
        }
    }

    /* see if we're the parent or the child */
    if (pid == 0) {
        printf ("Hello from child %d.\n", i);
    }

    /* parent continues */
    printf ("Hello from the parent.\n");

    exit (0);
}

The code is supposed to take in a command line argument of the number of processes to create, but it doesn’t quite behave as expected. What’s wrong it? It may not be obvious from a first pass, but I recommend reading through and figuring out what this code actually does. Run the code with different arguments to get a better idea of what’s happening.

This behavior is the fundamental idea behind an exploit sometimes seen in malicious software called a fork bomb. The code in a fork bomb infinitely spawns new processes, starving the system of resources and effectively crashing it. The simplest fork bomb in C looks like this:

#include <unistd.h>

// WARNING! Run at your own risk!
int main(void)
{
    while(1)
        fork();
}

Unit Testing in C

Like any good software developer, I’m sure you’re beginning to wonder how to write sustainable C code. The way to do that in any language is with unit tests. The idea behind unit tests is to have each “unit” of your code be validated with assertive tests. A “unit” is typically a single function, and the goal of unit tests on that function is to validate that the function behaves as you expect for a range of inputs.

While there are many unit testing frameworks for C, for the purposes of this article I’ll be using minunit. Minunit is an incredibly simple framework that makes learning how to unit test easy. In fact, the source code is a whole 3 lines (seen below).

To install minunit, copy the below code into a “minunit.h” file. It doesn’t matter where it’s located in your file system, as long as you know how to include it in your unit test files.

#define mu_assert(message, test) do { if (!(test)) return message; } while (0)
#define mu_run_test(test) do { char *message = test(); tests_run++; \
                               if (message) return message; } while (0)
extern int tests_run;

Minunit provides two critical macros, mu_assert and mu_run_test. mu_assert takes a message and a statement that evaluates to a boolean value (i.e. my_int == 5), returning the message if the boolean value is false. mu_run_test handles calling the functions containing all of your mu_assert calls.

A simple test suite using minunit looks something like this:

#include <stdio.h>
#include "minunit.h"

int tests_run = 0;

// Here's a sample function we'll test
int add(int x, int y) {
    return x + y;
}

// Here's another sample function,
// except this one has an error (* instead of +)
int bad_add(int x, int y) {
    return x * y;
}

// Testing the add() function
static char * test_add() {
    mu_assert("error, adding 2 and 3 != 5", add(2, 3) == 5);
    return 0;
}

// Testing the bad_add() function
static char * test_bad_add() {
    mu_assert("error, adding 2 and 3 != 5", bad_add(2, 3) == 5);
    return 0;
}

// Here is where we call all of the unit test functions
static char * all_tests() {
    mu_run_test(test_add);
    mu_run_test(test_bad_add);
    return 0;
}

int main(int argc, char **argv) {
    char *result = all_tests();
    // result will be the failed test message if a test fails
    if (result != 0) {
        printf("%s\n", result);
    }
    else {
        printf("ALL TESTS PASSED\n");
    }
    printf("Tests run: %d\n", tests_run);

    return result != 0;
}

test_add and test_bad_add are the unit testing functions in this example. They should contain 1 or more calls of mu_assert that should test something regarding the functionality of add and bad_add, respectively. A good unit test should have a representative sample of the range of possible inputs as well as a representative sample of the expected domain of the function being tested.

As a note, the above code does not do this well. For example, if I had chosen to have the assertions testing adding 2 and 2, both unit tests would have passed because the bad_add function would find 2 * 2 = 4, which is the expected result as well.

In minunit, there is no automation regarding running the test suite. As such, you have to call mu_run_test on each unit test function, which is aggregated in the all_tests function in the above sample.

While the functionality of minunit is limited, I think it makes a good place to start with unit testing. When you feel comfortable with the idea of unit testing, I’d recommend going and finding another, more feature-rich alternative. You can find an excellent discussion regarding the available choices on Stack Overflow.

Degugging Memory Useage with Valgrind for Fun and Profit

Everyone is familiar with ‘malloc()’ and ‘free()’, and also their reputation for being notoriously dangerous when misused. Memory leaks, and other allocation related bugs, are some of the most common and pernicious issues affecting software. With the advent of more sophisticated compilers, as well as the burgeoning use of interpreted languages memory leaks are becoming a less commonly encountered problem, but they are still something that anyone working in C will probably face. Since C allows the programmer to do more of less whatever they please with their pointers it’s easy to mismanage a few bytes of memory in a program of non-trivial size, as there is no semantic enforced by the language to ensure any degree of memory safety. Memory management being so common an issue there has naturally developed a set of tools to help us deal with it. One of these tools, Valgrind can be incredibly helpful in the quest for perfect memory management. If you don’t have a working Valgrind install one can be obtained as follows. As of the writing of this article the current version is 3.11.0.

wget http://www.valgrind.org/downloads/valgrind-3.11.0.tar.bz2
tar -jxvf valgrind-3.11.0.tar.bz2
//once in the untarred directory
./configure
make && make install

Now that we have Valgrind we need some code that mismanages memory. One of Valgrind’s abilities is detecting memory leaks, so we need some code that leaks.

#include <stdlib.h>

int main() {
    void* unfreed;
    int idx;
    for (idx=0; idx<5; idx++) {
        unfreed = malloc(100);
    }
    return 0;
}

While Valgrind works with literally any binary (try it out on some of the coreutils), in practice it’s helpful to have debugging symbols, so it’s worth compiling the program under test with -g. Once you have a binary, it’s time to break out Valgrind. Since Valgrind supports a number of features your invocation has to specify the subset you wish to use. For this example, you want to check for leaks so you will have to enable the leak check tool. This should look something like

valgrind --tool=memcheck --leak-check=full ./program_name

Running this will print information about the programs memory (mis)useage that should look something like the listing below.

...
==24810== Command: ./program_name
==24810== 
==24810== 
==24810== HEAP SUMMARY:
==24810==     in use at exit: 500 bytes in 5 blocks
==24810==   total heap usage: 5 allocs, 0 frees, 500 bytes allocated
==24810== 
==24810== 500 bytes in 5 blocks are definitely lost in loss record 1 of 1
==24810==    at 0x4A05E46: malloc (vg_replace_malloc.c:195)
==24810==    by 0x400525: main (test.c:7)
==24810== 
==24810== LEAK SUMMARY:
==24810==    definitely lost: 500 bytes in 5 blocks
==24810==    indirectly lost: 0 bytes in 0 blocks
==24810==      possibly lost: 0 bytes in 0 blocks
==24810==    still reachable: 0 bytes in 0 blocks
==24810==         suppressed: 0 bytes in 0 blocks
...

Valgrind is kind enough to show us that the allocation on line 7 results in the program leaking 500 bytes of data across 5 different allocations. Knowing how much and where you’re program is leaking is a step up from the massive no information that you typically have. This gives me an excellent opportunity to quote the legendary Billy Mays “but wait there’s more”! Valgrind can also tell you when you use invalid memory. Time for another broken program.

#include <stdlib.h>

int main()
{
char *x = malloc(10);
x[10] = 'a';
return 0;
}

The mistake here is pretty obvious, but it’s also a startlingly easy class of error to make. Write enough MatLab and all your indices go to hell. Fortunately Valgrind has tools to detect invalid heap access. If we invoke Valgrind we can see that it detects invalid memory access.

valgrind --tool=memcheck --leak-check=yes program_name
...
==9814==  Invalid write of size 1
==9814==    at 0x804841E: main (program_name.c:6)
==9814==  Address 0x1BA3607A is 0 bytes after a block of size 10 alloc'd
==9814==    at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==9814==    by 0x804840F: main (program_name.c:5)
...

If you order today for the low low price of free, memcheck will also let you know when you’re using uninitialized memory. For example, the statement

int x;
if (x==0) {...}

will generate a warning along the lines of

==17943== Conditional jump or move depends on uninitialised value(s)
==17943==    at 0x804840A: main (program_name.c:6)

In fact, Valgrind can trace the use of uninitialized variables throughout the program. Also worth mentioning is Valgrind’s ability to detect double frees.

As powerful as memcheck is Valgrind has aditional capabilities including thread error detection, call graph generation, and cache and branch prediction capabilities. see the docs for more information. Also, for those interested in the LLVM ecosystem Address sanitizer is an interesting tool for memory analysis using clang.

gprof, because pants are for squares, I want to go fast

Sometimes, when you want to go REALLY fast, just using a compiled language like C won’t be enough. This situation, as strange as it sounds is unfortunately common. Fortunately st. Ignucious provides. Enter gprof, a tool for understanding the “execution profile” of a program written in C, Pascal, or Fortran77. For those of you without a fortran77 program handy, IBM has a nice (read silly) piece of code in their gprof tutorial, listed below.


#include <stdio.h>

int a(void) {
     int i=0,g=0;
     while(i++<100000)
     {
          g+=i;
     }
     return g;
}
int b(void) {
     int i=0,g=0;
     while(i++<400000)
     {
          g+=i;
     }
     return g;
}

int main(int argc, char** argv)
{
     int iterations;

     if(argc != 2)
     {
          printf("Usage %s <No of Iterations>\n", argv[0]);
          exit(-1);
     }
     else
          iterations = atoi(argv[1]);

     printf("No of iterations = %d\n", iterations);

     while(iterations--)
     {
          a();
          b();
     }
}

Looking at it, the behavior is fairly obvious. Given a number of iterations n the program calls function a and b n times. Functions a and b both perform the same task, but a different number of times.

Intuitivley the program should spend exactly four times as long in b as it does in a, but we don’t know how long that is, and validating this hypothesis is a good demonstration of gprof’s value. Gprof, like gcc depends on code injected by the compiler. To do this you compile with the ‘-pg’ flag, e.g. ‘gcc -pg -o program_under_test test.c’. Once you’ve done this, invoke the program with the relevant command line arguments. The rest of this article assumes that the program is invoked with argv[1]==1000. Invoking a program compiled with ‘-pg’ will create identical behavior to an invocation of an instance compiled without, however, it will output a file, gmon.out, that contains statistics about the program’s runtime. Once this file has been generated you can run ‘gprof <program name>’, and a wall of text will appear describing the runtime characterstics of the program.

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 80.53      8.69     8.69     1000     8.69     8.69  b
 20.51     10.90     2.21     1000     2.21     2.21  a

 %         the percentage of the total running time of the
time       program used by this function.

cumulative a running sum of the number of seconds accounted
 seconds   for by this function and those listed above it.

 self      the number of seconds accounted for by this
seconds    function alone.  This is the major sort for this
           listing.

calls      the number of times this function was invoked, if
           this function is profiled, else blank.
 
 self      the average number of milliseconds spent in this
ms/call    function per call, if this function is profiled,
           else blank.

 total     the average number of milliseconds spent in this
ms/call    function and its descendents per call, if this 
           function is profiled, else blank.

name       the name of the function.  This is the minor sort
           for this listing. The index shows the location of
           the function in the gprof listing. If the index is
           in parenthesis it shows where it would appear in
           the gprof listing if it were to be printed.
Copyright (C) 2012-2014 Free Software Foundation, Inc.

Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
                     Call graph (explanation follows)


granularity: each sample hit covers 2 byte(s) for 0.09% of 10.90 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.00   10.90                 main [1]
                8.69    0.00    1000/1000        b [2]
                2.21    0.00    1000/1000        a [3]
-----------------------------------------------
                8.69    0.00    1000/1000        main [1]
[2]     79.7    8.69    0.00    1000         b [2]
-----------------------------------------------
                2.21    0.00    1000/1000        main [1]
[3]     20.3    2.21    0.00    1000         a [3]
-----------------------------------------------

 This table describes the call tree of the program, and was sorted by
 the total amount of time spent in each function and its children.

 Each entry in this table consists of several lines.  The line with the
 index number at the left hand margin lists the current function.
 The lines above it list the functions that called this function,
 and the lines below it list the functions this one called.
 This line lists:
     index      A unique number given to each element of the table.
                Index numbers are sorted numerically.
                The index number is printed next to every function name so
                it is easier to look up where the function is in the table.

     % time     This is the percentage of the `total' time that was spent
                in this function and its children.  Note that due to
                different viewpoints, functions excluded by options, etc,
                these numbers will NOT add up to 100%.

     self       This is the total amount of time spent in this function.

     children   This is the total amount of time propagated into this
                function by its children.

     called     This is the number of times the function was called.
                If the function called itself recursively, the number
                only includes non-recursive calls, and is followed by
                a `+' and the number of recursive calls.

     name       The name of the current function.  The index number is
                printed after it.  If the function is a member of a
                cycle, the cycle number is printed between the
                function's name and the index number.


 For the function's parents, the fields have the following meanings:

     self       This is the amount of time that was propagated directly
                from the function into this parent.

     children   This is the amount of time that was propagated from
                the function's children into this parent.

     called     This is the number of times this parent called the
                function `/' the total number of times the function
                was called.  Recursive calls to the function are not
                included in the number after the `/'.

     name       This is the name of the parent.  The parent's index
                number is printed after it.  If the parent is a
                member of a cycle, the cycle number is printed between
                the name and the index number.

 If the parents of the function cannot be determined, the word
 `<spontaneous>' is printed in the `name' field, and all the other
 fields are blank.

 For the function's children, the fields have the following meanings:

     self       This is the amount of time that was propagated directly
                from the child into the function.

     children   This is the amount of time that was propagated from the
                child's children to the function.

     called     This is the number of times the function called
                this child `/' the total number of times the child
                was called.  Recursive calls by the child are not
                listed in the number after the `/'.

     name       This is the name of the child.  The child's index
                number is printed after it.  If the child is a
                member of a cycle, the cycle number is printed
                between the name and the index number.

 If there are any cycles (circles) in the call graph, there is an
 entry for the cycle-as-a-whole.  This entry shows who called the
 cycle (as parents) and the members of the cycle (as children.)
 The `+' recursive calls entry shows the number of function calls that
 were internal to the cycle, and the calls entry for each member shows,
 for that member, how many times it was called from other members of
 the cycle.
Copyright (C) 2012-2014 Free Software Foundation, Inc.

Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
Index by function name

   [3] a                       [2] b

gprof is kind enough to provide explanations of its outputs, so I won’t go into significant depth about how to interpret them. However, there are a couple of things it doesn’t tell you. Gprof profiles the program by sampling at even intervals. Because of this the values displayed, while roughly four to one are not exactly accurate. In fact were you to decrease the number of iterations to 100 and run gprof a couple times you would see noticeably different results. With an appropriate number of samples though, you can see that b runs for about 4 times as long as a overall. The flat profile also shows the number of calls made and the ms/call for each function. These statistics can be very helpful in identifying the sections of a project that are incurring significant runtime cost. For example, if you have a function that gets called several places, which accounts for a significant portion of execution time, it is easy to identify and optimize.

The call graph provides a textual representation of all the function calls made by the program, as well as subgraphs for each individual function which include their caller. Each listing provides information about how much time is spent in the function itself, as well as how much of that time is spent in functions it calls. I find the call graph to be helpful in developing a more nuanced understanding of a program’s behavior. You can see where functions are called often, and identify other potential routes for optimization, for example identifying places where reducing the use of a function is valuable.

Now, since we’re good (read lazy) programmers, who weild glorious multipass optimizing compilers the first/only step we might take is turning some optimizations on. This can be done by recompiling the program with ‘-O’; in this case I used ‘-O3’ but you can use lower or no numbers for less optimization. If we run the newly created program we see some interesting things, namely the following flat profile.

Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

This seems a little odd, the program’s execution time has more or less dropped to nothing. To see why, we must delve into the machine code the compiler is generating. To do this, we once again look to GNU. objdump is a tool for disassembling binary files. It shows section delimited listings of the program translated to the relevant assembly language. It can be invoked by calling ‘objdump -d <program name>’. If you look at the section containing main() in the optimized and unoptimized program there are some interesting things to be seen. The unoptimized version is about what we expect. It does some things and calls (the callq opcode calls the associated function) the functions we expect.

0000000000400706 <main>:
  400706:       55                      push   %rbp
  400707:       48 89 e5                mov    %rsp,%rbp
  40070a:       48 83 ec 20             sub    $0x20,%rsp
  40070e:       e8 fd fd ff ff          callq  400510 <mcount@plt>
  400713:       89 7d ec                mov    %edi,-0x14(%rbp)
  400716:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
  40071a:       83 7d ec 02             cmpl   $0x2,-0x14(%rbp)
  40071e:       74 23                   je     400743 <main+0x3d>
  400720:       48 8b 45 e0             mov    -0x20(%rbp),%rax
  400724:       48 8b 00                mov    (%rax),%rax
  400727:       48 89 c6                mov    %rax,%rsi
  40072a:       bf 5c 08 40 00          mov    $0x40085c,%edi
  40072f:       b8 00 00 00 00          mov    $0x0,%eax
  400734:       e8 a7 fd ff ff          callq  4004e0 <printf@plt>
  400739:       bf ff ff ff ff          mov    $0xffffffff,%edi
  40073e:       e8 fd fd ff ff          callq  400540 <exit@plt>
  400743:       48 8b 45 e0             mov    -0x20(%rbp),%rax
  400747:       48 83 c0 08             add    $0x8,%rax
  40074b:       48 8b 00                mov    (%rax),%rax
  40074e:       48 89 c7                mov    %rax,%rdi
  400751:       b8 00 00 00 00          mov    $0x0,%eax
  400756:       e8 d5 fd ff ff          callq  400530 <atoi@plt>
  40075b:       89 45 fc                mov    %eax,-0x4(%rbp)
  40075e:       8b 45 fc                mov    -0x4(%rbp),%eax
  400761:       89 c6                   mov    %eax,%esi
  400763:       bf 79 08 40 00          mov    $0x400879,%edi
  400768:       b8 00 00 00 00          mov    $0x0,%eax
  40076d:       e8 6e fd ff ff          callq  4004e0 <printf@plt>
  400772:       eb 0a                   jmp    40077e <main+0x78>
  400774:       e8 1d ff ff ff          callq  400696 <a>
  400779:       e8 50 ff ff ff          callq  4006ce <b>
  //Calls to a and b, the source of most of the runtime overhead
  40077e:       8b 45 fc                mov    -0x4(%rbp),%eax
  400781:       8d 50 ff                lea    -0x1(%rax),%edx
  400784:       89 55 fc                mov    %edx,-0x4(%rbp)
  400787:       85 c0                   test   %eax,%eax
  400789:       75 e9                   jne    400774 <main+0x6e>
  40078b:       b8 00 00 00 00          mov    $0x0,%eax
  400790:       c9                      leaveq 
  400791:       c3                      retq   
  400792:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400799:       00 00 00 
  40079c:       0f 1f 40 00             nopl   0x0(%rax)

Everything in the unoptimized version is about what we’d expect, but looking at the optimized version there are some interesting things to be seen.

0000000000400550 <main>:
  400550:       55                      push   %rbp
  400551:       48 89 e5                mov    %rsp,%rbp
  400554:       e8 b7 ff ff ff          callq  400510 <mcount@plt>
  400559:       83 ff 02                cmp    $0x2,%edi
  40055c:       75 1d                   jne    40057b <main+0x2b>
  40055e:       48 8b 7e 08             mov    0x8(%rsi),%rdi
  400562:       31 c0                   xor    %eax,%eax
  400564:       e8 c7 ff ff ff          callq  400530 <atoi@plt>
  400569:       bf b1 08 40 00          mov    $0x4008b1,%edi
  40056e:       89 c6                   mov    %eax,%esi
  400570:       31 c0                   xor    %eax,%eax
  400572:       e8 69 ff ff ff          callq  4004e0 <printf@plt>
  400577:       31 c0                   xor    %eax,%eax
  400579:       5d                      pop    %rbp
  40057a:       c3                      retq   
  40057b:       48 8b 36                mov    (%rsi),%rsi
  40057e:       bf 94 08 40 00          mov    $0x400894,%edi
  400583:       31 c0                   xor    %eax,%eax
  400585:       e8 56 ff ff ff          callq  4004e0 <printf@plt>
  40058a:       83 cf ff                or     $0xffffffff,%edi
  40058d:       e8 ae ff ff ff          callq  400540 <exit@plt>
  400592:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400599:       00 00 00 
  40059c:       0f 1f 40 00             nopl   0x0(%rax)

The optimizer has produced more succinct machine code, and interestingly it has removed the calls to a and b, accounting for the massive reduction in execution time. This is because the compiler has some ability to assess the side effects of a function, and since there are no relevant side effects, and the return value is unused, it can optimize away the function calls. Often compiler optimizations won’t be a silver bullet, but they can prove very helpful. When you they aren’t enough though, gprof can be incredibly helpful in identifying the sections of code worth hand optimizing.

GDB, or how I learned to stop worrying and love the debugger

While compiled languages are typically faster, this comes at the cost of increased obfuscation. If you’re familiar with an interpreted language you are probably used to errors getting caught at runtime and being handled by the language interpreter. Since an interpreter completely contains the execution of the software it is able to do things like inspect stack frames and variable types. In addition there is typically some language support for complex error handling. While these things can be accopmlished in C it is typically less intuitive, and often requires the use of additional tools. GDB, the GNU Debugger is one such tool. From the GDB man page

“GDB can do four main kinds of things (plus other things in support of these) to help you catch bugs in
the act:
· Start your program, specifying anything that might affect its behavior.
· Make your program stop on specified conditions.
· Examine what has happened, when your program has stopped.
· Change things in your program, so you can experiment with correcting the effects of one bug and go on to learn about another.”

To quickly demonstrate some of GDBs ability we will need a program as contrived as it is broken. Fortunately Wikipedia provides.

#include <stdlib.h>;

void fault();

int main() {
    int i = 0;
    fault(&i);
}

void fault() {
    *i = 4;
    char *s = "segfaults";
    *s = 'H';
}

If you compile and run this program you should see something along the lines of “segmentation fault(core dumped)”. To use GDB we need two things, a binary with debugging symbols, and a working GDB install. All of the examples in this post use the program listed above, if you wanna try it out you can create a binary with debug symbols using the -g flag with clang or gcc e.g.

gcc -g -o program program.c

It is imperative to compile a version of your binary with -g on to get the most out of GDB. It is also worth noting that gcc was used to generate the binary used to make this and clang’s output will be minimally but noticeably different. Next, if you don’t have GDB installed you can download a copy of the latest release from the GNU repos, currently 7.10, and build it as follows

wget https://ftp.gnu.org/gnu/gdb/gdb-7.10.tar.xz
tar -Jxvf gdb-7.10.tar.xz && cd gdb-7.10
./configure
make && make install

With GDB installed one can start using it to debug programs. Begin by entering ‘gdb [program name]’ in a terminal. If everything is working you should recieve a prompt. If you enter ‘run’ the program will run and GDB will print something along the lines of

(gdb) run
Program received signal SIGSEGV, Segmentation fault.
0x00000000004004db in fault (i=0x7fffffffe95c) at toasted.c:11
11          *s = 'H';

GDB already provides some useful information. We’re able to see the line that causes the segfault, as well as the function it’s in and the arguments to that function. However, when code reuse is nontrivial, it isn’t always clear where a function is being called. If you want to inspect the full series of function calls that led to an error you can enter ‘backtrace’. backtrace will print the entirety of the current call stack along with the address, line number, and arguments of each function call.

(gdb) backtrace
#0  0x00000000004004db in fault (i=0x7fffffffe95c) at toasted.c:11
#1  0x00000000004004c4 in main () at toasted.c:6

GDB also supports breakpoints. Breakpoints are a tool for stopping execution within the debugging environment, making it possible to inspect and alter a programs state upon the attempted execution of an arbitrary section of code. This allows you to preempt faulty code and study the source of errors. To insert a breakpoint enter ‘breakpoint [LOCATION]’, where LOCATION can be a function name, a line number, or a memory address. Once you’ve set a breakpoint, when you run your program using ‘run’ within GDB, its execution will be halted when the program counter reaches the specified address. While breakpoints can be implemented in software, it is interesting to note that x86 has several registers which implement breakpoints at the hardware level. Breakpoints are numbered, and to remove a breakpoint simply enter ‘delete [NUMBER]’

(gdb) break fault
Breakpoint 1 at 0x4004e0: file toasted.c, line 11.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y


Starting program: /root/warez/program 

Breakpoint 1, fault (i=0x7fffffffe95c) at toasted.c:11
11          *i = 4;

Above a breakpoint is created, breakpoint 1, and the program restarted, only to be paused at the breakpoint. From this point we can do a variety of interesting things. One useful command is ‘print’, which prints memory and register values. The syntax is ‘print [FORMAT] [VALUE]’. FORMAT is optional, and specifies the base VALUE will be printed in, for example /d is decimal, /t is binary, and /x is hex. VALUE is the register, variable, or location you wish to print. To print registers, prefix the register name with ‘$’, variables can be optionally annotated with a C style type cast

(gdb) print fault
$2 = {void (int *)} 0x4004d8 <fault>;
(gdb) print /x $rip
$3 = 0x4004e0
(gdb) print /x 0x4004d8
$4 = 0x4004d8(gdb)
print /d (int) *i
$5 = 0

We can see now that the program counter, stored in register rip, is 8 bytes higher than the address of our function. Since our function has a label at the beginning which is just the 64-bit address of that location it makes sense that we see the disparity we do. To print the value of pointers, C style dereferencing is supported. In addition to inspecting variables, GDB can also change their values. This can be done using ‘set [VARIABLE]=[VALUE]’. It is important that the type of value matches that of variable. When in doubt ‘whatis [VARIABLE]’ should provide some type information.

(gdb) whatis i
type = int *
(gdb) set *i=5
(gdb) print *i
$6 = 5

Having inspected and modified program state, it is time to resume execution. Two ways to do this are ‘step’ and ‘continue’. step, and the variant stepi move the program forward one line, and one instruction respectivly, stopping again after executing the appropriate number of instructions. You can step forward repeatedly to inspect how a program changes state line by line, or, if you’re not using a high level language like C, instruction by instruction. continue will continue executing your program normally, stopping only at the next breakpoint.

(gdb) step
12          char *s =this is the  "segfaults";
//next line to be executed
(gdb) print *i
$6 = 4

All of these things can be quite helpful, but we still haven’t used GDB to discover the cause of the segfault. This warrants a closer inspection of what the program is doing. To get this disassemble the function in question using ‘disas [FUNCTION]’

(gdb) disas fault
Dump of assembler code for function fault:
   0x00000000004004d8 <+0>:     push   %rbp
   0x00000000004004d9 <+1>:     mov    %rsp,%rbp
   0x00000000004004dc <+4>:     mov    %rdi,-0x18(%rbp)
   0x00000000004004e0 <+8>:     mov    -0x18(%rbp),%rax

   0x00000000004004e4 <+12>:    movl   $0x4,(%rax)
=> 0x00000000004004ea <+18>:    movq   $0x400584,-0x8(%rbp)
   0x00000000004004f2 <+26>:    mov    -0x8(%rbp),%rax
   0x00000000004004f6 <+30>:    movb   $0x48,(%rax)

   0x00000000004004f9 <+33>:    nop
   0x00000000004004fa <+34>:    pop    %rbp
   0x00000000004004fb <+35>:    retq   
End of assembler dump.

Since the program counter currently resides in this function, GDB is kind enough to annotate exactly which address the program counter points at.

And now for a brief digression about assembly. The broken out section constitutes the actual body of the C function. The other two sections are boilerplate the compiler adds to functions. The first instruction in the relevant block can be ignored as it deals with setting the value of i. The pointed instruction, which we know is the first instruction in the assignment ‘char* s = “segfault”;’, can be read ‘move 0x400584 as a 64-bit value to the address 8 bytes lower than the base of the current stack frame’.The offset is because the stack grows down in memory. Everything is still pretty reasonable, 0x400584 makes sense typewise as a value for our pointer s, which should be stored on the stack, it’s a static value because the compiler layed it out as binary data when it was compiled. You can confirm this by opening another terminal, leaving your precious GDB session untouched, and running

strings [program name] | grep segfaults

. strings is a tool to print the various ascii strings present in a binary file, and piping its output through grep should show the presence of the string “segfaults” in the binary itself. Running step executes the pointed instruction, and the one after that, moving the pointer’s address into a register for later use. The next instruction seems sane, moving 0x48 (ascii H) to the beginning of the string’s location in memory. The only instructions left are a nop and some stuff common to most functions, so this instruction must constitute the entirety of the offending line.

While the next instruction will trigger the segfault, the key lies in the address being written to, so it’s worth finding out a little more about it. It so happens that by looking at the section information the error reveals itself, and while printing out the section layout is rarely a thing I find myself doing in GDB it is certainly within the capabilities of the tool. The rather unclear syntax to get this information is ‘maint info section’, which when run gives you a summary of each of the individual sections in the binary.

(gdb) maint info section
Exec file:
    `/root/warez/program', file type elf64-x86-64.
 [0]     0x00400200->0x0040021c at 0x00000200: .interp ALLOC LOAD READONLY DATA HAS_CONTENTS
 [1]     0x0040021c->0x0040023c at 0x0000021c: .note.ABI-tag ALLOC LOAD READONLY DATA HAS_CONTENTS
 [2]     0x0040023c->0x00400260 at 0x0000023c: .note.gnu.build-id ALLOC LOAD READONLY DATA HAS_CONTENTS
 [3]     0x00400260->0x0040027c at 0x00000260: .gnu.hash ALLOC LOAD READONLY DATA HAS_CONTENTS
 [4]     0x00400280->0x004002c8 at 0x00000280: .dynsym ALLOC LOAD READONLY DATA HAS_CONTENTS
 [5]     0x004002c8->0x00400300 at 0x000002c8: .dynstr ALLOC LOAD READONLY DATA HAS_CONTENTS
 [6]     0x00400300->0x00400306 at 0x00000300: .gnu.version ALLOC LOAD READONLY DATA HAS_CONTENTS
 [7]     0x00400308->0x00400328 at 0x00000308: .gnu.version_r ALLOC LOAD READONLY DATA HAS_CONTENTS
 [8]     0x00400328->0x00400340 at 0x00000328: .rela.dyn ALLOC LOAD READONLY DATA HAS_CONTENTS
 [9]     0x00400340->0x00400370 at 0x00000340: .rela.plt ALLOC LOAD READONLY DATA HAS_CONTENTS
 [10]     0x00400370->0x0040038a at 0x00000370: .init ALLOC LOAD READONLY CODE HAS_CONTENTS
 [11]     0x00400390->0x004003c0 at 0x00000390: .plt ALLOC LOAD READONLY CODE HAS_CONTENTS
 [12]     0x004003c0->0x00400572 at 0x000003c0: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
 [13]     0x00400574->0x0040057d at 0x00000574: .fini ALLOC LOAD READONLY CODE HAS_CONTENTS
 [14]     0x00400580->0x0040058e at 0x00000580: .rodata ALLOC LOAD READONLY DATA HAS_CONTENTS

Looking at the section data one can see that 0x400584 is in the read only data section. Writing to read only memory causes a segfault, but this is hard to determine as the cause without a bit of knowledge about how compilers lay out binaries or a trusty debugger.

One final command worth mentioning is ‘watch’, like break it will pause the execution of the program under certain conditions, but it does so when a piece of data changes. That data must exist, so it is sometimes necessary to use watch in conjunction with break. For example

(gdb) break main
Breakpoint 1 at 0x4004be: file toasted.c, line 6.
(gdb) run
Starting program: /root/warez/program 

Breakpoint 1, main () at toasted.c:6
6           int i = 0;
(gdb) watch i
Hardware watchpoint 2: i
(gdb) delete 1
(gdb) continue
Continuing.
Hardware watchpoint 2: i

Old value = 0
New value = 4
fault (i=0x7fffffffe95c) at toasted.c:12
12          char *s = "segfaults";
(gdb) bt
#0  fault (i=0x7fffffffe95c) at toasted.c:12
#1  0x00000000004004d1 in main () at .c:7

GDB has a load of additional functionality, and I highly reccomend checking out the official documentation in all of its Javascript free wonderment.