GrammaTalk

Interpreting CodeSonar for Binaries Results Part 1

Posted on

by


INTRODUCTION:

So you’ve installed CodeSonar for Binaries, set up your hub, and managed to perform an analysis. You are now face-to-face with tons of warning reports that are encoded in this cryptic language called assembly code. How do you interpret this information? How do you make heads or tails of what CodeSonar is telling you?

Related:


Interpreting CodeSonar for Binaries Results

Let’s take a look at a typical CodeSonar for Binaries warning. Figure 1 shows a warning report that CodeSonar produced when analyzing the open source program gnuchess. The warning describes a buffer overrun that CodeSonar found in the gnuchess executable and displays a listing of the relevant code surrounding where the buffer overrun occurs.

Understanding CodeSonar for Binaries Results.png 

Figure 1: A Buffer Overrun Warning Generated by CodeSonar

Note that this view is entirely generated by CodeSonar. The actual machine code is just a bunch of bytes that are difficult for humans to comprehend. CodeSonar reconstructs an assembly code representation of the machine code that’s more amenable for human consumption. The process of creating this view is called disassembly, and it involves a number of tasks including deciding which bytes in the program are code and which are data, how to decode the code bytes into individual instructions, how to name things like function entry points, and where references to such names occur within individual instructions and data objects.

Sections, Addresses, and Symbols

The display of the disassembly includes several features. Down the left-hand side, CodeSonar includes section and effective address information. For example:

_text:0000000000419350

Executable binary files are divided up into separate sections such as the text, data, rodata, and bss. Each of these sections generally contains a particular kind of content (e.g., code, read/write data, read-only data, or zero-initialized data). Each of these sections is then loaded by the operating system at a particular address in memory. And as a result, each element in a section gets loaded at a specific address as well. From the disassembly view produced by CodeSonar, we can see that the function return_append_str is located in the _text section and is loaded at address 0x419350.

The concept of a structured function doesn’t really exist explicitly in binary code. Control is passed from one function to another by targeting the address that a function’s entry point is located at. Addresses are labeled by symbols to help tools like the linker understand how to connect a call site to the called function. CodeSonar tries to recover these symbols when they are still present in the binary. But if not, it will generate symbols as needed to help in reading the disassembly listing. In this example, we can see several such symbols. For example, the following two symbols refer to addresses directly displayed in the listing:

  • return_append_str
  • loc_4193B8

The first of these was recovered from debugging information. The latter was generated by CodeSonar.

There are also uses of symbols in this listing that refer to addresses not directly listed in this view. These include the following symbols:

  • __thunk_.strlen
  • __thunk_.malloc
  • __thunk_.memcpy

These symbols represent the addresses of functions that the function return_append_str calls to accomplish its task. In this case, all three are functions that are imported from a DLL. The “__thunk_” prefix indicates that the functions are not called directly, but rather are invoked through a small snippet of code that mediates the DLL interface. CodeSonar uses the term thunk to refer to these little snippets.

Local Variables and Parameters

At the function entry point for return_append_str, there is a block of statemens that define locations of local variables and parameters in the function’s stack frame. Each function maintains its own stack frame where it stores local variables and passes parameters. Generally, when a function calls another function, it pushes parameter values onto the stack. Then it invokes the function call, causing the return address to also be saved to the stack. The called function then adds on top of this space for its own local variables. When the called function returns, it removes its local variables (by “popping” the stack), and the return address and parameters are removed as the return executes (or possibly by the calling function).

Most accesses to local variables and parameters are performed by computing an offset off of an address that represents the base of the stack frame (often called the “frame base”). The frame base is generally stored in a register, so accessing a variable or parameter usually involves dereferencing a fixed offset added to that register.

In our example, we see the following local variable declaration:

ext_28 = qword ptr -40

This indicates that CodeSonar thinks there is an 8-byte local variable (a “qword”) located on the stack at offset -40 from the frame base.

We also see this declaration:

return-addr$ = qword ptr 0

This indicates that the return address for the function is also 8 bytes in size and is located at offset 0 from the frame base (i.e., it is stored at the base of the stack frame.)

In this particular example, there are no declarations for parameters in the stack frame. That’s because for the x64 platform, the calling convention dictates that the first few parameters are passed in registers rather than on the stack. And since this example was compiled for Linux, those registers are rdi, rsi, rdx, rcx, r8, and r9, in that order. In our example, return_append_str takes two parameters, which are stored in rdi and rsi.

Decoded Instructions

Finally, we have the instructions, themselves. For x86 and x64, CodeSonar uses what’s called “Intel syntax” in the assembly code it generates. That means that two-operand instructions generally have the following form:

<opcode>  <dest>, <src>

For example:

mov r13, rsi

The <opcode> indicates what operation is performed by the instruction. The <src> operand indicates where the source value comes from, and the <dst> operand indicates where the result of the operation is stored. In some instructions, the <dst> operand is also used as a second source value to be used in the operation.

The mnemonics for the <opcode> part of the instruction are generally pretty obvious, though some can be confusing. For example, the “mov” instruction moves data from the <src> operand to the <dst> operand. The “sub” instruction subtracts the <src> operand from the <dst> operand and stores the result in the <dst> operand. When in doubt, check the instruction manual for the particular processor you’re working with.

There are also instructions with only one operand in our example. For example, the “push” instruction causes the stack to grow and the value of the instruction’s operand to be stored at the new “top” of the stack (it is actually the bottom of the stack, because the function stack on x64 grows downward in the address space). Another class of 1-operand instructions is the group of control-flow instructions. Normally, execution works down the sequence of instructions one after the next. Control-flow instructions alter this flow by causing execution to jump elsewhere. In our example, we see, for example, the “jz” instruction, which is a condition jump – it performs its jump only if the previous arithmetic operation computed a zero result value. We also see a number of “call” instructions. These generally transfer control to the entry point of a function. When the called function returns, control flow comes back to the instruction following the “call” instruction.


In Part 2 of this post we’ll cover how to interpret the warning execution path and how to use this to discover the source of the error.


 

Related Posts

Check out all of GrammaTech’s resources and stay informed.

view all posts

Contact Us

Get a personally guided tour of our solution offerings. 

Contact US