DIY Binary Analysis with OBIN



To learn how tools like IDA work under the hood, and learn more about binary analysis, I made OBIN for Osiris Binary analysis tool which does the following:

  • Parsing the elf file and show the information in the header
  • Disassembling the sections which include program code (there is also an experimental gui with TkInter)
  • Generating the function call graph
  • Checking whether a sequence of syscalls or function calls can happen during the execution of the program

The source of OBIN is available here.


Parsing the ELF

Parsing the elf header is a tenuous process of looking into the documentation and implementing it to the minute detail that it is specified. The elf contains information about the architecture of the machine, the endianness, and the layout of the other data in the file which includes the actual code of the program. The additional data is there so that the OS knows how to actually load the program in the memory and prepare it for execution. There are tools (like readelf) which display the information of an elf and can be used to verify our code. Below you can see how OBIN show the information from of an elf:

Disassembling

Once we get the relevant sections in the file which includes the code, we need to transform it from binary chunks into human readable assembly format. This looks like the most simple task: for a given architecture there is an specification detailing how each instruction is encoded into binary formats. (i.e. in x86, 0x90 is the NOP instruction) But despite the looks of it, instructions have different length (for example in x86, some are 1 bytes like NOP but it can be as long as 15 bytes!) and there are 1000+ instructions for x86 alone, and imagine wanting to do all this for multiple architectures, so this is something I knew I am definitely not going to implement myself! This problem has been tackled by other people in the past and there are really cool libraries that we can use for it, I opted to use Capstone: It is free, bindings for whatever language you desire and it covers most architectures you care about. Recently during CSAW’19 I had the privilege of running into some lads from Binary Ninja and they told me that despite these advantages, Capstone is not so fast (which matters when you want to analyze large programs) and it is also not free of bugs, which is also one of the challenges in making a good disassembler: to handle every instruction and every possible edge-case and quirk in the ISA. There is a great blog post by Binary Ninja regarding disassemblers, which shows that making disassemblers is not mundane nor is it an archaic practice. In OBIN I also made a very simple gui using TkInter to show disassembly (invoke the program with “-gui” option to see this):

What’s next?

After parsing the elf format, and disassembling the code we can extract useful information like the callgraph: by considering the call instructions we can make the graph. Imagine having an exploitable function called foo, you will be then interested to know how it is invoked in the program so that you can exploit it. Or take the notorious Heartbleed vulnerability, which is triggered when the sizes of memcpy call came from a call to ntohl, you want to check for this behaviour and flag that it as unsafe. Usually, like the example of Heartbleed we need some more analysis than just call graph, like knowing where variables come from and where they go: which is called taint analysis, maybe you can add this feature to OBIN yourself. The callgraph generation of OBIN looks like this:


OBIN can also check for syscalls, and you can specify a sequence of syscalls, and OBIN checks wheteher such sequence of syscalls happens in the program. Such analysis is done by antivirus software, but usually in much more sophisticated fashion, because antivirus software has to deal with very large binaries, and in very large volumes. It is thus very important to do such checks very fast, and to also check against a very large database of suspicious/malicious behaviour. These are encoded very efficiently into signatures and matched on any binary in the system.