Basic instructions for LLVM and Control Flow graph extraction

I am currently attempting to learn how to use LLVM for control flow graph extraction on linux (Ubuntu). Basically, I need to be able to break down specific basic functions blocks from assembly code, and use it to make a CFG.

Do any of you upstanding human beings have any knowledge or resources that could possibly assist me in this task?

I apologize if this is a very basic question. I have already installed the proper files/programs.

Thank you in advance.

This isn’t by itself too difficult, as I have done something similar recently, but does require some modifications of LLVM.

The basic algorithm is simple:

For each ISA instruction, create a new MachineInstr and add it to the current MachineBasicBlock.

At each branch instruction, add it to the current MBB and add it to a list and create a new MBB.

After creating your list of MBB, iterate through them and reconnect the successors based on branches and fall throughs.

The problem is that what you are producing has no connection to the IR, and there are parts of LLVM that expect that link, specifically the printing/CFG dumping functions.

This isn’t by itself too difficult, as I have done something similar
recently, but does require some modifications of LLVM.

By the way there’s some stuff in LLVM that creates an MC CFG
(MCModule, MCObjectDisassembler, ..), but it still needs a lot of work
to be reliable and work in more cases - I have some patches locally
that need some more work and that I’ll eventually push though.

It gets tricky when you want to really have basic blocks, without
duplicating subsets of the instructions when you discover an entry
point in a basic block you already created. It’s even trickier when
you consider jumping inside an instruction, and needing to join an
existing basic block.

For instance if you jump to an instruction that starts at address X
and takes up 7 bytes, but disassembling at address X+5 gives you a
valid 2 byte instruction, then you need to have a basic block with the
7byte instruction, another with the 2byte one, and both having the
basic block starting at X+7 as a successor.

If you want to do some quick experimentation, you can use
"llvm-objdump -cfg -d <binary>”, which gives you a CFG for each
function found in the binary in a separate graphviz dot file. It
doesn’t look at the object file format stuff (symbols, or fancier
things like the FUNCTION_STARTS load command on mach-o), but again,
I’ll get around to all this eventually.

Until then, patches welcome !

— Ahmed