Lifting ASM to IR

Does there exist a tool that could lift a binary (assembly for some supported target) to LLVM IR? If there isn’t, does this seem like something that would be feasible?

http://llvm.org/devmtg/2013-04/bougacha-slides.pdf
might be a starting point.

Joerg

Does there exist a tool that could lift a binary (assembly for some
supported target) to LLVM IR? If there isn't, does this seem like
something that would be feasible?

There's plenty of variations on the idea: Revgen/S2E, Fracture, Dagger
(my own), libcpu, several closed-source ones used by pentest shops,
some that use another representation before going to IR (say
llvm-qemu), and probably others still I forgot about.

Are you interested in a specific target / use case?

http://llvm.org/devmtg/2013-04/bougacha-slides.pdf
might be a starting point.

Note that after a hiatus I've been slowly revamping Dagger (more to
come), making the implementation parts of the slides tremendously
out-of-date (it doesn't help that, at the time, I was a kid with a
laptop and a dream - not to say I'm much more now).

For instance, the translation now re-uses the existing
instruction-selection patterns in LLVM as much as possible, rather
than hand-writing them.

Also note that, as opposed to the other projects, it's a for-fun
hobby, so you might want to investigate your options :wink:

-Ahmed

I was thinking something along the lines of lifting a binary into IR and
spitting it out for a different processor.

Or maybe a decompiling tool of some kind.

    >> Does there exist a tool that could lift a binary (assembly for some
    >> supported target) to LLVM IR? If there isn't, does this seem like
    >> something that would be feasible?

    There's plenty of variations on the idea: Revgen/S2E, Fracture, Dagger
    (my own), libcpu, several closed-source ones used by pentest shops,
    some that use another representation before going to IR (say
    llvm-qemu), and probably others still I forgot about.

    Are you interested in a specific target / use case?

I was thinking something along the lines of lifting a binary into IR and
spitting it out for a different processor.

This is going to be extremely difficult. Imagine for example how this function would be compiled:

   struct Foo {
     void *v;
     int i;
     long l;
   };

   long bar(Foo *f) {
     return f->l;
   }

If we pick a particular target, and compile this function for that, then 'foo' will have some offset into the struct from which it loads 'l'. This is easy because we know the sizes of the struct's members, and the layout rules for structs on the target.

Now turn that around: given an offset into a struct for one target, what's the offset into the same struct on another target? We're stuck because we can't reconstruct from this offset what the sizes of v and i are individually; all we have is their sum (and that doesn't even take alignment issues into account). This is because when we're looking at the binary we don't know, given that offset, that the two elements in front of it are a void* and an int.

Now, you might think that: "well, okay we'll just use the offsets from one target in the other target's binaries". But that isn't going to work either: what if void* isn't the same size between the two targets? And that's just the tip of the iceberg.

TL;DR: binary translation is a very difficult problem.

Jon

I was afraid of something like that. I was thinking that translating
function calls would be an issue; I didn't think about data layout.

what if void* isn't the same size between the two targets?

I believe that the translated program has no other choice but to use the same
pointer size and data layout as the original one. Though, memory operations have
to be modified to map one memory space into another properly.

Having spent my formative years (or whatever it's called) many years
back writing a disassembler that produced "re-assemblable" code (that
worked most of the time), I'm quite familiar with the fundamentals of
simply knowing "what is read-only data, what is instructions". I had
to invent several heuristics to solve for example jump-tables [switch
statement] that are PC-relative or text-strings embedded in the "code"
section - both typically "look kind of like instructions, but aren't
really" - strings are a little less difficult than jump-tables, but
both can be very tricky - especially if they are relatively short
sequences (is "F\n" an instruction or string?)

If you then want to abstract it further, to a higher level language
(e.g LLVM IR), you have to actually undersstand the semantics of each
instruction.

Calls are typically not so bad - you can figure out from the system
calling convention what is part of the call and what isn't (assuming
the calling convention of the compiler is documented, of course)

Indeed, I guess one of the most difficult questions in binary translation must be
to discriminate between code and data in an executable. And the most difficult
task must be to support self-modifying code.

It seems that though one can use some heuristics to handle both these issues
properly in most cases, to create a robust binary translator one needs to have
both a full copy of initial binary and a full-fledged binary *interpreter* at
run time. Quite similar to compilation of dynamic languages.