help decompiling x86 ASM to LLVM IR

Hi,

I am looking to decompile x86 ASM to LLVM IR.
The original C is this:
int test61 ( unsigned value ) {
        int ret;
        if (value < 1)
                ret = 0x40;
        else
                ret = 0x61;
        return ret;
}

It compiles with GCC -O2 to (rather cleverly removing any branches):
0000000000000000 <test61>:
   0: 83 ff 01 cmp $0x1,%edi
   3: 19 c0 sbb %eax,%eax
   5: 83 e0 df and $0xffffffdf,%eax
   8: 83 c0 61 add $0x61,%eax
   b: c3 retq

How would I represent the SBB instruction in LLVM IR?
Would I have to first convert the ASM to something like:
   0000000000000000 <test61>:
   0: cmp $0x1,%edi Block A
   1: jb 4: Block A
   2: mov 0x61,%eax Block B
   3: jmp 5: Block B
   4: mov 0x40,%eax Block C
   5: retq Block D (Due to join point)

...before I could convert it to LLVM IR ?
I.e. Re-write it in such a way as to not need the SBB instruction.

The aim is to be able to then recompile it to maybe a different target.
The aim is to go from binary -> LLVM IR -> binary for cases where the
C source code it not available or lost.

I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit.
The LLVM IR should be target agnostic, but would permit the
re-targetting task without having to build AST and structure as a C or
C++ source code program.

Any comments?

James

James Courtier-Dutton <james.dutton@gmail.com> writes:

I am looking to decompile x86 ASM to LLVM IR.
The original C is this:
int test61 ( unsigned value ) {
        int ret;
        if (value < 1)
                ret = 0x40;
        else
                ret = 0x61;
        return ret;
}

It compiles with GCC -O2 to (rather cleverly removing any branches):
0000000000000000 <test61>:
   0: 83 ff 01 cmp $0x1,%edi
   3: 19 c0 sbb %eax,%eax
   5: 83 e0 df and $0xffffffdf,%eax
   8: 83 c0 61 add $0x61,%eax
   b: c3 retq

How would I represent the SBB instruction in LLVM IR?
Would I have to first convert the ASM to something like:
   0000000000000000 <test61>:
   0: cmp $0x1,%edi Block A
   1: jb 4: Block A
   2: mov 0x61,%eax Block B
   3: jmp 5: Block B
   4: mov 0x40,%eax Block C
   5: retq Block D (Due to join point)

...before I could convert it to LLVM IR ?
I.e. Re-write it in such a way as to not need the SBB instruction.

The aim is to be able to then recompile it to maybe a different target.
The aim is to go from binary -> LLVM IR -> binary for cases where the
C source code it not available or lost.

I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit.
The LLVM IR should be target agnostic, but would permit the
re-targetting task without having to build AST and structure as a C or
C++ source code program.

Any comments?

This is not possible, except for specific cases.

Consider this code:

long foo(long *p) {
  ++p;
  return *p;
}

The X86 machine code would do something like

add %eax, 4

for `++p', but for x86_64 it would be

add %rax, 8

But you can't know that without looking at the original C code.

And that's the most simple case.

The gist is that the assembly code does not contain enough semantic
information.

It compiles with GCC -O2 to (rather cleverly removing any branches):
0000000000000000 <test61>:
    0: 83 ff 01 cmp $0x1,%edi
    3: 19 c0 sbb %eax,%eax
    5: 83 e0 df and $0xffffffdf,%eax
    8: 83 c0 61 add $0x61,%eax
    b: c3 retq

How would I represent the SBB instruction in LLVM IR?

If you're decompiling an assembly language into IR, it is best to treat the CFLAGS register as just another register which is manipulated as a side effect of instructions and letting a dead-code elimination pass eliminate extraneous uses. A rough equivalent for llvm IR in this could would be
%cf = icmp lt i32 1, %edi
%eax2 = sub i32 %eax, %eax
%1 = zext i1 %cf to i32
%eax3 = sub i32 %eax2, %1
%eax4 = and i32 0xffffffdf, %eax3
%eax5 = add i32 0x61, %eax4

The aim is to be able to then recompile it to maybe a different target.
The aim is to go from binary -> LLVM IR -> binary for cases where the
C source code it not available or lost.

I know qemu can use LLVM IR as an intermediate form for optimizing emulation; you might want to look into their source code. Or actually just outright use qemu.

I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit.
The LLVM IR should be target agnostic, but would permit the
re-targetting task without having to build AST and structure as a C or
C++ source code program.

Retargetting binaries for different hardware sounds like a losing proposition to me, especially if you're trying to retarget x86 binary code to x86-64: problems here include code acting as if sizeof(void*) = 4 instead of the correct value of 8. The only safe way to do this is to effectively emulate the original target machine... which is more or less what qemu does.

Hi James,

Hi,

I am looking to decompile x86 ASM to LLVM IR.
The original C is this:
int test61 ( unsigned value ) {
         int ret;
         if (value < 1)
                 ret = 0x40;
         else
                 ret = 0x61;
         return ret;
}

It compiles with GCC -O2 to (rather cleverly removing any branches):
0000000000000000 <test61>:
    0: 83 ff 01 cmp $0x1,%edi
    3: 19 c0 sbb %eax,%eax
    5: 83 e0 df and $0xffffffdf,%eax
    8: 83 c0 61 add $0x61,%eax
    b: c3 retq

How would I represent the SBB instruction in LLVM IR?

you could use an llvm.ssub.with.overflow.i32 intrinsic to get the
sub-with-carry, and then explicitly extend the carry flag to i32 and
subtract it off too. See

http://llvm.org/docs/LangRef.html#llvm-ssub-with-overflow-intrinsics

Ciao, Duncan.

I already know how to handle the case you describe.
I am not converting ASM to LLVM IR without doing quite a lot of analysis first.
1) I can already tell if a register is refering to a pointer or an
integer based on how it is used. Does it get de-referenced or not? So,
I would know that "p" is a pointer.
2) From the binary, I would know if it was for 32bit or 64bit.
3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p +
1" (64bit long), or "p = p + 2(32bit long)"

So, I think your "It is not possible" is a bit too black and white.

This is a bad example. A compiler compiling LP64 code would generate the above code on x86_64 for the given C code. An ILP32 compiler for x86_64 would generate something more akin to the 32-bit x86_32 code given above. It should be possible to statically convert such a simple program from one instruction set to another (provided that they're not funky instruction sets with 11 bit words).

That said, converting machine code from one machine to another is, I believe, an undecidable problem for arbitrary code. Certainly self-modifying code can be a problem. There's no type-information, either, so optimizations that may rely on it can't be done. Anything that uses memory-mapped I/O or I/O ports is going to cause a real challenge, and system calls won't work the same way on a different architecture. There are probably other gotcha's of which I am not aware. In short, it's an exercise fraught with danger, and there will always be a program that breaks your translator.

Most systems that do binary translation do it dynamically (i.e., they grab a set of instructions, translate them to the new instruction set, and then cache the translation for reuse as the program runs). They are essentially machine code interpreters enhanced with Just-In-Time compilation for speed.

-- John T.

I already know how to handle the case you describe.
I am not converting ASM to LLVM IR without doing quite a lot of analysis first.
1) I can already tell if a register is refering to a pointer or an
integer based on how it is used. Does it get de-referenced or not? So,
I would know that "p" is a pointer.

What if the variable is being loaded out of a memory location, and the current use increments it by four but never dereferences it, while some other location derefences it?

What if (in x86-64 code) the variable clears the low three bits of the pointer to use it as scratchpad space for a few tracking bits? In 32-bit code, that's unsafe, since you can only guarantee two unused bits.

What if you have a pointer variable in the middle of the struct, so you need to shift the data offset of a pointer-relative address to get the correct variable?

What if you have the equivalent assembly code for this C code:
union {
   struct {
     int *a;
     int b;
   };
   struct {
     int c;
     int d;
   };
} x;

...
switch () {
  case A: return &x->b;
  case B: return &x->d;
}

After optimization, cases A and B reduce to the same assembly in 32-bit code but not in 64-bit code.

How would you propose to detect and fix these cases?

2) From the binary, I would know if it was for 32bit or 64bit.
3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p +
1" (64bit long), or "p = p + 2(32bit long)"

So, I think your "It is not possible" is a bit too black and white.

No, it's AI-hard, as evidenced that porting programs from 32-bit to 64-bit at the source-code level is nontrivial for large projects with lots of developers. And you only have less information at assembly level.

James Courtier-Dutton <james.dutton@gmail.com> writes:

I already know how to handle the case you describe.
I am not converting ASM to LLVM IR without doing quite a lot of analysis first.
1) I can already tell if a register is refering to a pointer or an
integer based on how it is used. Does it get de-referenced or not? So,
I would know that "p" is a pointer.
2) From the binary, I would know if it was for 32bit or 64bit.
3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p +
1" (64bit long), or "p = p + 2(32bit long)"

So, I think your "It is not possible" is a bit too black and white.

There is no amount of automated analysis that makes possible
"translating" arbitrary binary code from one architecture to another.

Your above stated rules would fail for my example. This code:

int foo(int *p) {
   ++p;
   return *p;
}

compiled in x86 (Linux or Windows) would generate the very same binary
code than

long foo(long *p) {
   ++p;
   return *p;
}

but those functions generate different code in x86_64-linux, where `int'
is 32 bits and `long' 64 bits. In the general case, it is unfeasible to
decide if `p' is a pointer to `int' or `long' on x86.

There are lots and lots of examples of that kind. Other type of problems
are translating ABI-related code, reflecting external data structures...

John Criswell <criswell@illinois.edu> writes:

This is a bad example. A compiler compiling LP64 code would generate
the above code on x86_64 for the given C code. An ILP32 compiler for
x86_64 would generate something more akin to the 32-bit x86_32 code
given above. It should be possible to statically convert such a simple
program from one instruction set to another (provided that they're not
funky instruction sets with 11 bit words).

"such a simple program" can't be converted without looking at the
original C code (or some representation that retains the relevant info.)

Please see my other response to the OP.

Apart from that, you need to figure out the big picture starting from
the "simple program." That's what examples are for :slight_smile:

[snip]

Joshua Cranmer :penguin: <Pidgeot18@gmail.com> writes:

So, I think your "It is not possible" is a bit too black and white.

No, it's AI-hard, as evidenced that porting programs from 32-bit to
64-bit at the source-code level is nontrivial for large projects with
lots of developers. And you only have less information at assembly
level.

Now, *that's* a very good argument.

So, if we take the source-code level case.
You can write a source-code level program that will compile unchanged
to produce a 32-bit application or a 64-bit application.
Proof of this is just looking at almost any Linux based distro
available in 32-bit or 64-bitapplications.
So, if you then ask a different question:
Instead of porting a 32-bit program to 64-bit, port the 32-bit program
to a program that will work equally well if compiled for 32-bit target
or 64-bit target?

First steps in this might be looking at every use of "int" and "long"
and replace them with int32_t and int64_t. I.e. replace target
specific types with target agnostic types.
So, if the binary is 32bit, int will be 32bit, change the source code
to say "int32_t" instead of "int".
if the binary is 32bit, and on that target long will be 32bit, change
the source code to say "int32_t".

I know that there will be special cases that are difficult to handle.
I don't expect 100%. I am looking to write a tool that can do say 80%
of the work.
I believe that I could recognise blocks that we know will work, and
highlight the "unsure" sections of the code, for closer inspection.
I am hoping to be able to highlight target agnostic code and highlight
target specific code and automate the target agnosic parts.

My current decompiler does statistical analysis in order to identify types.
E.g. This register at this instruction is most likely a int32_t but
might be a uint32_t, but definitely not a uint64_t.

So, it is not black and white. I want it to work say 80% of the time,
but at least highlight where the remaining 20% is, and do manual work
on it.

I did not know that. Thank you. I will take a look.

Kind Regards

James

So, if we take the source-code level case.
You can write a source-code level program that will compile unchanged
to produce a 32-bit application or a 64-bit application.
Proof of this is just looking at almost any Linux based distro
available in 32-bit or 64-bitapplications.
So, if you then ask a different question:
Instead of porting a 32-bit program to 64-bit, port the 32-bit program
to a program that will work equally well if compiled for 32-bit target
or 64-bit target?

That's still impossible. In C++, it's trivial to write code like this:
template<size_t size>
struct AlignedStorage;

template<4>
struct AlignedStorage {
   union {
     uint32_t;
     uint8_t element;
   };
};

template<8>
struct AlignedStorage {
   union {
     uint64_t;
     uint8_t element;
   };
};

...
AlignedStorage<sizeof(void*)> storage;

You end up compiling literally different code based on the size of a pointer with templates. Or you could use it in a macro. This isn't academic:
<http://dxr.mozilla.org/search?tree=mozilla-central&q=regexp%3A%2F%23if.*SIZEOF_%2F&redirect=true&gt; [1]. (Note: this is in a code base that already uses the intN_t types almost everywhere instead of plain int/long/etc.).

This is a concern even before we get to optimizations that can take advantage of identical representations to deduplicate code on different branches, or the fact that the inlining of sizeof() operations as constants has profound second-order effects on code like radically alterating structural layout (grepping a recent paper indicates that precision on structural typing binary programs even when you're collapsing all types of the same size is 90%. That is an upper bound on your effectiveness).

First steps in this might be looking at every use of "int" and "long"
and replace them with int32_t and int64_t. I.e. replace target
specific types with target agnostic types.
So, if the binary is 32bit, int will be 32bit, change the source code
to say "int32_t" instead of "int".
if the binary is 32bit, and on that target long will be 32bit, change
the source code to say "int32_t".

In 3 million lines of code, there are:
* >1000 uses of size_t
* 857 uses of ptrdiff_t
* >1000 uses of intptr_t and uintptr_t
* 839 uses of ssize_t

I am assuming that all of these are intended to be explicitly pointer-sized integer variables. In addition, there are over 504 distinct unions, which is a subset of places where types are polymorphically used--I'm not counting uses of reinterpret_cast or static_cast (or C-style type-punning)--which chalk up into several thousand more possible combinations.

I know that there will be special cases that are difficult to handle.
I don't expect 100%. I am looking to write a tool that can do say 80%
of the work.

You are *very* optimistic to assume that you can well-type 80% of the program given only the binary code of the program. I think DSA managed, given LLVM IR mid-optimization, to determine 80% of the objects accessed by loads/stores to be a type more precise than a bag of bytes on SPEC2000, which isn't a particularly hard benchmark for real-world programs.

So, it is not black and white. I want it to work say 80% of the time,
but at least highlight where the remaining 20% is, and do manual work
on it.

I am assuming a lot about your background knowledge here, but the fact that you were not aware of qemu as prior art and also some of your choices of words leads me to believe that you have not looked very hard into prior research on static analysis either of C code or binary code. That is not a recipse for success.

[1] Shameless DXR plug: we support regex searches :slight_smile: