GSoC Proposal: Table-Driven Decompilation

Hi,

Here's one of my proposals for GSoC 2012. What do you think?

Chip

Project Title: Table-Driven Decompilation

Abstract:
Over the years, the LLVM family has grown to include nearly every type of build tool in existence. One of the few missing is a decompiler. LLVM's TableGen tool could potentially accelerate development of such a tool; most backends already have the information needed to implement it. This project proposes implementing support for decompilation in LLVM using information gleaned from target description files. Such a decompiler could be used for analysis, optimization, and recompilation of machine code.

Proposal:
Since its humble beginnings in 2001, LLVM has grown from a simple compiler toolkit to an entire family of build tools. Currently, it includes an assembler, a disassembler, a JIT, a C compiler, a debugger, an archiver, various tools for analyzing object files, and even a linker. In fact, just about the only tool missing from this set (aside from various language compilers) is a decompiler--a tool to turn machine code back into LLVM IR. This project proposes adding such a tool.

Some of the information needed to produce such a tool is already present--in the form of target description files, some of which contain patterns used to transform LLVM IR--or, more accurately, a selection DAG--into machine code. Since a decompiler is largely a compiler working in reverse, it should be conceivable to use these patterns to transform machine code back into a selection DAG--and transform that, in turn, back into raw LLVM IR. To actually read machine code, the decompiler will use the MC disassembler to produce MCInst objects, which can be transformed back into CodeGen's MachineInstr representation, so it can be fed through the selection DAG in reverse.

Some DAG->machine code transformations aren't controlled by TableGen patterns, but by custom transformations implemented as C++ code. For those transformations, custom C++ code performing the reverse transformation will be necessary.

Proposed schedule:
Week 1-4: Work on MCInst->MachineInstr transformation.
Week 5: Work on TableGen backend to generate MachineInstr->SDAG tables.
Week 6-8: Work on support for custom MachineInstr->SDAG transformation.
Week 9-12: Work on SDAG->LLVM IR transformation.

Contact Information:
email: cdavis (at) mymail (dot) mines (dot) edu
phone: (seven-one-nine) 963-4781
IRC: cdavis5x@oftc, cdavis5x@freenode

Interest:
Since Apple gutted PowerPC support from Mac OS 10.7, I have been searching for a way to run old PowerPC apps. While it would be easy to make my own recompiler manually using LLVM's JIT, I believe it would be far more useful to create a general framework for transforming machine code back to LLVM IR.

Use to LLVM:
Decomposing machine code back to LLVM IR could have potentially many uses within the LLVM community. Such a component would make building a recompiler--a tool that transforms machine code from one form to another--easier. In addition, it would allow clients to use the existing LLVM optimization and analysis passes on machine code, without the need to implement special tools to operate directly on MCInst objects.

Experience:
I have been working with LLVM for over two and a half years. I implemented support for the force_align_arg_pointer attribute, as well as the beginnings of Win64 exception-handling support. In GSoC 2010, I added support for multiple C++ ABIs, and the beginnings of support for the Microsoft Visual C++ ABI. In addition, I have also contributed to the Wine project--some of which was making it easier to compile with Clang :).

Hi Chip,

I have little experience in this area, but here some feedback:

The proposal looks nice and decompilation and later binary to binary translation sounds very interesting. However, to me it seems this is a very difficult topic and it would be good if the proposal would show you understand the difficulties and you have ideas how to solve them. A topic that I heard is difficult is e.g. how to keep track of the state of registers and CPU flags.

The libcpu project solves this by not even trying to reverse the individual LLVM-IR to machine code transformations, but to directly emit LLVM-IR that directly models each instruction as function calls that perform the original calculation and that, at the same time, model the CPU state. You may want to investigate what they do exactly,
Akso it would be interesting to explain how your approach comparer to the libcpu approach? Do you think yours has benefits? What are its drawbacks? Did you consider to improve libcpu instead of starting your own tool?

Cheers
Tobi

This strikes me as an extremely ambitious project, and I wonder how much you could actually get done in a summer.

Take the MCInst->MachineInstr conversion. In order to properly do this phase, you have to reconstruct functions, basic blocks, control flow... as well as identifying global variables and their initializers (should they have any). I am not an expert in the state of the art, but some experiments with IDA Free indicate that identifying the latter correctly (consider arrays where you have references to particular elements of the array: you have references to the middle of the array, which would generally construct a new symbol offset... arrays of structs are even worse!).

Another issue you would have is getting the implicit register usages correct... if you leave these out, you would get a MachineFunction that you can't do any code generation optimizations on safely. This isn't necessarily a problem, but it's something which would need warning labels in big, flashing neon signs. In any case, tracking the implicit use of registers is critical to being able to do register "deallocation" and recovering SSA values for the machine code. It's not clear to me that you could let this fall out during the SelectionDAG process.

Finally, it's not clear to me that using a selection DAG would derive you much benefit instead of just doing a straight MachineInstr->LLVM IR transformation. However, this is not my area of expertise, so I won't comment any further in this regard.

Hi,

Here's one of my proposals for GSoC 2012. What do you think?

Chip

Project Title: Table-Driven Decompilation

Abstract:
Over the years, the LLVM family has grown to include nearly every type of build tool in existence. One of the few missing is a decompiler. LLVM's TableGen tool could potentially accelerate development of such a tool; most backends already have the information needed to implement it. This project proposes implementing support for decompilation in LLVM using information gleaned from target description files. Such a decompiler could be used for analysis, optimization, and recompilation of machine code.

Hi Chip,

I have little experience in this area, but here some feedback:

The proposal looks nice and decompilation and later binary to binary translation sounds very interesting. However, to me it seems this is a very difficult topic and it would be good if the proposal would show you understand the difficulties and you have ideas how to solve them. A topic that I heard is difficult is e.g. how to keep track of the state of registers and CPU flags.

The libcpu project solves this by not even trying to reverse the individual LLVM-IR to machine code transformations, but to directly emit LLVM-IR that directly models each instruction as function calls that perform the original calculation and that, at the same time, model the CPU state. You may want to investigate what they do exactly,
Akso it would be interesting to explain how your approach comparer to the libcpu approach? Do you think yours has benefits? What are its drawbacks?

My plan was to do "register deallocation," whereupon code which uses a limited number of physical registers would be converted back to SSA form. (I guess I should have elaborated on that in my proposal, huh? :slight_smile: I think my approach would result in faster code, since it wouldn't basically be trying to run an emulator on each instruction. A major drawback is that it seems to be very difficult to reverse the IR->SDAG and SDAG->MachineInstr transformations. Perhaps it would be better and faster to use something like FastISel for MachineInstr->LLVM IR transformations?

Did you consider to improve libcpu instead of starting your own tool?

No, I did not. I'll have to look into that.

Thanks.

Chip

Proposal:
Since its humble beginnings in 2001, LLVM has grown from a simple compiler toolkit to an entire family of build tools. Currently, it includes an assembler, a disassembler, a JIT, a C compiler, a debugger, an archiver, various tools for analyzing object files, and even a linker. In fact, just about the only tool missing from this set (aside from various language compilers) is a decompiler--a tool to turn machine code back into LLVM IR. This project proposes adding such a tool.

Some of the information needed to produce such a tool is already present--in the form of target description files, some of which contain patterns used to transform LLVM IR--or, more accurately, a selection DAG--into machine code. Since a decompiler is largely a compiler working in reverse, it should be conceivable to use these patterns to transform machine code back into a selection DAG--and transform that, in turn, back into raw LLVM IR. To actually read machine code, the decompiler will use the MC disassembler to produce MCInst objects, which can be transformed back into CodeGen's MachineInstr representation, so it can be fed through the selection DAG in reverse.

Some DAG->machine code transformations aren't controlled by TableGen patterns, but by custom transformations implemented as C++ code. For those transformations, custom C++ code performing the reverse transformation will be necessary.

This strikes me as an extremely ambitious project, and I wonder how much
you could actually get done in a summer.

Take the MCInst->MachineInstr conversion. In order to properly do this
phase, you have to reconstruct functions, basic blocks, control flow...
as well as identifying global variables and their initializers (should
they have any). I am not an expert in the state of the art, but some
experiments with IDA Free indicate that identifying the latter correctly
(consider arrays where you have references to particular elements of the
array: you have references to the middle of the array, which would
generally construct a new symbol offset... arrays of structs are even
worse!).

Maybe I should devote my SoC project to just this, then? Perhaps, make a simple MCInst->MachineInstr transformer that doesn't handle anything complex like global variables or arrays or anything like that?

Another issue you would have is getting the implicit register usages
correct... if you leave these out, you would get a MachineFunction that
you can't do any code generation optimizations on safely. This isn't
necessarily a problem, but it's something which would need warning
labels in big, flashing neon signs. In any case, tracking the implicit
use of registers is critical to being able to do register "deallocation"
and recovering SSA values for the machine code. It's not clear to me
that you could let this fall out during the SelectionDAG process.

True.

Finally, it's not clear to me that using a selection DAG would derive
you much benefit instead of just doing a straight MachineInstr->LLVM IR
transformation. However, this is not my area of expertise, so I won't
comment any further in this regard.

Perhaps after I finish MCInst->MachineInstr conversion, I should do a MachineInstr->LLVM IR transformation--or maybe I should just do MCInst->LLVM IR directly! I wanted to use the SelectionDAG because I wanted to take advantage of the patterns stored in the target description files, but now that I think about it, I'm not sure I'll get much out of it, either…

Thanks for your input.

Chip