Questions on LLVM and binary translation

Hi,

I'm currently investigating LLVM to see whether it can be used in dynamic binary translation. My goal is to translate different "source" machine code into the "target" machine code during runtime, e.g., MIPS -> x86. LLVM has a well-defined intermediate representation to separate source and target machine code (source -> LLVM Bytecode -> target), and is a quit extensible and adaptable framework. So I consider LLVM as a good choice for building advanced binary manipulation tools. But I have several questions in regard to fit LLVM into dynamic binary translation use cases:

1. The current JIT implementation assumes the bytecode file is fully generated and should be read and parsed by [BytecodeFileReader] before JIT (right? ). Can current LLVM be extended to support to parse bytecodes just-in-time, that is, parse block of bytecode whenever available? I think it may be a useful and interesting feature for LLVM.

2. Why are the current codegen passes per-function-a-time? I'd rather do it per-BB-a-time because some BBs in the function may not be executed at all. Is there any difficulty to do codegen per-BB-a-time?

Thank you for the attention, and any suggestions and comments on applying LLVM on dynamic binary translation are mostly welcomed.

- Daniel Bao

I'm currently investigating LLVM to see whether it can be used in dynamic binary translation. My goal is to translate different "source" machine code into the "target" machine code during runtime, e.g., MIPS -> x86. LLVM has a well-defined intermediate representation to separate source and target machine code (source -> LLVM Bytecode -> target), and is a quit extensible and adaptable framework. So I consider LLVM as a good choice for building advanced binary manipulation tools. But I have several questions in regard to fit LLVM into dynamic binary translation use cases:

Ok.

1. The current JIT implementation assumes the bytecode file is fully generated and should be read and parsed by [BytecodeFileReader] before JIT (right? ). Can current LLVM be extended to support to parse bytecodes just-in-time, that is, parse block of bytecode whenever available? I think it may be a useful and interesting feature for LLVM.

Actually it already does this. You can add functions to the Module whenever you'd like. When you call the function added, they will be JIT'd. The one feature that is missing that would be really nice is to be able to get a callback from the JIT when an external function is called, so that the .bc file could be generated in a 'pull' style instead of a push style.

2. Why are the current codegen passes per-function-a-time? I'd rather do it per-BB-a-time because some BBs in the function may not be executed at all. Is there any difficulty to do codegen per-BB-a-time?

The algorithms used are function-at-a-time, e.g. register allocation.

Note that an LLVM "Function" is just a unit of code with well defined inputs and outputs. They wouldn't need to correspond to the functions in your input program. If you turn each input basic block into a separate LLVM function, it should do what you want.

Thank you for the attention, and any suggestions and comments on applying LLVM on dynamic binary translation are mostly welcomed.

I don't have much experience with the field, maybe some other readers of this list do :slight_smile:

-Chris