First, is there a way to search the archives for this list ? I apologize in advance if I have stepped on a FAQ.
My goal is to execute legacy binary machine code from a very old one of a kind computer on a variety of modern computers. I already wrote an emulator for the legacy machine that executes the old machine code. However, my emulator is just an interpreter and therefore has some limitations:
- The emulator spends a lot of time in an executive loop that fetches legacy instructions, decodes them, and jumps to appropriate C functions that emulate each legacy instruction. The executive loop also has to handle emulated interrupts, support single-step debugging, etc.
- The emulator is compiled and run on only a few modern hardware/operating system combinations. The emulator is fairly portable, but extensive optimizations on some platforms restrict capabilities on other platforms.
- The emulator executes the legacy machine code unmodified which is good, but that means opportunities for optimization are lost. The legacy machine code is full of dead code, jumps to jumps, redundant sub-expressions, unnecessary memory accesses, etc. Back in the old days, compilers really didn't optimize at all. They generated horrible code that was sometimes hand modified.
My idea is to convert my emulator into a translator that emits LLVM IR either directly or via calls to the LLVM library. I would then execute the result via JIT or native code compilation...
Is this a reasonable approach ?
Can this approach be used even when the legacy code is self modifying ? After a code modification, a re-translation and re-JIT would be needed.
Are there any general suggestions ?
I'm currently writing an IA-32 to LLVMIR translator. I'm only mid way
through, but I can certainly say that there have been more difficulties
than I anticipated when I began!
I think that it is a reasonable approach, perhaps especially in your
case, since you have an emulator already. Automatic static translation
is equivalent to the halting problem for IA-32 code, though perhaps it
wouldn't be for yours (what architecture are you using?). A dynamic
phase is therefore necessary for me -- if it is for you too, you'll have
a leg up.
Self-modifying code is both hideous and unusual, and very difficult to
deal with. I'm leaving it to one side.
General thoughts: are you sure that LLVMIR is suitable? You may be
better off with a lower-level representation. At least in my case, LLVM
enforces a level of structure that doesn't exist in machine code. That's
something you'll also probably have to deal with.
Its type system also hampers the modification of translated code, so
it's advantageous to ensure that you won't need to change any code once
translated. This is of particular importance when you're trying to
figure out the bounds of an array, and things like that: a change to the
size of an array is a change of its type, which means it's much easier
just to get the size of the array right in the first place. I'm
currently in the process of altering my code so that a lot more analysis
takes place before translation even begins!
Finally, how will you deal with memory accesses and aliasing? This is
certainly the thorniest problem, and its the one my dynamic phase exists
Do email me off-list if you like -- it sounds like we're pursuing
similar lines of inquiry!
You may also want to take a look at valgrind. It is capable of translating IA-32/64, PPC-32/64, etc.. to its own SSA-style IR. And then it has backends from the IR to the very same architectures. You can also build a backend to LLVM and let it further optimize the generated code (although valgrind already has its own optimizers).
Building such translators is not easy business. I can tell you that by experience.. Depending on what you want to achieve, I would reuse something already existent.
This is off-topic. I would reply directly to Erick, but I seem to have
missed the note that had his reply address.
The evidence in the literature supporting non-trivial optimization in a
dynamic binary to binary translator should be viewed with caution. Our
suspicion when we started the HDTrans work was that all that the fancy
optimizations accomplish in most cases is to (almost) make up for the
computational cost of generating the IR in the first place. The numbers
we achieved on IA32->IA32 translation seem to support that view. There
are special cases where this might not be true, and of course there are
specific counter-examples, but the HDTrans numbers apparently came as a
shock to some people in the dynamic translator world.
Erick mentioned in particular a number of control flow optimizations
that are very easily handled in a direct emitter (i.e. with no IR). Dead
code is also naturally eliminated by any dynamic translator. Redundant
expressions are not, but the performance advantage of modern processors
over legacy machines is so large that eliminating those may not be
worthwhile. Newer machines typically have more registers and/or fast L1
caches, both of which are your friend in a dynamic translator.
Given this, if performance actually matters your time and effort might
be better spent crafting a really tight binary decoder that pays careful
attention to cache utilization, and then implementing a relatively
modest number of branch to branch and call/return optimizations that are
fairly simple to do with a low-level direct translator.
Regardless of whether you accept my argument about the value of
optimization, it's clear from the HDTrans numbers that you can lose a
lot more performance in decoding and IR generation than the published
research work has generally acknowledged.
Please don't hesitate to contact me directly if I can be helpful.