Incorporating Parallelism into a dynamic binary translator

(Redirect me to a more appropriate mailing list if need be)

For a class project, I intend to write a binary translator for Qemu
using LLVM. My plan (still forming) is to implement a frontend that
will convert DEC Alpha instructions into LLVM IR which can then be
optimized and run through whatever backends LLVM has to arrive at
optimized native code.

This was done as a GSoC project in 2007 for ARM CPUs [1][2], which
showed that the time to translate to LLVM IR, optimize, and then to
native code is huge in comparison to actual runtime execution of the
programs (also see [3]).

So, I'm going to experiment with caching the generated bytecode and
native code on-disk and skipping retranslation on subsequent
executions of the program. (Other suggestions are greatly

Now, my class is specifically about parallel code optimization. As I
understand it, LLVM doesn't have the infrastructure (loop dependency
analysis, vector instructions in IR, etc) for auto-vectorization. And
I don't think LLVM is going to automatically split code into threads
for me--so the most immediate way to implement some form of
parallelism is to compile translate/optimize/link
basic-blocks/functions/full programs in parallel. But, if I'm already
caching the results, this is going to be progressively less

I'd much rather have parallelism in the generated code than use
parallelism in generating the code, you see. :slight_smile:

I'd love to implement auto-vectorization or a pass that's able to
divide orthogonal code into separate threads, but it's a project with
a 6-weeks timeline. The project only needs to be at the
proof-of-concept stage at the 6 weeks mark, but I certainly want
something I'm able to demonstrate.

What can I do?