Need help on JIT compilation speed

Hi there,

I am trying to JIT a rather big wasm bytecode program to x86 native code and running into this JIT compilation time issue. In the first stage, I use MCJIT to translate wasm bytecode to a single LLVM IR Module which ends up with 927 functions. Then it took a pretty long time to apply several optimization passes to this big IR module and finally generate x86 code. What should I do to shorten the compilation time? Is it possible to compile this single big IR Module with MCJIT in parallel? Is OrcV2 JIT faster than MCJIT? Can the ‘concurrent compilation’ feature mentioned in Orcv2 webpage help on this? Thanks in advance for any advice.

This is how I organized the optimization pass:


This is how I apply passes to my single IR module (which actually includes 927 functions)

if (comp_ctx->optimize) {
for (i = 0; i < comp_ctx->func_ctx_count; i++)


Hi Terry,
CC’ed lang hames he is the best person to answer.

In general, ORCv2 is the new and stable JIT environment. In order to have a fast compilation time you can use the -lazy compilation option in ORCv2, this will result in fast compile time and interleave compile time with execution time. You can also use the concurrent compilation option in ORCv2 to speedup. Additionally, we did a new feature called “speculative compilation” in ORcv2 which yields good results for a set of benchmarks. If you are interested please try this out. We would like to have some benchmarks on your case :slight_smile:
To try things out you can check out the examples directory in LLVM for ExecutionEngine.
I hope this helps

Hi Praveen,

Thanks for your help. I will follow your suggestions and get back if I can make some progress.


Hi Terry,

As Praveen mentioned, OrcV2 supports concurrent compilation and lazy compilation, both of which may help reduce time-to-execution. There are a number of things to keep in mind as you consider your options though:

(1) Neither OrcV2 nor MCJIT have any special tricks to speed up compilation of LLVM IR: The IR compilation process is opaque to them, and for the same input IR they will both take the same time*.
(2) LLVM does not currently support concurrent optimization within a module: Different modules can be compiled concurrently (provided they are attached to different LLVMContexts), but two functions in the same module can not.
(3) Laziness can help to reduce time-to-execution if some of your IR is either unlikely to be used (in which case you can avoid compiling it altogether) or unlikely to be used until later in program execution (in which case you can defer compilation).
(4) Concurrency can help reduce the wall-clock time required for compilation if you can break your modules up in a suitable way. If you’re relying on whole module optimizations then there are some trade-offs to consider: breaking up a module to enable concurrent compilation may eliminate inlining opportunities. Cloning available_externally function definitions into your module to re-enable inlining opportunities can address this, but adds overhead of its own. Since it appears that you are just doing function-at-a-time optimization (without inlining) you may not have to worry about this.
(5) Some false dependencies still exist in OrcV2s concurrent compilation system – these may artificially limit the amount of parallel work you can do at the moment. I have fixes in mind, but I also have some other features and bugs to address first, so fixes may not be available for a while.

Finally: I’m not sure whether you’re just measuring IR optimization time or including CodeGen time too, but the best way to reduce the amount of compilation work to be done is to play around with your optimization settings and look for optimizations that you can discard without having too much impact on generated code quality. You’ll want to look at both the IR optimizations and codegen optimization levels for this.


  • Note: OrcV2 and MCJIT will take the same time to compile the same IR once they reach the compilation stage, however Orc’s lazy compilation utilities will automatically break up modules before the reach the compiler, so you can’t do an apples-to-apples comparison of compile times there.