Question about LLVM LLJIT Compile time

Hi,

We are using the new LLJIT class in our compiler. We have not been successful using the parallel JIT feature. When we tried it previously on multiple modules, our compile-time increased significantly. I don’t know if we are using it incorrectly, or that we miss out on optimizations we get when running on a single merged module, but it hasn’t worked for us yet. We are pretty far behind HEAD atm, but will try it again soon.

In the meantime, we are trying to find ways to gauge the compilation time of a module. We pass a single module to the LLJIT instance. Is there is any information we can get during the JIT construction to let us compare against other modules we run through JIT? We’re trying to find hot spots or performance issues in our modules. Timers or statistical data would be helpful if they exist during the execution of the JIT engine.

I imagine parallelizing the JIT will be our best bet for increasing performance, but we have not been able to use that yet.

Any help/ideas would be appreciated.

Thanks,
Chris

Hi Chris,

I can think of a couple of things to check up front:

(1) Are you timing this with a release build or a debug build? ORC uses asserts liberally, including in code that is run under the session lock, and this may decrease parallelism in debug builds.

(1) Are you using a fixed sized thread pool with an appropriate limit? Compiling too many things in parallel can have negatively impact performance if it leads to memory exhaustion.

(2) Are you loading each module on a different LLVMContext? Modules sharing an LLVMContext cannot be compiled concurrently, as contexts cannot be shared between threads.

And some follow up questions: What platform are you running on? Are you using LLJIT or LLLazyJIT? What kind of slow-down do you see relative to single-threaded compilation?

Finally, some thoughts: The performance of concurrent compilation has not received any attention at all yet, as I have been busy with other feature work. I definitely want to get this working though. There are no stats or timings collected at the moment, but I can think of a few that i think would be useful and relatively easy to implement: (1) Track time spent under the session lock by adding timers to runSessionLocked, (2) Track time spent waiting on LLVMContexts in ThreadSafeContext, (3) Add a runAs utility with timers to time execution of JIT functions.

What are your thoughts? Are there any other tools you would like to see added?

Cheers,
Lang.

I did a few compile-and-run benchmarks with lli et al recently. I didn’t see overall performance improvements from parallelization either.
Using a release build on my Macbook Pro, typical runtimes for 403.gcc from bitcode precompiled with clang -O1 -g0 look like this (real, user, sys):

Static build and run:
clang -o main -O0 <bitcode> && ./main <input> 4.652s 4.471s 0.172s

Eager JIT:
lli -jit-kind=mcjit 18.086s 17.975s 0.088s
lli -jit-kind=orc-eager (local hack) 15.334s 11.534s 0.264s

Per-function lazy JIT:
lli -jit-kind=orc-lazy 13.939s 13.779s 0.146s
lli -jit-kind=orc-lazy -compile-threads=8 15.171s 15.590s 0.245s
SpeculativeJIT -num-threads=8 10.292s 17.306s 0.380s

Per-module lazy JIT:
lli -jit-kind=orc-lazy -per-module-lazy 4.655s 4.580s 0.069s
lli -jit-kind=orc-lazy -per-module-lazy -compile-threads=8 4.695s 6.184s 0.173s

Invocations with the -compile-threads parameter dispatch compilation to parallel threads. My guess is that so far the synchronization overhead eats up all speedup, but I didn’t investigate enough to bake this with evidence. It would be nice to see the difference for -jit-kind=orc-eager, but with my local hack I am currently running into an internal error in the JITed code that I don’t understand yet.

Cheers,
Stefan

Hi Lang,

(1) Are you timing this with a release build or a debug build? ORC uses asserts liberally, including in code that is run under the session lock, and this may decrease parallelism in debug builds.

  • Mostly with release builds. I’ve only attempted debug builds when trying to take a trace with Valgrind/Vtune.

(1) Are you using a fixed sized thread pool with an appropriate limit? Compiling too many things in parallel can have negatively impact performance if it leads to memory exhaustion.

  • Yes I’ve tried as few as 2 threads. Doesn’t seem to help.

(2) Are you loading each module on a different LLVMContext? Modules sharing an LLVMContext cannot be compiled concurrently, as contexts cannot be shared between threads.

  • I’ve tried both ways, but ya I stuck with separate Contexts per module.

And some follow up questions: What platform are you running on? Are you using LLJIT or LLLazyJIT? What kind of slow-down do you see relative to single-threaded compilation?

  • Platform is a beefy server (shared among developers) with lots of cores. It’s running Ubuntu. So we’re using LLLazyJIT, but have laziness turned off by setting CompileWholeModule. So one test I was using took rougly 2-3 min to compile (single module). When splitting the module and compiling, and setting threads to 2, it was taking roughly twice as long.

Finally, some thoughts: The performance of concurrent compilation has not received any attention at all yet, as I have been busy with other feature work. I definitely want to get this working though. There are no stats or timings collected at the moment, but I can think of a few that i think would be useful and relatively easy to implement: (1) Track time spent under the session lock by adding timers to runSessionLocked, (2) Track time spent waiting on LLVMContexts in ThreadSafeContext, (3) Add a runAs utility with timers to time execution of JIT functions.

What are your thoughts? Are there any other tools you would like to see added?

  • I’m curious about (1) - runSessionLocked. Unfamiliar with that. Not to sound greedy, but all 3 sound very helpful :slight_smile:

  • If it helps I could possibly send some code over. Let me know if you’d like to see it.

Thanks,
Chris

Hi Chris,

When splitting the module and compiling, and setting threads to 2, it was taking roughly twice as long.

Yikes.

  • If it helps I could possibly send some code over. Let me know if you’d like to see it.

Yes – that would be great!

  • I’m curious about (1) - runSessionLocked. Unfamiliar with that.

The JIT symbol table operations (registering symbol definitions, lodging queries, updating symbol state) are all protected by the session lock in ExecutionSession. The intent is that these operations should be fast relative to compilation of modules, so there shouldn’t be too much serialization on the session lock. If lots of tasks are waiting on access to the session lock then we need to look at the performance of the symbol table operations.

Given what you’re seeing, I suspect this is poor performance in the symbol dependence tracking system. We should be able to see that with the logging from (1).

One other thing to note: If you break your module M up into modules A and B then you’ll need to make sure that you issue lookups for symbols in both A and B up-front. If you only look for your entry symbol then you’ll trigger compilation of one module, but your second thread will sit idle until the first module reaches the linker and starts issuing lookups for symbols in the second. That serialization will prevent you from seeing any concurrency benefits, even if I fix the dependence tracking performance. To make this easy I’ll add a new transform to ExecutionUtils.h to issue these lookups for you.

Cheers,
Lang.