So I finally took the plunge and switched to MCJIT (wasn’t too bad, as long as you remember to call InitializeNativeTargetDisassembler if you want disassembly…), and I got the functionality to a point I was happy with so I wanted to test perf of the system. I created a simple benchmark and I’d thought I’d share the results, both because I know I personally had no idea what the results would be, and because it seems like there’s some low-hanging fruit to improve performance.
My JIT is currently structured as creating a new module per function it wants to jit; I had experimented with using an approach where I had an “incubator module” where all IR starts, and then on-demand extract it to “compilation modules” when I want to send it to MCJIT, but my experience was that this wasn’t very helpful. (My goal was to enable cross-function optimizations such as inlining, but there’s no easy way [and might not even make sense] to run module-level optimizations on a single function.)
The benchmark I set up is a simple REPL loop, where the input is a pre-parsed no-op statement. I put this in a loop and measured the amount of time it took, and tested it at 1k iterations and 10k iterations. This includes my IR-generation, but my expectation is that that should be negligible compared to the MCJIT time (confirmed through profiling). The absolute numbers are from a Release build with asserts turned off (this made a big difference), and the percentages are from a Release+Profiling build.
For 1k iterations, the test took about 640ms on my desktop machine, ie 0.64ms per module. Looking at the profiling results, it looks like about 47% of the time is spent in PassManagerImpl::run, and another 47% is spent in addPassesToEmitMC, which feels like it could be avoided by doing that just once. Of the time spent in PassManagerImpl::run, about 35% is spent in PassManager overhead such as initializeAnalysisImpl() / removeNotPreservedAnalysis() / removeDeadPasses().
For 10k iterations, the test took about 12.6s, or 1.26ms per module, so there’s definitely some slowdown happening. Looking at the profiling output, it looks like the main difference is the appearance of MCJIT::finalizeLoadedModules(), which ultimately calls RuntimeDyldImpl::resolveRelocations() and SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate over all memory sections leading to quadratic overhead. I’m not sure how easy it would be, but it seems like there could be single-module variants of these apis that could cut down on the overhead, since it looks like MCJIT knows what modules need to be finalized but doesn’t pass this information to the dyld / memory manager.
My overall takeaway from these numbers is pretty good: they’re good enough for where my JIT is right now, and it seems like there’s some relatively-straightforward work that can be done to make them better. I’m curious what other people think.