ORC JIT Weekly #34 -- ORC Runtime and JITLink improvements, and a performance question

Hi All,

Just a few minor updates this week:

  • Initial ORC runtime unit testing infrastructure has landed. Now that all the basic infrastructure is in place I plan to start rolling out implementation patches next week.

  • ObjectLinkingLayer acquired support for JITLink LinkGraphs as first-class input (on a par with object files).

  • LinkGraph debug dumping has been improved.

An interesting question came up in discussion with @Xexizy in #jit on the llvm discord server: Given that we’re re-using the static compiler and trying to match the native code model, how and where would we expect JIT’d code performance to differ from AOT code performance. Leaving aside compilation and linking overhead and focusing on performance of JIT’d code once it’s in memory, here are a few quick observations / thoughts:

  • Feedback on JIT’d code performance has been sparse. Performance has been good enough for my use cases so far, so I haven’t gotten around to measuring it systematically. It would be cool to build some ORC JIT benchmarks, but exactly what those benchmarks should look like is not clear yet.

  • Memory layout: We make no attempt to lay memory out in a way that is friendly to the memory system, though the user has some control over that through their choice of memory manager implementation. The cost of poor memory layout will vary from system to system. In theory JITLink should give us enough information and flexibility to re-layout function bodies in memory (using some sort of measurement/analysis to determine a favorable layout). Nobody has actually tried this yet to my knowledge.

  • Indirect access: RuntimeDyld usually uses indirect access through registers for functions and globals. On some platforms this may be quite inefficient. JITLink uses direct calls, synthesizes jump stubs for external call targets only, and uses global offset tables to access data. Built-in JITLink optimizations opportunistically bypass the jump stubs and GOT loads whenever the target ends up being in range. The resulting linked code should be nearly identical to the ahead-of-time compiled versions.

  • Laziness: Lazy compilation in ORCv1 always used pointer stubs, and this is the default behavior in ORCv2 too. JITLink allows us to identify call sites, which we could use to rewrite calls (security model permitting) to bypass the stubs after function bodies are lazily compiled.

  • Thread local variables: The JIT only supports emulated thread local variables at the moment (where it supports them at all). The ORC runtime enables support for native thread locals on MachO, but the current implementation isn’t optimized – performance still won’t be as good as pre-compiled TLVs. Future implementations should be able to reduce the cost to something much closer to pre-compiled TLVs.

I’ll keep thinking about this and add to this list if I come up with more. If any of you have thoughts or insights on JIT’d code performance that you’d like to share please jump in.

– Lang.