LLVM Orc Weekly #28 -- ORC Runtime Prototype update

Hi All,

Happy 2021!

I’ve just posted a new Orc Runtime Preview patch: https://github.com/lhames/llvm-project/commit/8833a7f24693f1c7a3616438718e7927c6624894

Quick background:

To date, neither ORC nor MCJIT have had their own runtime libraries. This has limited and complicated the implementation of many features (e.g. jit re-entry functions, exception handling, JID’d initializers and de-initializers), and more-or-less prevented the implementation of others (e.g. native thread local storage).

Late last year I started work on a prototype ORC runtime library to address this, and with the above commit I’ve finally got something worth sharing.

The prototype above is simultaneously limited and complex. Limited, in that it only tackles a small subset of the desired functionality. Complex in that it’s one of the most involved pieces of functionality that I anticipate supporting, as it requires two-way communication between the executor and JIT processes. My aim in choosing to tackle the hard part first was to get a sense of our ultimate requirements for the project, particularly in regards to where it should live within the LLVM Project. It’s not a perfect fit for LLVM proper: there will be lots of target specific code, including assembly, and it should be easily buildable for multiple targets (that sounds more like compiler-rt). On the other hand it’s not a perfect fit for compiler-rt: it shares data structures with LLVM, and it would be very useful to be able to re-use llvm::Error / llvm::Expected (that sounds like LLVM). At the moment I think the best way to square things would be to keep it in compiler-rt, allow inclusion of header-only code from LLVM in compiler-rt, and then make Error / Expected header-only (or copy / adapt them for this library). This will be a discussion for llvm-dev at some point in the near future.

On to the actual functionality though: The prototype makes significant changes to the MachOPlatform class and introduces an ORC runtime library in compiler-rt/lib/orc. Together, these changes allow us to emulate the dlopen / dlsym / dlclose in the JIT executor process. We can use this to define what it means to run a JIT program, rather than just running a JIT function (the way TargetProcessControl::runAsMain does):

ORC_RT_INTERFACE int64_t __orc_rt_macho_run_program(int argc, char argv[]) {
using MainTy = int (
)(int, char *[]);

void *H = __orc_rt_macho_jit_dlopen(“Main”, ORC_RT_RTLD_LAZY);
if (!H) {
__orc_rt_log_error(__orc_rt_macho_jit_dlerror());
return -1;
}

auto *Main = reinterpret_cast(__orc_rt_macho_jit_dlsym(H, “main”));
if (!Main) {

__orc_rt_log_error(__orc_rt_macho_jit_dlerror());
return -1;
}

int Result = Main(argc, argv);

if (__orc_rt_macho_jit_dlclose(H) == -1)
__orc_rt_log_error(__orc_rt_macho_jit_dlerror());

return Result;
}

The functions __orc_rt_macho_jit_dlopen, __orc_rt_macho_jit_dlsym, and __orc_rt_macho_jit_dlclose behave the same as their dlfcn.h counterparts (dlopen, dlsym, dlclose), but operate on JITDylibs rather than regular dylibs. This includes running static initializers and registering with language runtimes (e.g. ObjC).

While we could run static initializers before (e.g. via LLJIT::runConstructors), we had to initiate this from the JIT process side, which has two significant drawbacks: (1) Extra RPC round trips, and (2) in the out-of-process case: initializers not running on the executor thread that requested them, since that thread will be blocked waiting for its call to return. Issue (1) only affects performance, but (2) can affect correctness if the initializers modify thread local values, or interact with locks or threads. Interacting with threads from initializers is generally best avoided, but nonetheless is done by real-world code, so we want to support it. By using the runtime we can improve both performance and correctness (or at least consistency with current behavior).

The effect of this is that we can now load C++, Objective-C and Swift programs in the JIT and expect them to run correctly, at least for simple cases. This works regardless of whether the JIT’d code runs in-process or out-of-process. To test all this I have integrated support for the prototype runtime into llvm-jitlink. You can demo output from this tool below for two simple input programs: One swift, one C++. All of this is MachO specific at the moment, but provides a template that could be easily re-used to support this on ELF platforms, and likely on COFF platforms too.

While the discussion on where the runtime should live plays out I will continue adding / moving functionality to the prototype runtime. Next up will be eh-frame registration and resolver functions (both currently in OrcTargetProcess). After that I’ll try to tackle support for native MachO thread local storage.

As always: Questions and comments are very welcome.

– Lang.

lhames@Langs-MacBook-Pro scratch % cat foo.swift
class MyClass {
func foo() {
print(“foo”)
}
}

let m = MyClass()
m.foo();

lhames@Langs-MacBook-Pro scratch % xcrun swiftc -emit-object -o foo.o foo.swift
lhames@Langs-MacBook-Pro scratch % llvm-jitlink -dlopen /usr/lib/swift/libswiftCore.dylib foo.o
foo
lhames@Langs-MacBook-Pro scratch % llvm-jitlink -oop-executor -dlopen /usr/lib/swift/libswiftCore.dylib foo.o
foo
lhames@Langs-MacBook-Pro scratch % cat inits.cpp
#include

class Foo {
public:
Foo() { std::cout << “Foo::Foo()\n”; }
~Foo() { std::cout << “Foo::~Foo()\n”; }
void foo() { std::cout << “Foo::foo()\n”; }
};

Foo F;

int main(int argc, char *argv[]) {
F.foo();
return 0;
}
lhames@Langs-MacBook-Pro scratch % xcrun clang++ -c -o inits.o inits.cpp
lhames@Langs-MacBook-Pro scratch % llvm-jitlink inits.o
Foo::Foo()
Foo::foo()
Foo::~Foo()
lhames@Langs-MacBook-Pro scratch % llvm-jitlink -oop-executor inits.o
Foo::Foo()
Foo::foo()
Foo::~Foo()

Wow, thanks for the update. One more ORC milestone in a short period of time!

On macOS I built the C++ example like this:

% cmake -GNinja -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS=“clang;compiler-rt” …/llvm
% ninja llvm-jitlink llvm-jitlink-executor lib/clang/12.0.0/lib/darwin/libclang_rt.orc_osx.a
% clang++ -c -o inits.o inits.cpp

The in-process version works perfectly, but with the out-of-process flag the examples fails:

% ./bin/llvm-jitlink inits.o
Foo::Foo()
Foo::foo()
Foo::~Foo()
% ./bin/llvm-jitlink -oop-executor inits.o
JIT session error: Symbols not found: [ __ZTIN4llvm6detail14format_adapterE ]

Any idea what could go wrong here? Otherwise I can try to debug it later this week. (Full error below.)

Best
Stefan

Hi Stefan,

% ./bin/llvm-jitlink -oop-executor inits.o
JIT session error: Symbols not found: [ __ZTIN4llvm6detail14format_adapterE ]

I’ve been testing with a debug build:

% xcrun cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_PROJECTS=“llvm;clang;compiler-rt” …/llvm

Matching this build might fix the issue, though building with my config (if it works) is only a short-term fix. The error that you’re seeing implies that the runtime is dependending on a symbol from libSupport that is not being linked in to the target (llvm-jitlink-executor). I’ll aim to break these dependencies on libSupport in the future. Mostly that means either removing the dependence on llvm::Error / llvm::Expected (e.g. by creating stripped down versions for the orc runtime), or making those types header-only.

– Lang.

Big question for JIT clients: Does anyone have any objection to APIs in ORC relying on the runtime being loaded in the target? If so, now is the time to let me know. :slight_smile:

I think possible objections are JIT’d program startup time (unlikely to be very high, and likely fixable via careful runtime design and pre-linking of parts of the runtime), and difficulties building compiler-rt (which sounds like something we should fix in compiler-rt).

If we can assume that the runtime is loadable then we can significantly simplify the TargetProcess library, and TargetProcessControl API, and further accelerate feature development in LLVM 13.

– Lang.

Interesting that we don’t see the same failure then (Debug is the default build type).

So I did some debugging:

Actually, the ZTIN prefix of the missing symbol indicates a typeinfo symbol. This is weird because we don’t usually build anything with RTTI in LLVM right? Anyway I rebuilt orc_rt_macho_dlfcn_remote.cpp.o with an explicit “-fno-rtti” and voila! That fixes the issue.

It seems that COMPILER_RT_COMMON_CFLAGS doesn’t disable RTTI by default. Or maybe just on my system? Should we add it for clang_rt.orc as long as we do have the libSupport dependency? I put up a PR here: Cheers Stefan

We always compile and execute in the same process, so I don’t imagine that would make any difference to us…?

Geoff

Hi Geoff,

We always compile and execute in the same process, so I don’t imagine that would make any difference to us…?

The runtime would be required for both in-process and out-of-process JITing if you use any features that depend on it. Existing and planned features that would require the runtime include: lazy compilation, exception handling, static initializer / destructors, thread local variables, and API based target memory access. Failure to load the runtime will cause any of these features to issue an error when used (probably a “failed to resolve symbol XXX” for the corresponding runtime implementation symbol).

So you’ll need to build compiler-rt and ship the orc runtime alongside your JIT to use any of the features above.

– Lang.

Actually, the ZTIN prefix of the missing symbol indicates a typeinfo symbol. This is weird because we don’t usually build anything with RTTI in LLVM right? Anyway I rebuilt orc_rt_macho_dlfcn_remote.cpp.o with an explicit “-fno-rtti” and voila! That fixes the issue.

/foreheadslap

That’s the issue. I forgot that I turned on RTTI for my LLVM build while testing this. I guess compiler-rt builds with it by default, which makes sense.

So, for anyone wanting to test out the prototype, you’ll want to add ‘-DLLVM_ENABLE_RTTI=True’ to your CMake innovation for now.

– Lang.

Would this introduce start-up latency based on the size of the compilation? JIT jobs with 100-200MB of source are not uncommon here, and I’d hate to see latency get much worse.

A small enough fixed latency would be ok.

Hi Kevin,

Would this introduce start-up latency based on the size of the compilation? JIT jobs with 100-200MB of source are not uncommon here, and I’d hate to see latency get much worse.

Short answer: I don’t see this adding measurable latency for your use case.

Longer answer:

I think the runtime will be non-mandatory: If you’re not going to use any of the features you could opt out of loading it, in which case there will be no overhead at all.

If you load the runtime but don’t use it you will pay a small cost each time a lookup traverses the StaticLibraryDefinitionGenerator for the runtime. The cost would be one virtual call, plus N * DenseMap<uintptr_t> lookup. You can almost certainly arrange your libraries so that this cost is never paid in practice (by making the runtime the last thing searched).

The first time each feature (laziness, thread locals, dlfcn call, etc.) is used the JIT will load and link the corresponding part of the library. This will involve jit-linking a few kilobytes of pre-compiled code at most, and happens only once per feature per session.

I don’t think any of this will be measurable against the cost of compiling any source files, let alone tens or hundreds of Mb worth.

Are you using the concurrent compilation feature on Orc? If you’re seeing high startup latency on multicore machines that may offer a way to reduce it.

– Lang.

I am always in favor of getting JIT improvements into mainline as soon as possible. Also, simplifying APIs is an important goal.

On the other hand this raises a few general questions for me:

IIUC the current patch introduces a dependency from LLVM to compiler-rt/clang build artifacts, because the llvm-jitlink-executor executable is fully functional only if it can find the clang_rt.orc static library in the build tree. Do we have dependencies like that in mainline so far?

And how does it affect testing? So far, it seems we have no testing for out-of-process execution, because the only tool that exercises this functionality is llvm-jitlink, which itself is mostly used as a testing helper. If the functionality now moves into the TargetProcess library, it might be worth thinking through the test strategy first.

At the moment, the TargetProcess library is only used by JITLink. Will it stay like that? If so, RuntimeDyld-based JITs will not be affected by the patch. Will it be possible to build and run e.g. a LLJIT instance with a RuntimeDyldLinkingLayer if only LLVM gets built (omitting LLVM_ENABLE_PROJECTS=“clang;compiler-rt”)?

Last but not least, there are examples that use JITLink. Could we still build and run them if only LLVM gets built?

Thanks,
Stefan

Hi Stefan,

I am always in favor of getting JIT improvements into mainline as soon as possible. Also, simplifying APIs is an important goal.

On the other hand this raises a few general questions for me:

IIUC the current patch introduces a dependency from LLVM to compiler-rt/clang build artifacts, because the llvm-jitlink-executor executable is fully functional only if it can find the clang_rt.orc static library in the build tree. Do we have dependencies like that in mainline so far?

These are excellent questions.

As far as I know we do not have any dependencies like this (LLVM tool depends on compiler-rt for full functionality) in the mainline yet. Even superficially similar situations (e.g. building clang without building compiler-rt or libcxx) are quite different as there must already be compatible system libraries available to bootstrap LLVM in the first place.

On the other hand, while it’s true that llvm-jitlink will no longer have full functionality without loading the runtime, it’s not the case that there will be any regression from the current functionality: we’re only enabling new functionality here (at least in llvm-jitlink, which never used to run initializers). This means that we can use the new ‘-use-orc-runtime=false’ option to disable runtime loading and still run all llvm-jitlink tests that are in-tree today.

And how does it affect testing? So far, it seems we have no testing for out-of-process execution, because the only tool that exercises this functionality is llvm-jitlink, which itself is mostly used as a testing helper. If the functionality now moves into the TargetProcess library, it might be worth thinking through the test strategy first.

I think it is reasonable to maintain the existing tests (adding ‘-use-orc-runtime=false’), then add new end-to-end tests of the runtime via llvm-jitlink that will be dependent on having built compiler-rt.

At the moment, the TargetProcess library is only used by JITLink. Will it stay like that? If so, RuntimeDyld-based JITs will not be affected by the patch. Will it be possible to build and run e.g. a LLJIT instance with a RuntimeDyldLinkingLayer if only LLVM gets built (omitting LLVM_ENABLE_PROJECTS=“clang;compiler-rt”)?

I think that the LLI tool should be migrated to TargetProcess too. We should distinguish between TargetProcess and the ORC runtime though: TargetProcess can be linked into the executor without requiring the runtime.

LLI does run static initializers / deinitializers today, but only in-process using the existing initializer infrastructure (which I think is a hack). I think two reasonable options going forward would be:
(1) Make running initializers via lli dependent on building compiler-rt (as in llvm-jitlink). This is my preferred solution.
(2) Move the existing hacks out of ORC and into LLI (or a specific LLI_LLJIT class in ORC).

For MCJIT-like use cases you will definitely still be able to build and run an LLJIT instance with either RTDyldObjectLinkingLayer (RuntimeDyld) or ObjectLinkingLayer (JITLink) with LLVM only. I think this will remain true indefinitely. What I think may change over time is how advanced features (e.g. laziness, initializers) are implemented: It is so much easier and lower maintenance to implement these within the runtime. Eventually I could see us dropping support for them in LLVM-only builds, at which point it will be a runtime error to attempt to use those features (e.g. attempting to add a lazyReexport will yield a “cannot resolve __orc_rt_jit_reentry” error).

Do these solutions seem reasonable to you?

Thank you very much again for thinking about this – I think testing and dependencies are the trickiest aspects of introducing this runtime, and it’s very helpful to get other developers’ perspectives.

Regards,
Lang.

Thanks for your comprehensive reply!

This means that we can use the new ‘-use-orc-runtime=false’ option to disable runtime loading and still run all llvm-jitlink tests that are in-tree today.

Fantastic. I had a look at the code and it seems totally reasonable. One detail stated in your prototype implementation is, that the default value for the option would depend on compiler-rt being built or not. It’s possible to infer this info from CMake, but it’s a little tricky since compiler-rt is generally configured after LLVM. I made a patch that might be used for that purpose once your prototype landed: Yes agreed. Though, it seems to me that this makes the LLJITBuilder configuration quite complex. After all, we’d need to keep RuntimeDyld support and this doesn’t work with TargetProcess right? Would that create the case for a separate jit-kind ‘orc-rtdyld’? …, then orc-rtdyld could keep the hack a la (2) and JITLink-based kinds can switch to (1)? Moving lli-specific code out of ORC makes sense in any case. Sounds reasonable. Maybe that would also give an opportunity to reevaluate the role of lli. It always seemed to me that it was intended as a helper tool for testing and for developers to quickly run some bitcode snippets. With the advent of orc-lazy, lli also started to serve as an entry point to these advanced ORC features. I recently started wondering how big the overlap between these two use-cases still is and whether it might be favorable to have two separate executables instead. What’s your impression? I guess there are also good reasons why lli should remain as it is today. Stefan