How will OrcJIT guarantee thread-safety when a function is asked to be re generated?

I’ve been playing with OrcJIT a bit, and from the looks of it I can (like in the previous JIT I suppose?) ask for a function to be re generated.

If I’ve given the address of the function that LLVM gave me to an external party, do “I” need to ensure thread-safety?

Or is it safe to ask OrcJIT to re generate code at that address and everything will work magically?

I’m thinking it won’t because it’s quite possible some thread might be executing code, and we’ll be asking LLVM to write bytes there.

How does one generally go do such updates? I’m looking for some guidance without adding a trampoline in front of it. Do runtimes that support re-generation of code have an if check or something before entering the method?

[+Lang, keeper of JITs, designer of ORCs]

Hi Hayden,

Dave’s answer covers this pretty well. Neither Orc nor MCJIT currently reason about replacing function bodies. They may let you add duplicate definitions, but how they’ll behave if you do that isn’t specified in their contracts. They definitely won’t replace old definitions unless you provide a custom memory manager that’s rigged to lay new definitions down on top of old ones.

I suspect that existing clients of MCJIT have tackled this by adding thread safety into their wrappers around MCJIT, or into the JIT’d code itself, but I’m just guessing. (CC’ing Keno and Philip, in case they have insights).

I think this would be cool to build in to Orc though. Two quick thoughts:

(1) Replacing function bodies at the same address is impossible if the function is already on the stack: You’d be replacing a definition that you’re later going to return through. So, if you want to replace functions at the same address you’ll have to have some sort of safe-point concept where you know the function you want to replace isn’t on the stack.

(2) Replacing function bodies at the same address isn’t the only way to avoid the overhead of a trampoline. I haven’t implemented this yet, but I really want to add llvm.patchpoint support to Orc. In that case you can lay down your replacement definition at a different address, update all your callsites, then delete your old definition after you’re done executing it. Relative to using trampolines this lowers your execution cost (calls are direct rather than indirect), but increases your update cost (you have to update many callsites, rather than a single trampoline).

Out of interest, why the desire to avoid trampolines? They do make life a lot easier here. :slight_smile:

Cheers,
Lang.

We haven’t tackled either code replacement or threading yet, hence haven’t run into this particular problem yet. I’m planning to push through my TLS patch for RuntimeDyld (which is my current blocker for adding threading), but parallel codegen will probably still be a little bit off for me. On the replacing functions front, I was planning to go the patchpoint route as well.

Hi Hayden,

Dave's answer covers this pretty well. Neither Orc nor MCJIT currently
reason about replacing function bodies. They may let you add duplicate
definitions, but how they'll behave if you do that isn't specified in their
contracts. They definitely won't replace old definitions unless you provide
a custom memory manager that's rigged to lay new definitions down on top of
old ones.

I suspect that existing clients of MCJIT have tackled this by adding thread
safety into their wrappers around MCJIT, or into the JIT'd code itself, but
I'm just guessing. (CC'ing Keno and Philip, in case they have insights).

I think this would be cool to build in to Orc though. Two quick thoughts:

(1) Replacing function bodies at the same address is impossible if the
function is already on the stack: You'd be replacing a definition that
you're later going to return through.

If the function you wish to replace is active on the stack, you can
replace the return PC that was going to return into that active frame
with a PC pointing into a stub that knows how to replace the active
stack frame with something that would let the new code continue
executing. The stub will then have to branch into a suitable position
in the new generated code. Once you have done this for all "pending
returns" into the old bit of generated code, you can throw the old
code away, since nothing will ever return into it.

This can be tricky to get right but if you have built OSR support
already for some other reason then this is a viable option. This
scheme is very similar to throwing an exception, and the semantics of
"catching" an exception is to branch to a newly generated block of
code.

So, if you want to replace functions
at the same address you'll have to have some sort of safe-point concept
where you know the function you want to replace isn't on the stack.

That will work, but can be very hard to make happen. For instance,
the method you want to replace may have called a function that has an
infinite loop in it.

Hi Sanjoy,

(1) Replacing function bodies at the same address is impossible if the

function is already on the stack: You’d be replacing a definition that

you’re later going to return through.

If the function you wish to replace is active on the stack, you can
replace the return PC that was going to return into that active frame
with a PC pointing into a stub that knows how to replace the active
stack frame with something that would let the new code continue
executing. The stub will then have to branch into a suitable position
in the new generated code. Once you have done this for all “pending
returns” into the old bit of generated code, you can throw the old
code away, since nothing will ever return into it.

This can be tricky to get right but if you have built OSR support
already for some other reason then this is a viable option. This
scheme is very similar to throwing an exception, and the semantics of
“catching” an exception is to branch to a newly generated block of
code.

That all makes sense. What are your thoughts on the trade-offs of this vs the patchpoint approach though? If you can modify previously executable memory it seems like the patchpoint approach would have lower overhead, unless you have a truly huge number of callsites to update?

So, if you want to replace functions

at the same address you’ll have to have some sort of safe-point concept

where you know the function you want to replace isn’t on the stack.

That will work, but can be very hard to make happen. For instance,
the method you want to replace may have called a function that has an
infinite loop in it.

Agreed. This might find a home in simple REPLs where calling an infinite-loop would be undesirable/unexpected behavior, but that’s also an environment where you are unlikely to want reoptimization.

(2) Replacing function bodies at the same address isn’t the only way to

avoid the overhead of a trampoline. I haven’t implemented this yet, but I

really want to add llvm.patchpoint support to Orc. In that case you can lay

down your replacement definition at a different address, update all your

callsites, then delete your old definition after you’re done executing it.

Relative to using trampolines this lowers your execution cost (calls are

direct rather than indirect), but increases your update cost (you have to

update many callsites, rather than a single trampoline).

FWIW, Pete Cooper and I have tossed around ideas about adding utilities to Orc for injecting frame-residence counting and automatic cleanup in to functions to facilitate this 2nd approach. The rough idea was that each function would increment a counter on entry and decrement it on exit. Every time the counter hits zero it would check whether it has been “deleted” (presumably due to being replaced), and if so it would free its memory. This scheme should be easy to implement, but hasn’t gone past speculation on our part.

  • Lang.

That all makes sense. What are your thoughts on the trade-offs of this vs
the patchpoint approach though? If you can modify previously executable
memory it seems like the patchpoint approach would have lower overhead,
unless you have a truly huge number of callsites to update?

You need the hijack-return-pc approach *in addition* to a call-site
patching approach. Modifying the return PC lets you guarantee that
nothing will *return* into the old generated code. To guarantee that
nothing will *call* into it either you could use a double indirection
(all calls go through a trampoline) or patchpoints.

-- Sanjoy

Thanks for all the comments on this.

The concern for not using a trampoline is that it’s very likely my language will interpret for a while, then generate code, then generate better code, etc. Much like the WebKit FLT JavaScript project. It seems like they will be using patch points? I should probably use that for my perf use-case as well.

However, I have an additional case that requires correctness to maintained across threads, i.e. the code may be semantically different. In this case I need either all callsites updated at once, or none. It can’t be that half of the
callsites are calling into a different function because there maybe real differences.

I suppose an approach I can take is to use patch points for eventual consistency and have an if check that has maintains the correctness and will trampoline or execute code depending on if all call sites are done being patched.

Or maybe I’m over complicating this for my language :slight_smile:

Hi Sanjoy,

Hi Sanjoy,

You need the hijack-return-pc approach *in addition* to a call-site
patching approach. Modifying the return PC lets you guarantee that
nothing will *return* into the old generated code. To guarantee that
nothing will *call* into it either you could use a double indirection
(all calls go through a trampoline) or patchpoints.

You need to hijack the return addresses if you want to delete the original
function body immediately, but what if you just leave the original in-place
until you return through it? That is the scheme that Pete and I had been
talking about. On the one-hand that means that the old code may live for an
arbitrarily long time, on the other hand it saves you from implementing some
complicated infrastructure. I suspect that in most JIT use-cases the cost of
keeping the duplicate function around will be minimal, but I don't actually
have any data to back that up. :slight_smile:

I agree that the cost of keeping around the old generated code is
likely to be small.

However, you may need to ensure that nothing returns into the original
code for correctness: the reason you're replacing an old compilation
with a new one could very well be that you want to do something that
makes the old piece of code incorrect to execute. For instance, in
the old compilation you may have assumed that class X does not have
any subclasses (and you devirtualized some methods based on this
assumption), but now you're about to dlopen a shared object that will
introduce a class subclassing from X. Your devirtualization decisions
will be invalid after you've run dlopen, and it would no longer be
correct to execute the old bit of code.

You could also do this by appending all calls / invokes with

  some_other_method();
  if (!this_method->is_still_valid) {
    do_on_stack_replacement(); // noreturn
    unreachable;
  }
  // continue normal execution

and that's perfectly sound because you'll always check validity before
re-entry into a stack frame. Is this what you meant by frame
residence counting?

It's worth noting that, as per its design goals, Orc is agnostic about all
this. Either scheme, or both, should be able to be implemented in the new
framework. It's just that nobody has implemented it yet.

So this is the Patches Welcome(TM) part. :stuck_out_tongue:

-- Sanjoy

Sanjoy, so does the JVM do what you described? I always thought the JVM has a trampoline as methods are HotSpot compiled.

I couldn't reproduce that behavior. I just wrote a small program to

1. mmap one page read/write,
2. write mov eax, 0x12345678; ret to it,
3. mprotect it to read/exec,
4. call the instructions and print the result,
5. mprotect it to read/write,
6. change 0x12345678 to 0x55aa,
7. mprotect it to read/exec, and
8. call the instructions and print the result.

steve$ ./a.out
0x12345678
0x55aa

I just tested this on OS X 10.9.5 and 10.10.1.

a.c (806 Bytes)

We solve this by using MCJIT to generate the code, and then managing it ourselves. We have an instance of MCJIT per compiler thread. We use MCJIT to perform one compilation at a time, and then disconnect the generated code from MCJIT.

Our runtime has existing mechanisms for patching the call sites to point to the new version of the code. We've been able to use those essentially without modification. To put it differently, we consider that out of scope for MCJIT.

p.s. It's worth stating that this type of code life cycle management is *hard*. If the older version is still valid when the new one is installed, it gets a bit easier, but if you have to invalidate before install, you need to build a sophisticated deoptimization mechanism.

Philip

Hi Sanjoy,

I hadn’t started thinking about deoptimization yet. That makes perfect sense though.

So this is the Patches Welcome™ part. :stuck_out_tongue:

Yep. :slight_smile:

Cheers,
Lang.