Why does the x86-64 JIT emit stubs for external calls?

In X86CodeGen.cpp, the following code appears in the handler used for
CALL64pcrel32 instructions:

        // Assume undefined functions may be outside the Small codespace.
        bool NeedStub =
          (Is64BitMode &&
              (TM.getCodeModel() == CodeModel::Large ||
               TM.getSubtarget<X86Subtarget>().isTargetDarwin())) ||
          Opcode == X86::TAILJMPd;
        emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word,
                          MO.getOffset(), 0, NeedStub);

This causes every external call to be emitted as a call to a stub
which then jumps to the real function.
I understand, thanks to the helpful folks on #llvm, that calls across
more than 31 bits of address space need to be emitted as a "mov
$ADDRESS, r10; call *r10" pair instead of the simple "call
rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call
pair emitted inline? And why are Darwin and TAILJMPs special?

Having this out of line seems to lose up to 2% performance on the
Unladen Swallow benchmarks, so, while it's not urgent, it'd be nice to
figure out how to avoid the stubs.

What kind of patch would be welcome to fix this?

Thanks,
Jeffrey

In X86CodeGen.cpp, the following code appears in the handler used for
CALL64pcrel32 instructions:

       // Assume undefined functions may be outside the Small codespace.
       bool NeedStub =
         (Is64BitMode &&
             (TM.getCodeModel() == CodeModel::Large ||
              TM.getSubtarget<X86Subtarget>().isTargetDarwin())) ||
         Opcode == X86::TAILJMPd;
       emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word,
                         MO.getOffset(), 0, NeedStub);

This causes every external call to be emitted as a call to a stub
which then jumps to the real function.
I understand, thanks to the helpful folks on #llvm, that calls across
more than 31 bits of address space need to be emitted as a "mov
$ADDRESS, r10; call *r10" pair instead of the simple "call
rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call
pair emitted inline? And why are Darwin and TAILJMPs special?

This is needed because of lazy compilation, before the callee is resolved, it is just a JIT stub. It's heap allocated so it may not be in the lower 4G even if the code size model is small. I know this is the case on Darwin x86_64, I am not sure about other targets. I forgot why this is needed for tail calls, sorry.

In theory we can make the code generator inline mov+call, the reality is it doesn't know whether it's jitting or not. Also, we really want to keep the code generation the same (as much as possible) whether it's jitting or compiling. One possible solution for this is to add code size model specifically for JIT so code generator can generate more efficient code in that configuration.

Evan

In X86CodeGen.cpp, the following code appears in the handler used for
CALL64pcrel32 instructions:

  // Assume undefined functions may be outside the Small codespace\.
  bool NeedStub =
    \(Is64BitMode &amp;&amp;
        \(TM\.getCodeModel\(\) == CodeModel::Large ||
         TM\.getSubtarget&lt;X86Subtarget&gt;\(\)\.isTargetDarwin\(\)\)\) ||
    Opcode == X86::TAILJMPd;
  emitGlobalAddress\(MO\.getGlobal\(\), X86::reloc\_pcrel\_word,
                    MO\.getOffset\(\), 0, NeedStub\);

This causes every external call to be emitted as a call to a stub
which then jumps to the real function.
I understand, thanks to the helpful folks on #llvm, that calls across
more than 31 bits of address space need to be emitted as a "mov
$ADDRESS, r10; call *r10" pair instead of the simple "call
rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call
pair emitted inline? And why are Darwin and TAILJMPs special?

This is needed because of lazy compilation, before the callee is resolved,
it is just a JIT stub.

Even with lazy compilation, the contents of the stub get emitted (by
JITEmitter::getPointerToGlobal) as a direct call to the function, not
the compilation callback, because the function is an external
declaration. You can watch this happen with the following program:

declare i32 @rand()

define i32 @main() nounwind {
entry:
  %call = tail call i32 @rand() ; <i32> [#uses=1]
  %add = add i32 %call, 2 ; <i32> [#uses=1]
  ret i32 %add
}

and the command line `lli -debug-only=jit -march=x86-64 test.bc`.

With lazy compilation and a call to an internal function, the
JITEmitter can emit a stub even if MachineRelocation::doesntNeedStub()
(the field NeedStub gets passed into) returns true. Only returning
false constrains the emitter.

It's heap allocated so it may not be in the lower 4G
even if the code size model is small. I know this is the case on Darwin
x86_64, I am not sure about other targets.

Oh, other targets can certainly allocate code above 4G too.
sys::AllocateRWX just uses mmap with no constraints on the returned
address, and I've got a Linux desktop where that always produces an
address over 4G.

I forgot why this is needed for
tail calls, sorry.

In theory we can make the code generator inline mov+call, the reality is it
doesn't know whether it's jitting or not. Also, we really want to keep the
code generation the same (as much as possible) whether it's jitting or
compiling. One possible solution for this is to add code size model
specifically for JIT so code generator can generate more efficient code in
that configuration.

For non-JIT, the code generator doesn't ever need a stub, right? The
linker does it using the relocation information? It must be ignoring
the NeedStub parameter. ... But wait, is this code generator used for
anything besides the JIT? Compiling uses the AsmPrinter until direct
object code generation lands, and presumably they're redesigning this
whole subsystem.

It sounds like I'd have to fully understand the whole structure of the
code generator to fix this, and for <=2% performance, that's not
really worth it. I'll probably wait for the direct object code people
to get around to it. Thanks though.

In X86CodeGen.cpp, the following code appears in the handler used for
CALL64pcrel32 instructions:

       // Assume undefined functions may be outside the Small
codespace.
       bool NeedStub =
         (Is64BitMode &&
             (TM.getCodeModel() == CodeModel::Large ||
              TM.getSubtarget<X86Subtarget>().isTargetDarwin())) ||
         Opcode == X86::TAILJMPd;
       emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word,
                         MO.getOffset(), 0, NeedStub);

This causes every external call to be emitted as a call to a stub
which then jumps to the real function.
I understand, thanks to the helpful folks on #llvm, that calls across
more than 31 bits of address space need to be emitted as a "mov
$ADDRESS, r10; call *r10" pair instead of the simple "call
rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call
pair emitted inline? And why are Darwin and TAILJMPs special?

This is needed because of lazy compilation, before the callee is
resolved, it is just a JIT stub. It's heap allocated so it may not be
in the lower 4G even if the code size model is small. I know this is
the case on Darwin x86_64, I am not sure about other targets. I forgot
why this is needed for tail calls, sorry.

In theory we can make the code generator inline mov+call, the reality
is it doesn't know whether it's jitting or not. Also, we really want
to keep the code generation the same (as much as possible) whether
it's jitting or compiling. One possible solution for this is to add
code size model specifically for JIT so code generator can generate
more efficient code in that configuration.

Since the CodeEmitter's are now generically parameterized they can be specialized for JIT quite easily now.

Aaron

In X86CodeGen.cpp, the following code appears in the handler used for
CALL64pcrel32 instructions:

      // Assume undefined functions may be outside the Small codespace.
      bool NeedStub =
        (Is64BitMode &&
            (TM.getCodeModel() == CodeModel::Large ||
             TM.getSubtarget<X86Subtarget>().isTargetDarwin())) ||
        Opcode == X86::TAILJMPd;
      emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word,
                        MO.getOffset(), 0, NeedStub);

This causes every external call to be emitted as a call to a stub
which then jumps to the real function.
I understand, thanks to the helpful folks on #llvm, that calls across
more than 31 bits of address space need to be emitted as a "mov
$ADDRESS, r10; call *r10" pair instead of the simple "call
rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call
pair emitted inline? And why are Darwin and TAILJMPs special?

This is needed because of lazy compilation, before the callee is resolved,
it is just a JIT stub.

Even with lazy compilation, the contents of the stub get emitted (by
JITEmitter::getPointerToGlobal) as a direct call to the function, not
the compilation callback, because the function is an external
declaration. You can watch this happen with the following program:

There are probably some opportunities to improve upon the codegen here. Please file a bugzilla report, so I'd be reminded to take a look at some point.

declare i32 @rand()

define i32 @main() nounwind {
entry:
  %call = tail call i32 @rand() ; <i32> [#uses=1]
  %add = add i32 %call, 2 ; <i32> [#uses=1]
  ret i32 %add
}

and the command line `lli -debug-only=jit -march=x86-64 test.bc`.

With lazy compilation and a call to an internal function, the
JITEmitter can emit a stub even if MachineRelocation::doesntNeedStub()
(the field NeedStub gets passed into) returns true. Only returning
false constrains the emitter.

It's heap allocated so it may not be in the lower 4G
even if the code size model is small. I know this is the case on Darwin
x86_64, I am not sure about other targets.

Oh, other targets can certainly allocate code above 4G too.
sys::AllocateRWX just uses mmap with no constraints on the returned
address, and I've got a Linux desktop where that always produces an
address over 4G.

I forgot why this is needed for
tail calls, sorry.

In theory we can make the code generator inline mov+call, the reality is it
doesn't know whether it's jitting or not. Also, we really want to keep the
code generation the same (as much as possible) whether it's jitting or
compiling. One possible solution for this is to add code size model
specifically for JIT so code generator can generate more efficient code in
that configuration.

For non-JIT, the code generator doesn't ever need a stub, right? The

Right.

linker does it using the relocation information? It must be ignoring
the NeedStub parameter. ... But wait, is this code generator used for
anything besides the JIT? Compiling uses the AsmPrinter until direct

We are talking about the system linker, it doesn't use this code. The code generator proper doesn't know if it's generating code for static compilation or for jit. The code that creates stub etc. is JIT specific. JIT has to do a bit more work since it can't rely on anything else to relocate symbols.

object code generation lands, and presumably they're redesigning this
whole subsystem.

It sounds like I'd have to fully understand the whole structure of the
code generator to fix this, and for <=2% performance, that's not
really worth it. I'll probably wait for the direct object code people
to get around to it. Thanks though.

This is not a part of the direct object code path. I'll look at it at some point.

Evan