llvm register reload/spilling around calls

Hi,

I was investigating some performance issues with llvm JIT-generated code
(x86_64), and looking at the assembly it indeed seemed quite suboptimal.
In particular, the code is basically implementing some kind of caching.
If there's a cache hit, the code just takes the value from the cache, if
not it will do whatever is necessary to update the cache - this is
expensive but happens only in about 1% of all cases and just calls a
function to do it.
So I saw that the code is doing lots of register spilling/reloading. Now
I understand that due to calling conventions, there's not really a way
to avoid this - I tried using coldcc but apparently the backend doesn't
implement it and hence this is ignored.
But what is really bad about this, is that the spilling/reloading ALWAYS
happens, regardless if the branch containing the call is taken or not.
Since the branch is almost never taken, that is obviously quite bad (but
even if the branch would be taken more often, which the compiler can't
know, I can't see why the reloading is always happening).
I'm not quite sure what performance impact this has, but it looks to me
like it definitely would make a difference, as the code not taking the
branch is quite simple.
I tried with both llvm 2.7 and 2.8, no difference.
So is there any optimization option I'm missing which could improve
this? Or is this simply the way things are (would that be considered a
bug?). If this is a known limitation, any ideas if it's possible to work
around that (by changing the affected jit code)?
I'm attaching the IR I've hack-extracted from the jit code (it might be
bogus but it compiles just fine). I think the assembly shows what I'm
talking about quite well (even has the comments about the
restore/spills). I used llc -O3 to compile.

Roland

sillysaverestore (3.41 KB)

sillysaverestore.s (6.58 KB)

So I saw that the code is doing lots of register spilling/reloading. Now
I understand that due to calling conventions, there's not really a way
to avoid this - I tried using coldcc but apparently the backend doesn't
implement it and hence this is ignored.

Yes, unfortunately the list of call-clobbered registers is fixed at the moment, so coldcc is mostly ignored by the backend.

Patches welcome.

So is there any optimization option I'm missing which could improve
this? Or is this simply the way things are (would that be considered a
bug?). If this is a known limitation, any ideas if it's possible to work
around that (by changing the affected jit code)?

The -pre-alloc-split option should handle stuff like this when calls clobber an entire register class. That probably only applies to XMM registers.

Work on proper live range splitting is in progress. You can try it out with -spiller=inline, but it is highly experimental and volatile at the moment.

I don't know any short term solutions.

/jakob

Thanks for giving it a look!

So I saw that the code is doing lots of register
spilling/reloading. Now I understand that due to calling
conventions, there's not really a way to avoid this - I tried using
coldcc but apparently the backend doesn't implement it and hence
this is ignored.

Yes, unfortunately the list of call-clobbered registers is fixed at
the moment, so coldcc is mostly ignored by the backend.

Patches welcome.

What would be needed there? I actually tried a quick hack and simply
changed the registers included in the list in
X86RegisterInfo::getCalleeSavedRegs, so some xmm regs were included
(similar to what was done for win64). But the result wasn't what I
expected - the callee now indeed saved/restored all the xmm regs I
added, however the calling code did not change at all...

So is there any optimization option I'm missing which could improve
this? Or is this simply the way things are (would that be
considered a bug?). If this is a known limitation, any ideas if
it's possible to work around that (by changing the affected jit
code)?

The -pre-alloc-split option should handle stuff like this when calls
clobber an entire register class. That probably only applies to XMM
registers.

I tried that and the generated code did not change at all.

Work on proper live range splitting is in progress. You can try it
out with -spiller=inline, but it is highly experimental and volatile
at the moment.

Tried that too but the code mostly remained the same (there were 2
additional spills right at the beginning and some of the register
numbers changed but that was all).
There's also a -spiller=splitting option, I don't know what it should do
but it just crashed...

Roland

Look in X86InstrControl.td. The call instructions are all prefixed by:

   let Defs = [RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11,
              FP0, FP1, FP2, FP3, FP4, FP5, FP6, ST0, ST1,
              MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7,
              XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7,
              XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15, EFLAGS],

This is the fixed list of call-clobbered registers. It should really be controlled by the calling convention of the called function instead.

The WINCALL* instructions only exist because of this.

One problem is that calling conventions are handled while building the selection DAG, and the DAG doesn't really know to represent clobbered registers.

Perhaps X86TargetLowering::LowerCall() could decorate the X86ISD::CALL node with the calling convention somehow?

Dan, do you have any thoughts on how to communicate the calling convention and call clobbered registers to the eventual CALL MachineInstr?

/jakob

Thanks for giving it a look!

So I saw that the code is doing lots of register
spilling/reloading. Now I understand that due to calling
conventions, there's not really a way to avoid this - I tried
using coldcc but apparently the backend doesn't implement it
and hence this is ignored.

Yes, unfortunately the list of call-clobbered registers is fixed
at the moment, so coldcc is mostly ignored by the backend.

Patches welcome.

What would be needed there? I actually tried a quick hack and
simply changed the registers included in the list in
X86RegisterInfo::getCalleeSavedRegs, so some xmm regs were included
(similar to what was done for win64). But the result wasn't what I
expected - the callee now indeed saved/restored all the xmm regs I
added, however the calling code did not change at all...

Look in X86InstrControl.td. The call instructions are all prefixed
by:

let Defs = [RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11, FP0, FP1, FP2,
FP3, FP4, FP5, FP6, ST0, ST1, MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7,
XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8, XMM9, XMM10,
XMM11, XMM12, XMM13, XMM14, XMM15, EFLAGS],

This is the fixed list of call-clobbered registers. It should really
be controlled by the calling convention of the called function
instead.

The WINCALL* instructions only exist because of this.

Ahh I see now. I hacked this up and indeed the code looks much better.
I can't force it to use win64 calling conventions right?
Would do just fine for this case (much closer to a cold calling
convention, I really only need 5 preserved xmm regs).

Roland

Look in X86InstrControl.td. The call instructions are all prefixed
by:

let Defs = [RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11, FP0, FP1, FP2,
FP3, FP4, FP5, FP6, ST0, ST1, MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7,
XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8, XMM9, XMM10,
XMM11, XMM12, XMM13, XMM14, XMM15, EFLAGS],

This is the fixed list of call-clobbered registers. It should really
be controlled by the calling convention of the called function
instead.

The WINCALL* instructions only exist because of this.

Ahh I see now. I hacked this up and indeed the code looks much better.
I can't force it to use win64 calling conventions right?

No, only by targeting Windows.

Would do just fine for this case (much closer to a cold calling
convention, I really only need 5 preserved xmm regs).

If XMM registers are the problem, -pre-alloc-split really ought to help you.

You may want to investigate why it doesn't.

/jakob

The simplest way would probably be to add separate X86ISD opcodes for
each desired set of call-clobbered registers.

Dan

The problem is the large number of call-like instructions.

We would need copies of CALL*, TCRETURN*, and TAILJMP* for each calling convention. I was hoping we could avoid that, and even get rid of the WINCALL instructions.

What if InstrEmitter::EmitMachineNode called a target hook to add call-clobbered registers when II.isCall()? We would need some way of communicating the calling convention to the target hook. An immediate operand could work.

/jakob

The simplest way would probably be to add separate X86ISD opcodes for
each desired set of call-clobbered registers.

Sounds really gross... Maybe it will be better just to teach codegen
to query single place for the set of call-clobbered registers?

What if InstrEmitter::EmitMachineNode called a target hook to add call-clobbered registers when II.isCall()? We would need some way of communicating the calling convention to the target hook. An immediate operand could work.

+1

This is something we want to do. But it turns out to be more complicated since it touches a lot of code.

My guess is we'll implement this sometime during the first half of next year in part to implement cold_cc.

Evan

(repost with right sender address)

Look in X86InstrControl.td. The call instructions are all prefixed
by:

let Defs = [RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11, FP0, FP1, FP2,
FP3, FP4, FP5, FP6, ST0, ST1, MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7,
XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8, XMM9, XMM10,
XMM11, XMM12, XMM13, XMM14, XMM15, EFLAGS],

This is the fixed list of call-clobbered registers. It should really
be controlled by the calling convention of the called function
instead.

The WINCALL* instructions only exist because of this.

Ahh I see now. I hacked this up and indeed the code looks much better.
I can't force it to use win64 calling conventions right?

No, only by targeting Windows.

Would do just fine for this case (much closer to a cold calling
convention, I really only need 5 preserved xmm regs).

If XMM registers are the problem, -pre-alloc-split really ought to help you.

You may want to investigate why it doesn't.

Ok, I see if I can figure out something, though I have no in-depth
knowledge of llvm.
I think only xmm regs are really a problem because r12-r15 are
callee-saved and hence used for holding the most frequently used values,
which seems to be enough to avoid spilling there.

It looked to me like it could be related to something mentioned in the
lib/Target/README.txt file:

//===---------------------------------------------------------------------===//

We should investigate an instruction sinking pass. Consider this silly
example in pic mode:

#include <assert.h>
void foo(int x) {
  assert(x);
  //...
}

we compile this to:
_foo:
  subl $28, %esp
  call "L1$pb"
"L1$pb":
  popl %eax
  cmpl $0, 32(%esp)
  je LBB1_2 # cond_true
LBB1_1: # return
  # ...
  addl $28, %esp
  ret
LBB1_2: # cond_true
...

The PIC base computation (call+popl) is only used on one path through the
code, but is currently always computed in the entry block. It would be
better to sink the picbase computation down into the block for the
assertion, as it is the only one that uses it. This happens for a lot of
code with early outs.

Another example is loads of arguments, which are usually emitted into the
entry block on targets like x86. If not used in all paths through a
function, they should be sunk into the ones that do.

In this case, whole-function-isel would also handle this.

//===---------------------------------------------------------------------===//

Though maybe that's not related, since the arguments are actually
(mostly) used in all paths.

Roland

I can't really see what's happening there. Any hints what to look at?

Roland