spilling & xmm register usage

Hello everybody,

I have stumbled upon a test case (the attached module is a slightly
reduced version) that shows extremely reduced performance on linux
compared to windows when executed using LLVM's JIT.

We narrowed the problem down to the actual code being generated, the
source IR on both systems is the same.
Try compiling the attached module:

llc -O3 -filetype=asm -o BAD.s BAD.ll

Under linux, the resulting assembly file shows that only registers up to
xmm5, while the same command under windows generates assembly that uses
all registers up to xmm15 (on the same 64bit Intel Q9550).
At the same time, the linux-assembly shows lots and lots of spills and
reloads.

Although I did not check whether the code generated by the JIT is the
same or comparable, the fact that this occurs with the static llc seems
to prove that there is a major problem here.

This applies both to the current SVN trunk and SVN revision 112036.

Can somebody reproduce that or give comments on what happens there?

Best regards,
Ralf

BAD.ll (9.5 KB)

What you are describing sounds a lot like http://llvm.org/bugs/show_bug.cgi?id=1512

New awesomeness is being added to the register allocator to deal with this, but it will probably be about 6 months before it can be turned on by default.

Some people have gotten better code form the fast allocator if the basic blocks are few and large.

The difference between Linux and Windows could be caused by different calling conventions. Don't call functions in the middle of your go-fast floating point code.

/jakob

Hello everybody,

I have stumbled upon a test case (the attached module is a slightly
reduced version) that shows extremely reduced performance on linux
compared to windows when executed using LLVM's JIT.

We narrowed the problem down to the actual code being generated, the
source IR on both systems is the same.
Try compiling the attached module:

llc -O3 -filetype=asm -o BAD.s BAD.ll

Under linux, the resulting assembly file shows that only registers up to
xmm5, while the same command under windows generates assembly that uses
all registers up to xmm15 (on the same 64bit Intel Q9550).
At the same time, the linux-assembly shows lots and lots of spills and
reloads.

The Win64 calling convention defines XMM6..XMM15 as callee saved, so their values can remain live across the calls. On Linux all XMM registers are call-clobbered so any live values must be spilled across calls. That's the basic reason for the difference. It may be there's something to do to improve the code on Linux, such as scheduling differently, I haven't looked in detail.

Don't we already have splitting around calls for exactly this sort of situation?

-Chris