(Sending this again, since apparently I sent it by mail after discourse migration)
I actually developed a prototype for spilling into registers on X86 last summer. Unfortunately I was not able to measure any speedups in real-world code with it. I saw some improvements for artificial examples.
I now uploaded my patches to phabricator as inspiration:
https://reviews.llvm.org/D118728
https://reviews.llvm.org/D118729
https://reviews.llvm.org/D118730
https://reviews.llvm.org/D118731
The work was based on a clang-12 branch and will not apply cleanly on ToT.
I added some comments below:
Spill2Reg in x86
Some notes here:
- Code size increases on average when spilling into XMM registers! Typical examples:
# "Traditional" spilling, 4 byte for spill/reload with small offset
48 89 5d c8 mov %rbx,-0x38(%rbp)
48 8b 5d d8 mov -0x28(%rbp),%rbx
# 8 byte for large offsets (not that common)
4c 89 bd 78 ff ff ff mov %r15,-0x88(%rbp)
4c 8b bd 70 ff ff ff mov -0x90(%rbp),%r15
# 1 or 2 bytes for push/pop!
41 57 push %r15
53 push %rbx
5b pop %rbx
41 5f pop %r15
# 5 bytes for copy to/from SSE register.
66 48 0f 6e c7 movq %rdi,%xmm0
66 48 0f 7e c0 movq %xmm0,%rax
- When reloading from the stack you can often “fold” the reload into the instruction by using X86 memory addressing modes. For SSE registers you always need separate instructions and a register to hold the value between SSE copy and use.
- Well contrary to my last point. In case of combined RELOAD,OP,SPILL sequence there may be a possibility to turn a whole operation operating on GP registers (think ANDQ) to an equivalent operation on XMM registers (ANDPD); I did not model that in my prototype
In the spiller versus as a standalone pass
I implemented things as part of the InlineSpiller which wasn’t too bad and is a more natural fit than a separate pass IMO. You avoid pass reordering issue, where the register allocator could have done a better job had it seen the effects of the SSE copies right away.
Please note that the profitability heuristic is implemented as a
target-dependent components, so other targets can implement their own
specialized heuristics.
Yeah I did not have any heuristics on this end… maybe that explains the bad results on my end.
Performance Results
The performance evaluation was done using
ba51d26ec4519f5b31de3acf643264504ea7bc7cas a base commit on a Skylake
Xeon Gold 6154. The code was compiled with-O3 -march=native.
Applying Spill2Reg to the code of the motivating example shown in a
previous section, replaces some of the spills/reloads withmovd
instructions, leading to about 10% better performance.
Spill2Reg also works on real-life applications: In SPEC CPU 2017 525.x264
Spill2Reg improves performance by about 0.3%.
0.3% seems close to typical noise levels when measuring SPEC (at least in my setups)…
I evaluated a number of big internal server application which did not improve. I also tried the llvm-test-suite and SPEC swinging in both positive and negative directions, which I could not differentiate from run-to-run noise.
- Matthias