[RFC] Spill2Reg: Selectively replace spills to stack with spills to vector registers

(Sending this again, since apparently I sent it by mail after discourse migration)

I actually developed a prototype for spilling into registers on X86 last summer. Unfortunately I was not able to measure any speedups in real-world code with it. I saw some improvements for artificial examples.

I now uploaded my patches to phabricator as inspiration:
https://reviews.llvm.org/D118728
https://reviews.llvm.org/D118729
https://reviews.llvm.org/D118730
https://reviews.llvm.org/D118731

The work was based on a clang-12 branch and will not apply cleanly on ToT.

I added some comments below:

Spill2Reg in x86

Some notes here:

  • Code size increases on average when spilling into XMM registers! Typical examples:
# "Traditional" spilling, 4 byte for spill/reload with small offset
48 89 5d c8 mov %rbx,-0x38(%rbp)
48 8b 5d d8 mov -0x28(%rbp),%rbx
# 8 byte for large offsets (not that common)
4c 89 bd 78 ff ff ff mov %r15,-0x88(%rbp)
4c 8b bd 70 ff ff ff mov -0x90(%rbp),%r15

# 1 or 2 bytes for push/pop!
41 57 push %r15
53 push %rbx
5b pop %rbx
41 5f pop %r15

# 5 bytes for copy to/from SSE register.
66 48 0f 6e c7 movq %rdi,%xmm0
66 48 0f 7e c0 movq %xmm0,%rax
  • When reloading from the stack you can often “fold” the reload into the instruction by using X86 memory addressing modes. For SSE registers you always need separate instructions and a register to hold the value between SSE copy and use.
  • Well contrary to my last point. In case of combined RELOAD,OP,SPILL sequence there may be a possibility to turn a whole operation operating on GP registers (think ANDQ) to an equivalent operation on XMM registers (ANDPD); I did not model that in my prototype

In the spiller versus as a standalone pass

I implemented things as part of the InlineSpiller which wasn’t too bad and is a more natural fit than a separate pass IMO. You avoid pass reordering issue, where the register allocator could have done a better job had it seen the effects of the SSE copies right away.

Please note that the profitability heuristic is implemented as a
target-dependent components, so other targets can implement their own
specialized heuristics.

Yeah I did not have any heuristics on this end… maybe that explains the bad results on my end.

Performance Results

The performance evaluation was done using
ba51d26ec4519f5b31de3acf643264504ea7bc7c as a base commit on a Skylake
Xeon Gold 6154. The code was compiled with -O3 -march=native .
Applying Spill2Reg to the code of the motivating example shown in a
previous section, replaces some of the spills/reloads with movd
instructions, leading to about 10% better performance.
Spill2Reg also works on real-life applications: In SPEC CPU 2017 525.x264
Spill2Reg improves performance by about 0.3%.

0.3% seems close to typical noise levels when measuring SPEC (at least in my setups)…

I evaluated a number of big internal server application which did not improve. I also tried the llvm-test-suite and SPEC swinging in both positive and negative directions, which I could not differentiate from run-to-run noise.

  • Matthias
1 Like