[RFC] Spill2Reg: Selectively replace spills to stack with spills to vector registers

vporpo · January 27, 2022, 2:57am

Hi everyone,

This is an RFC for a new machine IR pass called Spill2Reg, which
selectively replaces spills to the stack with spills to vector registers.
The chain of patches can be found here: ⚙ D118298 [Spill2Reg][1/9] Initial commit. This is boilerplate code.
(it includes all patches from D118298 to D118305).

Overview

The register allocator emits spill/reload instructions to temporarily save
register values to memory. These are typically stores to the stack (aka
spills) and loads from the stack (aka reloads or fills). These instructions
hurt performance not only because they increase the number of instructions
executed, but also because they add pressure to the memory resources, which
may already be heavily used by the actual memory instructions of the
workload.

Spill2Reg aims at reducing this overhead by selectively replacing
spills/reloads to/from the stack with spills/reloads to/from registers only
when it is profitable. The target registers that will hold the spilled
value must be of a different class than those that caused the spill, and in
architectures like x86 or ARM we can use vector registers to save values
from general purpose registers. Spill2reg can be profitable in a couple of
cases: (i) on targets where spills to stack are always slower than spills
to registers, (ii) in pathological cases with lots of spill/reload
instructions back-to-back, (iii) in memory intensive workloads. It is worth
pointing out that Spill2Reg can be profitable even on targets where spills
to registers and spills to stack have a similar latency. This is because
replacing some of the stack instructions with register instructions can
help remove some stalls caused by bottle-necks in the memory resources.

Early evaluation on a Skylake(Server) x86_64 system shows that Spill2Reg
can improve performance of both synthetic tests and of real-life workloads.

Why Spill to Registers?

There are a couple of reasons why it makes sense to spill to registers
instead of memory. To summarize: (i) there is usually a lot of free
register space even when spilling and (ii) spilling to vector registers can
remove back-end stalls. The following sections discuss these points in more
detail.

Free register space even when spilling

Consider the following code:

int D0, D1, D2, ..., D18;
foo() {
   int t0 = D0
   int t1 = D1
   ...
   int t18 = D18
   // Some code
   ... = t0
   ... = t1
       ...
   ... = t18
}

Variables t0 to t18 are all live across the middle point (marked with // Some code). When compiled for x86_64, this code will assign t0 to t14 to
registers, but will spill t15 to t18.
Here is what the assembly looks like:

        movl    D0(%rip), %eax
        movl    %eax, -8(%rsp)                  # 4-byte Spill
        movl    D1(%rip), %ecx
        movl    D2(%rip), %edx
        movl    D3(%rip), %esi
        movl    D4(%rip), %edi
        movl    D5(%rip), %r8d
        movl    D6(%rip), %r9d
        movl    D7(%rip), %r10d
        movl    D8(%rip), %r11d
        movl    D9(%rip), %ebx
        movl    D10(%rip), %ebp
        movl    D11(%rip), %r14d
        movl    D12(%rip), %r15d
        movl    D13(%rip), %r12d
        movl    D14(%rip), %r13d
        movl    D15(%rip), %eax
        movl    %eax, -4(%rsp)                  # 4-byte Spill
        movl    D16(%rip), %eax
        movl    %eax, -12(%rsp)                 # 4-byte Spill
        movl    D17(%rip), %eax
        movl    %eax, -16(%rsp)                 # 4-byte Spill
        movl    D18(%rip), %eax
        movl    %eax, -20(%rsp)                 # 4-byte Spill
        # ...  Some code ...
        movl    -8(%rsp), %eax                  # 4-byte Reload
        movl    %eax, U0(%rip)
        movl    %ecx, U1(%rip)
        movl    %edx, U2(%rip)
        movl    %esi, U3(%rip)
        movl    %edi, U4(%rip)
        movl    %r8d, U5(%rip)
        movl    %r9d, U6(%rip)
        movl    %r10d, U7(%rip)
        movl    %r11d, U8(%rip)
        movl    %ebx, U9(%rip)
        movl    %ebp, U10(%rip)
        movl    %r14d, U11(%rip)
        movl    %r15d, U12(%rip)
        movl    %r12d, U13(%rip)
        movl    %r13d, U14(%rip)
        movl    -4(%rsp), %eax                  # 4-byte Reload
        movl    %eax, U15(%rip)
        movl    -12(%rsp), %eax                 # 4-byte Reload
        movl    %eax, U16(%rip)
        movl    -16(%rsp), %eax                 # 4-byte Reload
        movl    %eax, U17(%rip)
        movl    -20(%rsp), %eax                 # 4-byte Reload
        movl    %eax, U18(%rip)

Meanwhile there is a lot of free space in the vector register file, yet we
are spilling to memory.

Memory and register spills use different ports in x86

According to [1] spill/reloads to/from the stack use ports 2, 3, 4, 7 on a
Skylake Server x86 micro-architecture, while spills/reloads to/from the
vector register file use Ports 0, 1, 5. This means that the stack-based and
vector-register-based spills/reloads use different back-end resources, and
by replacing one with the other can shift the overhead from some type of
resources to another.

This is particularly important for Spill2Reg, because it shows that it can
improve performance even if spills-to-stack and spills-to-registers have a
similar latency. If the CPU is stalling due to over-subscribed memory
resources, Spill2Reg can replace some of them with spills-to-registers,
which can help remove some of the stalls.

I think this looks familiar

If spilling to vector registers looks familiar, it is probably because GCC
has included support for spilling to registers for quite some time. In x86
this was done with vmovd instructions spilling to xmm registers.

However, to the best of my knowledge, spilling to vector registers in x86
has been disabled [2] due to stability [3,4,5], correctness [6], and
performance [7] issues.
The performance issues highlighted in [7] seem to be related to:
(i) double spilling: the vector register used for spilling get spilled to
memory, and
(ii) folded reloads: if a reload can be folded into an instruction, then
spilling to a vector register results in an additional instruction: the one
extracting the value from the vector register and inserting it into the
general purpose register.

I believe that the proposed Spill2Reg design can address all these issues.
These points are discussed in the following sections.

Spill2Reg in x86

There are several interesting points about spilling to vector registers in
x86:

Writing/Reading to vector registers can be done with a single assembly
instruction. So spilling to vector register does not introduce more
instructions.
The vector register space is quite large at 2KB (32 ZMM registers X 64
bytes) in Skylake server, so the chances are that there will be a lot of
free vector register space in general purpose workloads.
If needed we could insert values to any vector lane using
(V)PINSR{B,W,D,Q} and (V)PEXTR{B,W,D,Q} (supported since SSE4.1), or to
lane 0 using MOVD (supported since MMX).
We can choose between two types of instructions for moving data to vector
registers: MOVD/Q and PINSRD/Q:
- PINSRD/Q PEXTRD/Q allow us to insert/extract a value to/from any
  lane of the vector register. It is implemented in SSE4_1+.
- MOVD/Q moves a 32/64bit value to the first lane of a vector register.
  It is implemented in MMX+ instruction sets, so it is widely supported.
  GCC’s implementation uses MOVD.
- According to Agner Fog’s latency/throughput data for Skylake [9],
  MOVD has a lower latency than PINSRD/Q, same latency than MOV to/from
  memory, but lower throughput:

                  uops   uops      uops
                  fused  unfused   each   latency  throughput
                  domain domain    port
Spill-to-reg
------------
MOVD mm/x r32/64    1     1        p5       2       1
MOVD r32/64 mm/x    1     1        p0       2       1

PINSRD/Q x,r,i      2     2        2p5      3       2
PEXTRB/W/D/Q r,x,i  2     2       p0 p5     3       1

Spill-to-stack
--------------
MOV  m,r            1     2       p237 p4   2       1
MOV r32/64, m       1     1        p23      2       0.5
------------------------------------------------------------
mm  : 64-bit mmx register
x   : 128-bit xmm register
m   : memory operand
m32 : 32-bit memory operand

Source: Agner Fog's Instruction tables:
https://www.agner.org/optimize/instruction_tables.pdf

Spill2Reg on ARM

According to ARM’s software optimization guide [8] section 4.3 (page 57),
it is encouraged to spill to vector registers because these instructions
have a lower latency than spills to memory.
Also, given the RISC nature of ARM instructions, there are no folded memory
operands that we have to worry about when applying Spill2Reg.

Caveats

Instruction throughput/latency: Vector insert/extract instructions may
have lower throughput or higher latency than memory load/stores on certain
targets. If this is the case they should be used only when memory resources
are over subscribed.
Double spilling: We must take extra care not to spill the vector
registers that hold the spilled value, because this will only add overhead.
This was also observed in GCC’s implementation [7]. This is a concern when
Spill2Reg is implemented as part of the register allocator’s spiller.
Additional instructions: Stack based spills/reloads can potentially be
folded. This is quite common in x86. For example, given a reload MOV Reg1, [Spill] and an instruction using Reg1, like ADD Reg2, Reg1, this can
be folded into Add Reg2, [Spill], saving one instruction. Spill2Reg can
potentially block this folding, which will add an additional instruction
that reloads the data from the vector before the ADD. This is a concern
when Spill2Reg is implemented as part of the register allocator’s spiller,
because folding of spills/reloads may be taking place later on, after all
spills/reloads have been emitted.
Frequency throttling: Some targets will lower their turbo frequency when
running vector instructions [10]. Throttling is the highest for AVX-512 but
decreases as the vector width decreases, and it seems that there is no
frequency drop for 128-bit instructions.
Profitability models: There may be different trade-offs for different
targets or different generations of the same target. Each target should
implement its own profitability model.

Proposed Design

Spill2Reg can be implemented either as part of the spiller, within the
register allocator, or as a standalone pass after register allocation. Each
design point has its own benefits and shortcomings. The following sections
list some of the most important points for and against each design. The
discussion provides some insights into why we believe Spill2Reg is best
implemented as a separate pass after the register allocator and after
physical registers are introduced.

In the spiller versus as a standalone pass

In the spiller

Pros:
- The spiller is a natural location for the Spill2Reg code.
- All analysis data needed are already computed and readily available.
- The actual logic is quite simple.
Cons:
- The register allocator expects that a Spill/Reload is a store/load. So
  supporting Spill2Reg requires some refactoring.
- Spill/Reload removal optimizations within the register allocator need
  to be updated to handle spills to registers.
- The register allocation pass is already quite complex, and adding
  Spill2Reg can increase complexity even further.
- Folded spills/reloads (see GCC issue [7]): We need to be able to skip
  Spill2Reg if spills/reloads to the stack can be folded. Folding happens
  after the spill code is emitted, so implementing this is not trivial.
- Double-spills must be avoided (see GCC issue [7]): The vector pseudo
  register that holds the spilled value needs to be colored by the register
  allocator itself, and we need to guarantee that it won’t get spilled.

As a standalone pass

Pros:
- Small pass, easy to design and maintain.
- Easier pass testing.
- Straight-forward tracking of performance improvements/regressions,
  without having to worry if changes in Spill2Reg may affect the decisions of
  the register allocator.
Cons:
- May replicate analysis that is already available within the register
  allocator.
- Yet one more compiler pass.

Given the points listed above, and the performance/stability bugs reported
in GCC [2,3,4,5,6,7], I believe that it makes sense to implement Spill2Reg
as a standalone pass. This results in a simpler design, with simpler
testing and easier performance tracking to avoid performance regressions.

As a standalone-pass: Before or after VirtRegRewriter

Another design decision is whether we place the pass before or after the
Virtual Register Rewriter pass.

Before VirtRegRewriter

Pros:
- We are still working with pseudo-registers.
- Straight forward checking of pseudo register interference using
  LiveIntervals and LiveRegMatrix.
Cons:
- The main issue is with testing. Testing Spill2Reg would require running
  the register allocator pass first, and relying on it to generate
  spill/reload code, which would make tests very verbose and tricky to write.
- The pass would need to maintain the state of several data-structures,
  like LiveIntervals and LiveRegMatrix.

After VirtRegRewriter

Pros:
- The main point is ease of testing: We can test the pass in isolation,
  using small tests containing hand-written spill/reload code. No need to run
  other passes before it.
- Fewer data structures to maintain.
Cons:
- Spill2Reg needs its own live register tracking implementation since it
  can no longer rely on LiveIntervals and LiveRegMatrix for finding free
  physical registers.

Given that the pass design is quite similar in both cases, and that testing
is significantly nicer in one of them, the preferred option is after
VirtRegRewriter.

Target independent component

The Spill2Reg pass works as follows:

It collects all candidate spill/reloads. This filters out folded
spills/reloads and unsupported data-types by the target (target dependent
legality check).
Then it iterates through the collected candidates and checks if it is
profitable to spill to the vector register (target dependent cost model).
If profitable, it generates the new spills/reloads to/from the vector
register file and it removes the original instructions (target dependent
spill/reload instruction generation).

Target dependent components

This includes:

Legality checks that test whether the opcodes and types of spills/reloads
can be handled by the target.
The profitability heuristic, which checks whether we applying Spill2Reg
for a specific set of spills/reloads will lead to better performing code on
the target.
The generation of spill/reload instructions to/from a vector register.

Profitability Heuristic in x86

Given a candidate set of spills/reloads, we need to decide whether applying
Spill2Reg is more profitable. We are currently implementing an all or
nothing approach for the whole set: we will either replace all
spills/reloads or none.

According to [9] spills to vector registers have the same latency as spills
to memory, but have lower throughput on Skylake. So replacing all spills to
memory with spills to the stack can degrade performance. Instead, a better
strategy is to spill to registers only when the memory units are
over-subscribed, in an attempt to reduce any potential back-end stalls.

Since our goal is to avoid back-end stalls caused by bottle-necks on memory
resources, we need some way to measure when these stalls could happen.
Ideally we would query a pipeline model (like the one used in instruction
schedulers) to determine if spills-to-stack can cause pipeline stalls. For
now the implementation is based on a simple instruction count: if the count
of memory instructions in the proximity of the spill/reload is above a
threshold, then apply Spill2Reg.

Please note that the profitability heuristic is implemented as a
target-dependent components, so other targets can implement their own
specialized heuristics.

Performance Results

The performance evaluation was done using
ba51d26ec4519f5b31de3acf643264504ea7bc7c as a base commit on a Skylake
Xeon Gold 6154. The code was compiled with -O3 -march=native.
Applying Spill2Reg to the code of the motivating example shown in a
previous section, replaces some of the spills/reloads with movd
instructions, leading to about 10% better performance.
Spill2Reg also works on real-life applications: In SPEC CPU 2017 525.x264
Spill2Reg improves performance by about 0.3%.

References

[1] WikiChip Skylake microarchitecture:

[2] GCC spill to register:

github.com/gcc-mirror/gcc

gcc/config/i386/i386.c

5a431b60d


      
          
            return mode_for_vector (elem_mode, nunits);
          }
          
          
          
          /* Return class of registers which could be used for pseudo of MODE
             and of class RCLASS for spilling instead of memory.  Return NO_REGS
             if it is not possible or non-profitable.  */
          
          /* Disabled due to PRs 70902, 71453, 71555, 71596 and 71657.  */
          
          static reg_class_t
          ix86_spill_class (reg_class_t rclass, machine_mode mode)
          {
            if (0 && TARGET_GENERAL_REGS_SSE_SPILL
                && TARGET_SSE2
                && TARGET_INTER_UNIT_MOVES_TO_VEC
                && TARGET_INTER_UNIT_MOVES_FROM_VEC
                && (mode == SImode || (TARGET_64BIT && mode == DImode))
                && INTEGER_CLASS_P (rclass))

[3] GCC stability bug: Making sure you're not a bot!

[4] GCC stability bug: Making sure you're not a bot!

[5] GCC stability bug: Making sure you're not a bot!

[6] GCC correctness bug: Making sure you're not a bot!

[7] GCC performance bug: Making sure you're not a bot!

[8] ARM Software Optimization Guide:

[9] Agner Fog’s Instruction tables:

[10] Daniel Lemire’s AVX throttling blog:

Thanks,
Vasileios

preames · January 27, 2022, 4:29am

This is a wonderful writeup of an interesting design problem, and the chosen solution. I learned several things from reading it. Thank you taking the time to write this up.

My takeaway after reading was that you’ve clearly thought through all the issues in some depth. I might not lean the same direction on each choice, but you’ve more than convinced me that you’ve thought about them and made a reasoned choice. Solid +1 on the design here.

Philip

vporpo · January 27, 2022, 7:34am

Thanks Philip for the feedback, I am glad you enjoyed reading it.

Vasileios

sjoerdmeijer · January 27, 2022, 3:59pm

Same here, enjoyed reading this. Just a drive by comment about this:

Spill2Reg also works on real-life applications: In SPEC CPU 2017 525.x264 Spill2Reg improves performance by about 0.3%.

Our codegen for SPEC is relatively inefficient and there is a lot to gain here (compared to GCC), especially for x264 where we are more than 20% behind. That’s for AArch64 by the way, but since we are mostly fixing target independent things, I guess that’s also the case for X86. While this 0.3% is nice (every little helps), I think things would look quite different when the inefficiencies are addressed (we are working on it/some), and spill2reg that is fixing up things now may not be able to do so when that is the case. Long story short, we only have a 10% uplift for 1 case, but we need some more performance numbers? Ideally for AArch64 too? ��

But thanks again for the write up and the work, really nice.

Cheers.

qcolombet · January 27, 2022, 5:26pm

Hi Vasileios,

Spilling to registers is actually already possible with the existing allocator.
See https://lists.llvm.org/pipermail/llvm-dev/2019-December/137700.html

Depending on how you write to your vector registers that may be tricky to pull off, but I thought I share the information to make sure we’re not reinventing something that is already there.

Cheers,
-Quentin

Momchil_Velikov · January 27, 2022, 5:34pm

It’s interesting how this interacts with lazy save/restore of the FP context across context switches.

~chill

vporpo · January 27, 2022, 6:10pm

Thanks for your comments Sjoerd,

So far I have only tried a small synthetic test on AArch64, like the one with the back-to-back spills and reloads, and it seems to improve performance there too. I definitely need to do some more extensive performance evaluation, including AArch64.
I have seen a few SPEC benchmarks improve, not just x264, but this depends on the compiler options used and the exact compiler commit checked out. But yeah the performance improvements are usually around .5%.

Vasileios

madhur13490 · January 27, 2022, 6:27pm

+1. Liked the overall approach and thought process. BTW, in case you didn’t notice, AMDGPU backend does have a similar pass in place.

I have some suggestions/thoughts.

Please note that the profitability heuristic is implemented as a target-dependent components, so other targets can implement their own specialized heuristics.

Was there any thought about a finer control? e.g. Controlling spilling decisions for a certain region of the code like loop. Attaching metadata to a loop can help control the decisions. It is quite possible that spilling to vector registers is profitable in one loop but not other. Designing robust heuristics is a hard problem.

vporpo · January 27, 2022, 6:44pm

Hi Quentin,

Thanks for pointing that out. I have not actually tried that exact configuration you are referring to, but what I tried instead was generating spills to vector registers in the spiller, which was fairly straightforward. I would assume that both approaches would have the same issues I mention in the RFC. To summarize, the most important ones that were not straightforward to handle from within the register allocator:
i. avoiding performance regressions: avoiding spilling to register when the spill can be folded, and also avoiding double spilling.
ii. not trivial changes in the existing code: updating the spill/reload removal optimizations to handle non-memory instruction spills, and
iii. testing is quite a bit trickier than when this is done in a separate pass.
Another point is that the spill2reg pass itself is not very large or complex (~500 lines of code), so I think it makes sense to keep it as a separate pass.

Thanks,
Vasileios

vporpo · January 27, 2022, 7:24pm

Hi Momchil,
Spill2Reg definitely adds to the FP context, but I am not sure how it would interact with lazy save/restore.
Vasileios

vporpo · January 27, 2022, 7:59pm

Hi Madhur,

Thank you for the feedback. I am not aware of the AMDGPU backend pass, which pass are you referring to?

The current goal, on x86 at least, is to spill to vectors only when we get stalls due to the memory units being fully utilized. This is a safe choice because it helps avoid performance regressions. The current heuristic is fairly simple, it just counts nearby instructions and checks against a threshold, but I assume that this can’t be very accurate. So I think the next step is to query a pipeline model about stalls in the memory units instead. This should be relatively safe.

The next thing we can improve on is to step away from the all-or-nothing approach. So instead of applying spill2reg only when all spills/reloads are causing stalls, we could apply it even if a fraction of them are doing so. This would definitely require more sophisticated heuristics or manually marked regions, like what you are describing, because it would be much harder to avoid performance regressions. But I think this should also improve Spill2Reg’s coverage quite a bit, and possibly lead to more speedups. So yeah, there is a lot of room for improving the profitability decision, and any ideas on this are always welcome.

Thanks,
Vasileios

topperc · January 27, 2022, 8:11pm

Hi Momchil,
Spill2Reg definitely adds to the FP context, but I am not sure how it would interact with lazy save/restore.
Vasileios

The XSAVE instructions allow the OS to skip restoring the XMM registers on context switches if they are in their initial state or haven’t changed. Spilling to XMM registers will make the registers dirty in more cases than they might have been otherwise.

That reminds me that this needs to be disabled when the NoImplicitFloat attribute is present. One use for NoImplicitFloat is to prevent uses of FP and vector registers when compiling OS kernels. If the kernel code doesn’t change the FP or vector registers, they don’t need to be saved/restored during system calls. I hope you already took into account -mno-sse/-no-sse2.

vporpo · January 27, 2022, 9:25pm

Thank you for your comments Craig,
Great point, I will add a check for NoImplicitFloat.
Regarding -mno-sse, the pass is currently enabled only if X86STI->hasSSE41() so I guess this is covered.

Vasileios

madhur13490 · February 3, 2022, 5:04am

Hi Vasileios,

For the related pass, please have a look at SILowerSGPRSpills MachineFunctionPass and how things are handled in FrameLowering. (Connecting with AMD folks will give more info, if needed).

MatzeB · February 9, 2022, 8:49pm

(Sending this again, since apparently I sent it by mail after discourse migration)

I actually developed a prototype for spilling into registers on X86 last summer. Unfortunately I was not able to measure any speedups in real-world code with it. I saw some improvements for artificial examples.

I now uploaded my patches to phabricator as inspiration:
https://reviews.llvm.org/D118728
https://reviews.llvm.org/D118729
https://reviews.llvm.org/D118730
https://reviews.llvm.org/D118731

The work was based on a clang-12 branch and will not apply cleanly on ToT.

I added some comments below:

Spill2Reg in x86

Some notes here:

Code size increases on average when spilling into XMM registers! Typical examples:

# "Traditional" spilling, 4 byte for spill/reload with small offset
48 89 5d c8 mov %rbx,-0x38(%rbp)
48 8b 5d d8 mov -0x28(%rbp),%rbx
# 8 byte for large offsets (not that common)
4c 89 bd 78 ff ff ff mov %r15,-0x88(%rbp)
4c 8b bd 70 ff ff ff mov -0x90(%rbp),%r15

# 1 or 2 bytes for push/pop!
41 57 push %r15
53 push %rbx
5b pop %rbx
41 5f pop %r15

# 5 bytes for copy to/from SSE register.
66 48 0f 6e c7 movq %rdi,%xmm0
66 48 0f 7e c0 movq %xmm0,%rax

When reloading from the stack you can often “fold” the reload into the instruction by using X86 memory addressing modes. For SSE registers you always need separate instructions and a register to hold the value between SSE copy and use.
Well contrary to my last point. In case of combined RELOAD,OP,SPILL sequence there may be a possibility to turn a whole operation operating on GP registers (think ANDQ) to an equivalent operation on XMM registers (ANDPD); I did not model that in my prototype

In the spiller versus as a standalone pass

I implemented things as part of the InlineSpiller which wasn’t too bad and is a more natural fit than a separate pass IMO. You avoid pass reordering issue, where the register allocator could have done a better job had it seen the effects of the SSE copies right away.

Please note that the profitability heuristic is implemented as a
target-dependent components, so other targets can implement their own
specialized heuristics.

Yeah I did not have any heuristics on this end… maybe that explains the bad results on my end.

Performance Results

The performance evaluation was done using
ba51d26ec4519f5b31de3acf643264504ea7bc7c as a base commit on a Skylake
Xeon Gold 6154. The code was compiled with -O3 -march=native .
Applying Spill2Reg to the code of the motivating example shown in a
previous section, replaces some of the spills/reloads with movd
instructions, leading to about 10% better performance.
Spill2Reg also works on real-life applications: In SPEC CPU 2017 525.x264
Spill2Reg improves performance by about 0.3%.

0.3% seems close to typical noise levels when measuring SPEC (at least in my setups)…

I evaluated a number of big internal server application which did not improve. I also tried the llvm-test-suite and SPEC swinging in both positive and negative directions, which I could not differentiate from run-to-run noise.

Matthias

vporpo · February 9, 2022, 9:10pm

(Thanks for sending it again, re-posting my reply here too)

Thank you Matthias for your feedback and for sharing your prototype.

Can you share some details on what you evaluated?

0.3% seems close to typical noise levels when measuring SPEC (at least in my setups)…
I evaluated a number of big internal server application which did not improve. I also tried the llvm-test-suite and SPEC swinging in both positive and negative directions, which I could not differentiate from run-to-run noise.

The evaluation was on SPEC CPU2017 with -O3 -march=native on an Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz.

Indeed, several benchmarks are quite noisy, but the small speedup for x264 that I am reporting looks reproducible in both ref and train datasets with 20+ runs.

These are the exact performance numbers for 525.x264:

Spill2Reg: 206.21 with 0.21 standard deviation

Original: 207.21 with 0.26 standard deviation

When reloading from the stack you can often “fold” the reload into the instruction by using X86 memory addressing modes. For SSE registers you always need separate instructions and a register to hold the value between SSE copy and use.

Yes, touching folded spills/reloads is usually not worth it. The additional instruction seems to be adding overhead.

Well contrary to my last point. In case of combined RELOAD,OP,SPILL sequence there may be a possibility to turn a whole operation operating on GP registers (think ANDQ) to an equivalent operation on XMM registers (ANDPD); I did not model that in my prototype

I did not model this either, but yeah it could be useful to handle that case too.

I implemented things as part of the InlineSpiller which wasn’t too bad and is a more natural fit than a separate pass IMO. You avoid pass reordering issue, where the register allocator could have done a better job had it seen the effects of the SSE copies right away.

I started with a prototype in the InlineSpiller too, because indeed it is the most natural place for it. But having it as a separate pass has advantages of its own, like you could implement different strategies on when to spill to vector registers without having to make a decision right away and in the order dictated by the spiller, and also I find it that it is easier to test it in isolation.

Indeed you can avoid pass ordering issues, but this can also make it trickier to check whether the performance change came from Spill2Reg or because of some slight change in the decisions made by the register allocator.

Yeah I did not have any heuristics on this end… maybe that explains the bad results on my end.

Yeah, I think it is important to be very picky about when to spill to vectors because some of the spills will speed things up but some others may slow you down.

Thanks,

Vasileios

IAN2020 · July 29, 2023, 7:26am

Hi Vasileios,
Thank you for sharing, very impressed！
But should cross-function calls be excluded?
Code sequence like：

spill to vector
...
call function
...
reload from vector
...

In this case, the vector reg that have been selected should also be spilled to memory because of the call, as we do not know whether the register is dirty after the function call.

vporpo · August 15, 2023, 5:46pm

Yes, you are correct.
If I remember correctly this is taken care of by some checks while building the liveness data in the spill2reg pass.

cs23resch11001 · December 4, 2024, 4:50am

Hi Vasileios,

I was trying to see similar functionality from recent LLVM release vesion.
It seems Spill2Reg is still not into LLVM. If it is there in some LLVM branch, I would like to use it.
Thank you.

With Regards,
Jaya.

vporpo · December 4, 2024, 4:18pm

Hi Jaya,

Sadly the project never landed because I could not get people to review the patches. If there is new interest in the transformation I could rebase the patches and create pull-requests.

Topic		Replies	Views
InlineSpiller - hoists leave virtual registers without live intervals LLVM Dev List Archives	5	142	November 5, 2019
llvm register reload/spilling around calls LLVM Dev List Archives	12	190	October 22, 2010
Unexpected spilling of vector register during lane extraction on some x86_64 targets LLVM Dev List Archives	3	100	October 14, 2014
[RFC] RISCV vector register spill optimization pass IR & Optimizations riscv , llvm	3	407	September 4, 2024
Poor register allocation (constants causing spilling) LLVM Dev List Archives	5	200	July 31, 2015

[RFC] Spill2Reg: Selectively replace spills to stack with spills to vector registers

Overview

Why Spill to Registers?

I think this looks familiar

Spill2Reg in x86

Spill2Reg on ARM

Caveats

Proposed Design

In the spiller versus as a standalone pass

In the spiller

As a standalone pass

As a standalone-pass: Before or after VirtRegRewriter

Before VirtRegRewriter

After VirtRegRewriter

Target independent component

Target dependent components

Profitability Heuristic in x86

Performance Results

References

Related topics