The current state of spilling, function calls and related problems

Flakebi · February 19, 2021, 11:03am

This is an attempt to summarize the current state of spilling, function calls and related problems.

A general problem that often manifests with LLVM is that LLVM (IR) only has the view of a single thread and not the whole wave executing. In MachineIR, the control flow is then modeled in the wave view (by implicit uses of exec on all vector operations), registers are still modeled for a single lane. I.e. there is a single v0 register, while the hardware has 32/64 of them. This works fine in most cases but creates problems in operations that involve more than a single lane. These operations are:

Spilling (and restoring) SGPRs to VGPRs
Spilling (and restoring) SGPRs to scratch
Whole wave mode (WWM)
Function calls do not directly touch other lanes, but they are indirectly involved when they contain spilling or WWM parts
The write- and readlane intrinsic

Whole quad mode (WQM) is activating lanes that would be otherwise inactive. AFAIK this can only happen at the top level of a shader, i.e. it is not possible to activate lanes that are currently inactive but are already in use for something else, so WQM does not exhibit this problem.

Spilling SGPRs to scratch first saves the SGPRs to lanes of a VGPR (same as SGPR→VGPR) and then saves the VGPR to scratch. Restoring works the other way around.

What is happening when

We need more VGPRs ⇒ Spill VGPR to scratch (does not involve cross-lane operations), only saves currently active lanes
We need more SGPRs ⇒ Spill SGPRs to VGPR if possible, if no VGPR is unused in the function, spill to scratch
A function is called (caller) ⇒ Save (and restore) the currently active lanes of caller-save VGPRs and live SGPRs (same as spilling), does not save inactive lanes of used VGPRs and does not save VGPRs that are unused in the currently active lanes.
A function is called (callee) ⇒ Save (and restore) all lanes of VGPRs reserved for SGPR spills. Callee-save SGPRs and FP/BP (frame pointer and base pointer) are saved to VGPRs or scratch (same as spilling, FP/BP are handled separately).
We want to sum over all lanes ⇒ Switch to WWM and set a VGPR to 0 in inactive lanes

As VGPRs are incompletely modeled, this requires workarounds for all cross-lane operations:

SGPR→VGPR spilling
When the need to spill an SGPR comes up, a VGPR is searched that is unused during the whole function. If one is found, it is reserved for the whole function and used for spilling. If none is found, it resorts to SGPR→scratch spilling.
SGPR→scratch spilling
No workarounds, that’s why it can currently overwrite registers that are in use.
WWM
Same as for SGPR→VGPR spilling, VGPRs are reserved for the whole function if WWM is used.
Function calls
Reserving VGPRs for the whole function makes sure no other lane inside the function is overwritten. To not overwrite registers of the caller, the callee saves all lanes of these reserved registers.
(It does not yet do that, except for FP/BP→VGPR saves or if the VGPR is callee-saved. I forgot the second spill function and did not yet look at WWM reserved registers.)

Without function calls, SGPR→VGPR spilling and WWM are working correctly, SGPR→scratch spilling is not.
With function calls, none of these are working correctly, but SGPR→VGPR spills and WWM should be easy to fix. Saving FP and BP to scratch in the function prolog currently overwrites all lanes without saving them before.
The hard part to get correct is SGPR→scratch spilling.

How does SGPR→scratch spilling currently work:
Setting: All SGPRs are in use and we want to save some into scratch.

The register scavenger searches for a free VGPR. If none is free, it saves one to scratch (only saving active lanes).
The SGPRs are written into lanes of the VGPR, starting with 0. This can overwrite values if lane 0 is inactive and the VGPR is live there (this happens independently of finding a free VGPR or saving one as both only look at currently active lanes).
The exec mask is saved to the SGPRs (the SGPRs are currently unused). If it does not fit (spilling one SGPR in wave64 mode), exec is saved to more lanes of the VGPR.
The exec mask is set to a constant where all written lanes a live (e.g. -1 if all lanes are used for spills).
The VGPR containing the SGPR values is saved to scratch.
The exec mask is restored.
If more than 32 SGPRs should be spilled, repeat from step 2 with the next batch of SGPRs
If the register scavenger spilled a VGPR, it is restored.

Some possible solutions to not clobber precious variables from other lanes:

Put SGPRs only in currently active lanes of the VGPR: Does not work when exec is 0, also needs some free registers to find and write into active lanes
Save all VGPRs without needing an extra register to save exec: Save VGPR, flip exec (xor with -1), save previously inactive lanes, flip exec again to restore it
Reserve 1/2 SGPRs per function to save exec: Save exec to reserved registers and set to -1, save whole VGPR, write SGPRs into VGPR, save VGPR again, restore VGPR, restore exec

Regarding reserved registers, there might be more efficient ways than saving them (partially) twice:

For caller-save VGPRs, the callee only needs to save inactive lanes, so we could do (exec = exec ^ -1) instead of (exec = -1) for these. This comes at the cost of some SALU instructions and some stalls after changing exec
Once we have proper liveness tracking, we could mark a few (or all – at the expense of memory bandwidth) VGPRs as caller-save for the whole wave, so a callee can clobber them in WWM or SGPR spilling

Long-term solution for some of the problems: Tracking live range of VGPRs of other lanes

Model VGPRs as registers with sub-registers for active and inactive lanes. Normal instructions will use the “active lanes” subreg, special instructions like SGPR spilling, writelane, etc. use the whole register. Handling control flow is the complex part (i.e. transitioning from an if- to an else-block needs to merge the active subregs into the inactive subregs, this gets more complex with nested control-flow).
Alternatively, add implicit uses on instructions inside control flow that overwrite a VGPR only for parts of a wave (probably a bad idea as it fixes some problems but still is not modeling anything correctly).
Something completely different, e.c. a concept of register “lanes” in LLVM

Any long-term solution should also allow to get rid of reserving VGPRs for the whole function. They should only be reserved for the lifetime they need.

What is happening where

An overview of the passes and which parts related to spilling happen where in which functions and how they play together. The target classes here (e.g. TargetFrameLowering) refer to the AMDGPU variants (e.g. SIFrameLowering).

+-----------------+       +----------------------+
| SIWholeQuadMode | ----> | SIPreAllocateWWMRegs | ---->
+-----------------+       +----------------------+


+----------+       +---+       +-------------------+
| RegAlloc | ----> | … | ----> | SILowerSGPRSpills |
+----------+       +---+       +-------------------+


      +---+       +----------------------+       +---+
----> | … | ----> | PrologEpilogInserter | ----> | … |
      +---+       +----------------------+       +---+


      +------+
----> | IPRA |
      +------+

SIPreAllocateWWMRegs

Save VGPRs for WWM in MFI (MachineFunctionInfo)::WWMReservedRegs

RegAlloc

TII (TargetInstrInfo)::storeRegToStackSlot
InlineSpiller::spill → TII::storeRegToStackSlot

SILowerSGPRSpills

MFI::allocateSGPRSpillToVGPR
TRI (TargetRegisterInfo)::eliminateSGPRToVGPRSpillFrameIndex → TRI::spillSGPR
spillCalleeSavedRegs → TFI (TargetFrameLowering)::determineCalleeSavesSGPR (determineCalleeSaves but SGPRs only)
spillCalleeSavedRegs → TII::storeRegToStackSlot → Create SI_SPILL_S32_SAVE and similar pseudos

PrologEpilogInserter (PEI)

TFI::determineCalleeSaves (VGPRs only) → getVGPRSpillLaneOrTempRegister (for FP/BP) → MFI::allocateSGPRSpillToVGPR reserve lanes of VGPRs if possible
TFI::determineCalleeSaves (VGPRs only) → getVGPRSpillLaneOrTempRegister (for FP/BP) → findScratchNonCalleeSaveRegister
insertCSRSaves → TII::storeRegToStackSlot
insertCSRSaves → MFI::allocateSGPRSpillToVGPR
TFI::processFunctionBefore: Decide to add scavenging FI (frame index)
Add function prolog and epilog
- TFI::emitPrologue
  - exec = -1
  - Save reserved VGPRs for SGPR→VGPR spills → buildPrologSpill
  - Save FP/BP to scratch or VGPR → buildPrologSpill
- TFI::emitEpilogue
replaceFrameIndices → TRI::eliminateFrameIndex → TRI::spillSGPR: Creates SI_SPILL_… pseudos, which are later lowered to instructions

IPRA (inter-procedural register allocation)

If a call destination is known (direct call), callers only need to save registers that are actually clobbered by the callee

Related Reviews

D95946: Fix FP/BP→VGPR spilling
D96336: Fix SGPR→scratch spilling
D96869: Fix FP/BP→scratch spilling
D96517: Optimize SGPR→scratch spilling

You survived until the end!

Topic		Replies	Views
Strange spill behaviour LLVM Dev List Archives	2	67	March 21, 2013
Spillers LLVM Dev List Archives	11	73	August 7, 2007
Problems with spill/reload and register scavenging LLVM Dev List Archives	0	102	March 29, 2016
Spilling to register for a given register class LLVM Dev List Archives	4	108	December 18, 2019
global register allocators and spill code LLVM Dev List Archives	0	74	February 26, 2004