The current state of spilling, function calls and related problems

This is an attempt to summarize the current state of spilling, function calls and related problems.

A general problem that often manifests with LLVM is that LLVM (IR) only has the view of a single thread and not the whole wave executing. In MachineIR, the control flow is then modeled in the wave view (by implicit uses of exec on all vector operations), registers are still modeled for a single lane. I.e. there is a single v0 register, while the hardware has 32/64 of them. This works fine in most cases but creates problems in operations that involve more than a single lane. These operations are:

  • Spilling (and restoring) SGPRs to VGPRs
  • Spilling (and restoring) SGPRs to scratch
  • Whole wave mode (WWM)
  • Function calls do not directly touch other lanes, but they are indirectly involved when they contain spilling or WWM parts
  • The write- and readlane intrinsic

Whole quad mode (WQM) is activating lanes that would be otherwise inactive. AFAIK this can only happen at the top level of a shader, i.e. it is not possible to activate lanes that are currently inactive but are already in use for something else, so WQM does not exhibit this problem.

Spilling SGPRs to scratch first saves the SGPRs to lanes of a VGPR (same as SGPR→VGPR) and then saves the VGPR to scratch. Restoring works the other way around.

What is happening when

  • We need more VGPRs ⇒ Spill VGPR to scratch (does not involve cross-lane operations), only saves currently active lanes
  • We need more SGPRs ⇒ Spill SGPRs to VGPR if possible, if no VGPR is unused in the function, spill to scratch
  • A function is called (caller) ⇒ Save (and restore) the currently active lanes of caller-save VGPRs and live SGPRs (same as spilling), does not save inactive lanes of used VGPRs and does not save VGPRs that are unused in the currently active lanes.
  • A function is called (callee) ⇒ Save (and restore) all lanes of VGPRs reserved for SGPR spills. Callee-save SGPRs and FP/BP (frame pointer and base pointer) are saved to VGPRs or scratch (same as spilling, FP/BP are handled separately).
  • We want to sum over all lanes ⇒ Switch to WWM and set a VGPR to 0 in inactive lanes

As VGPRs are incompletely modeled, this requires workarounds for all cross-lane operations:

  1. SGPR→VGPR spilling
    When the need to spill an SGPR comes up, a VGPR is searched that is unused during the whole function. If one is found, it is reserved for the whole function and used for spilling. If none is found, it resorts to SGPR→scratch spilling.
  2. SGPR→scratch spilling
    No workarounds, that’s why it can currently overwrite registers that are in use.
  3. WWM
    Same as for SGPR→VGPR spilling, VGPRs are reserved for the whole function if WWM is used.
  4. Function calls
    Reserving VGPRs for the whole function makes sure no other lane inside the function is overwritten. To not overwrite registers of the caller, the callee saves all lanes of these reserved registers.
    (It does not yet do that, except for FP/BP→VGPR saves or if the VGPR is callee-saved. I forgot the second spill function and did not yet look at WWM reserved registers.)

Without function calls, SGPR→VGPR spilling and WWM are working correctly, SGPR→scratch spilling is not.
With function calls, none of these are working correctly, but SGPR→VGPR spills and WWM should be easy to fix. Saving FP and BP to scratch in the function prolog currently overwrites all lanes without saving them before.
The hard part to get correct is SGPR→scratch spilling.

How does SGPR→scratch spilling currently work:
Setting: All SGPRs are in use and we want to save some into scratch.

  • The register scavenger searches for a free VGPR. If none is free, it saves one to scratch (only saving active lanes).
  • The SGPRs are written into lanes of the VGPR, starting with 0. This can overwrite values if lane 0 is inactive and the VGPR is live there (this happens independently of finding a free VGPR or saving one as both only look at currently active lanes).
  • The exec mask is saved to the SGPRs (the SGPRs are currently unused). If it does not fit (spilling one SGPR in wave64 mode), exec is saved to more lanes of the VGPR.
  • The exec mask is set to a constant where all written lanes a live (e.g. -1 if all lanes are used for spills).
  • The VGPR containing the SGPR values is saved to scratch.
  • The exec mask is restored.
  • If more than 32 SGPRs should be spilled, repeat from step 2 with the next batch of SGPRs
  • If the register scavenger spilled a VGPR, it is restored.

Some possible solutions to not clobber precious variables from other lanes:

  • Put SGPRs only in currently active lanes of the VGPR: Does not work when exec is 0, also needs some free registers to find and write into active lanes
  • Save all VGPRs without needing an extra register to save exec: Save VGPR, flip exec (xor with -1), save previously inactive lanes, flip exec again to restore it
  • Reserve 1/2 SGPRs per function to save exec: Save exec to reserved registers and set to -1, save whole VGPR, write SGPRs into VGPR, save VGPR again, restore VGPR, restore exec

Regarding reserved registers, there might be more efficient ways than saving them (partially) twice:

  • For caller-save VGPRs, the callee only needs to save inactive lanes, so we could do (exec = exec ^ -1) instead of (exec = -1) for these. This comes at the cost of some SALU instructions and some stalls after changing exec
  • Once we have proper liveness tracking, we could mark a few (or all – at the expense of memory bandwidth) VGPRs as caller-save for the whole wave, so a callee can clobber them in WWM or SGPR spilling

Long-term solution for some of the problems: Tracking live range of VGPRs of other lanes

  • Model VGPRs as registers with sub-registers for active and inactive lanes. Normal instructions will use the “active lanes” subreg, special instructions like SGPR spilling, writelane, etc. use the whole register. Handling control flow is the complex part (i.e. transitioning from an if- to an else-block needs to merge the active subregs into the inactive subregs, this gets more complex with nested control-flow).
  • Alternatively, add implicit uses on instructions inside control flow that overwrite a VGPR only for parts of a wave (probably a bad idea as it fixes some problems but still is not modeling anything correctly).
  • Something completely different, e.c. a concept of register “lanes” in LLVM

Any long-term solution should also allow to get rid of reserving VGPRs for the whole function. They should only be reserved for the lifetime they need.

What is happening where

An overview of the passes and which parts related to spilling happen where in which functions and how they play together. The target classes here (e.g. TargetFrameLowering) refer to the AMDGPU variants (e.g. SIFrameLowering).

+-----------------+       +----------------------+
| SIWholeQuadMode | ----> | SIPreAllocateWWMRegs | ---->
+-----------------+       +----------------------+


+----------+       +---+       +-------------------+
| RegAlloc | ----> | … | ----> | SILowerSGPRSpills |
+----------+       +---+       +-------------------+


      +---+       +----------------------+       +---+
----> | … | ----> | PrologEpilogInserter | ----> | … |
      +---+       +----------------------+       +---+


      +------+
----> | IPRA |
      +------+

SIPreAllocateWWMRegs

  • Save VGPRs for WWM in MFI (MachineFunctionInfo)::WWMReservedRegs

RegAlloc

  • TII (TargetInstrInfo)::storeRegToStackSlot
  • InlineSpiller::spill → TII::storeRegToStackSlot

SILowerSGPRSpills

  • MFI::allocateSGPRSpillToVGPR
  • TRI (TargetRegisterInfo)::eliminateSGPRToVGPRSpillFrameIndex → TRI::spillSGPR
  • spillCalleeSavedRegs → TFI (TargetFrameLowering)::determineCalleeSavesSGPR (determineCalleeSaves but SGPRs only)
    spillCalleeSavedRegs → TII::storeRegToStackSlot → Create SI_SPILL_S32_SAVE and similar pseudos

PrologEpilogInserter (PEI)

  • TFI::determineCalleeSaves (VGPRs only) → getVGPRSpillLaneOrTempRegister (for FP/BP) → MFI::allocateSGPRSpillToVGPR reserve lanes of VGPRs if possible
    TFI::determineCalleeSaves (VGPRs only) → getVGPRSpillLaneOrTempRegister (for FP/BP) → findScratchNonCalleeSaveRegister
  • insertCSRSaves → TII::storeRegToStackSlot
    insertCSRSaves → MFI::allocateSGPRSpillToVGPR
  • TFI::processFunctionBefore: Decide to add scavenging FI (frame index)
  • Add function prolog and epilog
    • TFI::emitPrologue
      • exec = -1
      • Save reserved VGPRs for SGPR→VGPR spills → buildPrologSpill
      • Save FP/BP to scratch or VGPR → buildPrologSpill
    • TFI::emitEpilogue
  • replaceFrameIndices → TRI::eliminateFrameIndex → TRI::spillSGPR: Creates SI_SPILL_… pseudos, which are later lowered to instructions

IPRA (inter-procedural register allocation)

  • If a call destination is known (direct call), callers only need to save registers that are actually clobbered by the callee

Related Reviews

  • D95946: Fix FP/BP→VGPR spilling
  • D96336: Fix SGPR→scratch spilling
  • D96869: Fix FP/BP→scratch spilling
  • D96517: Optimize SGPR→scratch spilling

You survived until the end! :dragon: