Investigating high register pressure (at the IR level?)

Hello,

I’m investigating a performance issue in an AMDGPU kernel which is caused by a lot of values being alive at the same time, causing extremely high register pressure (hundreds of spills of both SGPRs and VGPRs).

I’ve been looking at this for months on-and-off and I can’t seem to find a breakthrough. All my attempts net me 2-3% less spills at most.

  • I ran a register pressure analysis pass (ad-hoc) after every backend pass to see if the pressure was caused by a backend optimization. It is not - pressure is very high right out of ISel and stays high throughout the pass pipeline.
  • Already checked for missing DAG combines as well.
  • Checked many common CL options in the AMDGPU backend and IR optimizations, such as loop unrolling thresholds, and got no big impact (nothing more than a couple of %)
  • I spent a lot of time looking through debug logs of many common passes/ISel for clues but couldn’t find any.

My current theory is that there’s some IR optimizations (alone or in combination) that unfortunately cause register pressure to rise dramatically. For instance, there’s many values defined in the entry block that are reused throughout the function (some loads of <64 bit values have 100+ users).

I was wondering if anyone did some digging on this kind of issues in the past ? Any help would be greatly appreciated. I’m looking for anything really:

  • Ideas of passes to look at/disable/tweak
  • Theoretical optimizations that could be performed but aren’t implemented yet

Here’s some information on the kernel that may help:

  • Kernel has about 20 000 lines of IR and 500+ basic blocks.
  • There’s many layers of inlining, I think up to 10.
  • Extensive use of loop unrolling (critical for performance)
  • There’s a lot of reads from a big array constant global variable. A lot of things are loaded from there, from simple i32/i64s to pointers that are dereferenced later.
  • A lot of values defined in the entry blocks are re-used throughout the function. For instance, some small loads (i32/i64) have 100+ users in many basic blocks.

Thanks,
Pierre

One of the common culprits we’ve identified is loop invariant code motion, which can end up hoisting a lot of values to the top of a function. CSE is also a likely one based on what you write.

Both LICM and CSE are important canonicalization transforms. The right solution is to learn to rematerialize those values.

We’ve been talking about moving the spilling logic out of the register allocator proper, in favor of a scheme that looks more like what SSA-based register allocation proposes (without actually literally doing SSA-based register allocation):

  • In a backend region where we meticulously track register pressure:
    • There is a pass that reduces register pressure through a combination of rematerialization, spilling, and (potentially) code motion
    • Register allocation can then always success without (additional) spilling
1 Like

Thank you, that helps a lot. I will look more closely at rematerialization.
Is there an easy way to see which values failed to be rematerialized by RA (& thus which insts I need to look at rematerializing)? Do I just add some logs to canRematerializeAt or is there somewhere better?

EDIT: I took a look from the POV of canRematerializeAt, I noticed the instructions that are rejected the most often are:

  • Dereferenceable invariant loads from a constant global variable (S_LOAD_DWORDX2 to X8)
  • COPY

I’m wondering if it wouldn’t be a good idea to rematerialize such loads sometimes, but it seems tricky to do right. I think it’s cheaper to reload that global once than to spill 8 SGPRs & reload them later, no?
I did a quick test by setting the right bit on the instruction & modifying isReallyTriviallyReMaterializable but then remat fails a lot because the value that holds the address is no longer alive.

As for the COPY, I don’t know the context - why don’t we remat copies? Aren’t they cheap?