Register allocators and constrained classes at -O0

Hi all, I’ve run into an issue involving the fast allocator and constrained register classes that I’m looking for pointers on.

Consider a target with a register store instruction using an addressing mode with register base and offset:

*(base + index) = src;

This instruction may be modeled in tablegen with the operands: (outs) (ins REGCLASS:$src, REGCLASS:$base, REGCLASS_CONSTRAINED:$index)

In my particular case, REGCLASS contains 8 registers, R0-R7, and REGCLASS_CONSTRAINED has only 2, R0 and R1. The allocation order for REGCLASS is specifically set up to avoid R0 and R1 via an order that rotates the list left to (R2-R7, R0, R1).

In a piece of code at -O0, I’m at a point where I’ve allocated R2-R7, and the fast allocator sees a store of the form above. It chooses R0 and R1 for $src and $base. Because R0 and R1 are taken, $index can’t be allocated at all, and the allocator emits an error.

I first thought that the backend should be able to prioritize the constrained register first and found the AllocationPriority field in tablegen, but then noted that it was only used in the greedy allocator. I then looked at the fast allocator and noted that it seems to have no method for the target to affect priority or ordering. This is probably by design, so I pose my conundrum to the community since I can’t imagine that this hasn’t come up at some point in the past.

My current options seem to be:

  • Update REGCLASS to not contain R0 or R1 in the allocation order, but this seems like we’ll end up spilling in cases where we really don’t need to
  • Run the greedy allocator at -O0. It seems like no other target does something like this currently
    • NVPTX and WebAssembly override createTargetRegisterAllocator to return nullptr
    • Overriding addRegAssignAndRewriteFast to just return addRegAssignAndRewriteOptimized seems to fix the allocation problem, but I wonder if there’s a fundamental issue with how the target has been defined.
  • Perhaps attempt to adapt the recently-committed staged-allocation mechanism from https://reviews.llvm.org/rGeebe841a47cb to allocate constrained registers first

Regards,

J.B. Nagurne

Code Generation

Texas Instruments

* This problem is independent of allocation priorities (you can always have R2-R7 already assigned for other values in the surrounding code).
* The fast register allocator by default assigns registers in a greedy fashion one operand at a time and won't be able to revert the choice within an instructions. The really unusual situation here is that picking certain registers for the earlier "base" and "src" operands can make it impossible to pick one for constrained regclass, I don't remember seeing anything similar for other targets, they either have operands fixed to 1physical register or the register class contains more registers than number of operands...

Some ideas to work around the problems (though I guess none of them is great):
* Could shuffle the order of operands so the constraint one comes first?
* Could use a register class without R0 and R1 for the "base" and "src" inputs
* Could fix "index" to R0 or R1 upfront.

Thanks for the response, Matthias.

  • Shuffling operands is viable, but I had only considered this somewhat lightheartedly since it seems very fragile to tie the chances that we get an unallocatable situation to something as seemingly unrelated as operand order.

  • The option to effectively separate base/src and index allocation sets is also viable, but becomes more problematic as the ISA has fewer and fewer registers. Say if we remove all but R0-R3, it becomes extremely easy to get into pathological situations if we keep the ‘general’ set to 2 registers and the constrained set to the other 2, especially due to the relative rarity of base+index memory operations relative to other types of operations. I do admit that should this situation occur, the fault lies somewhat on overzealous architecture designers, but it’s a possibility to consider nonetheless.

  • Fixing index to R0 or R1 prior to allocation does have merit, whose spirit is reflected in the AMDGPU “staged” allocation mechanism. The target could tell the allocator to go through and assign the constrained registers first, and then go through to allocate the rest. I think this is a good solution in the long run, and am very happy that it has been shared upstream.

I did find it interesting that you used ‘greedy’ in the description for the fast allocator, given that there is a ‘Greedy’ allocator. I also feel that fast seems far greedier.

Can you or anyone else see an issue with always running the Greedy (tuned Basic) allocator at all optimization levels, or have history as to why Fast was chosen as the default at -O0? I can imagine it might involve the debugability of values as their live ranges are split.

JB

Hi James,

Thanks for the response, Matthias.

  • Shuffling operands is viable, but I had only considered this somewhat lightheartedly since it seems very fragile to tie the chances that we get an unallocatable situation to something as seemingly unrelated as operand order.

  • The option to effectively separate base/src and index allocation sets is also viable, but becomes more problematic as the ISA has fewer and fewer registers. Say if we remove all but R0-R3, it becomes extremely easy to get into pathological situations if we keep the ‘general’ set to 2 registers and the constrained set to the other 2, especially due to the relative rarity of base+index memory operations relative to other types of operations. I do admit that should this situation occur, the fault lies somewhat on overzealous architecture designers, but it’s a possibility to consider nonetheless.

  • Fixing index to R0 or R1 prior to allocation does have merit, whose spirit is reflected in the AMDGPU “staged” allocation mechanism. The target could tell the allocator to go through and assign the constrained registers first, and then go through to allocate the rest. I think this is a good solution in the long run, and am very happy that it has been shared upstream.

I did find it interesting that you used ‘greedy’ in the description for the fast allocator, given that there is a ‘Greedy’ allocator. I also feel that fast seems far greedier.

Can you or anyone else see an issue with always running the Greedy (tuned Basic) allocator at all optimization levels, or have history as to why Fast was chosen as the default at -O0?

Always running the greedy allocator would be ideal because it would reduce the maintenance burden of the allocators.
Now, the problem is that greedy is just too slow even when we disable some non essential heuristics. Switching to that allocator would be a pretty big compile time regression for our users. I don’t remember the numbers on top of my head but it was substantial.

That said, if you’re okay with that I believe you could tweak the pipeline for your target to always use the greedy allocator. You may want to make TargetPassConfig::getOptimizeRegAlloc virtual in that case.

Cheers,
-Quentin

The fast LLVM register allocator is truly on-the fly allocation with minimal analysis and book-keeping.

LLVM register allocator that supports the current “basic” and “greedy” variants should probably be called the “incremental” allocator. The goal was to allow live range splitting in between individual allocation decisions without rerunning any analyses. It’s really part of a larger register allocation pipeline that requires significant setup just to get going: SlotIndexes, LiveVariables, LiveIntervals, etc. Then to perform allocation it maintains a LiveIntervalUnion data structure. The live range splitting and register reuse that it performs might also slightly degrade debug information or at least make assembly-level debugging more confusing.

Once you get through the setup, LLVM’s incremental allocation itself is pretty fast for an optimizing register allocator (at least for targets that don’t have thousands of physical registers). The basic variant is probably a good choice for first or second level JITs. It was intended to be about as fast as linear scan when spilling/splitting is not required, but avoid linear scan’s quadratic behavior when spilling/splitting is required.

The greedy variant adds eviction. This is a form of backtracking, actually making it algorithmically non-greedy–it reverses incremental allocation decisions. (I started using the name because eviction seemed like a “greedy” thing to do in the context of a single allocation decision, and the name regrettably stuck). Eviction can be interleaved with live range splitting and other incremental allocation decisions, making it hard to control the number of steps. I suspect live range splitting can get somewhat out of control for certain CFG structures. I’m sure that would fixable by someone who understands the splitter well enough though.

-Andy