[LLVM][CodeGen][RFC] Synthetic register classes and allocation mask to determine the final allocation order

This RFC is to get the opinion about two features I’m introducing in MRI for codegen.
I put out 4 related patches.

  1. [CodeGen][MRI] Introduce synthetic register classes by cdevadas · Pull Request #86006 · llvm/llvm-project · GitHub
  2. [CodeGen][MRI] Add allocation mask for register classes by cdevadas · Pull Request #86007 · llvm/llvm-project · GitHub
  3. [AMDGPU] Add synthetic regclass WWM_VGPR_32 for wwm reg operands by cdevadas · Pull Request #86008 · llvm/llvm-project · GitHub
  4. [AMDGPU] Implement wwm-register allocation by cdevadas · Pull Request #86012 · llvm/llvm-project · GitHub

(1 & 2 are the generic patches, 3 & 4 are the AMDGPU specific.)

I’m trying to solve a recurrently occurring problem in the AMDGPU backend while allocating vector registers. We made some attempts earlier, but we couldn’t fully solve the problem. The real problem (as @arsenm mentioned in the earlier discussion) is that we can’t correctly model the two CFGs and two parallel modes of register liveness that coexist at the same time.

Currently, register classes can only be defined statically. There is no way we can enable/disable them conditionally during CodeGen. I’m trying to have two register classes (named WWM_VGPR_32 and VGPR_32) that use the same register pool. To be precise, the virtual registers of WWM_VGPR_32 classes will be introduced in a specific pass before regalloc and we don’t want various regclass queries like getAllocatableClass, getMinimalPhysRegClass, etc. to choose the WWM regclass for operands instead of the other. The instructions using wwm regclass are executed differently than the operations with the other class. So it is inevitable to dynamically determine when to make them active. The synthetic field added in the regclass definition would allow us to achieve it. It imposes a change in many files due to the MRI helper function isEnabled() to check if a regclass is active or not.

Even though they share all available VGPRs statically, their actual allocatable set should be distinct. To achieve that, I introduced the allocation mask vector in MRI that holds a mask value per register class determined by the targets and will get applied while computing the AllocationOrder for them. We could partition the total VGPRs into two sets and add use in their static definitions so that there is no overlap during allocation. But that would bring huge performance penalties. The dynamic partition, on the other hand, helps us minimize the total number of VGPRs allocated and ultimately reduces the performance cost.

I hope I explained the motivation for these PRs. Let me know if you have any questions.

But that would bring huge performance penalties.

What kind of performance penalties are we talking about here? I’d like to reduce the number of orthogonal concepts here if we can… having a register change register files seems really confusing. Would dynamically generating register IDs solve the issue?

1 Like

I think this situation is asking for a better mechanism to manage register classes. On one hand, the .td files define a set of classes usually motivated by the target architecture, on the other hand, TableGen invents new classes whenever it believes it needs more. The result is that we can’t consciously select what set of classes we’re interested in when we want some kind of a register class operation: the ones we’ve defined may not be sufficient, all of them together may be too much.

For example register pressure calculation took into account register classes that were completely irrelevant, and inventing a new class for convenience would all of a sudden affect scheduling. This was a very long time ago, so I’m not sure if this problem still exists.

I haven’t read the entire first PR, but I’m not sure if having a single boolean flag per class will add enough generality. I’m thinking of identifying register classes by some property (e.g. the set of register classes that are associated with operands of instructions in this function, etc.), but I don’t have a concrete solution in mind.

I’m hoping that with sufficient characterization of register classes, we could direct scheduling, allocation, etc to do the exact thing we expect.

In GPU like execution, the computations are divided and executed in parallel among waves/warps that hold multiple workItems/threads in it. The hardware resources like registers and memory are shared among the multiple waves in a group. More the waves we launch, we get better throughput. The available vector registers (VGPRs) in AMDGPUs are shared by the waves in a group. If the program occupies more VGPRs, it limits the number of waves that can simultaneously execute and so does the performance.
It is crucial, in that way, for GPU execution to minimize the resource allocation for programs. If we partition the registers statically, they won’t be optimal for various use cases. However, if the allocation order can be dynamically determined for different regclasses based on the program, the allocator can do a better job to allocate a minimal set of vector registers.

We tried a similar approach earlier by incorporating a virtual register flag to the special operands called WWM_REG that can be set via MRI interface. But that didn’t solve the problem entirely. Apart from the actual register assignment, interfaces like InlineSpiller, SplitKit, etc. depend on reg-liveness and that doesn’t handle all corner cases and causes incorrect allocation at times. Virtual registers are considered for allocation based on their regclasses, and the actual allocation unit is the register itself. So, having two distinct regclasses seemed a better approach to me.

This is another instance where representing ~anything as “registers” is problematic. Really we want to partition the register units into disjoint allocatable sets, but that requires unrolling into the huge number of register aliases present in all the register classes

This doesn’t really fit into the current uses of register classes. We aren’t expressing a hardware operand constraint. We are expressing a software constraint to control allocation.

What I have in mind is being able to specify a set of classes we’re interested in in a call to getMinimalPhysRegClass for example, and do it in some way that is generic, or at least easy to extend. The “classes used in operands” was just an example.

If you were able to restrict the set of classes for allocation, that should solve your problem if I’m understanding the issue correctly.

The synthetic flag is a “specialization” of sorts, where you only have a single property. I’m just wondering how it can be extended, because I think it would solve a more general problem.

I still need some time to fully understand the patch stack, but my first thought whilee going through it was “this is adding a lot of boilerplate for a very specific purpose”.

IMO if we need to add that much boilerplate, we should definitely make this more generic so (if?) when this need to be done again some other day, it’s much easier and less messy.

e.g.:

  • Make the “synthetic” flag in the RegClass just a target-specific 8 or 16 bit flag (a bit like MachineOperand flags such as MONoClobber)
  • Allow targets to optionally attach extra info to MachineRegisterInfo, e.g. a MRIContext that’s opt-in.
    • Or just put it in MachineFunction::getInfo?
  • Some part of your patch such as MRI.isEnabled(RC) in TRI just become isEnabled(RC, MRI) which would be a virtual function of TRI that lets targets do whatever they want to enable/disable the info.
    • In our case it’d look at the per-MF context to see which RC has been dynamically disabled.

This is just a loose suggestion, my point is just that this PR stack solves an important problem but also introduces a lot of invasive changes (which are needed!), and I think it’s better if those changes are as generic as possible.

Sounds like this mechanism of dynamically disabling/enabling reg classes can be used at other places as well. Have you happened to discuss any other use cases inside or outside of AMDGPU backend?

1 Like

This RFC is also meant for opinions on similar interests. Do you know any specific use case to describe here?

The overall design of sythetic field was made in a generic way rather than implementing is as a target-specific property. It is meant to achieve the concept of dynamically defined regclasses.
The helper methods can be turned into virtual in the future if more use cases are found for other targets. It is more appropriate to be a single entity rather than a flag that can represent multiple properties because this field is meant for a distinct purpose.

I don’t see a strong recommendation to go ahead with the patches. I don’t either see a better alternative being proposed to address the inherent problem. A gentle reminder to have a detailed look at the patches. They are essential to solve some critical issues we face in the AMDGPU codegen and I would like to keep the review moving.

I introduced the allocation mask vector in MRI that holds a mask value per register class determined by the targets and will get applied while computing the AllocationOrder for them.

Can you explain on a high level how this mask vector will be used to determine Allocation order?

RegisterClassInfo::compute() determines the allocation order for register classes. The RawOrder is by default all registers as specified in the registerclass definition. Currently, the target’s reserved registers mask is applied to it while getting the final AllocationOrder. With my patch, the allocation mask will also be applied before finalizing the set of allocatable registers.
See here [CodeGen][MRI] Add allocation mask for register classes by cdevadas · Pull Request #86007 · llvm/llvm-project · GitHub
The interested targets should compute and update the allocation masks beforehand as we do for AMDGPU here,
[AMDGPU] Implement wwm-register allocation by cdevadas · Pull Request #86012 · llvm/llvm-project · GitHub

1 Like

If I understand correctly how these features get used in the AMDGPU backend:

  1. You are tracking whether a particular virtual register is a WWM value by using the special WWM_VGPR_32 register class. Previously this was done via VRegFlags in AMDGPU’s SIMachineFunctionInfo. So this is not new functionality, just a different way of storing the WWM-ness info.

  2. In #86012 you use updateAllocationMaskForRCs to ensure that the registers allocated for VGPR_32 and WWM_VGPR_32 are disjoint, to avoid “incorrect liveness info being plotted for wwm-operands during allocation, eventually leading to wrong spills and splits of liveranges when come short of physical registers.”

I don’t understand why #2 is necessary. At this point we have the lowered per-wave CFG, so LLVM’s liveness info should pessimistically report VGPR interference as if they are all WWM values, which is what you want for the real WWM values, and conservatively correct for normal (non-WWM) values. So what are the correctness problems?

(If the correctness problems are caused by the SIOptimizeVGPRLiveRange pass then perhaps we just need to teach that pass not to touch WWM values.)

This interference is not captured, because the interference is between unrelated segments. Unless we added a second dimension to the interference matrix, a physical register assigned a WWM value is tainted and cannot mix with any non-WWM segments. We also need to have different handling for split/spill depending on whether the value is WWM in that segment. If a WWM value is evicted, that physical register becomes available for assignment. It may then be assigned to a non-WWM segment which is invalid.

Most of the issues are with SGPR spills, which are obviously inserted after this point

This interference is not captured, because the interference is between unrelated segments. Unless we added a second dimension to the interference matrix, a physical register assigned a WWM value is tainted and cannot mix with any non-WWM segments.

The interference matrix is based on liveness which is based on the per-wave CFG, so surely it is treating all values like WWM values? (Which is correct for WWM values and conservatively correct, but pretty wasteful, for other values.)

We also need to have different handling for split/spill depending on whether the value is WWM in that segment.

Understood, but there may be other ways of plumbing that information through to the place where the copy/spill is generated.

If a WWM value is evicted, that physical register becomes available for assignment. It may then be assigned to a non-WWM segment which is invalid.

I don’t understand why this is invalid. I’d need to see an example.

The virtual reg flag method used earlier was inefficient and like Matt has mentioned, the PhysRegs used for WWM allocation can’t be reused for non-wwm variables. The flag method didn’t help us determine the disjoint sets for the assignment. But the combination of synthetic class (WWM_VGPR_32) + allocation Mask allowed us to achieve the disjoint sets dynamically based on the wwm & non-wwm virtreg requirements for the function.

We don’t have a problem when RA decides to reuse a physreg assigned to a wwm-liverange for non-wwm operand. Because their spills/splits are handled by inserting whole-wave spills/splits. It is achieved with the help of virtReg flag WWM_REG in the compiler today. But the otherway around is a problem.
We have some cases identified internally. Certain VGPRs assigned for non-wwm liveranges in an outer region has been re-used for wwm-allocation in a nested region where only sub-sets of the threads enter. Even though the physReg is spilled/restored around that inner-region, the wwm-operation has clobbered some of its inactive lanes which the spill-reload couldn’t help restore. Because the original liverange being spilled was non-wwm and the spiller interface has no way to judge/determine they might get reassigned to a wwm-operand. This ends up corrupting the original non-wwm operand’s uses after the inner region. This is the primary reason we need disjoint sets.