RFC: Introduce generic predicated COPY opcode

The COPY operation for some targets is more than just a copy between two registers. There are some predicated dependencies associated with copies of certain register operands.
Such dependencies are currently marked with an implicit operand that sometimes limits targets while lowering them into target instructions.
Especially, when the copies are inserted during generic transformations (like regalloc pipeline), adding such target constraints becomes difficult if not correctly modeled.
I ran into a situation recently while implementing a feature for AMDGPU that made us think of the need for a Predicated COPY opcode.

For all vector register copies, AMDGPU target has a predicated dependency on a special hardware register called execution mask (exec).
The vector registers have multiple lanes (64, for example) and the execution mask represents the active lanes at a given point.
With the divergent control flow, GPUs dynamically turn on/off the lanes at various points, and for certain operands, we need to force enable all lanes to perform whole-wave vector operations including the copy.
That said, the vector copies either become wave-copy (all lanes enabled) or lane-copy (only a subset of lanes enabled).
Converting certain copies inserted during live range split in the regalloc pipeline into wave-copies turned out to be difficult.
The introduction of the predicated copy helped us model them better and later correctly lower them into target instructions.

I posted a basic implementation of the generic predicated copy with ⚙ D143754 [MachineInstr] Introduce generic predicated copy opcode
To enable Predicated copy for AMDGPU ⚙ D143757 [AMDGPU] Enable predicated copy right from instruction selection
The use case of Predicated copy ⚙ D143762 [AMDGPU] Enable whole wave register copy

I’m not sure I understand all the details here. I am assuming your problem is this, correct me if I am missing something:

The default generic COPY operation in machine-IR assumes that there is a to plainly copy a register to another without any other effects or dependencies. Usually some late target specific pass replaces the COPY instructions with actual machine instructions performing the copy. It seems you are having trouble with this in AMDGPU because the real copy instructions read some kind of control/predication register and you somehow need this dependency to be visible earlier in codegen? (why?)

  • The proposed solution feels very specific to your target architecture. Is this generic enough to ever be useful to any other target?
  • What if the next target has copies writing a other control register instead of reading it? What if it depends on 2 other registers? Inventing ever more COPY_XXX variants for every case does not sounds like good design for target-agnostic generic code.
  • The name predicated copy seems confusing. As as far as I understand the generic code does not need to know this being predicated as there is no way to influence the predication anyway. It’s merely a COPY with an extra operand, using the term predication in the generic code is bound to bring more confusion than it helps…
  • Usually in the case of target specific constraints the best solution is to use target specific opcodes. So would a target specific AMDGPU_COPY operation work? Usually we have some room for callbacks to still enable generic code to operate on target specific operations when necessary (though does it need to do much with those COPYs?).

The real problem we’re trying to solve here is how to represent the fact that we have 2 different CFGs at the same time and 2 parallel modes of register liveness. We have another dimension of liveness in register lanes that isn’t properly modeled. For most of the IR we assume a single-thread view of the registers, but in specific value contexts we need to expose the fact that VGPRs have many lanes (I’ll refer to these as “WWM values”).

It’s not entirely accurate to describe this as a property of the copy instruction themselves. Physically we’ll end up with the same mov instruction to implement the copy in either case. However, depending on the context of the copied value, we need to insert additional instructions to disable the predication when the copy is lowered. This is a property of a specific value / life segment, and not the physical register. We’re currently tracking this with flags set on the virtual register, but once that’s rewritten to a physical register, we need to track this property in an opcode.

Probably not. I think it’s net less painful to just add some generic pieces only AMDGPU uses than try to over-engineer some interface to access a backend defined operation. If someone came up with another use for virtual register flags, they would probably need the same basic thing to preserve the interpretation after virtual registers are rewritten.

We’re not really trying to represent the control register property, we’re trying to preserve the meaning of flags we set on a virtual register after the virtual register is rewritten in the opcode.

Correct, it’s more like switching how the copy is lowered and we didn’t come up with a better name. The predication details is more like this specific instance.

To my previous point, I think it’s easier to just define a generic opcode for copy-with-preserved-virtreg-flags meaning rather than trying to over-engineer an interface to get to some backend defined opcode. We have way too many poorly defined hooks as it is.

Could you explain how that works? Is this something that other GPUs can run into?

This is the defining feature of AMDGPUs. I don’t know if any other architectures do quite the same thing. We have a scalar unit and a vector unit. The scalar unit manages control flow, both in the sense of predicated lane masking and in terms of regular scalar branching. The programming model (and the way we treat the IR) is as a single lane view. We then deduce parts of the program that are uniform and can be moved to the scalar unit, and can use true branching in those cases.

LLVM understands the scalar branching but currently has limited understanding of the divergent / vector / predicated / logical control flow graph. Ideally we would track a second CFG that follows the original program’s vector semantics. The live ranges for vector registers really follow this CFG, not the scalar branching. This results in both us having situations where the register allocator is more conservative than it needs to be, but we also can’t represent the edge cases where the vector registers are being used specially for scalar values which is the problem we’re trying to address here.

In summary we need additional context for a particular live segment of a register, that’s not associated with a register class or physical register. We’re currently setting virtual register flags in relevant contexts, but these need to be maintained (through a different opcode) after allocation.

We have a similar situation. I posted about it last week

In our case the COPY and REG_SEQUENCE need to sometimes (based on register class) have an extra implicit operand.

I have been wondering about a callback to copy myself but I was thinking on the lines of a callback which accepts the MachineInstruction as argument and returns a target specific opcode…

I don’t want to digress this discussion, just pointing out that there is another architecture with similar requirements. I’ve read the linked patches and I don’t think this solves the exact problem I’m facing and I think that’s because our implementations are different.

This all sounds like you have some variation of predicated code which generic LLVM codegen can’t express/model. I am aware of some precedent in modeling this[1] but I suspect none of this applies here.

So I am still left wondering how you intend to model things specifically within MIR. Would it be possible to give a more concrete example?

Do the currently available patches add all the support necessary for the generic codegen, or are they just steps towards some future goal?

I think the worry that needs to be addressed here is whether other contributors will be able to correctly deal with the new instructions when evolving and maintaining the code going forward…

[1] Probably not relevant for this discussion directly as I have a feeling we have a different flavor of things here. These are the other examples I am aware of LLVM targets dealing with predicated code:

  • For ARM32 we just don’t model predication earlier in the pipeline. Predicated code is only formed post-regalloc as an optimization. The predication is only known/modeled within the IfConversion pass which at the end adds artificial implicit defs/uses to the code so the resulting liveness is a conservative over-approximation of the real one so that at later passes at least don’t mis-optimize.
  • LLVM also features the instruction bundle concept that as far as I know is used by out-of-tree targets to add predication early in the pipeline and bundle result up. Meaning the rest of the backend won’t touch the sequence within the bundle anymore. This of course means that register allocation, copy coalescing etc. will only happen in-between bundles but not for anything within them.

Yes. For the moment we’re avoiding truly expressing this, and just adding flags and pseudo-instructions to track the minimum support for codegen we need. A more honest representation would require adding a second CFG to nearly everything and a parallel set of liveness info.

The primary user of this is for compiler inserted spill instructions (there are some other special case uses, but they haven’t been ported over to the new scheme yet). The specific uses have the special property that they ignore the exec mask, and unconditionally run for all lanes. We know at construction that they have these properties, but any copies or other instructions derived from these values need to maintain this property.

Between ⚙ D134950 [CodeGen] Use delegate to notify targets when virtual registers are created and ⚙ D143758 [CodeGen] MRI call back in TargetMachine, we’re adding a way to track custom register flags attached to a virtual register to tag these WWM values. We’re creating new spill instructions with virtual registers. These will be propagated to any new virtual registers split from the originals. However, since these are associated with a specific register value, we lose this once the virtual register is assigned. To preserve the codegen semantics of a copy, we need to change the opcode so we know to change the emission based on what the flag had to have been for the copy.

So after running SGPR allocation and lowering SGPR spills, we have something like

%0:vgpr_32 = IMPLICIT_DEF ; is_wwm flag set on %0
%0:vgpr_32 = V_WRITELANE_B32 %0, 0, %0 (tied-def %1)

%2:sgpr_32 = V_READLANE_B32 %0
S_USE %2
%4:vgpr_32 = PRED_COPY %3:vgpr_32 ; unrelated value, no wwm flag Physically has $exec read

During the second register allocation run for the VGPRs, we end up splitting the live range for %0. We need the WWMness preserved, so

%0:vgpr_32 = IMPLICIT_DEF ; is_wwm flag set on %0
%0:vgpr_32 = V_WRITELANE_B32 %9:sgpr_32, 0, implicit %0 (tied-def %0)
%5:vgpr_32 = COPY %0 ; is_wwm flag propagated to this split %5

%2:sgpr_32 = V_READLANE_B32 %5
S_USE %2
%4:vgpr_32 = PRED_COPY %3:vgpr_32 ; unrelated value, no wwm flag Physically has $exec read

After the second allocation run, we need to codegen the 2 flavors of copies differently:

$vgpr0 = IMPLICIT_DEF ; needs to be treated as a WWM value for the WWM live segments
$vgpr0 = V_WRITELANE_B32 $sgpr0, 0, implicit $vgpr0 (tied-def)
$reserved_registers = S_OR_SAVEEXEC_B64 $exec, -1 ; Activate all lanes for WWM copy
$vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec ; exec is irrelevant, forced to all 1s
$exec = S_MOV_B64 $reserved_registers

$sgpr1 = V_READLANE_B32 $vgpr1
S_USE $sgpr1
$vgpr3 = V_MOV_B32_e32 $vgpr5 ; unrelated value, really reads exec and doesn’t need it forced on

The part I’m least happy with is we still need to make a contextless decision about copies to insert prior to register allocation (such as during initial MIR construction). We start by assuming everything is a PRED_COPY, and replace those with non-pred COPY for registers where the distinction can’t exist. Really it’s simply wrong to use PRED_COPY for any other WWM values at any point.

Right, this notion of predication is completely different, in that it’s optional and you can choose to form it. Predication is more fundamental here. We need to opt out of it with extra instructions for semantics in some situations, not optimize in optionally.

Let us know if there is any further concern or if anyone needs clarity. We would like to upstream this work soon.

Could you instead change the live range splitting code to use a target hook to insert a copy instruction? I think it would make more sense to add flexibility to how registers are copied instead of creating a target-specific opcode in the generic namespace. This may have deeper impact on things like register coalescing (if we use target hooks instead of ever using COPY directly), but may be a more robust solution in the end.

As per the suggestions on this thread and the reviews posted earlier, we implemented a target specific solution. Instead of a new generic PRED_COPY opcode, we now have an AMDGPU::PRED_COPY opcode which we intend to use for WholeWaveMode(WWM) copy operations.

For now, our primary objective is to use this opcode instead of TARGET_OPCODE::COPY for live range splitting in regalloc which will allow us to construct appropriate MIR for these WWM operations.

The solution involves modifying relevant copy checks in the register allocator pipeline(LiveRangeSplitting.cpp, InlineSpiller, TII::foldMemoryOperand, etc) so that this new PRED_COPY is still recognized as a COPY. Achieved by replacing MI.isCopy() checks with TII.isCopyInstr(MI) allowing targets to query if MI is a copy instruction.

There is a major issue that we are facing with this. TII.isCopyInstr() is interpreted very differently by targets, for example in MIPS it returns true for “WRDSP %0:gpr32, 16, implicit-def $dspccond” and since COPY is considered to be within registers the immediate operand here causes the compiler to crash. Moreover, now a wider set of instructions are queried as copies in the register allocator. Some x86 tests also required updating after this but I think they are okay [D150388]. Shouldn’t these target implementations be revised?

Requesting review comments from MIPS, X86, Thumb2 target experts.

WIP patches: [D150388, D150390]

edit: MIPS test crashing after D150388 : CodeGen/Mips/dsp-r1.ll

Seems like this depends on what TII.isCopyInstr wants to be.
Does it want to return true for anything that’s copy-like, including some target-specific instructions (like it seems to be the case currently), or should it just return true for trivial, generic copy operations that needs to be lowered?

Maybe introducing a new function, e.g. isGenericCopyInstr and renaming the older one to isCopyLikeInstr could work?

(note I’ve never worked on these bits before, this is just my 2 cents after reading this update)

I think this is just a Mips bug. I don’t see how you can parse this as remotely copy like since there’s no def operand here. If it is a copy, I think it’s the target’s responsibility to report no for something that isn’t a reg-to-reg copy. This is analogous to isMoveImm where the target needs to filter out cases that operate as constant materialization that can also copy registers