Cleanly Addressing GlobalISel/DAG Pattern Matching Differences

Hello!

I’m creating this thread with the hopes of sparking a discussion regarding how differences between the DAG/GISel should be addressed when writing TableGen patterns.

Intro

I have encountered multiple GlobalISel pattern-matching difficulties in the past few weeks, and the latest occurrence is in D132483.

In essence, the typical scenario is that a pattern that we try to match is straightforward in the DAG, but more intricate/varied in gMIR. Those differences can be very difficult - or even impossible - to cleanly express in TableGen.

This thread is a bit of a continuation of a previous thread started by @jayfoad: globalisel: cross-bank constant propagation? - #5 by jayfoad

Disclaimer

I’m a fairly new contributor to GISel, I still have much to learn. Take everything I wrote here with a pinch of salt.

The Problem

The above diff tries to improve generation of the V_BFI_B32 instruction in the AMDGPU backend when GlobalISel is being used. Take the following pattern, for example:

def : AMDGPUPat <
  (DivergentBinFrag<xor> i32:$z, (and i32:$x, (xor i32:$y, i32:$z))),
  (V_BFI_B32_e64 VSrc_b32:$x, VSrc_b32:$y, VSrc_b32:$z)
>;

This simple pattern can sometimes fail to match due to copies between register banks, which are common the the AMDGPU backend (they’re always inserted to transforrm between VGPR/SGPR). Thus, the following gMIR cannot match the pattern above:

%4:sgpr(s32) = G_XOR %1:sgpr, %2:sgpr
%7:vgpr(s32) = COPY %4:sgpr(s32)             ; COPY prevents pattern matching
%5:vgpr(s32) = G_AND %0:vgpr, %7:vgpr
%6:vgpr(s32) = G_XOR %2:sgpr, %5:vgpr

As far as I know, there is no way to tell TableGen to ignore COPY instructions currently. Thus, we need to get rid of the copy (which the diff tries to do, but we quickly run into other - more complex - issues), or perform instruction selection in C++ (which is annoying).

Potential Solutions

A maybe_copy DAG operator

A while ago, @jayfoad started a patch to introduce a dummy DAG operator that can be used to tell TableGen that a copy may be present in gMIR. There was a lot of discussion about it but it never landed.

Personally, I don’t think this is the right approach. It seems innocent enough to add a maybe_copy operation in every affected pattern. It reminds me a bit of using null/checking for null in programming language design: it starts small, but it quickly finds its way everywhere. Much like you’d check for null values every other line in some language, I fear that we’ll find ourselves adding maybe_copy every other operation in patterns.

Automatically skipping cross-registerbank copies in GISel

Another option would be to consistently ignore cross registerbank copies in GISel when checking opcodes. e.g.

  • If GIM_CheckOpcode sees a COPY, but the expected opcode is not COPY, it checks if it’s a cross-registerbank copy. If it is, it looks past it to try and find the correct instruction.
  • A COPY would only be considered as cross-registerbank if the input/output type is strictly equal, and the only difference between the input/output VReg is the register bank being different.

This would immediately solve the problem, and I think short-term it would have less adverse effects. There’s a few concerns with this but I believe they could be addressed rather easily.

The biggest concern for me is if we want to disable this behavior in some cases, e.g. a pattern shouldn’t match if the operand is a copy. I think we could solve this by adding a special operator (e.g. no_copy or direct, for instance) to express that copies are undesired for a given instruction’s operands.

// could match even if there's a COPY between the adds.
(add $x, (add $y, $z)
// cannot match if there's a COPY between the adds
(add $x, (nocopy (add $y, $z))

This has the same drawbacks as maybe_copy, but I think it would be used less often and thus would be less of an issue.

Conclusion/Help Wanted

How do you think we should move forward with those issues? Are any of the proposed solution acceptable at this time? Is there a better solution I’m missing?

Thanks a lot!

1 Like

Sorry for the ping @nhaehnle, @arsenm, @jayfoad, but do you have any thoughts about this?

I don’t know much about AMDGPU, but it doesn’t seem valid to ignore cross bank copies during selection. Coming from an AArch64 perspective, a selection pattern is necessarily specific to particular register banks (to select the right instruction variant). If I were to come across this problem in AArch64, one way might be to have a pass to optimize cross bank copies to leave the resulting MIR more amenable to pattern selection.

In our case, a pass isn’t possible as those copies - as far as I understand them - represent the transition between scalar & vector operations. Scalar and vector operations run on different ALUs and naively changing the nature of an operation (e.g. Scalar op → Vector op so we can remove a copy) is not trivially possible as it would mess up performance, occupancy, etc.

Sometimes we have patterns where we want to match a full operation, whether it’s scalar/vector/mix of both, so we can generate a single instruction that can take all the operands.

The problem we really need to solve here is how to match patterns while respecting the constant bus restriction, and if/how we should be using tablegen to do this. Ideally we would be able to keep using tablegen, because otherwise we’re going to have to write an awful lot of custom matching code in a post-selector optimization pass.

To some degree, we can have an SGPR or VGPR in any operand for most instructions from a register bank perspective, but there are additional complications. We’re making RegBankSelect insert copies to VGPR for any VALU operation so that every instruction trivially respects the constant bus restriction. Otherwise, if we had more than one SGPR operand, direct selection would violate the restriction. We need full instruction context to decide which operands should be folded after the fact.

So really we need a pattern level predicate that looks through all the copies and considers the number of SGPRs if the fold were to occur. We sort of do this for some operations already with ThreeOpFrag, but if I remember correctly that’s missing some cases involving copies too (e.g. see the fixmes). For more nested patterns, this gets more complicated (and I’m not entirely sure what the contract is for Operands and the custom predicate code)

So really we need a pattern level predicate that looks through all the copies and considers the number of SGPRs if the fold were to occur

That sounds good, but IIRC complex predicates have to be leaf patterns in GISel. How would this be implemented? Do you have an idea of what the final pattern would look like?

That sounds good, but IIRC complex predicates have to be leaf patterns in GISel. How would this be implemented? Do you have an idea of what the final pattern would look like?

I don’t really have a concrete idea. The language we’re working with feels limiting. I can see improving the API for complex predicates to be able to access the operands by name in the code fragment. Works fine as long as there’s a single output instruction, if a bit ugly. For patterns with multiple outputs, I think we would have to write custom code for any of them (don’t think we really have any of these though).

I’m not sure I fully understand. Do you mean making code like this work in tablegen?

(SomeComplexPat<xor> i32:$z, (SomeComplexPat<and> i32:$x, (SomeComplexPat<xor> i32:$y, i32:$z)))

I’m not sure how the different patterns would cooperate (preserve state) to determine if a fold is possible (if the constant bus limitation is satisified). Though, I could see this working if we just follow the current pattern of “trivially satisfying the constant bus limitation by inserting copies”. The nested patterns could simply work by checking for a COPY before checking the opcode

However I’m not sure how easy it is to fix tablegen to do this in practice. Wouldn’t it require some rework to the return value of ComplexPatterns to make it more easy to manipulate? (allow the GISel logic to extract the Nth result (register) to get the instruction that defines it & match the nested ComplexPattern against it). Seems like quite a bit of work (and maybe worth an RFC beforehand), but the payoff would be worth it I think?

Finally, if we really do want to use ComplexPatterns to ignore the copies, I’m afraid it’ll be way too common and we’ll end up with many instructions using it (= adding noise to already complicated TableGen files). Is there an alternative we could consider?

One idea I had at some point was to simply add a “non trivial” opcode matching system. Instead of using GIM_CheckOpcode, we could add a system where a custom function is called (like the current CxxPredicates) to check the opcode (& in our case, ignore copies) & return the instruction. What would you think about that?

(SomeComplexPat<xor> i32:$z, (SomeComplexPat<and> i32:$x, (SomeComplexPat<xor> i32:$y, i32:$z)))

No. The predicate needs to solve a property of the output instruction, considering all of the matching inputs. Some kind of predicate on each individual input doesn’t help you. This is what we get with the current complex predicates (although I couldn’t tell you how the operand indices you get correspond to these named input values).

So what we would - ideally - need is a system that allows predicate to take all properties of the matched fragment into account? Could this be achieved using some GISelPredicateCode or do we need some upgrades to TableGen?

Alternatively, we could just work around this issue by implementing the selection logic in C++. It deviates from the original goal of getting those BFI patterns to match though, and I guess we can’t keep working around such issues indefinitely, but it’d get this working.

I’d really like to get those patterns to work as soon as possible, what would be the next steps to implement the required fix(es)?

Selection for all of these patterns in manual C++ is what we’re trying to avoid.

GISelPredicateCode might be good enough, but I think the API is too unclear with what the meaning of the incoming operands is. It would be better if we had some generated enum or something so we can refer to the operands by name in the fragment.

The other ugly piece is applying predicates per-output instruction, but I don’t know if we really have any patterns with multiple outputs where this could come up.

Would the GISelPredicateCode be run in-between matching instructions, and allow for pre-processing the operands to skip the COPY? How would it work?

Not sure what you mean by “in-between”. The predicate is a function of the input pattern. The custom predicate has the option of looking through copies as it sees fit. However, for cases where we need to choose an operand, I don’t think the predicates give you way to pass the looked-through value into the output operand

What I meant is that, in the pattern above:

def : AMDGPUPat <
  (DivergentBinFrag<xor> i32:$z, (and i32:$x, (xor i32:$y, i32:$z))),
  (V_BFI_B32_e64 VSrc_b32:$x, VSrc_b32:$y, VSrc_b32:$z)
>;

A typical problem is that we can’t match the “and” because the “xor” RHS has a COPY before the and.
We need a way to ignore this copy before GISel attempts to match the XOR.

So by “in-between”, I meant that we would need something that will run on the xor, the and, and the other xor so it can “pre-process” the operands before GISel attempts to match the next instruction.

I see that there is ⚙ D86294 AMDGPU/GlobalISel: Add tablegen operator that looks through copies would that be an adequate solution for you?
I will check if the patch author has any intention to resume work & land it, otherwise I could take over it then.

Overall I think that this should be done in global-isel emitter.
Emitter could recognize fields in PatFrag with special name, some ideas for fields:

  • look through copies when checking opcode
  • when you find operand save it somewhere to be inspected later when all operands are matched (PredicateCodeUsesOperands might be enough but requires named operands iirc)
  • check bus limit for matched operands (in InstructionSelectorImpl, not in complex predicate)

I did a test where I added a GISelIgnoreCopies bit to PatFrag to allow the following:

def : AMDGPUPat <
  (DivergentBinFrag<xor> i32:$z, (MaybeCopied<and> i32:$x, (MaybeCopied<xor> i32:$y, i32:$z))),
  (V_BFI_B32_e64 VS_32:$x, VS_32:$y, VS_32:$z)
>;

And it’s implemented by adding a new “RecordInsnIgnoreCopies” operation.

It seems to work but then it trips on “CheckIsSameOperand”. I can work around it by relaxing the check & also allowing getDefIgnoringCopies equality but it’s hacky. How should we deal with this one? I’m not a fan of adding another “IgnoringCopies” variant of that opcode; IMO it’s a bit of a slippery slope, I don’t want to end up with a lot of opcodes being duplicated in this fashion.

I think we can solve the constant bus limit problem by adding some new “Verifier” InstructionSelector opcode. We could emit it before BuildMI and make it call a function in the InstructionSelector w/ the Opcode + the operands & fail the current Try block when it returns false.

Draft: ⚙ D136234 [WIP] Allow ignoring copies in GISel

I did it in “Pattern” instead of “PatFrag” to avoid needing to wrap all operands in a specific PatFrag to enable this.

I didn’t update the tests yet but it looks like there’s quite a few regressions. I also don’t see any crashes due to reg bank limits so I’m not sure if I did it right - I expected a lot more failures and a lot of crashes.

[build] Failed Tests (9):
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/add.v2i16.ll
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/add_shl.ll
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/inst-select-pattern-and-or.mir
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/inst-select-pattern-xor3.mir
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/shlN_add.ll
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/store-local.128.ll
[build]   LLVM :: CodeGen/AMDGPU/GlobalISel/store-local.96.ll

Ping - can anyone give some feedback on ⚙ D136234 [WIP] Allow ignoring copies in GISel?

I would like to restart the discussion here because I did some experiments and I have new questions.

Firstly, is it possible for GlobalISel to select instructions that violate constant bus restrictions?
If I understand correctly, that’s impossible since it always constrains the selected instruction operands to the right registers after selecting an instruction. That would explain why I do not see any crashes in my diff above.

Secondly, do we have any intention of selectively ignoring copies (enable it on a case-by-case basis), or is it the kind of thing we would do backend-wide (e.g. all pattern matching in the AMDGPU backend ignores cross-regbank copies)?

  • If we want to ignore copies by default, perhaps what we need isn’t a bit in Pattern/PatFrag, but instead something like a InstructionSelectorTraits class that would allow us to control how opcodes like RecordInsn work. Using the right design pattern, we may be able to get all “traits” functions to be inlined and have zero performance cost.
  • If we want to enable it on a case-by-case basis, why? Are there cases where ignoring them could be a disadvantage?

Finally, if we want to ignore them on a case-by-case basis, which case would be more common? Ignoring them or not ignoring them?

  • If we want to ignore them more often than not, then I believe a bit in the Pattern class would be the way to go (like in my WIP diff above), as we probably don’t want to clutter all the patterns with a new operation like “ignore_copy”.
  • Else, then maybe a new operation to ignore copies explicitly is a better idea.