globalisel: cross-bank constant propagation?

jayfoad · March 26, 2021, 2:43pm

Hi,

While experimenting with alternative ways of lowering funnel shifts on
AMDGPU I have run into a case where this GMIR:

    %0:vgpr(s32) = COPY $vgpr0
    %1:vgpr(s32) = COPY $vgpr1
    %2:sgpr_64 = COPY $sgpr30_sgpr31
    %3:sgpr(s32) = G_CONSTANT i32 5
    %7:vgpr(s64) = G_MERGE_VALUES %1(s32), %0(s32)
    %9:vgpr(s32) = COPY %3(s32)
    %8:vgpr(s64) = G_LSHR %7, %9(s32)
    %4:vgpr(s32) = G_TRUNC %8(s64)
    $vgpr0 = COPY %4(s32)
    %5:ccr_sgpr_64 = COPY %2
    S_SETPC_B64_return %5, implicit $vgpr0

does not get matched by this selection pattern:

def : GCNPat<(i32 (trunc (srl i64:$src0, (i32 ShiftAmt32Imm:$src1)))),
(V_ALIGNBIT_B32_e64 (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
(i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)),
VSrc_b32:$src1)>;

because the COPY of G_CONSTANT 5 from sgpr to vgpr gets in the way of
matching the (i32 ShiftAmt32Imm:$src1).

What's a good way of fixing this? I know there has been some
discussion before about whether the generated matcher should look
through copies, either automatically or only when the tablegen pattern
is augmented with some kind of "please look through copies here" node.

As an alternative, would it make sense to run some cross-bank constant
propagation after regbankselect? It seems pretty obvious to me that %9
above should be simplified to:

%9:vgpr(s32) = G_CONSTANT i32 5

so long as it's legal, which we can easily check, can't we?

Thanks,
Jay.

nhaehnle · March 27, 2021, 7:57am

Hi Jay,

Where does the copy of a constant from SGPR to VGPR come from? Naively, I’d think that better GMIR would be

%3:sgpr(s32) = G_CONSTANT i32 5
…

%8:vgpr(s64) = G_LSHR %7, %3(s32)

Cheers,
Nicolai

jayfoad · March 27, 2021, 8:56am

Hi Nicolai!

For simplicity our regbankselect says that all operands of VALU
instructions have to go in vgprs. Moving some of them into sgprs is
left as an optimisation for a later pass. As you know there are limits
on //how many// operands of a VALU instruction can be sgprs or
constants, which are not simple to express in terms of alternative
operand mappings.

Thanks,
Jay.

nhaehnle · March 27, 2021, 1:40pm

Hi Jay,

For simplicity our regbankselect says that all operands of VALU
instructions have to go in vgprs. Moving some of them into sgprs is
left as an optimisation for a later pass. As you know there are limits
on //how many// operands of a VALU instruction can be sgprs or
constants, which are not simple to express in terms of alternative
operand mappings.

Okay, that makes sense: the initial register bank selection is biased to generate legal code.

It sounds likely that similar issues of obstructive COPYs will appear in other patterns as well, right? For example, if you have original code of the form (x * (-y)), where x is divergent and y is uniform, it seems likely you’d get something like (typed completely off the top of my head):

t:sgpr = G_FNEG y:sgpr
t’:vgpr = COPY t:sgpr
r:vgpr = G_FMUL x:vgpt, t’:vgpr

… and then an instruction selection pattern that pulls the negation into a source modifier would fail due to the COPY? The point is that this is not just an issue for constants, so a constant-specific pass seems the wrong way to go.

The remaining options I see right now are:

(1) Allow pattern matching to look through COPYs.
(2) Eliminate those COPYs before instruction selection.

Not having thought about this very long, the relative merits seem to be:

(2) results in smaller GMIR and a simpler pattern matcher, which is better for compile time
(2) requires legalization after(?) instruction selection (constant bus limit etc.), which is going to be a pain in the existing CodeGen framework

Neither option makes me happy I guess I would go with (2) if this was a greenfield development, but (1) fits better in the existing LLVM code base.

Cheers,
Nicolai

jayfoad · March 29, 2021, 10:14am

For simplicity our regbankselect says that all operands of VALU
instructions have to go in vgprs. Moving some of them into sgprs is
left as an optimisation for a later pass. As you know there are limits
on //how many// operands of a VALU instruction can be sgprs or
constants, which are not simple to express in terms of alternative
operand mappings.

Okay, that makes sense: the initial register bank selection is biased to generate legal code.

It sounds likely that similar issues of obstructive COPYs will appear in other patterns as well, right? For example, if you have original code of the form (x * (-y)), where x is divergent and y is uniform, it seems likely you'd get something like (typed completely off the top of my head):

  t:sgpr = G_FNEG y:sgpr
  t':vgpr = COPY t:sgpr
  r:vgpr = G_FMUL x:vgpt, t':vgpr

... and then an instruction selection pattern that pulls the negation into a source modifier would fail due to the COPY? The point is that this is not just an issue for constants, so a constant-specific pass seems the wrong way to go.

Yes, good point.

The remaining options I see right now are:

(1) Allow pattern matching to look through COPYs.
(2) Eliminate those COPYs before instruction selection.

Not having thought about this very long, the relative merits seem to be:

- (2) results in smaller GMIR and a simpler pattern matcher, which is better for compile time
- (2) requires legalization after(?) instruction selection (constant bus limit etc.), which is going to be a pain in the existing CodeGen framework

Neither option makes me happy I guess I would go with (2) if this was a greenfield development, but (1) fits better in the existing LLVM code base.

Here's a patch that tackles (1): ⚙ D86294 AMDGPU/GlobalISel: Add tablegen operator that looks through copies . The
current version of the patch adds a dag operator that looks through
copies. I'm not particularly thrilled about that because as a pattern
writer I don't know when I would be required to use it, so I would
probably sprinkle it everywhere.

Previous versions of the same patch changed GlobalISel pattern
matching to look through copies automatically, without any special dag
operator. This seems much more user friendly to me, but Matt argued
that it "is selecting instructions incorrectly, and then relying on
something else to fix them up".

To be honest I am confused about the role of RegBankSelect when you
have instruction selection patterns that match more than one GMIR
node. How can RegBankSelect possibly tell you what alternative
mappings (i.e. instructions) are available, without predicting what
patterns will be matched by instruction selection?

Jay.

arsenm · March 29, 2021, 1:04pm

There are 2 issues:
1. Current RegBankSelect does not consider the uses when selecting the bank. This is a general missing optimization
2. For the AMDGPU case, I think we should have a post-regbankselect combiner for this. It’s often better to materialize constants for each bank

I don’t think we actually want to have to look through copies, and the places we do are just working around the status quo.

The folding SGPR/constants into instructions should be a new and improved version of SIFoldOperands. I think optimizing this is beyond the scope of what RegBankSelect and selection patterns. Far too much code would need to be taught to respect and preserve the constant bus limitation otherwise, so that’s why everything uses VGPRs.

-Matt

arsenm · March 29, 2021, 1:06pm

There is a table of costs for alternative bank mappings. For the AMDGPU case, I don’t think this cost table makes much sense. The condition is mostly divergence based. The exception would be for some VGPR vs. AGPR cases.

-Matt

jayfoad · March 29, 2021, 1:34pm

I can understand leaving it to a later pass to fold //sgprs or
constants// into an instruction. What I can't understand is how you do
the same kind of thing for more complex selection patterns like:

  t:sgpr = G_ADD y:sgpr, z:sgpr
  t':vgpr = COPY t:sgpr
  r:vgpr = G_ADD x:vgpt, t':vgpr

How can we select v_add3_u32 from this? I can only think of two options:

1. Select s_add and v_add and leave it to a later pass to combine
them. This seems to be giving up on doing decent pattern-based
instruction selection.
2. Match it in the instruction selector, using a pattern that
(explicitly or implicitly) looks through the cross-bank copy. But then
you're back to the problem that two of the inputs are sgprs, which may
or may not be valid according to complex operand restrictions.

Thanks,
Jay.

nhaehnle · March 29, 2021, 6:51pm

Hi Nicolai!

For simplicity our regbankselect says that all operands of VALU
instructions have to go in vgprs. Moving some of them into sgprs is
left as an optimisation for a later pass. As you know there are limits
on //how many// operands of a VALU instruction can be sgprs or
constants, which are not simple to express in terms of alternative
operand mappings.

Thanks,
Jay.

There are 2 issues:

Current RegBankSelect does not consider the uses when selecting the bank. This is a general missing optimization

For the AMDGPU case, I think we should have a post-regbankselect combiner for this. It’s often better to materialize constants for each bank

This sounds desirable, but how would this work on GMIR without interfering in strange ways with pattern matching?

The challenge I see is: due to the constant bus limitation, there are situations in which such a combiner has to make a choice as to which operand to combine / eliminate the COPY. If the choice is arbitrary, it may sometimes inhibit later pattern matching.

Related: are we clear on what is legal for GMIR? If G_ADD of regbank sgpr into regbank vgpr is supposed to be legal, does it then have to satisfy the constant bus limitation? If so, why are we still calling it G_ADD instead of V_ADD?

I don’t think we actually want to have to look through copies, and the places we do are just working around the status quo.

The folding SGPR/constants into instructions should be a new and improved version of SIFoldOperands. I think optimizing this is beyond the scope of what RegBankSelect and selection patterns. Far too much code would need to be taught to respect and preserve the constant bus limitation otherwise, so that’s why everything uses VGPRs.

I can understand leaving it to a later pass to fold //sgprs or
constants// into an instruction. What I can’t understand is how you do
the same kind of thing for more complex selection patterns like:

t:sgpr = G_ADD y:sgpr, z:sgpr
t’:vgpr = COPY t:sgpr
r:vgpr = G_ADD x:vgpt, t’:vgpr

How can we select v_add3_u32 from this? I can only think of two options:

Select s_add and v_add and leave it to a later pass to combine
them. This seems to be giving up on doing decent pattern-based
instruction selection.

Match it in the instruction selector, using a pattern that
(explicitly or implicitly) looks through the cross-bank copy. But then
you’re back to the problem that two of the inputs are sgprs, which may
or may not be valid according to complex operand restrictions.

The SelectionDAG pattern for v_add3 and friends already checks this using a C++ code fragment, doesn’t it?

Cheers,
Nicolai

jayfoad · March 30, 2021, 10:32am

Yes but I had always assumed that was just a heuristic. Are we saying
it's required for correctness? Actually I'm confused about the whole
concept of checking for constant bus violations at this stage.

In a normal compiler, the instruction selector would be allowed to
select any instruction that works, regardless of register classes, and
it would be the register allocator's job to copy the input values into
suitable input registers. So given this GMIR:

  t:sgpr = G_ADD y:sgpr, z:sgpr
  t':vgpr = COPY t:sgpr
  r:vgpr = G_ADD x:vgpt, t':vgpr

I would naively hope that the instruction selector could select this
without worrying about register banks or constant bus restrictions:

%4:vgpr_32 = V_ADD3_U32 %0:vgpr_32, %1:vgpr_32, %2:vgpr_32

Then optionally SIFoldOperands (which does know about constant bus
restrictions) could modify some of the input register classes to take
sgprs instead.

Then the register allocator would do its job, inserting sgpr-to-vgpr
copies as and where necessary.

Jay.

nhaehnle · March 31, 2021, 9:30pm

I don’t think we actually want to have to look through copies, and the places we do are just working around the status quo.

The folding SGPR/constants into instructions should be a new and improved version of SIFoldOperands. I think optimizing this is beyond the scope of what RegBankSelect and selection patterns. Far too much code would need to be taught to respect and preserve the constant bus limitation otherwise, so that’s why everything uses VGPRs.

I can understand leaving it to a later pass to fold //sgprs or
constants// into an instruction. What I can’t understand is how you do
the same kind of thing for more complex selection patterns like:

t:sgpr = G_ADD y:sgpr, z:sgpr
t’:vgpr = COPY t:sgpr
r:vgpr = G_ADD x:vgpt, t’:vgpr

How can we select v_add3_u32 from this? I can only think of two options:

Select s_add and v_add and leave it to a later pass to combine
them. This seems to be giving up on doing decent pattern-based
instruction selection.

Match it in the instruction selector, using a pattern that
(explicitly or implicitly) looks through the cross-bank copy. But then
you’re back to the problem that two of the inputs are sgprs, which may
or may not be valid according to complex operand restrictions.

The SelectionDAG pattern for v_add3 and friends already checks this using a C++ code fragment, doesn’t it?

Yes but I had always assumed that was just a heuristic. Are we saying
it’s required for correctness? Actually I’m confused about the whole
concept of checking for constant bus violations at this stage.

Point 1: We need to define what legal MIR must satisfy. Legal final ISA must satisfy the constant bus limitation, so that’s what legal MIR must satisfy – at least for most of the codegen pipeline.

Point 2: To help with development, we have a MIR verifier to catch issues. Given point 1, the MIR verifier should catch constant bus limitation violations.

Point 3: Therefore, MIR has to satisfy that limitation at all times.

There were a few times when I’ve thought that perhaps we need to be more explicit about different flavors of MIR, which would affect what the MIR verifier checks. So far, there’s MIR in SSA form and MIR out of SSA form. That would imply an LLVM-wide change of philosophy, but may well be worth it compared to the status quo which is comparatively unprincipled.

In a normal compiler, the instruction selector would be allowed to
select any instruction that works, regardless of register classes, and
it would be the register allocator’s job to copy the input values into
suitable input registers. So given this GMIR:

t:sgpr = G_ADD y:sgpr, z:sgpr
t’:vgpr = COPY t:sgpr
r:vgpr = G_ADD x:vgpt, t’:vgpr

I would naively hope that the instruction selector could select this
without worrying about register banks or constant bus restrictions:

%4:vgpr_32 = V_ADD3_U32 %0:vgpr_32, %1:vgpr_32, %2:vgpr_32

This should be possible, no? Actually, I imagine that isel would immediately assign an sgpr class to %1 and %2. I thought the only thing missing is for the pattern to actually match the incoming GMIR despite the COPY?

Cheers,
Nicolai

jayfoad · April 1, 2021, 2:19pm

Thanks for your thoughts. I realise there are more and more things
about instruction selection and register classes that I don't fully
understand yet, so I'll try to educate myself before coming back to
this.

Jay.

Topic		Replies	Views
Cleanly Addressing GlobalISel/DAG Pattern Matching Differences Code Generation	18	1432	November 23, 2022
Dealing with illegal operand mappings in RegBankSelect LLVM Dev List Archives	8	137	February 27, 2019
Simpler subreg ops in machine code IR LLVM Dev List Archives	4	229	June 16, 2010
X86: copyConstantToRegister LLVM Dev List Archives	3	87	June 8, 2004
Some backend questions LLVM Dev List Archives	5	124	June 7, 2004

globalisel: cross-bank constant propagation?

Related topics