[GlobalISel][MIPS] Legality and instruction combining

Hello,

I am developing GlobalISel for MIPS. I have a few questions and observations about defining legality of generic instruction and also possible combining of instructions and artifacts in pre/post legalizer combiner or elsewhere (e.g. in some sort of instruction-select patterns).

I look at legality as “If generic instruction can be selected into machine instruction, it is legal”.

For example, let’s look at G_ICMP and G_SELECT. In llvm IR type of result of icmp is always i1, and test argument (type 1) in select is also i1.
Here is an .ll example:

define i32 @f(i32 %a, i32 %b, i32 %c, i32 %d) {
entry:
%cmp = icmp slt i32 %a, %b
%cond = select i1 %cmp, i32 %d, i32 %c
ret i32 %cond
}

and corresponding MIR snippet:

%4:(s1) = G_ICMP intpred(sgt), %0(s32), %1
%5:
(s32) = G_SELECT %4(s1), %2, %3

On mips 32, integer compare uses i32 as result and that result is zero extended.
For G_SELECT, we will select instructions that check whether “test register” was zero or not (selected instructions are movz or movn).
Test register has size 32, having that in mind, idealy, legalizer should produce

%4:(s32) = G_ICMP intpred(sgt), %0(s32), %1
%5:
(s32) = G_SELECT %4(s32), %2, %3

after combining all extends and truncs.

Currently, it is not possible to widen test register (type 1) in G_SELECT and we could end up with telling legalizer that test argument is legal for i1, and later select that register as i32 regardless. Therefore, the question is:

Q: What are the plans:
a) Is it planned for legalizer to set sizes of all arguments to appropriate sizes of phisical registers (i.e. types s1, s8 and s16 should be widened to s32)?
b) Is the plan to sometimes let s1 as legal type and ignore it later?
c) Is the plan to sometimes let s1 as legal type and use this type as a hint to select between multiple available instructions using some patterns?

For example, in https://reviews.llvm.org/D51489, G_ICMP got its result (type0) legalized as legal for s32 (corresponding to plan a)). We would get similar final output (also correct) if we said that type0 is legal for s1 (corresponding to b), but in this case it is also necessary to: tell that extends are legal for {s32, s1}, regbankselct and instruction select G_ZEXT.

It would be nice to have a place around legalizer where we could combine extending artifacts with some instructions, as combining instruction with G_AND (or some other bitwise instr) seem like much more work later that could be done earlier. Its better to not produce superfluous extend (bitwise instr) instruction at all, then deal with it later.
It would also be nice if it could be possible to choose if extend instructions are replaced with bitwise instructions or not. Not replacing them in early passes will give us fewer instructions to work with and carry on in later passes. It would let us select extend in instruction-select. This approach could result in using a little less memory. Also some mips processors have instructions that extract bits or do sign extend. These could be selected from G_ZEXT or G_SEXT generating single instruction instead of two equivalent bitwise instructions.

Q: I see that there is a plan to add pre/post legalizer combining pass. Where should combining happen, in pre/post legalizer combining pass or in the instruction select pass?

For example, G_SELECT could be selected into ‘movn’, the default option, i.e. move on true.
However, G_SELECT where test is defined with G_ICMP, could be selected into movz or movn based on compare opcode (opcode that requires negation (with xor) could be selected into movz and remove xor (or into movn with but we need to swap true and false register)). If we select icmp and select independently we could end up with an extra xor instruction.

I ask this question since combination of G_LOAD and G_SEXT

%1:(s8) = G_LOAD %0(p0)
%2:
(s32) = G_SEXT %1(s8)

will be replaced with G_SEXTLOAD in upcoming combiner. And if we let %1:_(s8) = G_LOAD %0(p0) to be legal, we could select G_SEXT into G_LOAD (going bottom up) into sign extending load of appropriate size in instruction-select.

Best regards,
Petar

Hi Petar,

Hello,

I am developing GlobalISel for MIPS. I have a few questions and observations about defining legality of generic instruction and also possible combining of instructions and artifacts in pre/post legalizer combiner or elsewhere (e.g. in some sort of instruction-select patterns).

I look at legality as "If generic instruction can be selected into machine instruction, it is legal".

Yes, that's right. I'd insert 'always' into that sentence since legality can't be conditional on having the right instructions nearby. If you say that G_ADD:s32 is legal then you are saying that this opcode is selectable in any context. If you need conditional legality then you need to conditionally lower into something that's unconditionally legal.

For example, let's look at G_ICMP and G_SELECT. In llvm IR type of result of icmp is always i1, and test argument (type 1) in select is also i1.
Here is an .ll example:

define i32 @f(i32 %a, i32 %b, i32 %c, i32 %d) {
entry:
  %cmp = icmp slt i32 %a, %b
  %cond = select i1 %cmp, i32 %d, i32 %c
  ret i32 %cond
}

and corresponding MIR snippet:

    %4:_(s1) = G_ICMP intpred(sgt), %0(s32), %1
    %5:_(s32) = G_SELECT %4(s1), %2, %3

On mips 32, integer compare uses i32 as result and that result is zero extended.
For G_SELECT, we will select instructions that check whether "test register" was zero or not (selected instructions are movz or movn).
Test register has size 32, having that in mind, idealy, legalizer should produce

    %4:_(s32) = G_ICMP intpred(sgt), %0(s32), %1
    %5:_(s32) = G_SELECT %4(s32), %2, %3

after combining all extends and truncs.

I assume we're ignoring mips32r6 for the moment. r6 changed the behaviour for some of the conditional branches such that they really are s1 (instead of testing the whole register for 0 and not-0, they only test bit 0).

Currently, it is not possible to widen test register (type 1) in G_SELECT and we could end up with telling legalizer that test argument is legal for i1, and later select that register as i32 regardless.

Could you clarify what you mean here? The new legalizer info can define this with:
  getActionDefinitionsBuilder(G_SELECT).clampScalar(1, s32, s32)
so I'm guessing you mean that code to mutate the G_SELECT is currently missing

Therefore, the question is:

Q: What are the plans:
a) Is it planned for legalizer to set sizes of all arguments to appropriate sizes of phisical registers (i.e. types s1, s8 and s16 should be widened to s32)?

Targets will generally legalize to their native types which is often the same as the physical register sizes. I say 'often' because it's really about the operations that can be performed rather than the register size. One common case is having 64-bit registers but having lots of s32 operations be legal (e.g. mips64 and aarch64). Another case that comes up occasionally is s64 operations being legal but only having 32-bit physical registers.

However, none of this is actually a requirement. It really boils down to controlling the scale of the problem that the post-legalizer passes have to deal with. The only real requirement is that post-legalizer passes have to support everything the legalizer declares legal.

b) Is the plan to sometimes let s1 as legal type and ignore it later?

I'm not sure what you mean here

c) Is the plan to sometimes let s1 as legal type and use this type as a hint to select between multiple available instructions using some patterns?

That's a viable option. You do need to ensure the patterns cover every possibility though (including G_ICMP:s1 by itself and (G_SELECT:sN s1, ...) by itself).

For example, in https://reviews.llvm.org/D51489, G_ICMP got its result (type0) legalized as legal for s32 (corresponding to plan a)). We would get similar final output (also correct) if we said that type0 is legal for s1 (corresponding to b), but in this case it is also necessary to: tell that extends are legal for {s32, s1}, regbankselct and instruction select G_ZEXT.

I believe the minimum required if any sN operation is legal is G_TRUNC and G_ANYEXT but I'm not completely certain of that. We might require G_ZEXT and G_SEXT too.

It would be nice to have a place around legalizer where we could combine extending artifacts with some instructions, as combining instruction with G_AND (or some other bitwise instr) seem like much more work later that could be done earlier. Its better to not produce superfluous extend (bitwise instr) instruction at all, then deal with it later.
It would also be nice if it could be possible to choose if extend instructions are replaced with bitwise instructions or not. Not replacing them in early passes will give us fewer instructions to work with and carry on in later passes. It would let us select extend in instruction-select. This approach could result in using a little less memory. Also some mips processors have instructions that extract bits or do sign extend. These could be selected from G_ZEXT or G_SEXT generating single instruction instead of two equivalent bitwise instructions.

The reason we emit the bitwise instructions is because we don't want the later passes to have to deal with extend/truncate for every possible size combination. For example, suppose we can't handle s1 after the legalizer and widen s1 to s32 everywhere we see it. We'll get code like this:
  %1(s1) = G_TRUNC %0(s32)
  %2(s32) = G_ZEXT %1(s1)
but post-legalizer we can't handle s1 at all. We can't just eliminate the pair though, because this code zeros bits 1 up to 31. So we emit the closest thing we have to zero-extend-inreg which is G_AND
  %2(s32) = G_AND %0(s32), i32 1

One aspect of this that I'm not very keen on is the G_SHL+G_SHR pairs emitted for G_SEXT. I'd prefer a G_SEXT_IN_REG instead

Q: I see that there is a plan to add pre/post legalizer combining pass. Where should combining happen, in pre/post legalizer combining pass or in the instruction select pass?

The more that can be done pre-legalizer the better but some is only exposed by the legalizers changes so it needs to be both.

For example, G_SELECT could be selected into 'movn', the default option, i.e. move on true.
However, G_SELECT where test is defined with G_ICMP, could be selected into movz or movn based on compare opcode (opcode that requires negation (with xor) could be selected into movz and remove xor (or into movn with but we need to swap true and false register)). If we select icmp and select independently we could end up with an extra xor instruction.

I would probably handle this case with ISel patterns. There's no real need to combine the G_SELECT+G_ICMP cases into something more specific.

I ask this question since combination of G_LOAD and G_SEXT

%1:_(s8) = G_LOAD %0(p0)
%2:_(s32) = G_SEXT %1(s8)

will be replaced with G_SEXTLOAD in upcoming combiner. And if we let %1:_(s8) = G_LOAD %0(p0) to be legal, we could select G_SEXT into G_LOAD (going bottom up) into sign extending load of appropriate size in instruction-select.

That's the way it's handled at the moment and it interacts badly with ISel patterns in the presence of optimization. The root problem is that it's not possible to define a pattern for an i8 load that results in an s8 in ISel patterns due to the way CodeGenDAGPatterns relates pattern types with register types. As a result of this, only the G_SEXT+G_LOAD pair is selectable which violates the principles of the legalizer. While we're not optimizing it's not too bad of a problem but if an optimization deletes the G_SEXT then ISel will fail.

Hi Daniel,

Could you clarify what you mean here? The new legalizer info can define this with:
getActionDefinitionsBuilder(G_SELECT).clampScalar(1, s32, s32)
so I’m guessing you mean that code to mutate the G_SELECT is currently missing

Yes, LegalizerHelper::widenScalar widens only TypeIdx==0, it doesn’t do that for TypeIdx==1. Is it intentionally implemented this way?

b) Is the plan to sometimes let s1 as legal type and ignore it later?

I’m not sure what you mean here

For example lets look at AArch64 G_SELECT:
getActionDefinitionsBuilder(G_SELECT)
.legalFor({{s32, s1}, {s64, s1}, {p0, s1}})
.clampScalar(0, s32, s64)
.widenScalarToNextPow2(0);

In this case LLT of operand 1 (s1) in G_SELECT has size 1, and corresponding register class in selected instruction has size 32 (that is $src1 in AArch64::ANDSWri, it has GPR32 regsiter class).
For that reason s1 is practically ignored and silently converted to s32.
We could get similar result if we have:

getActionDefinitionsBuilder(G_SELECT)
.legalFor({{s32, s32}, {s64, s32}, {p0, s32}})
.clampScalar(0, s32, s64)
.clampScalar(1, s32, s32)
.widenScalarToNextPow2(0);

And in this case sizes of LLTs and register classes would be the same.
Implementation for G_ICMP on AArch64 is very similar to second described option for G_SELECT.
Is there a reason for different implementation of G_SELECT and G_ICMP on AArch64? Are there some general ruses to determine which of the two given options is better in which case?

Hi Daniel,

Could you clarify what you mean here? The new legalizer info can define this with:
getActionDefinitionsBuilder(G_SELECT).clampScalar(1, s32, s32)
so I’m guessing you mean that code to mutate the G_SELECT is currently missing

Yes, LegalizerHelper::widenScalar widens only TypeIdx==0, it doesn’t do that for TypeIdx==1. Is it intentionally implemented this way?

No, we just haven’t needed to implement the TypeIdx==1 case so far. Up until recently we didn’t have a means to test things that weren’t in use by a target. We decided it was better for the compiler to fail than to submit untested/untestable code so there’s quite a few gaps like this.

b) Is the plan to sometimes let s1 as legal type and ignore it later?

I’m not sure what you mean here

For example lets look at AArch64 G_SELECT:
getActionDefinitionsBuilder(G_SELECT)
.legalFor({{s32, s1}, {s64, s1}, {p0, s1}})
.clampScalar(0, s32, s64)
.widenScalarToNextPow2(0);

In this case LLT of operand 1 (s1) in G_SELECT has size 1, and corresponding register class in selected instruction has size 32 (that is $src1 in AArch64::ANDSWri, it has GPR32 regsiter class).
For that reason s1 is practically ignored and silently converted to s32.

Oh, I see what you mean. This is also a viable option but it requires that bits 1-31 don’t affect the results of legal operations no matter what value they are. This makes little difference to something like G_ADD, but things like G_LSHR (that one isn’t really an issue for s1, but applies for s2-s30) have to ensure they insert leading zeros even if the bits in the container are 1’s. Of course, you don’t have to declare the difficult operations to be legal and can let them widenScalar up to a type that’s easy to deal with.

For Mips, this approach is probably ok so long as G_TRUNC defines bits 1-31 and any other s1-producing operations that are legal maintain those bits to keep them at the value they would be if you had only just done the G_TRUNC. I expect you only want G_TRUNC/G_ANYEXT/G_SEXT/G_ZEXT, G_ICMP/G_FCMP, and G_OR/G_XOR/G_AND in which case it’s easy to achieve this.

We could get similar result if we have:

getActionDefinitionsBuilder(G_SELECT)
.legalFor({{s32, s32}, {s64, s32}, {p0, s32}})
.clampScalar(0, s32, s64)
.clampScalar(1, s32, s32)
.widenScalarToNextPow2(0);

And in this case sizes of LLTs and register classes would be the same.
Implementation for G_ICMP on AArch64 is very similar to second described option for G_SELECT.
Is there a reason for different implementation of G_SELECT and G_ICMP on AArch64? Are there some general ruses to determine which of the two given options is better in which case?

At the moment, it largely comes down to preference. I don’t have any real evidence to back this up but I expect the latter will be the better optimized in the long run since maintaining the excess bits requires instructions and these are explicit in the GMIR. This means we can optimize it and decline to do it whenever it doesn’t matter. For example:
%2(s2) = G_ADD %0(s2), %1(s2)
%4(s2) = G_ADD %2(s2), %3(s2)
%5(s2) = G_LSHR %4(s2), i2 1
has to either preserve bits 2-31 between each operation just in case one that cares about it comes along, or define bits 2-31 just before an operation that cares about it even if they’re already correct. This costs six and instructions. Whereas:

%3(s32) = G_AND %0(s2), i32 3
%4(s32) = G_AND %1(s2), i32 3
%5(s32) = G_AND %2(s2), i32 3

%6(s32) = G_ADD %3(s32), %4(s32)
%7(s32) = G_AND %6(s32), i32 3

%8(s32) = G_ADD %5(s32), %7(s32)
%9(s32) = G_AND %8(s32), i32 3
%10(s32) = G_LSHR %9(s32), i2 1
%11(s32) = G_AND %9(s32), i32 3
can optimize to:

%6(s32) = G_ADD %0(s32), %1(s32)
%8(s32) = G_ADD %2(s32), %6(s32)

%10(s32) = G_LSHR %8(s32), i2 1
%11(s32) = G_AND %9(s32), i32 1
which costs only one and instruction.

Thanks Daniel,

Then, for now, I will widen s1 to size of register class in instruction “to be selected”.