RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization

We have seen a series of performance issues where we are emitting unnecessary masks on ARM64 that all have a similar form. Here is an example:

define zeroext i1 @test(i8 zeroext %theChar) align 2 {
    %theChar.off = add i8 %theChar, -97
    %0 = icmp ult i8 %theChar.off, 26
    br i1 %0, label %return, label %lor.lhs.false
    lor.lhs.false: ; preds = %entry
    ret i1 false
  return: ; preds = %entry
    ret i1 true

compiles to:

  .globl _test
  .align 2
_test: ; @test
; BB#0: ; %entry
  sub w8, w0, #97 ; =97
  and w8, w8, #0xff
  cmp w8, #26 ; =26
  cset w0, lo

but the “and w8, w8, 0xff” is superfluous and could be removed, yielding:

  .globl _test
  .align 2
_test: ; @test
; BB#0: ; %entry
  sub w8, w0, #97 ; =97
  cmp w8, #26 ; =26
  cset w0, lo

The AND is just a single extra instruction, but it is an additional data dependency and it sometimes shows up in the middle of things like string processing loops and several hot loops in SPEC, so we would really like to fix it.

After looking into it I have determined what is going on is that when an i8 or i16 is legalized up to an i32 the extra masks are introduced at intermediary steps of the calculation to preserve the same overflow behavior that would have occurred if there were native 8 or 16 bit data types on the processor. In some cases that is necessary, but in many cases (such as the one above) based on the type of compares involved and the provenance of the inputs it is possible to know that behavior is the same regardless of whether or not the AND instruction is there, and thus it is safe to remove it.

While at first glance it seems like it would be reasonable to handle this with a few DAG patterns in the TableGen files it turns out that is a less than satisfactory approach. Whether the AND is removable is constrained by the values of 2-3 of the inputs and the compare, which ends up with a combinatorial explosion of patterns, especially once one considers the inputs values may come from slightly different nodes such AssertZexts or extending loads (which we need to capture since we need to know the intended extension semantics). Attached is a file (AndDAG.pdf) that shows what I believe to be a canonical form of the sub-DAG we are looking at, the following details may vary:

1) The orange subtree code be a different form of extension, such a load<LD1[%arrayidx], anyext from i8> instead of an AssertZext
2) The blue and purple elements could be different constants
3) The green element (which is the condition code) could be different
4) It could all be feeding a different instruction (in particular an AArch64ISD::BRCOND) at the end.
5) The inputs code be i16 instead i8

Rather than try respond reactively to individual cases we see in bugs I would like to solve the entire problem. To that end I propose introducing an architecture specific DAG combine that recognizes this basic tree, and performs the and elimination if legal. Recognizing the DAG and recording the value is fairly straight forward, but determining when the and is legal is somewhat more complicated. Each case has taken me a few minutes, and that is just looking at constant values.

In order to speed of the process I wrote a small command line tool that runs over all inputs for both assembly patterns and generates 2 dimensional charts of the cases where the code is the same with and without the mask. Using these charts it is possible to write a series of equations to cover all the legal cases, not unlike a Carnaugh map. Since these equations need to work for inputs that are both i8 and i16 they have to be written symbolically in terms of the input types (max representable, min representable, positive, negative, 0, etc) in order to be generally applicable. Using that fact it is possible to reduce the test case down to i4’s using 0x0f, making the tables much more tractable. The tables run from -8 to +15 in both dimensions expressing all the 32 bit representations of zero and sign extended i4 values. I am attaching an example chart (ZeroExtendedLO.png), but I actually generate 24 of them (the al, nv, vc and vs condition codes are not relevant leaving us with 12 codes, and then we need 2 charts for each, one when the input is sign extended and when when it is zero extended).

Fox example, a set of covering equations for a zero extended variable using for the LO cond code is:

(AddConstant >= 0 && SubsConstant <= 0)

(AddConstant <= 0 && SubsConstant >= 0 && SubsConstant <= int_max)
(Subs2 >= Subs2 + uint_max))

Which would result in an equivalent chart to the one attached when using 4 bit integers, but representing it symbolically it also expands to 8 and 16 bit integers.

Attached is a WIP patch that I have made. It is incomplete in several ways, but I believe it should be enough to discuss whether or not this is reasonable path to pursue. In particular it:

1) It is not quite LLVM coding style
2) It only works against a AssertZext input
3) It only has equations for LO
4) It only matches on CSEL, not BRCOND et al
6) It pass the CC around as an unsigned
7) It makes also several simplifying assumptions that I believe are true but need to confirm:
        1) By the time the DAG gets to me it has been canonicalized such that it is of the form:
             (AArch64ISD::SUBS (and (+ variable constant1) mask) constant2)
         2) The mask is a 2^8-1 or 2^16-1 (though it is possible to extend this to arbitrary powers of 2 it doesn’t
             seem like any fronted would generate those)

If there is consensus this is a reasonable path to pursue then my intent is to clean up the patch fill in the other sets of equations, modifying the script I used to generate the tables to output JSON or something similar, and write a validator to ensure the equations 100% accurately match the generated output, then choose a suitable subset of test cases to include with the patch.


AndDAG.pdf (22 KB)

DAG-AND.patch (5.8 KB)

Hi Louis,

This is interesting. Moreover, it reminds me of the problem that I addressed in PPCTargetLowering::DAGCombineExtBoolTrunc and PPCTargetLowering::DAGCombineTruncBoolExt for i1 (and i32) <-> i64 extensions in the PPC backend. I think you could apply a similar methodology to finding larger clusters of instructions that can benefit from this kind of transformation.