Simplifying selects + arm stuff

<cc'ing llvmdev>

Hi Chris, in commit 12800 you introduced the following transform:

Implement select.ll:test12*

This transforms code like this:

    %C = or %A, %B
    %D = select %cond, %C, %A
into:
    %C = select %cond, %B, 0
    %D = or %A, %C

Since B is often a constant, the select can often be eliminated. In any case,
this reduces the usage count of A, allowing subsequent optimizations to
happen.

Right, I still think this is goodness.

Now consider the following example:

%z = and i32 %x, %y
%s = select i1 %cond, i32 %y, i32 %z
%r = and i32 %x, %s

Here %r could be simplified to
%r = and i32 %x, %y
(i.e. the same as %z) so replaced with %z everywhere.

We even have logic to do this (ThreadBinOpOverSelect, which I just added) in
InstructionSimplify. The nice thing about InstructionSimplify is that it
doesn't ever create any new instructions, so is pretty much always a win.
The problem is that the logic never fires: your instcombine logic gets there
first, and transforms the example into

%z = select i1 %cond, i32 -1, i32 %x
%s = and i32 %z, %y
%r = and i32 %s, %x

which is not nearly as good.

Yeah, that's not nearly as good as "%r = and i32 %x, %y" :slight_smile:

Note that:

%z = select i1 %cond, i32 -1, i32 %x
%s = and i32 %z, %y

Is a "conditional and". It would be interesting to know if the ARM backend gets this as a single predicated 'and' instruction (similarly for the 'or' and 'xor' version of these patterns). I bet not, which is bad if instcombine is canonicalizing this way.

Do you have any thoughts for how to get the best of both worlds? How about
only applying your transform when one of the operands (eg B) is a constant?

I don't think that's a great solution, because reducing # uses of a value is general goodness (increases odds that "hasOneUse()" xforms will kick in for example).

Alternatively, it could look at uses of the select and try to simplify those
first using InstructionSimplify (not sure if instcombine is allowed to
simplify a different instruction to the one it is visiting), or check to see
if some or all uses of the select would be simplified by InstructionSimplify,
and not perform the transform if so.

Instead of turning this into a phase ordering issue, I'd rather increase the power of the folding logic to catch the general form. In this case, instead of handling this in instsimplify, why not handle this in reassociate?

This is conceptually no different than (x & y & x) -> (x & y), it's just that in this case you have: (x & y & (x or -1)) which should fold to (x & y) just the same.

-Chris

Hi Chris,

We even have logic to do this (ThreadBinOpOverSelect, which I just added) in
InstructionSimplify. The nice thing about InstructionSimplify is that it
doesn't ever create any new instructions, so is pretty much always a win.
The problem is that the logic never fires: your instcombine logic gets there
first, and transforms the example into

  %z = select i1 %cond, i32 -1, i32 %x
  %s = and i32 %z, %y
  %r = and i32 %s, %x

which is not nearly as good.

...

Instead of turning this into a phase ordering issue, I'd rather increase the power of the folding logic to catch the general form. In this case, instead of handling this in instsimplify, why not handle this in reassociate?

This is conceptually no different than (x& y& x) -> (x& y), it's just that in this case you have: (x& y& (x or -1)) which should fold to (x& y) just the same.

I actually came to the same conclusion, only since I didn't know about
the reassociate pass I planned to do this in instcombine :slight_smile: I will give
it a go.

Thanks for your help,

Duncan.

Hi Chris,

Instead of turning this into a phase ordering issue, I'd rather increase the power of the folding logic to catch the general form. In this case, instead of handling this in instsimplify, why not handle this in reassociate?

This is conceptually no different than (x& y& x) -> (x& y), it's just that in this case you have: (x& y& (x or -1)) which should fold to (x& y) just the same.

in fact this already works, at least for the given example:
  -instcombine -reassociate -instcombine results in:

original:
   %z = and i32 %x, %y
   %s = select i1 %cond, i32 %y, i32 %z
   %r = and i32 %x, %s

after first instcombine:
   %z = select i1 %cond, i32 -1, i32 %x
   %s = and i32 %z, %y
   %r = and i32 %s, %x

after reassociate:
   %z = select i1 %cond, i32 -1, i32 %x
   %s = and i32 %y, %x
   %r = and i32 %s, %z

after second instcombine:
   %s = and i32 %y, %x

This might be a coincidence, but I am happy for the moment :slight_smile:

Ciao,

Duncan.

Ok! If this "conditional and" idiom is common, it would still be worth teaching reassociate about it, so we don't run into phase order issues.

-Chris

Yeah, that’s not nearly as good as “%r = and i32 %x, %y” :slight_smile:

Note that:

%z = select i1 %cond, i32 -1, i32 %x
%s = and i32 %z, %y

Is a “conditional and”. It would be interesting to know if the ARM backend gets this as a single predicated ‘and’ instruction (similarly for the ‘or’ and ‘xor’ version of these patterns). I bet not, which is bad if instcombine is canonicalizing this way.

On ARM, an instruction predicated on false predicate is still executed so it’s frequently undesirable. Because llvm canonicalize to select instruction, it already generates significantly more predicated instructions than gcc. We have seen some regressions due to overly aggressive select formation.

By definition select requires both source operands to be evaluated. Given how good branch predicators are these days, I’m not surprised it often turns branching code performs better. ICC also almost never generates conditional moves.

Evan

That may be, but I can't imagine that:

$ cat t.ll
define i32 @test(i32 %a, i32 %b, i32 %x, i32 %y) nounwind {
%cond = icmp slt i32 %a, %b
%z = select i1 %cond, i32 -1, i32 %x
%s = and i32 %z, %y
ret i32 %s
}
$ llc t.ll -o - -march=arm
_test: @ @test
@ BB#0:
  cmp r0, r1
  mvn r12, #0
  movlt r2, r12
  and r0, r2, r3
  bx lr

is better than a cmp + conditional and + bx.

-Chris

%z = select i1 %cond, i32 -1, i32 %x
%s = and i32 %z, %y

Is a "conditional and". It would be interesting to know if the ARM backend gets this as a single predicated 'and' instruction (similarly for the 'or' and 'xor' version of these patterns). I bet not, which is bad if instcombine is canonicalizing this way.

On ARM, an instruction predicated on false predicate is still executed so it's frequently undesirable. Because llvm canonicalize to select instruction, it already generates significantly more predicated instructions than gcc. We have seen some regressions due to overly aggressive select formation.

By definition select requires both source operands to be evaluated. Given how good branch predicators are these days, I'm not surprised it often turns branching code performs better. ICC also almost never generates conditional moves.

That may be, but I can't imagine that:

$ cat t.ll
define i32 @test(i32 %a, i32 %b, i32 %x, i32 %y) nounwind {
%cond = icmp slt i32 %a, %b
%z = select i1 %cond, i32 -1, i32 %x
%s = and i32 %z, %y
ret i32 %s
}
$ llc t.ll -o - -march=arm
_test: @ @test
@ BB#0:
  cmp r0, r1
  mvn r12, #0
  movlt r2, r12
  and r0, r2, r3
  bx lr

is better than a cmp + conditional and + bx.

This should be

        cmp r0, r1
        movlt.w r2, #-1 @ or mvnlt r2, #0
        and.w r0, r2, r3
        bx lr

which we gets right in Thumb2 mode (I need to check why it's not matching in ARM mode). How can we use a conditional and here? The result is either (y & -1) or (y & x), the "and" is not conditional.

Evan

y&-1 == y. There is no need to materialize -1 as a constant.

-Chris

This should be

      cmp r0, r1
      movlt.w r2, #-1 @ or mvnlt r2, #0
      and.w r0, r2, r3
      bx lr

which we gets right in Thumb2 mode (I need to check why it's not matching in ARM mode). How can we use a conditional and here? The result is either (y & -1) or (y & x), the "and" is not conditional.

y&-1 == y. There is no need to materialize -1 as a constant.

Yes, *face palm*. That will require target specific dag combine.

Evan