[InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass

Hi,

We recently found a testcase showing that simplifications in
instcombine sometimes change the instruction without reducing the
instruction cost, but causing problems in TwoAddressInstruction pass.
And it looks like the problem is generic and other simplification may
have the same issue. I want to get some ideas about what is the best
way to fix such kind of problem.

The testcase:
----------------------------- a.ll ----------------------------------
@a = global i64 0, align 8
@b = global i32 0, align 8

define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) local_unnamed_addr {
entry:
  %t1 = load i16, i16* %arrayidx, align 2
  %conv.i = zext i16 %t1 to i32
  %and.i = and i32 %conv.i, 255
  %and7.i = and i32 %conv.i, 1792
  %t3 = zext i32 %and7.i to i64
  %t4 = load i64, i64* @a, align 8
  %add.i = add i64 %t4, %t3
  %cmp1 = icmp eq i64 %add.i, 1
  %cmp2 = icmp ult i32 %and.i, 101
  %bool = and i1 %cmp1, %cmp2
  br i1 %bool, label %if.then, label %if.else, !prof !0

if.then: ; preds = %entry
  %r1 = trunc i64 %add.i to i32
  br label %return

if.else: ; preds = %entry
  %r2 = and i32 %and.i, 31
  store i32 %and.i, i32* @b, align 8
  br label %return

return: ; preds = %if.else, %if.then
  %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]
  ret i32 %ret
}

!0 = !{!"branch_weights", i32 2000, i32 1}

So to write this in a more condensed form, you have:

%v0 = ...
%v1 = and %v0, 255
%v2 = and %v1, 31
use %v1
use %v2

and transform this to
%v0 = ...
%v1 = and %v0, 255
%v2 = and %v0, 31
...

This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end.

In my experience it is best to restrict those sort of transformation to situations where you can prove there aren't multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users
will match the same pattern).

- Matthias

So to write this in a more condensed form, you have:

%v0 = ...
%v1 = and %v0, 255
%v2 = and %v1, 31
use %v1
use %v2

and transform this to
%v0 = ...
%v1 = and %v0, 255
%v2 = and %v0, 31
...

This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end.

In my experience it is best to restrict those sort of transformation to situations where you can prove there aren't multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users
will match the same pattern).

- Matthias

Thanks Matthias, yes, the transformation in instcombine is of non-root
node pattern. What I am worried about to add the restriction that %v1
has to have only one use is that such change may regress some
benchmarks. For the same case, after we add the limitation, we will
save a move in BB1, but may add another move in BB2, because %v2 and
%v1 are both alive after %v2 = and %v1, 31. Which choice is better
depends on the hotness of BB1 and BB2, so the result of such
transformation in instcombine is like flipping a coin. Currently we
don't have a pass to use BFI to do such transformation in a meaningful
way. Maybe we need such a pass, and with that pass, we can disable the
transformations in instcombine without regressing certain benchmarks.

BB1:

%v0 = ...
%v1 = and %v0, 255

BB2:

%v2 = and %v1, 31
use %v1
use %v2

Wei.

So to write this in a more condensed form, you have:

%v0 = …
%v1 = and %v0, 255
%v2 = and %v1, 31
use %v1
use %v2

and transform this to
%v0 = …
%v1 = and %v0, 255
%v2 = and %v0, 31

This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end.

In my experience it is best to restrict those sort of transformation to situations where you can prove there aren’t multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users
will match the same pattern).

  • Matthias

Thanks Matthias, yes, the transformation in instcombine is of non-root
node pattern. What I am worried about to add the restriction that %v1
has to have only one use is that such change may regress some
benchmarks. For the same case, after we add the limitation, we will
save a move in BB1, but may add another move in BB2, because %v2 and
%v1 are both alive after %v2 = and %v1, 31. Which choice is better
depends on the hotness of BB1 and BB2, so the result of such
transformation in instcombine is like flipping a coin. Currently we
don’t have a pass to use BFI to do such transformation in a meaningful
way. Maybe we need such a pass, and with that pass, we can disable the
transformations in instcombine without regressing certain benchmarks.

Yes this strategy will regress some situations and also goes against the LLVM spirit of perform as much normalization as possible in the middle end.

On the other hand eagerly transform all such opportunities has more potential to regress things than to improve things in my experience, especially since we currently have no way of reverting this transformation. So at the very least changing the strategy would be a very interesting experiment and at least in another compiler I worked on we saw a huge number of improvements and nearly no regressions when switching to this strategy of not transforming in the presence of multiple users on a non-root node.

  • Matthias

So to write this in a more condensed form, you have:

%v0 = …
%v1 = and %v0, 255
%v2 = and %v1, 31
use %v1
use %v2

and transform this to
%v0 = …
%v1 = and %v0, 255
%v2 = and %v0, 31

This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end.

In my experience it is best to restrict those sort of transformation to situations where you can prove there aren’t multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users
will match the same pattern).

  • Matthias

Thanks Matthias, yes, the transformation in instcombine is of non-root
node pattern. What I am worried about to add the restriction that %v1
has to have only one use is that such change may regress some
benchmarks. For the same case, after we add the limitation, we will
save a move in BB1, but may add another move in BB2, because %v2 and
%v1 are both alive after %v2 = and %v1, 31. Which choice is better
depends on the hotness of BB1 and BB2, so the result of such
transformation in instcombine is like flipping a coin. Currently we
don’t have a pass to use BFI to do such transformation in a meaningful
way. Maybe we need such a pass, and with that pass, we can disable the
transformations in instcombine without regressing certain benchmarks.

Yes this strategy will regress some situations and also goes against the LLVM spirit of perform as much normalization as possible in the middle end.

On the other hand eagerly transform all such opportunities has more potential to regress things than to improve things in my experience, especially since we currently have no way of reverting this transformation.

There are many cases where we have no way of reverting this transformation, but it seems like this is not one of those cases.

It seems like we could, potentially very late (not sure how late, maybe in remat itself? maybe earlier…), recognize that we can reduce the live value set by rewriting %v2 in terms of %v1 (in your reduced example). Even something as simple as demanded-bits should be able to tell that %v0 and %v1 are interchangeable.

But maybe there is something about this live-set-reduction transform that is hard? I’ve not thought about it much.

So at the very least changing the strategy would be a very interesting experiment and at least in another compiler I worked on we saw a huge number of improvements and nearly no regressions when switching to this strategy of not transforming in the presence of multiple users on a non-root node.

None of the above would invalidate the utility of this experiment of course.

I would guess we want to restrict the experiment to combines where the rewrite doesn’t substantially improve our ability to reason about the computed value (IE, something we can see through transparently like an and seems like a no-brainer).

My guess at why this would hurt LLVM, if it does, is due to the recursion depth limit and the degree to which we rely on lazy “up” walks with that depth limit rather than forward-propagation.

So to write this in a more condensed form, you have:

%v0 = ...
%v1 = and %v0, 255
%v2 = and %v1, 31
use %v1
use %v2

and transform this to
%v0 = ...
%v1 = and %v0, 255
%v2 = and %v0, 31
...

This is a classical problem with instruction combining. When you replace a
non-root node of a pattern (in this case %v2 would be the root pattern we
match, but we actually replace %v1 with %v0 as part of the replacement) and
that non-root node has multiple users then it becomes very hard to predict
whether the transformations pays off in the end.

In my experience it is best to restrict those sort of transformation to
situations where you can prove there aren't multiple users on the
intermediate/non-root node that we replace. (Or if you want to get fancier
that all users
will match the same pattern).

I agree. The propagation happens in the hope that v1 has only one use so
that '%v1 = ..' can be later eliminated, or that v1 has only two uses such
that the propagation can shrink the live range of %v1, saving one move
instruction for the other use of %v1. In reality, this is like flip a
coin and likely to hurt performance on x86.

Avoid doing this one the fly, but in a separate pass might be the way to go.

David

So to write this in a more condensed form, you have:

%v0 = …
%v1 = and %v0, 255
%v2 = and %v1, 31
use %v1
use %v2

and transform this to
%v0 = …
%v1 = and %v0, 255
%v2 = and %v0, 31

This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end.

In my experience it is best to restrict those sort of transformation to situations where you can prove there aren’t multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users
will match the same pattern).

  • Matthias

Thanks Matthias, yes, the transformation in instcombine is of non-root
node pattern. What I am worried about to add the restriction that %v1
has to have only one use is that such change may regress some
benchmarks. For the same case, after we add the limitation, we will
save a move in BB1, but may add another move in BB2, because %v2 and
%v1 are both alive after %v2 = and %v1, 31. Which choice is better
depends on the hotness of BB1 and BB2, so the result of such
transformation in instcombine is like flipping a coin. Currently we
don’t have a pass to use BFI to do such transformation in a meaningful
way. Maybe we need such a pass, and with that pass, we can disable the
transformations in instcombine without regressing certain benchmarks.

Yes this strategy will regress some situations and also goes against the LLVM spirit of perform as much normalization as possible in the middle end.

On the other hand eagerly transform all such opportunities has more potential to regress things than to improve things in my experience, especially since we currently have no way of reverting this transformation.

There are many cases where we have no way of reverting this transformation, but it seems like this is not one of those cases.

It seems like we could, potentially very late (not sure how late, maybe in remat itself? maybe earlier…), recognize that we can reduce the live value set by rewriting %v2 in terms of %v1 (in your reduced example). Even something as simple as demanded-bits should be able to tell that %v0 and %v1 are interchangeable.

This is also a mostly neutral example. I don’t know

But maybe there is something about this live-set-reduction transform that is hard? I’ve not thought about it much.

It certainly seems feasible and would be a cool project. There is some space here, where we get less register pressure but also less ILP when serializing here.

  • Matthias

This is also a mostly neutral example. I don’t know

Pressing send too fast, ignore this half sentence :slight_smile:

So to write this in a more condensed form, you have:

%v0 = ...
%v1 = and %v0, 255
%v2 = and %v1, 31
use %v1
use %v2

and transform this to
%v0 = ...
%v1 = and %v0, 255
%v2 = and %v0, 31
...

This is a classical problem with instruction combining. When you replace a
non-root node of a pattern (in this case %v2 would be the root pattern we
match, but we actually replace %v1 with %v0 as part of the replacement) and
that non-root node has multiple users then it becomes very hard to predict
whether the transformations pays off in the end.

In my experience it is best to restrict those sort of transformation to
situations where you can prove there aren't multiple users on the
intermediate/non-root node that we replace. (Or if you want to get fancier
that all users
will match the same pattern).

- Matthias

Thanks Matthias, yes, the transformation in instcombine is of non-root
node pattern. What I am worried about to add the restriction that %v1
has to have only one use is that such change may regress some
benchmarks. For the same case, after we add the limitation, we will
save a move in BB1, but may add another move in BB2, because %v2 and
%v1 are both alive after %v2 = and %v1, 31. Which choice is better
depends on the hotness of BB1 and BB2, so the result of such
transformation in instcombine is like flipping a coin. Currently we
don't have a pass to use BFI to do such transformation in a meaningful
way. Maybe we need such a pass, and with that pass, we can disable the
transformations in instcombine without regressing certain benchmarks.

Yes this strategy will regress some situations and also goes against the
LLVM spirit of perform as much normalization as possible in the middle end.

On the other hand eagerly transform all such opportunities has more
potential to regress things than to improve things in my experience,
especially since we currently have no way of reverting this transformation.
So at the very least changing the strategy would be a very interesting
experiment and at least in another compiler I worked on we saw a huge number
of improvements and nearly no regressions when switching to this strategy of
not transforming in the presence of multiple users on a non-root node.

- Matthias

Thanks for sharing the experience! I will try the same and see how the
performance looks like.

Wei.