Branch is not optimized because of right shift

Hi everyone,

In a twitch chat someone redirected me to an example that is not optimized: https://godbolt.org/z/BL-4jL

I included the original source code and this is after -O2. We both thought that the -8 branch could be optimized out. I added a nuw in the subtraction and it actually does it.

Any thoughts on why that doesn’t happen already?

Best,
Stefanos Baziotis

I'm not sure that I've captured what you mean here, but you can explore these sorts of things using Alive2:

   http://volta.cs.utah.edu:8080/z/b7Yj7j

John

Hi John,

I hadn’t seen alive2, amazing program, thank you! Keep up the good work.

I’m not sure that I’ve captured what you mean here

To be more specific for everyone:

  • First of all, I hope it’s clear that in the original (C) code, the region - 0x8 > 1000 branch should
    be eliminated. That is because it is inside a block that has ensured that 8 < region < 12. But for some reason,
    it is not eliminated. I’m trying to find that reason. :slight_smile:
  • I understand that in the -O2 .ll output, we take %0 and do it a right shift. That means that we don’t have
    any range info about %0 or %0 >>= 1, so we can’t eliminate the branch. What I don’t understand
    is why we chose to reuse %0 and do it a right shift and instead we didn’t have a phi there.
  • Finally, I saw that putting nuw in the addition with -8 eliminates the branch. This makes sense since,
    we know probably know from the code above that %0 >>= 1 is less than 12. Although, I don’t understand
    why the right shift was translated to an add with -16.

I hope this made some sense. You may ignore the last 2 and focus on the first, i.e. what optimization
should have been done but it’s not and what we can do about it.

A clearer version of the .ll: https://godbolt.org/z/2t4RU5

Best,
Stefanos

p.s. John I had sent you an e-mail regarding loop rotation some time ago. I’m not sure if it was intentional
on your part to be skipped, if so, sorry to bother you. :slight_smile: But there’s the alternative that it was just lost.

Στις Κυρ, 5 Απρ 2020 στις 7:12 π.μ., ο/η John Regehr via llvm-dev <llvm-dev@lists.llvm.org> έγραψε:

Hi,

To be more specific for everyone:
- First of all, I hope it's clear that in the original (C) code, the region - 0x8 > 1000 branch should
be eliminated. That is because it is inside a block that has ensured that 8 < region < 12. But for some reason,
it is not eliminated. I'm trying to find that reason. :slight_smile:
- I understand that in the -O2 .ll output, we take %0 and do it a right shift. That means that we don't have
any range info about %0 or %0 >>= 1, so we can't eliminate the branch. What I don't understand
is why we chose to reuse %0 and do it a right shift and instead we didn't have a phi there.
- Finally, I saw that putting nuw in the addition with -8 eliminates the branch. This makes sense since,
we know probably know from the code above that %0 >>= 1 is less than 12. Although, I don't understand
why the right shift was translated to an add with -16.

I hope this made some sense. You may ignore the last 2 and focus on the first, i.e. what optimization
should have been done but it's not and what we can do about it.

A clearer version of the .ll: https://godbolt.org/z/2t4RU5

I think the IR in both of your examples makes things harder for the compiler than expected from the original C source.

With the IR in your original example (https://godbolt.org/z/BL-4jL), I think the problem is that the branch condition is '%0 - 16 < 12’, which allows us to limit the range of `%0 - 16` easily, but in the branch only %0 is used. Sinking the lshr too early made the analysis harder.

In the second example (https://godbolt.org/z/2t4RU5), things are hard to simplify because we need to use the information from the condition of one select to simplify the earlier select.

The version in https://godbolt.org/z/_ipKhb is probably the easiest for analysis (basically the original C source code built with `clang -O0 -S -emit-llvm`, followed by running `opt -mem2reg`). There’s a patch under review that adds support for conditional range propagation (https://reviews.llvm.org/D76611) and with the patch it can be simplified to the code below by running `opt -ipsccp -simplifycfg -instcombine`

define i32 @test(i32 %0) {
  %.off = add i32 %0, -16
  %2 = icmp ult i32 %.off, 12
  %spec.select = select i1 %2, i32 66, i32 45
  ret i32 %spec.select
}

The reason it does not yet work in the default pipeline is that the branch condition will be combined to use an AND before the conditional propagation and D76611 does not support that yet. But once it lands that should be an easy extension.

The problems with your two versions could be addressed as well of course for that special case relatively easily I think, but the challenge is to fix it in a general and efficient (compile-time wise) way. I hope the conditional propagation should cover such cases soon though.

Cheers,
Florian

Hi,

I think the IR in both of your examples makes things harder for the compiler than expected from the original C source.

Note that both versions are from clang with -O2. The first is with version 9.0 and the second is with the trunk.

but in the branch only %0 is used. Sinking the lshr too early made the analysis harder.

Yes, exactly! That’s what I figured too.

The version in https://godbolt.org/z/_ipKhb is probably the easiest for analysis (basically the original C source code built with clang -O0 -S -emit-llvm, followed by running opt -mem2reg). There’s a patch under review that adds support for conditional range propagation (https://reviews.llvm.org/D76611) and with the patch it can be simplified to the code below by running opt -ipsccp -simplifycfg -instcombine

define i32 @test(i32 %0) {
%.off = add i32 %0, -16
%2 = icmp ult i32 %.off, 12
%spec.select = select i1 %2, i32 66, i32 45
ret i32 %spec.select
}

The reason it does not yet work in the default pipeline is that the branch condition will be combined to use an AND before the conditional propagation and D76611 does not support that yet. But once it lands that should be an easy extension.

The problems with your two versions could be addressed as well of course for that special case relatively easily I think, but the challenge is to fix it in a general and efficient (compile-time wise) way. I hope the conditional propagation should cover such cases soon though.

Ok, I checked the patch, best of luck! I see how it does the transformation, it’s interesting. I think that https://reviews.llvm.org/rGe9963034314edf49a12ea5e29f694d8f9f52734a could maybe help in such things.
Any idea about how the compiler could remove the lshr and use a add -16?

Best,
Stefanos Baziotis

Στις Κυρ, 5 Απρ 2020 στις 10:06 μ.μ., ο/η Florian Hahn <florian_hahn@apple.com> έγραψε:

Any idea about how the compiler could remove the lshr and use a add -16?
Actually, I just figured that doing this test is like solving this:

8 <= x/2 <= 13
16 <= x <= 26
0 <= x - 16 <= 10 => 0 <= x < 11
The left part is know since it’s unsigned
The right part could be done x <= 11 => x < 12 because it’s actually an integer division.
Wow… I would be really happy to know what pass does that.

Best,

  • Stefanos

Στις Δευ, 6 Απρ 2020 στις 12:00 π.μ., ο/η Stefanos Baziotis <stefanos.baziotis@gmail.com> έγραψε:

Typo: The last 3 x’s are x-16 of course.

Στις Δευ, 6 Απρ 2020 στις 12:20 π.μ., ο/η Stefanos Baziotis <stefanos.baziotis@gmail.com> έγραψε:

I’d guess a combination of instcombine rules together with some other transforms. You could probably use `-print-after-all` (`clang -mllvm -print-after-all` if you are using clang) to track down the relevant passes/steps.

Thanks, I didn’t know that! Indeed, it’s instruction combine that does the job.

  • Stefanos

Στις Δευ, 6 Απρ 2020 στις 12:38 π.μ., ο/η Florian Hahn <florian_hahn@apple.com> έγραψε:

Adding a nuw to the add -8 is incorrect. From the perspective of the unsigned math, -8 is treated a very large positive number. The input to the add is [8,13) and adding a large positive number to it wraps around past 0. So that is guaranteed unsigned wrap. On the other hand, a sub nuw 8 would be correct.

Hi Craig,

Adding a nuw to the add -8 is incorrect.
Yeah, I didn’t mean to say it was correct. It was just an observation that with nuw the optimization was happened and I asked if someone thought it was somehow connected.

From the perspective of the unsigned math, -8 is treated a very large positive number. The input to the add is [8,13) and adding a large positive number to it wraps around past 0. So that is guaranteed unsigned wrap
I understand yes, but I don’t think it is guaranteed. Unless I miss something, for values in [0, 7] it won’t wrap. But past that and up to (and including in the original source code) 13, it will wrap yes.

Best,

  • Stefanos

Στις Δευ, 6 Απρ 2020 στις 3:10 π.μ., ο/η Craig Topper <craig.topper@gmail.com> έγραψε:

Hi Craig,

Adding a nuw to the add -8 is incorrect.
Yeah, I didn’t mean to say it was correct. It was just an observation that with nuw the optimization was happened and I asked if someone thought it was somehow connected.

From the perspective of the unsigned math, -8 is treated a very large positive number. The input to the add is [8,13) and adding a large positive number to it wraps around past 0. So that is guaranteed unsigned wrap
I understand yes, but I don’t think it is guaranteed. Unless I miss something, for values in [0, 7] it won’t wrap. But past that and up to (and including in the original source code) 13, it will wrap yes.

But the value can’t be [0,7] due to the earlier branch. When I said it was guaranteed to wrap, I only meant for the range of values that were possible after the first branch.

In theory, the CorrelatedValuePropagation pass should have been able to optimize the select. It was able to see that the input to add was in the range [8,14) in the call to LVI->getConstantRange in processBinOp. processCmp skips calling LVI for the select’s icmp because the input isn’t in the same basic block and isn’t a phi. And the call to LVI->getConstant for the select in processSelect didn’t return a constant. I think this is because the code executed for getConstant doesn’t handle icmp even when it can prove the input is in a constant range. I tried removing the local value check in processCmp so that getPredicateAt would called, but that didn’t help either.

Hi,

But the value can’t be [0,7] due to the earlier branch. When I said it was guaranteed to wrap, I only meant for the range of values that were possible after the first branch.
Indeed, for some reason I remembered that the left check is >= 0 when I wrote this.

Thanks a lot for the breakdown in CorrelatedValuePropagation. I had no idea about this pass.

It was able to see that the input to add was in the range [8,14) in the call to LVI->getConstantRange in processBinOp.

I could not reproduce that. I get:
BinOp: %4 = add nsw i32 %2, -8
LRange: [0,-2147483648)
RRange: [-8,-7)

processCmp skips calling LVI for the select’s icmp because the input isn’t in the same basic block and isn’t a phi.
Do you maybe mean: is in the same BB (and isn’t a PHI).

By the way, I don’t get the reasoning in the comment above:
// As a policy choice, we choose not to waste compile time on anything where
// the comparison is testing local values.

I think this is because the code executed for getConstant doesn’t handle icmp even when it can prove the input is in a constant range.

Maybe we ended up on the same thing. I’m not sure I followed that correctly but getValueFromICmpCondition() should have been able to handle that.

Best,
Stefanos

Στις Δευ, 6 Απρ 2020 στις 5:22 π.μ., ο/η Craig Topper <craig.topper@gmail.com> έγραψε:

Hi,

I think this should be optimized as expected on latest trunk: https://godbolt.org/z/vW7K7f

Cheers

Hi Florian,

Great, thanks for your effort!

  • Stefanos

Στις Σάβ, 11 Ιουλ 2020 στις 11:13 π.μ., ο/η Florian Hahn <florian_hahn@apple.com> έγραψε: