Questions about jump threading optimization and what we can do

Since the bug report here: https://bugs.llvm.org/show_bug.cgi?id=44679 I’ve been thinking about cases like it, such as: https://godbolt.org/z/Fwq8mn

I wonder what we can do about this in a general sense. As far as I can tell, the jump threading algorithm is really conservative, which is one reason this isn’t working as well as I’d hope; however, we don’t want to produce irreducible control flow that the other passes would work less effectively on. Thoughts?

  • Karl

Since the bug report here: https://bugs.llvm.org/show_bug.cgi?id=44679 I've been thinking about cases like it, such as: Compiler Explorer

I wonder what we can do about this in a general sense. As far as I can tell, the jump threading algorithm is *really* conservative, which is one reason this isn't working as well as I'd hope; however, we don't want to produce irreducible control flow that the other passes would work less effectively on. Thoughts?

Just brainstorming, really, but what about running a less conservative
version of JumpThreading later, _after_ loop passes have run?

Cheers,
Nicolai

I'm confused. In your godbold link you run it with O0 which disables
almost all transformations. If we take the IR, remove the optnone
attribute, run mem2reg and jump-threading we get what I think is
reasonable:
  Compiler Explorer

Please correct me if I misunderstand anything here.

Cheers,
  Johannes

Holy crap, I completely missed that. I’m sorry! That’s my fault.

Here’s a better example. https://godbolt.org/z/fpTyFS
I don’t know what exactly you would need to change in JumpThreading to improve this, but I’m open to ideas.

How does the code you would like to have look like? I don't see a
relevant difference compared to gcc:

(clang unnecessarily introduces another temporary register, but that
seems unrelated)

Michael

Well ideally it’d do the same thing as before (with two if statements). Something like:
ret i64 4

It’s not really about needing to be as good as gcc, it’s more that I wonder why the same concept with just two if-else blocks gets optimized yet more doesn’t.

An interesting point to make is that I don’t think it’s jump threading optimizing the original example fully: https://godbolt.org/z/rGvHJk

-Karl

I am not convinced this is a jump-threading issue, but in the domain
of ScalarEvolution. ScEv would need to find out which of the exits are
taken or whether it loops infinitely. One can see that it exits after
10000 iterations, but ScalarEvolution cannot tell:

Michael

Ah, I see. I think there’s something else going on here too, though. https://godbolt.org/z/dCqdvv is optimized away but ScEv doesn’t know whether this loop terminates either.

Wait, you used the same example as I did. I’m confused then; if ScEv is having troubles but it still gets optimized away somewhere, what do you think is doing it?

I assumed the LLVM-IR behind the godbolt link represented the C code
you used before.

SimplifyCFG actually helps the loop being removed:

But this doesn't even involve JumpThreading.

Michael

Hm. I assumed that JumpThreading would be the primary factor in optimizing code like this. Guess not. I’ll need to look into SimplifyCFG to see what prevents it from doing the same thing to the other loop: https://godbolt.org/z/F6NjdG

I think SimplifyCFG could be abusing the undef value for this. If num is defined before the loop, it no longer becomes optimized: https://godbolt.org/z/woHRQf

Your code and the one you have seen in
https://bugs.llvm.org/show_bug.cgi?id=44679 are structurally
different. In that latter, setting n to a constant value causes a
branch taken in the next iteration. Your's is a more classic for-loop
from 0 to 2. It is ScalarEvolution's job to find this out. SimplyCFG
job is just to normalize the IR into a form understood by
ScalarEvolution. That is, ScalarEvolution is the primary engine
enabling this.

Michael

Ok, thanks for the clarification. So looking into ScalarEvolution is what I’d look into if I wanted to optimize this a bit more? I’m mainly interested in how I’d go about “de-looping” this.

Ideally, with only 2 loop iteration with different code being executed
per execution, it is fully unrolled (-loop-unroll). But for this to
happen, LoopUnroll needs to know the iteration count from
ScalarEvolution. Your code is structured a bit differently than the
typical for-loop, making it a bit more difficult. Canonicalization
(-simplifycfg -loop-rotate -loop-simplify -indvarsimplify) should make
it recognizable.

Michael