CFG normalization: avoiding `br i1 false`

Hi,

I was looking at Rust programs which are poorly optimized by LLVM, and
one occasional factor is that LLVM allows for long-lived `br i1 false`
instructions.

The problem is that if a propagation pass discovers the condition of a
branch, the branch itself will not be eliminated until a SimpllfyCfg
pass is reached, of which there are few in the pipeline.

One example is https://github.com/rust-lang/rust/issues/44041 - the
branch conditions are found by loop unrolling, but the `br i1 false`
instruction leaks to codegen.

Another example is https://github.com/rust-lang/rust/issues/45466.

In that case, induction variable simplification discovers that an
integer overflow check is unnecessary, but the remaining `br i1 false`
still blocks the following LoopIdiomRecognize pass and prevents the
loop from being optimized to a memset.

While it is possible to have point-fixes for both problems, this might
be widespread enough problem that a more general solution might be
better - for example, to have some sort of canonicalization that
automatically removes these branches in some situations (this requires
things such as the dominator tree to be rebuilt, so it shouldn't be
done literally everywhere, but it can be done fairly often).

What are your opinions?

Thanks,
- Ariel

Hi,

I was looking at Rust programs which are poorly optimized by LLVM, and
one occasional factor is that LLVM allows for long-lived `br i1 false`
instructions.

The problem is that if a propagation pass discovers the condition of a
branch, the branch itself will not be eliminated until a SimpllfyCfg
pass is reached, of which there are few in the pipeline.

One example is https://github.com/rust-lang/rust/issues/44041 - the
branch conditions are found by loop unrolling, but the `br i1 false`
instruction leaks to codegen.

Another example is https://github.com/rust-lang/rust/issues/45466.

In that case, induction variable simplification discovers that an
integer overflow check is unnecessary, but the remaining `br i1 false`
still blocks the following LoopIdiomRecognize pass and prevents the
loop from being optimized to a memset.

SimplifyCFG is the correct pass for this kind of transformation. I
don't think it's entirely unreasonable to run it more often, but the
compile time cost needs to be taken in account.

I'm not necessarily sympathetic to the idea of adding another canonicalization
pass only for this purpose.

Side note:
IMHO, at this point SimplifyCFG is doing even more than it should
(including, e.g. sinking of variable from "almost empty" BBs, i.e. BBs
which have only a single instruction & terminator, if I recall
correctly). If anything, I'd rather split the sinking logic and other operations
to a different pass, rather than moving simplifying proven unconditional
branches elsewhere :slight_smile:

Also, FWIW, there has been a recent effort to split SimplifyCFG in multiple
variants (with several cl::opt flags associated to enable specific
transformations).

While it is possible to have point-fixes for both problems, this might
be widespread enough problem that a more general solution might be
better - for example, to have some sort of canonicalization that
automatically removes these branches in some situations (this requires
things such as the dominator tree to be rebuilt, so it shouldn't be
done literally everywhere, but it can be done fairly often).

Now that LLVM has an incremental dominator API it should be more
feasible to run SimplifyCFG more times without recomputing the
dominator all the time. I haven't checked whether there are other
expensive analyses that need to be preserved.

Thanks,

I'm not necessarily sympathetic to the idea of adding another canonicalization pass only for this purpose.

The problem is that as you said, SimplifyCfg does all sorts of stuff,
and I suspect is not the fastest pass in the world.

Also, in the case that annoys me, there is an LCSSA pass in the
middle, and I suspect it would be a better idea to only do the LCSSA
etc. transform again if no changes were made, which I suspect is the
most common case.

I'm not necessarily sympathetic to the idea of adding another canonicalization pass only for this purpose.

The problem is that as you said, SimplifyCfg does all sorts of stuff,
and I suspect is not the fastest pass in the world.

I've not seen it showing up in a profile myself, although I'm not sure
whether it will be true in the future.
Sinking, e.g., could be moved in a separate pass (in fact, there's one
already, disabled by default).

Also, in the case that annoys me, there is an LCSSA pass in the
middle, and I suspect it would be a better idea to only do the LCSSA
etc. transform again if no changes were made, which I suspect is the
most common case.

LCSSA should be linear. Currently it's not because it uses the SSA
updater which computes dominance frontiers, so it might get O(N^3).
The alternative SSAupdater proposed solves this problem using the
novel algorithm from Braun et al. linked in the review
https://reviews.llvm.org/D28934
It still has the problem that if leaves redundant phis around at the
beginning of the block, e.g.

%patatino1 = phi [ %tinky, %somebb, %winky, %someotherbb ]
%patatino2 = phi [ %tinky, %somebb, %winky, %someotherbb ]

The current SSA updater pays a (sometimes high) cost to perform this
deduplication/removal (actually, checks whether there's already a PHI,
and if not inserts one).

There's already a LoopSimplifyCFG which is a loop-pass (and thus can iterate with other loop passes) and eliminates trivial branches. Having a simple pass which just strips unreachable blocks and converts conditional branches to unconditional ones while updating appropriate analyzes (LoopInfo, DomTree, LCSSA, etc..) seems very reasonable. This could also be a utility function called from the end of other passes. The hard bit is the analysis preservation. A good place to copy from might be the recent loop-unswitch work Chandler did.

Philip

There's already a LoopSimplifyCFG which is a loop-pass (and thus can iterate
with other loop passes) and eliminates trivial branches. Having a simple
pass which just strips unreachable blocks and converts conditional branches
to unconditional ones while updating appropriate analyzes (LoopInfo,
DomTree, LCSSA, etc..) seems very reasonable.

I'm not necessarily convinced having one-trick-pony pass is the best
option here.
It adds complexity, and there's already a pass in-tree which can
provide this functionality.
What are your concerns about running SimplifyCFG more often? It
shouldn't be a compile-time sink

This could also be a utility
function called from the end of other passes. The hard bit is the analysis
preservation. A good place to copy from might be the recent loop-unswitch
work Chandler did.

Philip

I don't think preserving the analyses is extremely hard (but I may be
wrong on this).
The incremental Dominator tree API makes the updates fairly easy.
LCSSA is slightly more complicated.
If you take a look at the new LoopUnswitch, in fact, it does call
recalculate rather than fixing it incrementally.
But, if LCSSA is computed right, recomputing entirely should take very
little time.

Thanks,

LoopSimplifyCFG should be updated to do this (I don’t know if it’s in the main pass pipeline yet, though).

When I wrote it I basically only included the simplest possible behavior that I needed with the possibility to add more in the future; I don’t know how much has been added since, of course.

—escha

Hi Davide,

There's already a LoopSimplifyCFG which is a loop-pass (and thus can iterate
with other loop passes) and eliminates trivial branches. Having a simple
pass which just strips unreachable blocks and converts conditional branches
to unconditional ones while updating appropriate analyzes (LoopInfo,
DomTree, LCSSA, etc..) seems very reasonable.

I'm not necessarily convinced having one-trick-pony pass is the best
option here.
It adds complexity, and there's already a pass in-tree which can
provide this functionality.
What are your concerns about running SimplifyCFG more often? It
shouldn't be a compile-time sink

The main issue with running SimplifyCFG more often is if we were to place the pass
*in between* a bunch of loop passes. Atleast in the Legacy pass manager, I think this breaks the nice
loop caching we have when placing all loop passes together.
I’ve not done any measurements on compile time to verify this theory.

LoopSimplifyCFG being a loop pass will not have this problem.

This could also be a utility
function called from the end of other passes. The hard bit is the analysis
preservation. A good place to copy from might be the recent loop-unswitch
work Chandler did.

Philip

I don't think preserving the analyses is extremely hard (but I may be
wrong on this).
The incremental Dominator tree API makes the updates fairly easy.
LCSSA is slightly more complicated.
If you take a look at the new LoopUnswitch, in fact, it does call
recalculate rather than fixing it incrementally.
But, if LCSSA is computed right, recomputing entirely should take very
little time.

The trouble (when I tried this transform in LoopSimplifyCFG) is updating
loopInfo analysis: we can have cases where eliminating trivial branches in an inner loop
will break the outer loop structure, and can cause it to no longer be a loop (consider a perfectly nested
inner loop within an outer loop, where the trivial branch is br i1 false, label %outerloopheader, label %innerloopHdr).

Another problem is if we changed the branches to unconditional ones, which then
makes some inner loop unreachable (i.e. no longer loops in LLVM terminology).

We will need to record these kinds of loops and update the LPM. There maybe more here.

Thanks,
Anna

[ + cc llvm-dev].

This came up again internally and I realized I hadn’t cc’ed llvm-dev in my original response.

Anna

I agree that LoopSimplifyCFG is the right place to do this. As Anna pointed out, it’s not as easy as to just run SimplifyCFG on a blocks belonging to the current loop: updating loops structure and preserving LCSSA make it challenging, but in the end it not only should solve the original issue mentioned in this thread (eliminating br i1 false), but it also can help compile time by reducing IR size.

Michael