Missed optimization with indirectbr terminator

Consider this IR fragment produced after -O3:

%7:
%8 = phi i8* [ blockaddress(@0, %19), %19 ], [ %12, %11 ]
%9 = phi i32 [ %20, %19 ], [ 0, %11 ]
indirectbr i8* %8, [label %4, label %19]

%19:
%20 = add nsw i32 %9, 1
%21 = icmp eq i32 %9, 9999
br i1 %21, label %16, label %7

the br in %19 should be optimized to branch directly to itself rather than going back to %7 (note that the arg %8 to the indirectbr will always be the address of %19 when coming from %19).
Is this a known missed optimization?

I haven’t read the code in detail, but it looks like JumpThreading at least attempts to thread across indirect branches. You can either try to fix it or file a bug with your test case.

Cameron

In the source it says “If the predecessor is an indirect goto, we can’t split the edge.” Why would that be the case?

Splitting an edge creates a block which executes when leaving a
specific block to go to a specific successor. The only way to split an
indirect goto is to insert code before the jump which checks for a
specific destination address. This is very likely to be a pessimization.

To answer your original question, the current implementation design
for indirect goto is intentionally based around having a single block
that terminates in an indirectbr. Otherwise the CFG of a
context-threaded interpreter loop becomes inherently O(N^2): there
are N instruction labels, each of which has approximately N
predecessors. We rely on a code-generation optimization, tail
duplication, to actually restore the context-threading behavior.

In short, don’t look at the IR for whether context-threading is
happening; look at the generated code.

John.

Nella citazione giovedì 7 luglio 2011 19:41:16, John McCall ha scritto:

I haven't read the code in detail, but it looks like JumpThreading at least attempts to thread across indirect branches. You can either try to fix it or file a bug with your test case.

In the source it says "If the predecessor is an indirect goto, we can't split the edge. <http://llvm.org/docs/doxygen/html/JumpThreading_8cpp_source.html#l00914>" Why would that be the case?

Splitting an edge creates a block which executes when leaving a
specific block to go to a specific successor. The only way to split an
indirect goto is to insert code before the jump which checks for a
specific destination address. This is very likely to be a pessimization.

Do you mean like turning a indirectbr %addr, [%a, %b, ...]
into a switch %addr, undef, [blockaddr(%a), %a], [blo
ckaddr(%b), %b], ...
? (I know that the switch doesn't accept a pointer as first argument, it's just an example)

To answer your original question, the current implementation design
for indirect goto is intentionally based around having a single block
that terminates in an indirectbr. Otherwise the CFG of a
context-threaded interpreter loop becomes inherently O(N^2): there
are N instruction labels, each of which has approximately N
predecessors. We rely on a code-generation optimization, tail
duplication, to actually restore the context-threading behavior.

In short, don't look at the IR for whether context-threading is
happening; look at the generated code.

I'll try to inspect the assembler. Just a quick thought in the mean time, in the snippet I posted, if the backedge pointed directly to %19, other optimizations would likely notice that the loop could be removed entirely and replaced with a single addition. Do you think the code generator is able t
o do this?

Nella citazione giovedì 7 luglio 2011 19:41:16, John McCall ha scritto:

I haven't read the code in detail, but it looks like JumpThreading at least attempts to thread across indirect branches. You can either try to fix it or file a bug with your test case.

In the source it says "If the predecessor is an indirect goto, we can't split the edge. <http://llvm.org/docs/doxygen/html/JumpThreading_8cpp_source.html#l00914>" Why would that be the case?

Splitting an edge creates a block which executes when leaving a
specific block to go to a specific successor. The only way to split an
indirect goto is to insert code before the jump which checks for a
specific destination address. This is very likely to be a pessimization.

Do you mean like turning a indirectbr %addr, [%a, %b, ...]
into a switch %addr, undef, [blockaddr(%a), %a], [blo
ckaddr(%b), %b], ...
? (I know that the switch doesn't accept a pointer as first argument, it's just an example)

That difference is relevant, though, because there's no way to
implement such a switch except as a series of pointer comparisons
(or at least offset comparisons). But I was approaching it from
the perspective of trying to split a single edge, rather than changing
the entire terminator; there's no natural way to do that with an indirect
branch, so you have to do explicit pointer comparisons before the goto.
Doing even one or two of those could easily kill a lot of the benefit of
the indirect branch.

Basically, generic optimizations should not be in the business of
trying to split indirect branch edges, which is why the generic
splitting routines fail on them.

I'll try to inspect the assembler. Just a quick thought in the mean time, in the snippet I posted, if the backedge pointed directly to %19, other optimizations would likely notice that the loop could be removed entirely and replaced with a single addition. Do you think the code generator is able to do this?

I don't think anyone's taught the loop optimizations to work
over indirect branches, but it's not impossible. This specific
case would be better fixed by teaching some existing pass,
maybe JumpThreading, that an indirectbr on a phi with constant
blockaddr operands can be usefully optimized.

John.

Why would you write a loop with an indirect branch where the loop can be deleted?

Cameron

Actually, jumpthreading already knows about indirectbr.

The problem is that JumpThreading refuses to thread across loop
headers (that is, the target blocks of backedges). It should have the
same problem with a phi of booleans and a conditional branch instead
of a phi of pointers and an indirectbr.

The comment above JumpThreading::FindLoopHeaders() explains the
current situation:

/// FindLoopHeaders - We do not want jump threading to turn proper loop
/// structures into irreducible loops. Doing this breaks up the loop nesting
/// hierarchy and pessimizes later transformations. To prevent this from
/// happening, we first have to find the loop headers. Here we approximate this
/// by finding targets of backedges in the CFG.
///
/// Note that there definitely are cases when we want to allow threading of
/// edges across a loop header. For example, threading a jump from outside the
/// loop (the preheader) to an exit block of the loop is definitely profitable.
/// It is also almost always profitable to thread backedges from within the loop
/// to exit blocks, and is often profitable to thread backedges to other blocks
/// within the loop (forming a nested loop). This simple analysis is not rich
/// enough to track all of these properties and keep it up-to-date as the CFG
/// mutates, so we don't allow any of these transformations.

Note that the case discussed in this thread (a backedge to a block
within the loop) is mentioned as "often profitable", but that
JumpThreading simply doesn't know enough about loops to recognize it.

There are several ways to solve this, as I see it:
1) teach jumpthreading about loops,
2) teach some other pass that knows about loops how to thread backedges, or
3) introduce a whole new pass just for this purpose.

I’m working on an IPO that uses indirectbrs. The loop was part of the code being compiled.

Filed bug 10377 w/testcase.