Why should we have the LoopPass and LoopPassManager? Can we get rid of this complexity?

As came up recently in other threads, loop passes really get in the way. Here are some of the ways:

  • There is only one Loop analysis pass - IVUsers. It seems unlikely that the loop nest walk is critical to computing this or preserving it.

  • Almost all of the things we think of as “required” and dependencies are actually transforms that canonicalize the form of a loop into particular shapes.

  • Because LoopPasses are nested inside of some other pass manager, we can’t compute function analyses after transforming loops with LoopSimplify and/or LCSSA, and have these analyses available to other LoopPasses such as the vectorizer and unroller.

  • LoopPasses don’t obey the typical “rules” of a pass over a narrow chunk of IR: they routinely access and modify IR outside of the loop.

There appear to be two chunks of “functionality” provided by loop passes:

  1. A worklist of loops to process. This is very rarely used:
    1.1) LoopSimplify and LoopUnswitch add loops to the queue.
    1.2) LoopExtractor, LoopDeletion, and LoopUnroll delete loops from the queue (but “no-op” would likely be correct if we didn’t have the loop pass manager).

  2. LoopUnswitch and the pass itself use some hooks to update analyses. This is only actually leveraged by LICM though to update the AliasSetTracker

These don’t seem like very serious use cases, and the cost is insanely high. Can we get rid of it?

As came up recently in other threads, loop passes really get in the way. Here are some of the ways:

- There is only one Loop analysis pass - IVUsers. It seems unlikely that the loop nest walk is critical to computing this or preserving it.

IVUsers is only used by LSR, and would make more sense as an LSR utility IMO.

- Almost all of the things we think of as "required" and dependencies are actually *transforms* that canonicalize the form of a loop into particular shapes.

- Because LoopPasses are nested inside of some other pass manager, we can't compute function analyses after transforming loops with LoopSimplify and/or LCSSA, and have these analyses available to other LoopPasses such as the vectorizer and unroller.

- LoopPasses don't obey the typical "rules" of a pass over a narrow chunk of IR: they routinely access and modify IR outside of the loop.

There appear to be two chunks of "functionality" provided by loop passes:

1) A worklist of loops to process. This is very rarely used:
1.1) LoopSimplify and LoopUnswitch add loops to the queue.

I’m making this up without much thought, but we may benefit from iterative loop transforms like Rotate -> LICM -> Unswitch -> Rotate -> LICM. We might need to come up with a preferred alternative: we can either continue to convert transforms into a utilities, or we can invent new pass manager tricks. e.g. iterate over a set of function passes or selectively run a pass on “dirty” regions.

1.2) LoopExtractor, LoopDeletion, and LoopUnroll delete loops from the queue (but "no-op" would likely be correct if we didn't have the loop pass manager).

I think the LoopPassManager is historically just a source of bugs in this respect.

2) LoopUnswitch and the pass itself use some hooks to update analyses. This is only actually leveraged by LICM though to update the AliasSetTracker

These don't seem like very serious use cases, and the cost is insanely high. Can we get rid of it?

Traditionally, compilers sometimes apply a pipeline of transforms on the loop tree bottom up, or only on inner loops to avoid recomputing a global analysis like DomTree between each of the transforms (you can locally invalidate some analysis but still safely use it for other loops). This is obviously tricky and would best handled within a single Function pass if anyone did want to do it. So it’s not a good reason to have a LoopPassManager.

I agree that the LoopPassManager cost is very high because it’s something that needs to be worked around almost every time someone wants to improve the pass pipeline.

-Andy

This is a really good point. Owen pointed it out to me as well in another
guise: when we unroll a loop, we need to re-run a bunch of the passes, but
re-running them when we *haven't* successfully unrolled a loop is a total
waste.

I'd like to think more about this, so a simpler option: what if we *just*
extract LoopSimplify and LCSSA? If all the LoopPasses preserve these, then
them being function passes would be fine. This would allow us to at least
*start* writing function passes over loops (like the LoopVectorizer) rather
than being forced to phrase them as LoopPasses.

I think I could implement this simpler option right away, and that would
unblock a lot of our efforts around unrolling, vectorizing, and PGO.

Great. LoopSimplify can be used as a utility to cleanup transformed loops if needed.

LCSSA could eventually be eliminated as prerequisites for some loop passes that should just use the SSA updater instead.

SCEV based passes would be next in line to convert to function passes. Reproducing and debugging issues related to SCEV invalidation and loop pass order is terrifying.

-Andy

This would really only work with pretty specialized logic in the LoopVectorizer and LSR to select only the cleanup aspects of LoopSimplify that they need. And it wouldn’t even have any utility in LSR.

Unfortunately, the loop pass manager makes doing this incrementally hard. The loop vectorizer runs on the inner most loop, makes in not-simplified. We then pop out to the outer loop and verify everything. The now-function-pass LoopSimplify fails its verification because it verifies all of the loops in the function, even though the inner loops aren’t going to be revisited.

My suggestion is to disable the whole-function verification in LoopSimplify while we’re fixing this. We can instead have LCSSA verify the specific loop it is processing which will give us the same safety guarantees. Does that sound OK?

That will let me do the following:

  1. Hoist LoopSimplify to a function pass
  2. Hoist LCSSA to a function pass
  3. Make the LoopVectorizer a function pass
  4. Make IVUsers a utility of LSR, run on-demand for the loop that LSR is processing (no pass manager involvement)
  5. Make LSR a function pass

As a consequence of this, we will actually fix the long-standing issue of LoopSimplify vs. ScalarEvolution, and cut the number of runs of LoopSimplify and LCSSA by half (!!!) so I think this is worth doing.

I have a patch that does #1 already, but wanted to check that you’re OK weakening the verification. Otherwise, I have to do 1, 2, 3, and 5 in a single commit, or teach the LoopVectorizer and LSR to preserve LoopSimplify… Yuck.

-Chandler

This patch appears to cause slight changes in three test cases:

    LLVM :: Transforms/IndVarSimplify/lftr-reuse.ll
    LLVM :: Transforms/LoopSimplify/ashr-crash.ll
    LLVM :: Transforms/LoopStrengthReduce/2011-12-19-PostincQuadratic.ll

Looking at lftr-reuse.ll, we successfully hoist the 'icmp slt' into the
'outer:' block as the comment says would be nice (because the outer loop is
simplified now, the test is checking for unsimplified). The LSR failure is
just that the loop basic blocks have different names (loopexit instead of
preheader).

The ashr-crash.ll case is minutely interesting -- we fail to hoist the
comparison that the test wants hoisted into the entry block. My suspicion
is that getting this to hoist with the heavily reduced pipeline used is
problematic, as the test seems more geared to tickle SCEV bugs than test
important optimization invariants.

All of the other regression tests pass, so this looks pretty good to me.
Let me know! I can mail the patch tomorrow if it helps.

This would really only work with pretty specialized logic in the LoopVectorizer and LSR to select only the cleanup aspects of LoopSimplify that they need. And it wouldn’t even have any utility in LSR.

Unfortunately, the loop pass manager makes doing this incrementally hard. The loop vectorizer runs on the inner most loop, makes in not-simplified. We then pop out to the outer loop and verify everything. The now-function-pass LoopSimplify fails its verification because it verifies all of the loops in the function, even though the inner loops aren’t going to be revisited.

My suggestion is to disable the whole-function verification in LoopSimplify while we’re fixing this. We can instead have LCSSA verify the specific loop it is processing which will give us the same safety guarantees. Does that sound OK?

Yes, in general we cannot rely on LoopSimplify for correctness since it’s not always possible to put loops in canonical form. Passes should check isSimplifiedForm, so it should only be a question of missed optimization.

That will let me do the following:

  1. Hoist LoopSimplify to a function pass
  2. Hoist LCSSA to a function pass
  3. Make the LoopVectorizer a function pass
  4. Make IVUsers a utility of LSR, run on-demand for the loop that LSR is processing (no pass manager involvement)
  5. Make LSR a function pass

Great.

As a consequence of this, we will actually fix the long-standing issue of LoopSimplify vs. ScalarEvolution, and cut the number of runs of LoopSimplify and LCSSA by half (!!!) so I think this is worth doing.

This was a really annoying issue that I spent a lot of time on in the past. So thanks.

I have a patch that does #1 already, but wanted to check that you’re OK weakening the verification. Otherwise, I have to do 1, 2, 3, and 5 in a single commit, or teach the LoopVectorizer and LSR to preserve LoopSimplify… Yuck.

Yes. Thank you.

-Andy

This looks like an example of the kind of order-of-loop-transform problems that I spent a staggering amount of time debugging. So the underlying problem should go away.

That said, it also looks like an example where we benefit from applying multiple passes to the inner loop before optimizing the outer loop. If you want to file a PR and disable it for now that’s cool.

-Andy

From: "Andrew Trick" <atrick@apple.com>
To: "Chandler Carruth" <chandlerc@gmail.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, January 22, 2014 2:09:06 PM
Subject: Re: [LLVMdev] Why should we have the LoopPass and LoopPassManager? Can we get rid of this complexity?

I have a patch that does #1 already, but wanted to check that you're
OK weakening the verification. Otherwise, I have to do 1, 2, 3, and
5 in a single commit, or teach the LoopVectorizer and LSR to
preserve LoopSimplify... Yuck.
This patch appears to cause slight changes in three test cases:

LLVM :: Transforms/IndVarSimplify/lftr-reuse.ll
LLVM :: Transforms/LoopSimplify/ashr-crash.ll
LLVM :: Transforms/LoopStrengthReduce/2011-12-19-PostincQuadratic.ll

Looking at lftr-reuse.ll, we successfully hoist the 'icmp slt' into
the 'outer:' block as the comment says would be nice (because the
outer loop is simplified now, the test is checking for
unsimplified). The LSR failure is just that the loop basic blocks
have different names (loopexit instead of preheader).

The ashr-crash.ll case is minutely interesting -- we fail to hoist
the comparison that the test wants hoisted into the entry block. My
suspicion is that getting this to hoist with the heavily reduced
pipeline used is problematic, as the test seems more geared to
tickle SCEV bugs than test important optimization invariants.

This looks like an example of the kind of order-of-loop-transform
problems that I spent a staggering amount of time debugging. So the
underlying problem should go away.

That said, it also looks like an example where we benefit from
applying multiple passes to the inner loop before optimizing the
outer loop. If you want to file a PR and disable it for now that’s
cool.

I had mentioned to Chandler previously that in the new pass manager I wanted the ability to register cleanup passes, "run this pass only if the previous pass actually changed something"... maybe this speaks for a slightly more general functionality, "run this pass only if the previous pass changed something, and here's a hint of a list of blocks or loops at which you should look." If we had something like this, we might be able to to handle cases like this without burdening the overall configuration with extra (unneeded) pass invocations.

-Hal

Turns out it was just a bug. =]

See r199884 for gory details. This works, is landed, and I'm working on
LCSSA now.

The key bit is that we didn't really need to do order-of-loop-transform
stuff, we just really need to preserve LoopSimplify form because if we
don't everything goes to crap. Notably, there are two loops that get
completely unrolled in this test case. After the inner one is unrolled we
will do a *significantly* poorer job of unrolling the outer loop unless we
re-simplify it first. That's not surprising really as this is the canonical
form the unroller expects.

It's looking like the loop *canonicalization* stuff really doesn't need the
inside-out pipelining, what it needs is for each loop pass to ensure that
the canonical form is preserved at each nesting of the loop.

But I still think that the loop *simplification* stuff really *does* need
the pipelining. As you pointed out originally, full-unroll, rotate, and
LICM can combine to dramatically simplify the structure iteratively at each
level. I'm still pondering what the best long-term structure for managing
this type of tightly coupled optimization is, but it doesn't look like it
will get in the way of easing access to function analysis passes.