[RFC] LCSSA vs. SSAUpdater ... FIGHT!

So, there are two primary ideas behind SSA form management in the loop optimizers of LLVM:

  • Require LCSSA form input, leverage its (very powerful) guarantees to simplify maintaining SSA form, and also maintain LCSSA form.

  • Don’t bother with LCSSA form input, assume the worst, and use powerful incremental SSA formation utilities built on SSAUpdater to form SSA on demand when needed.

(Note, there are plenty of places where SSAUpdater makes sense, so this isn’t really about doing away with it at all.)

In discussions, Andy had expressed a desire to move entirely away from LCSSA and Nick had expressed a desire to do the opposite, so I’d like to start a proper discusison of what people think and why.

I’ve worked a lot of preserving both LCSSA and LoopSimplify form in all of our loop passes recently thanks to yanking off the bandaid we’ve been relying on for the past N years of letting the LoopPassManager simply re-create these at all loop nest levels on the fly as necessary. During the course of that I’m starting to form an opinion on the subject as well.

I think that SSAUpdater works great for passes like GVN and friends that need to fast, incremental patching of the SSA graph. I also think it is completely incompatible with LCSSA in its current form. It just isn’t built in a way that would be friendly to preserving this kind of information. I think that’s OK, but it means that I think we can’t mix the two solutions effectively long term. PR18688 is the current manifestation of this madness.

Consistently, I am finding that doing incremental update and maintaining canonical form for loops is substantially simpler with LoopSimplify+LCSSA. The combination is just incredibly powerful. It also has a huge advantage in the way it is structures the updates: they avoid perturbing nested or enclosing loops. This property makes them ideal for working inside-out and iteratively across the loop nest as it is simplified.

So I would personally want to see a really strong argument against relying on LCSSA. But I’m open to that argument existing, and sending this email to tease it out. =] If we decide to go forward with LCSSA as the canonical form for loops, I’d like to merge LCSSA and LoopSimplify into a single canonicalization pass, and then I’ll work harder at re-casting the existing loop optimizations to leverage LCSSA more heavily and simplify their preservation of it as a consequence. Right now, we’re just brute-force recomputing LCSSA because we don’t really rely on it coming into the pass. It’s kind of the worst of both worlds. =/

-Chandler

From: "Chandler Carruth" <chandlerc@gmail.com>
To: "Andy Trick" <atrick@apple.com>, nicholas@mxc.ca, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Saturday, February 1, 2014 6:33:26 AM
Subject: [LLVMdev] [RFC] LCSSA vs. SSAUpdater ... FIGHT!

So, there are two primary ideas behind SSA form management in the
loop optimizers of LLVM:

- Require LCSSA form input, leverage its (very powerful) guarantees
to simplify maintaining SSA form, and also maintain LCSSA form.

- Don't bother with LCSSA form input, assume the worst, and use
powerful incremental SSA formation utilities built on SSAUpdater to
form SSA on demand when needed.

(Note, there are plenty of places where SSAUpdater makes sense, so
this isn't really about doing away with it at all.)

In discussions, Andy had expressed a desire to move entirely away
from LCSSA and Nick had expressed a desire to do the opposite, so
I'd like to start a proper discusison of what people think and why.

I've worked a lot of preserving both LCSSA and LoopSimplify form in
all of our loop passes recently thanks to yanking off the bandaid
we've been relying on for the past N years of letting the
LoopPassManager simply re-create these at all loop nest levels on
the fly as necessary. During the course of that I'm starting to form
an opinion on the subject as well.

I think that SSAUpdater works *great* for passes like GVN and friends
that need to fast, incremental patching of the SSA graph. I also
think it is completely incompatible with LCSSA in its current form.
It just isn't built in a way that would be friendly to preserving
this kind of information. I think that's OK, but it means that I
think we *can't* mix the two solutions effectively long term.
PR18688 is the current manifestation of this madness.

Consistently, I am finding that doing incremental update and
maintaining canonical form for loops is *substantially* simpler with
LoopSimplify+LCSSA. The combination is just incredibly powerful. It
also has a huge advantage in the way it is structures the updates:
they avoid perturbing nested or enclosing loops. This property makes
them ideal for working inside-out and iteratively across the loop
nest as it is simplified.

So I would personally want to see a really strong argument against
relying on LCSSA. But I'm open to that argument existing, and
sending this email to tease it out. =]
If we decide to go forward
with LCSSA as the canonical form for loops, I'd like to merge LCSSA
and LoopSimplify into a single canonicalization pass,

What is the benefit of merging them? LoopSimplify is certainly useful without LCSSA, and LCSSA is certainly useful without LoopSimplify's canonicalization. Also, and I don't know if this matters, but LoopSimplify can "fail" (for non-natural loops, at least), LCSSA cannot fail.

-Hal

Forming LCSSA is simpler with simplified loops, and when the loop is
simplified we should take advantage of it. Also, by doing them at the same
time we can have better locality and fewer trips over the various data
structures.

There's no important reason, it just seems a nicer organization to have a
single "canonicalize" step.

From: "Chandler Carruth" <chandlerc@gmail.com>
To: "Andy Trick" <atrick@apple.com>, nicholas@mxc.ca, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Saturday, February 1, 2014 6:33:26 AM
Subject: [LLVMdev] [RFC] LCSSA vs. SSAUpdater ... FIGHT!

So, there are two primary ideas behind SSA form management in the
loop optimizers of LLVM:

- Require LCSSA form input, leverage its (very powerful) guarantees
to simplify maintaining SSA form, and also maintain LCSSA form.

- Don't bother with LCSSA form input, assume the worst, and use
powerful incremental SSA formation utilities built on SSAUpdater to
form SSA on demand when needed.

(Note, there are plenty of places where SSAUpdater makes sense, so
this isn't really about doing away with it at all.)

In discussions, Andy had expressed a desire to move entirely away
from LCSSA and Nick had expressed a desire to do the opposite, so
I'd like to start a proper discusison of what people think and why.

I've worked a lot of preserving both LCSSA and LoopSimplify form in
all of our loop passes recently thanks to yanking off the bandaid
we've been relying on for the past N years of letting the
LoopPassManager simply re-create these at all loop nest levels on
the fly as necessary. During the course of that I'm starting to form
an opinion on the subject as well.

I think that SSAUpdater works *great* for passes like GVN and friends
that need to fast, incremental patching of the SSA graph. I also
think it is completely incompatible with LCSSA in its current form.
It just isn't built in a way that would be friendly to preserving
this kind of information. I think that's OK, but it means that I
think we *can't* mix the two solutions effectively long term.
PR18688 is the current manifestation of this madness.

Consistently, I am finding that doing incremental update and
maintaining canonical form for loops is *substantially* simpler with
LoopSimplify+LCSSA. The combination is just incredibly powerful. It
also has a huge advantage in the way it is structures the updates:
they avoid perturbing nested or enclosing loops. This property makes
them ideal for working inside-out and iteratively across the loop
nest as it is simplified.

So I would personally want to see a really strong argument against
relying on LCSSA. But I'm open to that argument existing, and
sending this email to tease it out. =]
If we decide to go forward
with LCSSA as the canonical form for loops, I'd like to merge LCSSA
and LoopSimplify into a single canonicalization pass,

What is the benefit of merging them? LoopSimplify is certainly useful without LCSSA, and LCSSA is certainly useful without LoopSimplify's canonicalization. Also, and I don't know if this matters, but LoopSimplify can "fail" (for non-natural loops, at least), LCSSA cannot fail.

Right. One of my issues with LCSSA is that it has to be completely maintained for correctness.

OTOH. It is not hard to compute LCSSA correctly and problems will probably be caught by verification. I have seen bugs in SCEV caused by LCSSA update. But not bugs in LCSSA itself (probably because it is recomputed from scratch)

I'll respond to the other points later.

Andy

I also don't want to get rid of LCSSA (if I said that before, I retract the statement). It can be useful to summarize the loop live-out values.

I think the question is: do we want to be in LCSSA during the early/canonical rounds of IR loop optimization? This is, LoopSimplify, Rotate, LICM, Unswitch, full unrolling (I think full unrolling should run earlier).

Why did I suggest avoiding LCSSA?
(1) Unnecessary overhead.
(2) Phi nodes "in the way".
(3) Awkward pass dependencies.

(1) We're forcing a significant analysis to run between passes to make it easier to update SSA after transformation. But we don't even know if any transformations are needed. I would prefer to use SSAUpdater to handle the transformations as they occur. We could save 1-2% of opt time [1].

(2) SSA based analysis are way simpler when they don't handle phis. They could be adapted to handle single operand phis, but we don't usually bother to check. I don't have a specific issue in mind that would impact the early loop opts.

(3) As you know LCSSA needs to be rerun between passes, which in turn requires the domtree. Generally removing dependence on analysis is a good thing.

So, what's the benefit of LCSSA? I'm told that it makes SSA update easier. I don't understand what could be easier than using an SSAUpdater utility. LoopRotate and LICM already use the updater. LoopUnroll requires LCSSA [2][3]. I don't understand the impact on LoopSimplify.

If LCSSA actually makes writing transformations easier, then I'm all for it. Could you point to some specific transformations that are easier with LCSSA?

-Andy

[1] I admit that LCSSA could be greatly speeding up our implementation of SSAUpdater by restricting the use lists of loop values.

[2] LoopUnroll does an incremental SSA update using LCSSA. SSAUpdater would probably be a batch update after unrolling. SSAUpdater might be a little simpler by separating concerns.

[3] We could make a much more efficient SSAUpdater that works with loop unrolling and the domtree by making use of the information that the old values always dominate the new ones. It is really dumb that our unroller invalidates the domtree.

I also don't want to get rid of LCSSA (if I said that before, I retract
the statement). It can be useful to summarize the loop live-out values.

I think the question is: do we want to be in LCSSA during the
early/canonical rounds of IR loop optimization? This is, LoopSimplify,
Rotate, LICM, Unswitch, full unrolling (I think full unrolling should run
earlier).

Just for clarification, LoopSimplify is not a loop pass, and just creates
the canonical loop nest structure with preheaders and exit blocks. LCSSA
requires this to have a place for its PHI nodes, not the other way around.

Also, full unrolling already runs as part of the early rounds of the IR
loop optimizations just as you say it should.

Why did I suggest avoiding LCSSA?

(1) Unnecessary overhead.
(3) Awkward pass dependencies.

(1) We're forcing a significant analysis to run between passes to make it
easier to update SSA after transformation. But we don't even know if any
transformations are needed. I would prefer to use SSAUpdater to handle the
transformations as they occur. We could save 1-2% of opt time [1].

Do you have any measurement of this? I have no evidence that LCSSA is a
remotely significant optimization time cost, but I've not gone looking.

(3) As you know LCSSA needs to be rerun between passes, which in turn
requires the domtree. Generally removing dependence on analysis is a good
thing.

So, we never re-run domtree here. It is preserved throughout, and it is
required for LoopInfo, so there is nothing to be saved on that front.
AFAICT, there is no analysis dependency cost of LCSSA given the other loop
optimizer.

Also, LCSSA *doesn't* need to be re-run between passes. If a pass has LCSSA
coming into it, then it is extremely easy to preserve it in place rather
than re-running things. We just have a *lot* of bugs because the old
loop-pass-manager setup just blindly re-ran LCSSA for us, papering over all
manner of badness.

(2) Phi nodes "in the way".
(2) SSA based analysis are way simpler when they don't handle phis. They
could be adapted to handle single operand phis, but we don't usually bother
to check. I don't have a specific issue in mind that would impact the early
loop opts.

Ok, I think this would be the more serious issue if it comes up in
practice. I don't have any cases where it comes up in practice for early
loop opts either though.

So, what's the benefit of LCSSA? I'm told that it makes SSA update
easier. I don't understand what could be easier than using an SSAUpdater
utility. LoopRotate and LICM already use the updater.

<snip>

If LCSSA actually makes writing transformations easier, then I'm all for

it. Could you point to some specific transformations that are easier with
LCSSA?

LICM. There is already a fast path in LICM that is the exact path that
LCSSA guarantees exists when sinking code into the exit blocks. The only
thing missing is for LICM to actually *use* LCSSA instead of just trying to
patch it up if it happens to find it in the loop. I think it would
dramatically simplify the entire sinking code.

A good use of SSAUpdater: re-forming SSA values *within* a loop body due to
mem2reg essentially. LICM has a little mini mem2reg pass in it that uses
SSAUpdater. This is a great use and I can't imagine something simpler. The
fact that we have LCSSA also makes it faster by bounding the space where it
has to rewrite uses (to the edge of the current loop).

But if we have LCSSA, we should never need SSAUpdater for updating SSA form
across the loop boundary in the way LICM does. Instead we should be able to
place an instruction, RAUW it, and insert a PHI node for each operand. Done.

Maybe a patch would be better to explain matters? As it happens, fixing
LICM this way will also fix PR18753 and a host of other latent bugs where
LICM invalidates LCSSA in some far-off loop and we crash later.

1-2% is the time I measured when I wrote that email. But that doesn’t account for any speedups we might get by being in LCSSA for (I’m speculating that calls to SSAUpdater might be faster).

I don’t think it’s a problem now, or will ever be for the standard -O2 pipeline. Just saying that an on-demand SSA updater is better if all else is equal.

-Andy

Agreed. Then if instcombine removes them its not an issue. Hopefully it doesn’t slow down instcombine.

That sounds reasonable.

That makes sense.

Sounds good.

-Andy

It’s worth noting that LCSSA predates SSAUpdater. If I went back in time and knew what I knew now, I wouldn’t have gone with LCSSA.

My gripes are three fold: 1) SSAUpdater can handle anything that LCSSA simplifies, 2) that LCSSA is annoying to keep up to date, 3) LCSSA burns compile time optimistically rewriting loop values, which are then later collapsed away even if nothing cares about those values.

My personal preference would be to get rid of LCSSA completely, but I don’t know how to stage that.

-Chris

> So, there are two primary ideas behind SSA form management in the loop
optimizers of LLVM:
>
> - Require LCSSA form input, leverage its (very powerful) guarantees to
simplify maintaining SSA form, and also maintain LCSSA form.
>
> - Don't bother with LCSSA form input, assume the worst, and use powerful
incremental SSA formation utilities built on SSAUpdater to form SSA on
demand when needed.
>
> (Note, there are plenty of places where SSAUpdater makes sense, so this
isn't really about doing away with it at all.)

It’s worth noting that LCSSA predates SSAUpdater. If I went back in time
and knew what I knew now, I wouldn’t have gone with LCSSA.

My gripes are three fold:

1) SSAUpdater can handle anything that LCSSA simplifies

Yes, it can, but it is *significantly* less simple. I think the simplicity
of reasoning and handling things with LCSSA is not without value.

2) that LCSSA is annoying to keep up to date

I have not found that to be the case. There are many places where we fail
to keep it up to date that have to be fixed, but as far as I can tell none
of these are due to the complexity, slowness, or challenge of keeping it up
to date. We just "forgot" (in that the loop pass manager made everything
work without any effort but with considerable compile time cost).

3) LCSSA burns compile time optimistically rewriting loop values, which
are then later collapsed away even if nothing cares about those values.

Sure. Put another way though, LCSSA puts the SSA graph into the "canonical
form" that simplifies the model of writing loop passes. I'm not really
disagreeing with you here, just saying that there is a tradeoff.

My suspicion having worked on this quite a bit now is that about 80% of the
compile time we are burning on LCSSA is due to failures to update LCSSA
appropriately in places where it is both convenient and simple to do so.

My personal preference would be to get rid of LCSSA completely, but I
don’t know how to stage that.

OK, I'm not *really* invested one way or the other. But having hacked on
the loop optimizer somewhat over the past 3 weeks, I have to say LCSSA form
has always been easier to reason about, debug, and transform. Perhaps I'll
run out of that warm fuzzy feeling, but so far its holding. And several
others have seemed to like LCSSA when I talked to them, so I don't want it
to be dismissed out of hand.

Either way, we *must* make a decision. The current state is untenable as
passes flagrantly destroy LCSSA making it very expensive indeed to
recompute. And we have to recompute it because a decent number of passes
really do rely on it. Interestingly, many of the *users* of LCSSA are not
rebuilding SSA form! They're using it to capture loop live-out values very
quickly, which is a very useful property IMO. As I mentioned in the thread
to Andy, there seem to still be clear uses for SSAUpdater to update SSA
values, and it merely needs to be able to incrementally preserve LCSSA
rather than breaking it. That doesn't seem like a ton of code to me, so I
think we can reasonably go either way. The tradeoff is in complexity of
loop passes' analysis and transforms of live-out values vs. the complexity
of canonicalizing to LCSSA and preserving it through the loop pass pipeline.

Until recently I felt exactly the same way. I didn’t want LCSSA just as another mechanism for updating SSA.

I’m warming up to having it run during the early loop passes if it
- significantly simplifies the LICM logic
- greatly speeds up SSAUpdater within loops (maybe little net compile-time increase on average?)
- compartmentalizes loop transforms and SSA update so we can debug loop opts one loop at a time

I still don’t particularly like that we force all LLVM clients to perform LCSSA when all they end up doing is rotating and simplifying loops (no LICM/unroll). So it is a tradeoff.

-Andy

I don't have a strong opinion here, just throwing out some thoughts. You've been working on the loop passes much more recently, so if you think it is worth holding on to (or worth using just for the early passes?) then go for it.

-Chris

>>> (Note, there are plenty of places where SSAUpdater makes sense, so
this isn't really about doing away with it at all.)
>>
>> It’s worth noting that LCSSA predates SSAUpdater. If I went back in
time and knew what I knew now, I wouldn’t have gone with LCSSA.
>>
>> My gripes are three fold: 1) SSAUpdater can handle anything that LCSSA
simplifies, 2) that LCSSA is annoying to keep up to date, 3) LCSSA burns
compile time optimistically rewriting loop values, which are then later
collapsed away even if nothing cares about those values.
>>
>> My personal preference would be to get rid of LCSSA completely, but I
don’t know how to stage that.
>
> Until recently I felt exactly the same way. I didn’t want LCSSA just as
another mechanism for updating SSA.
>
> I’m warming up to having it run during the early loop passes if it
> - significantly simplifies the LICM logic

I've committed this in r201148 in order to fix several PRs. We can of
course back it out and go down a different path if that's the end decision,
but it seems better to not have asserts in the interim. =]

You can see the simplification to 'sink' that falls out of this though.
This is, IMO, not a-typical.

> - greatly speeds up SSAUpdater within loops (maybe little net
compile-time increase on average?)
> - compartmentalizes loop transforms and SSA update so we can debug loop
opts one loop at a time
>
> I still don’t particularly like that we force all LLVM clients to
perform LCSSA when all they end up doing is rotating and simplifying loops
(no LICM/unroll). So it is a tradeoff.

Again, LoopSimplify does not require LCSSA today. Nothing to do there. If
we folded rotate into simplify (which we should probably do) then we would
be done.

I don't have a strong opinion here, just throwing out some thoughts.

You've been working on the loop passes much more recently, so if you think
it is worth holding on to (or worth using just for the early passes?) then
go for it.

It's worth noting that getting rid of LCSSA would be a non-trivial amount
of work and would have to be done somewhat carefully to preserve invariants
of other passes in the LPM. Not saying we can't do it, but there is a
reason to keep piling onto LCSSA until there is a clear decision to rip it
out, and only then to rip all of it out at once.

Anyways, I'm still somewhat preferring to keep it in place for the
simplifications. The cost appears to be quite small now that we don't
invalidate all AA. ;]

This looks great. Thanks!

This is not affected be your change, but you may be able to comment having just worked on it. I just noticed from code inspection that when we sink instructions into multiple loop exits, we insert multiple clones into potentially empty blocks, even if we only have one use.

I don’t think we have any pass that can clean up and properly sink these redundant instructions to their use and combine them. i.e. we don’t do lazy code motion. So it looks like LICM splits critical edges and bloats the code for no good reason.

-Andy

That’s a good point, and I’m totally on board with your massive simplification of the loop passes. Just for the record, the disadvantage that I’m pointing out is that LCSSA will be computed for all loops even is the code has no LICM opportunities. As opposed to the SSAUpdater approach which simply does nothing if there’s nothing to do.
-Andy

I strongly agree with this. I recently looked at using LCSSA vs SSAUpdater for an out-of-tree pass I’m working on and quickly settled on LCSSA due to its simplicity. I may revisit that decision someday if performance proves problematic, but for initial prototyping, LCSSA was a major win. Part of that is that I never really found clear documentation on what SSAUpdater actually did and how to use it. There’s some documentation in the header, but it’s not exactly obvious on first read. Philip