loop transforms and function analyses; especially divergence analysis

Hi all,

I have been doing some homework on enabling divergence analysis in the new pass manager. I am focusing only on the new divergence analysis (usually called GPU DA) for now. Adding the appropriate classes was easy enough, but loop unswitching has a problem:

https://bugs.llvm.org/show_bug.cgi?id=48819

The current situation is as follows:

1. Legacy DA is available in the old pass manager and is used by LoopUnswitch for correctness.
2. LoopUnswitch is not available in the new pass manager.
3. SimpleLoopUnswitch is available in both pass managers.
4. SimpleLoopUnswitch does not use any DA, and hence it is broken in both pass managers.

I intend to fix #4 by making the GPU DA available in the new pass manager, and then using whichever DA is available in SimpleLoopUnswitch for the two pass managers respectively.

The main problem is that the state of any divergence analysis seems under-defined when running a loop pass in either pass manager. For this, I am using the following terms: "invalid" implies that a value previously marked as not divergent has now become divergent due to a transform, while "stale" means that a value previously marked as divergent is now non-divergent due to some transform. The DA being stale does not affect correctness, but being invalid definitely does.

The LoopUnswitch pass in the legacy pass manager seems to rely on undocumented behaviour to ensure that divergence analysis is made available before loop passes are invoked. But it seems unsafe to assume that the divergence analysis is still valid if any loop transform happens before the LoopUnswitch pass is invoked. Is that something that people just chose to overlook, or am I missing something here?

In the new pass manager, it seems the equivalent would be to add divergence analysis to LoopStandardAnalysisResults. I tried using getCachedResults on an outer proxy, but that requires that the analysis can "never" be invalidated, which is a more explicit way to force the same question as the previous paragraph.

If all this is correct, then we could either arrange things so that divergence is updated whenever its dependencies (like dominator tree and loopinfo) are updated, or we need to have a second version of SimpleLoopUnswitch that runs on a function instead of a loop. This version can rely on a simpler statement that loop unswitching does not invalidate divergence and hence the DA need not be updated in the function pass while it is iterating over the loops in that function. The DA may become stale, but that just marks a potential opportunity to rerun the pass on that function.

Sameer.

There is an option to enable SimpleLoopUnswitch instead of LoopUnswitch in the legacy PM, but I doubt anybody is using that, so I wouldn’t worry about it.

Looking more closely at where divergence analysis is used, it looks like it’s only used for non-trivial loop unswitching. In the short term, we could just disable non-trivial unswitching if we detect that the target has divergence. In fact, non-trivial unswitching was only turned on in the new PM recently at -O3.

diff --git a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
index 9d3c8d0f3739…e490c18b2568 100644
— a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
+++ b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
@@ -2908,6 +2908,10 @@ static bool unswitchLoop(Loop &L, DominatorTree &DT, LoopInfo &LI,
if (L.getHeader()->getParent()->hasOptSize())
return false;

  • // Skip non-trivial unswitching for targets with divergence.
  • if (TTI.hasBranchDivergence())
  • return false;

There is an option to enable SimpleLoopUnswitch instead of LoopUnswitch in the legacy PM, but I doubt anybody is using that, so I wouldn't worry about it.

Right. But SimpleLoopUnswitch does support the legacy PM, and the check for divergence is at an internal point independent of the pass manager. So either way, I have to make sure it plays well with both pass managers. But like you said, it's not super important, and I can choose to do something simple at that point.

> Looking more closely at where divergence analysis is used, it looks like it's only used for non-trivial loop unswitching. In the short term, we could just disable non-trivial unswitching if we detect that the target has divergence. In fact, non-trivial unswitching was only turned on in the new PM recently at -O3.

Yeah, that's definitely something we want to consider. But it all depends on how important are the non-trivial cases in actual programs. We need to keep track of any potential performance regression from this decision, otherwise it will get wrongly attributed to the new pass manager. A simple way is to make the same change in LoopUnswitch with the legacy PM and see how it affects existing programs.

  > If we want to add something to LoopStandardAnalysisResults, I believe all loop passes must preserve everything in LoopStandardAnalysisResults, and updating all passes to handle divergence would be a pain. asbirlea should be able to provide more info.

That is clearly not desirable; divergence analysis is not used frequently enough to justify the effort.

> An alternative is to skip the whole analysis infra and recalculate it on every run of SimpleLoopUnswitch. The LoopUnswitch pass doesn't preserve divergence analysis, so the analysis would be invalidated every time the pass successfully unswitches. If it doesn't unswitch very many loops right now, there could be a large compile time impact.

I assume you mean SimpleLoopUnswitch in your second sentence? Skipping the infra would undermine the purpose of having that infra in the first place. But even if we did skip it, divergence analysis will have to be recomputed every time SimpleLoopUnswitch is invoked, since it's all the other loop passes that will invalidate it. That would mean at least once per each loop, which sounds rather costly. That is why I think it would be better to have a second version that runs on the whole Function once. Then it's up to each specific compilation flow to decide how to order it relative to the loop passes.

Sameer.

Here’s an attempt to add divergence analysis to the new pass manager as the first step:

https://reviews.llvm.org/D96615

This only introduces the analysis, and does not try to use it with other passes like SimpleLoopUnswitch. That comes next.

Sameer.

Here's an attempt at fixing the original problem, which is that SimpleLoopUnswitch needs access to divergence analysis ... non-trivial branches should not be unswitched if they are divergent because that is harmful to performance.

https://reviews.llvm.org/D96754

From the commit description:

"Loop unswitching on divergent conditions is harmful for
performance. The LoopUnswitch pass depends on LegacyDivergenceAnalysis
to avoid this, but the state of divergence analysis may be
stale (neither preserved nor invalidated) due to previous loop passes.

"The new pass manager provides SimpleLoopUnswitch which currently does
not skip divergent branches. Loop passes can request function analysis
results from an "outer proxy" analysis manager, but only if such
results are never invalidated. This change introduces another method
to request an analysis from the outer proxy even if it is stale. This
is sufficient for the current use-case, where it is not necessary to
update the divergence analysis after every loop pass, and the existing
stale result is still safely useable. The effect is equivalent to the
use of divergence analysis by LoopUnswitch in the legacy pass manager."

This arrangement is clearly different from LoopStandardAnalysisResults, which must always be updated by every loop pass. This change introduces a new class of analysis results that does not have to be passed as an argument to every loop pass. It is sufficient that they are made available by the loop pass manager, but with no guarantees about the state of the result. They are known to be available and valid when the loop passes start, and remain available via the outer proxy, although not necessarily valid.

Sameer.

IMO there are only two options that make sense as far as the cost involved.
The first, most straight-forward and cheap but a fairly big hammer, is to skip non-trivial unswitching for targets with divergence, as Arthur suggested.
The second, more expensive, is to compute DA inside SimpleLoopUnswitch inside the if (TTI.hasBranchDivergence()) clause, before defaulting to return false.

Alina

Yes, those are the only options available under the current state of the new pass manager, but neither is very attractive. We end up not doing anyting, or recomputing a function analysis on every loop. They are both just workarounds for functionality that was not ported over from the old PM to the new PM.

I am not sure if you had a chance to consider my comments on the review, but perhaps this is a better place to continue to that discussion:

In the legacy pass manager, the function LoopPass::preparePassManager() actually makes sure that if a loop transform T running inside a loop pass manager M invalidates analyses used by other passes in M, then T is split out into a separate loop pass manager M'. Thus every instance of loop pass manager is responsible for making sure that the required function analyses are recomputed before starting any loop passes. There doesn't seem to be a way to do something similar in the new PM. A loop pass should be able to isolate itself from side-effects of other loop passes on function analyses. The list of standard analyses serves as a good contract for heavily used analyses, but divergence analysis is not used frequently enough to justify adding it there.

The missing functionality can be reimplemented as follows:

1. Allow a loop pass to declare required function analyses that are not part of the standard list, by returning a list in some function, say "getRequiredOuterAnalyses()". For divergence analysis, this function needs to check TTI because the requirement is target dependent. We can do that by passing the standard analyses as an argument.

2. Rather than adding passes to an LPM directly, an outer manager should add them through a proxy. This proxy can own more than one LPMs and has the ability to enqueue function analyses between these LPMs.

3. Whenever addPass() is called on this proxy, it should check the loop pass to see if it requires any non-standard analyses. If yes, it should start a new LPM and enqueue the required analyses before the new LPM.

This is pretty much what the old PM does to handle such a dependency.

The class FunctionToLoopPassAdaptor is a good candidate ... it already does non-trivial things like running a canonicalization pipeline and is responsible for iterating over loops. It seems like the right place to split the flow into multiple LPMs.

Sameer.