SimpleLoopUnswitching (SLU) is a loop pass that transforms loops containing branches or switches into multiple loops, by taking the conditional outside the loop body.
For example it turns:
for (...)
A
if (cond)
B
C
into
if (cond)
for (...)
A; B; C
else
for (...)
A; C
The current SLU handles two kinds of branch conditions. Trivial unswitching is when the condition can be unswitched without cloning any code from inside the loop. Non-trivial unswitching requires code duplication.
Related work
The existing SLU pass generates incorrect results when transforming divergent branches on specific architectures (e.g. AMDGPU). To generate correct results, DivergenceAnalysis is required. However, divergenceAnalysis is a function-level analysis. Thus, existing SLU, which is based on Loop Pass, cannot utilize this function analysis efficiently.
Re-implement the existing SLU pass has been discussed in D109762.
Proposed solution
I am actively looking at implementing the non-trivial loop unswitching with FunctionPass.
Although using Function pass to re-implement Loop pass is not a new topic, one critical question worth to be discussed: How should we integrate the new function pass into the pipelines?
The existing SLU is a loop pass and integrated into Loop Pass managers. The new Function pass cannot be integrated into these Loop Pass managers and need to be inserted into other place. The exact position cannot be easily determined without experiments.
Would it be possible to factor out the part of the analysis you are interested in into a utility that can be called from any pass (be it function or loop pass)?
Apart from the optimal positioning of the pass in the pipeline, we also need to consider the impact on loop nests, and would likely need to make changes in SLU to not only work with one loop at a time but instead look at invariant conditions within the whole loop nest. Currently SLU is positioned in the loop pass manager after LICM but before many other passes such as loop-idiom, interchange, etc. AFAIR there is an intricate dependency between LICM and SLU in that if a branch is deep inside a loop nest, we require multiple runs of these passes to eventually move an invariant condition out of the whole loop nest. LICM may also depend on SLU in some cases, eg when wanting to move an invariant load that is guarded by an invariant condition.
The wording in this RFC seem to suggest that you want to separate out the handling of “non-trivial” vs “trivial” cases into different passes (one that is loop pass and one that is a function pass). I don’t think that’d be a good idea, but please clarify in case I misunderstood your proposal.
Ruobing is exploring the refactoring of SimpleLoopUnswitch for his summer internship.
The main motivation for this work came from requests to use DivergenceAnalysis in the loop pass, while the new pass manager format disallows the use of function analyses in loop passes without them being preserved.
As I understood it, one of the main motivations for this restriction in the loop pass managers was to limit interleaving of loop optimizations to loop passes that are considered tightly-coupled, and in such a way that performance is prioritized.
Due to its need to either compute or have access to DivergenceAnalysis for non-trivial unswitching to be correct on targets with divergence, it seems that non-trivial unswitching doesn’t fit this design decision and may be better as a function pass.
Until ⚙ D94448 [NewPM] Run non-trivial loop unswitching under -O2/3/s/z, non-trivial unswitching was not included in the loop simplification pipeline. It was added to resolve a regression when we switched to the new pass manager.
The end goal here is to determine if we can resolve that same regression with a function pass added to do the non-trivial unswitching - you understood correctly that this is the proposal.
The first step is not to remove the non-trivial unswitch functionality from the loop pass but to disable it in the simplification pipeline and replace it with a function pass on which we can experiment with: pipeline position, missing optimizations due to lack of loop interleaving or gained opportunities for divergent targets.
You are absolutely right that beside the challenge of its position, the new function pass may need to do additional optimizations in the absence of its interleaving with other loop passes. This is another big part of the problem that we’d like to understand this summer.
A recent example for its need of LICM is in: ⚙ D124251 [SimpleLoopUnswitch] Run LICM for nested unswitching tests.. The function pass doing non-trivial unswitching will need to hoist the freeze instruction it inserts itself, before trying to unswitch on the outer loop.
There are many open questions and it would be great to discuss more at the upcoming Loop Optimization WG call. I added it to the agenda.
We have workloads that rely on non-trivial loop unswitching for optimal performance. If non-trivial loop unswitching is disabled in the loop pipeline and changed to a function pass, potentially modifying it’s position/implementation, it could result in a negative impact to these workloads with respect to execution performance.
IMO the trivial and non-trivial pieces should stay together. I think the unification of unswitching as a whole is a good thing. Otherwise we have some things being unswitched in one place at some point and others being unswitched in another place at a different point. Both happening with different optimizations running leading up to that form of unswitching. Making the proposed change could result in further complications. Some are already anticipated such as the potential for additional optimizations required to make the function pass work.
Have any other solutions to the problem been explored? Maybe an approach from the analysis side instead if that’s the issue this proposal would like to solve?
I think it would be a good idea to revisit function analysis access from loop passes.
We currently don’t allow inner IR passes to on-demand query/update/invalidate outer IR-level analyses (e.g. a function pass potentially invalidating a module analysis that another function pass could later use). This was done to not block potential future pass manager concurrency. The way this is enforced is by only providing a getCachedResult method in OuterAnalysisManagerProxy, no getResult. This also makes things a bit more deterministic. So in the loop pass/function analysis case, loop passes cannot on–demand request or invalidate a function analysis.
However, loop passes must query/update/invalidate function-level analyses given that the boundary between loops and functions is not as clear as something like function vs module. This is evidenced by the LoopStandardAnalysisResults every loop pass has access to which contains some function analyses that loop passes can use and must keep up to date. This works around the previous limitation.
Experimenting with allowing the loop/function version of OuterAnalysisManagerProxy (FunctionAnalysisManagerLoopProxy) to call FunctionAnalysisManager::getResult might be interesting. We’d have to make sure to properly invalidate function/loop analyses between loop passes. Maybe we can even get rid of LoopStandardAnalysisResults, although that may not be desirable for other reasons.
Talking this over with some people, the main concern about this approach is the potential for quadratic behavior. DivergenceAnalysis is a function analysis and scans the entire function. If we recalculate this per loop in the function, we’ll be scanning through loops in the function a quadratic number of times.
But actually I see that we have DivergenceAnalysisImpl::RegionLoop which limits the analysis to only check in the loop. We could make a loop analysis version of DivergenceAnalysis and use that in SimpleLoopUnswitch.
Thank you for the feedback provided at the Loop Optimization sync!
Following up as discussed on the issues that were raised with this proposal.
First concern was if there were any run-time regressions. We ran SPEC (and the rest of the LLVM test suite) and a series of internal benchmarks and saw no regressions. @nkhoun, could you please test the set of patches and provide us with feedback on their impact on performance for your workloads?
Second concern was about impact on compile-time. Here are the patches tested through the compiler tracker: LLVM Compile-Time Tracker
Third, you asked about clarifications on why non-trivial unswitching was not enabled by default originally. The context is that SimpleLoopUnswitch was designed as a replacement for LoopUnswitch. This latter pass didn’t do a lot of non-trivial unswitching; it kept state and was designed to have a “budget” on how much it would unswitch to avoid code size increases. In SimpleLoopUnswitch, the trivial part doesn’t have this problem because it doesn’t duplicate code, while non-trivial unswitching does and there is no cost model or budget to limit how much unswitching it will do. This was why it was not enabled by default originally.
Once the switch sticks, we plan to take advantage of the new function pass and explore changes that can make a difference in both compile and run time. One change will be on introducing DivergenceAnalysis (⚙ D128001 apply DivergenceAnalysis for SLU). Another change to be explored is using profile information (BFI) to limit the amount of non-trivial unswitching currently performed for cold loops in order to limit code size increases.
Lastly, per your suggestion we will rename the use-funcpass-nontrivial-unswitch flag to remove nontrivial from the naming.
Thanks for your suggestion. I have looked through the code and found there is no existing use case where the RegionLoop is not NULL: all use case set it as NULL to make it a function-level analysis. Also, I think loop-level analysis is much harder than function-level analysis: for SPMD programs, we can easily know the only source variable that cause divergence are built-in function call (threadIdx.x). However, for a loop, all variables that lived before the loop maybe divergent. So I think that’s why no one ever analyze divergence on Loop level.
In a word, this interface is not well tested and will not decrease the overload either (we need to record the divergent lived variables for each loop), so I think move to FunctionPass is the better choice.
Ok yeah it looks like it’s not tested as a loop analysis and would be less precise.
I think making non-trivial unswitching a function pass is fine as long as nothing regresses (especially in regards to interaction with LICM) and we can show improvements on targets with divergence. @bcahoon you mentioned this in ⚙ D109762 [NewPM][SimpleLoopUnswitch] Add DivergenceInfo, is this still an issue?
Hi @aeubanks, at the time of adding D109762 we did have regressions due the differences in loop unswitching between the pass managers. The regressions were addressed with other changes so I paused with pushing the patch due to the concerns mentioned in the ticket. I assumed I would find another example, which would require the change, but I haven’t come across one. I’ll track down the test case again to see if the non-trivial unswitching has any effect currently.
@alina Are these results comparing current compile-time vs when moving non-trivial to a function pass? The other question raised in the loop opt call was what would compile-time impact be if we leave the non-trivial handling in LPM1 while adding additional non-trivial unswitching as a function pass.
@nkhoun, could you please test the set of patches and provide us with feedback on their impact on performance for your workloads?
I have applied the set of patches and done a number of performance runs on the workloads we have that exercise non-trivial loop unswitching. From the data I collected it looks like there are no notable degradations expressed by applying the patches.
@nkhoun Thank you for testing the patches and confirming no regressions on your workloads.
@bmahjour We haven’t tested the compile time for keeping non-trivial unswitching twice (in the loop and as a follow on function pass). As we noticed no regressions with a single non-trivial unswitching run (as a function pass), the assumption was that any additional compile time added by the second run is a waste.
We can get actual measurements if you think it’s still meaningful.
Following the discussions at the open meetings, Ruobing is now looking at adding PGO to the current loop unswitch pass in order to reduce code size for cold loops.
We may revisit this proposal as a later time.