RFC: Uniformity Analysis for Irreducible Control Flow

https://reviews.llvm.org/D130746

Uniformity analysis is a generalization of divergence analysis to include irreducible control flow:

  1. The proposed spec presents a notion of “maximal convergence” that captures the existing convention of converging threads at the headers of natual loops.
  2. Maximal convergence is then extended to irreducible cycles. The identity of irreducible cycles is determined by the choices made in a depth-first traversal of the control flow graph. Uniformity analysis uses criteria that depend only on closed paths and not cycles, to determine maximal convergence. This makes it a conservative analysis that is independent of the effect of DFS on CycleInfo.
  3. The analysis is implemented as a template that can be instantiated for both LLVM IR and Machine IR.

Current status:

  • passes existing tests for divergence analysis
  • passes new tests with irreducible control flow
  • currently working on Machine IR tests

Based on concepts originally outlined by Nicolai Haehnle nicolai.haehnle@amd.com

With contributions from Ruiling Song ruiling.song@amd.com and Jay Foad jay.foad@amd.com.

3 Likes

Does this completely supersede both IR divergence analyses (lib/Analysis/DivergenceAnalysis.cpp and lib/Analysis/LegacyDivergenceAnalysis.cpp) such that we could remove them from the codebase?

Yes, that is the intention.
Legacy DA has some incorrect corner cases. The current DA is correct but limited to reducible control flow. The uniformity analysis is both correct and supports irreducible control flow. It uses CycleInfo instead of LoopInfo. Cycles are a generalization of loops that recognize irreducible strongly connected components in the CFG.
And separately, uniformity analysis also works with Machine IR!

Bump! Mentioning some folks who had commented on previous work on convergence control intrinsics. The concepts introduced in uniformity analysis are very closely related to that.

D85603: IR: Add convergence control operand bundle and intrinsics

@simoll @dsanders @jdoerfert @efriedma-quic @AnastasiaStulova @jholewinski

I’m happy to see the existing convergence semantics in LLVM IR formalized in a consistent, explainable way.

I don’t really work with targets where convergence where convergence is relevant, so I don’t really have the intuition to say whether this formulation is “right”, but I don’t see any obvious issues with this formulation or the code implementing it.

It might be worth borrowing one of the examples from the “Motivating Examples of Convergent Operations” section of the documentation for D85603, or adding some external reference explaining some specific use of convergent operations. The current document goes straight to the formal definition without any informal introduction.

@ssahasra gave an overview of this RFC during the LLVM GPU WG Meeting earlier today. The recording is available at Uniformity Analysis - LLVM GPU WG Meeting 2022-08-19 - YouTube.

I haven’t tested it but from reading the code I suspect it won’t work for our use case unless we inline everything. Two extensions would probably be needed but should be doable if we split off the logic from the pass. That way I can provide inter-procedural and assumed information while using the intra-procedural transformation steps. Said differently, are people opposed to refactoring this (in tree) to allow usage inside the Attributor (=inter-procedural + optimistic fixpoint iteration framework)?

Actually this work is intended to be separate from the treatment of convergent operations. Convergent operations specify that certain threads must execute them convergently. Uniformity analysis (or divergence analysis) merely answers the question that if two threads execute an operation convergently (which is any operation, not necessarily a convergent operation), then whether the output is uniform.

Like the introduction says, convergent operations put some constraints on CFG transforms, which hold only when manipulating control flow around divergent branches. The uniformity analysis or the existing divergence analysis provide a way to identify non-divergent or uniform branches, where those transforms may be allowed. Neither analysis affects or depends on the semantics of convergent operations (as expressed by the “convergent” attribute). This is why convergent operations are not mentioned in this document other than the introduction.

This takes us way past the sale, as far as I can see! The current intent of the RFC is to invite attention to how we are defining convergence and uniformity in irreducible CFGs. This definition works for AMDGPU, but we believe it can easily be more general than that.

I am certainly open to any refactoring in tree once the implementation lands. I would be happy to contribute to such efforts.

I am still curious about where this will be used in the Attributor framework. Note that divergence analysis and uniformity analysis are both expensive, and need to be recomputed for the whole function with any change in the CFG. Attempts to defining incremental updates have not panned out yet. We usually avoid depending on this analysis in transformations unless the value is considerable, and it’s usually meant to disable a transformation (for example, don’t unswitch a loop if the branch is divergent).

Fair. We have downstream code to determine “all threads run together” regions in a cheap way. And upstream we already determine single threaded regions. Both are useful for us as they allow us to do store-load forwarding across functions. This is otherwise hard as synchronization make dominance and reachability unusable. With the “thread analysis” we can utilize those concepts, see https://tianshilei.me/wp-content/uploads/2022/06/ipdps-2022.pdf

Bump!

This review has been waiting for attention for close to four months now. The uniformity analysis is important for the following:

  1. Lay a strong groundwork for defining the convergence control intrinsics proposed in D85603. This is meant to completely replace the currently under-specified “convergent” attribute in LLVM IR.
  2. Improve optimizations in irreducible CFGs.
  3. Make uniformity analysis available in MIR, thus improving code generation. Especially the AMDGPU backend will greatly benefit from this information when allocating scalar/vector registers in a wave.

So far, there have been some comments that are “not opposed” to the change. It’s not clear if people who should be interested in the patch are able to provide the right attention. The work was presented at the LLVM Dev Meeting in San Jose too, in order to attract more attention.

I would really like some advice on how to move this forward. The concepts are sound, as far as they were reviewed by people working directly on that change. The analysis is new and does not affect existing pipelines. The AMDGPU pipeline is eager to use this analysis and is committed to maintaining it. Is this sufficient reason to let the patch through?

For submitting, at least a code review is necessary, which does not need domain expertise.

Sameer.

Bump! This review has been waiting for quite some time now. If there are no objections, I would like to just commit the patch next week!

Sameer.