Hi there,
I’d like to have parallel reduction support in the OpenMP dialect. It seems to boil down to two parts (1) modeling the aspects reduction scoping and participation clauses in the dialect and (2) modeling the data sharing aspects of reduction clauses. I understand how to model (1) reasonably well but (2) has a non-trivial interplay with not-yet-designed parts of the dialect.
For (1), I propose to model the reduction kind definition through a dedicated op and model reduction participation through token-typed region arguments.
Reduction Declarations
The definition is similar to defining any other symbol, along the lines of
// Reduction declarations are symbols living in the surrounding module.
// They have the type of the value being reduced as an attribute.
// New reduction is declared for every type (no generics in MLIR).
omp.declare_reduction @sub_i32 : i32
// Initialization region describes how the private reduction value is
// initialized. Its entry argument is the value of the reduction
// initializer that can be used inside. The region is isolated from above.
// It is expected to omp.yield the initialization result on all control
// flow paths.
init(%omp_orig: i32) {
%c0 = constant 0 : i32
omp.yield %c0 : i32
}
// The combiner region describes how to combine the two partial reduction
// accumulators. Note that in case of parallel loops it is *NOT* executed
// on each iteration. Instead, loops perform partial reductions through
// loop-carried values, which are then combined across different threads
// using the combiner. For example, subtraction reduction is implemented
// by *subtracting from zero* in the loop and by *adding up* the partial
// accumulator values (which are negative).
// The combiner takes the accumulator values as region arguments, with the
// first value coming from the accumulator. It is expected to omp.yield the
// combined value on all control flow paths.
combine(%omp_out: i32, %omp_in: i32) {
%0 = addi %omp_out, %omp_in : i32
omp.yield %0 : i32
}
// The optional atomic region describes how to combine the partial
// values atomically. It takes a *pointer* to the accumulator and a
// partial value and expects the pointer to be updated atomically
// to account for the partial value. In absence, the lowering can either
// generate the default `cmpxchg`-based version around the combiner
// or advise the OpenMP runtime that the atomic version is not
// available.
atomic(%omp_out: !llvm.ptr<i32>, %omp_in: i32) {
llvm.atomicrmw add monotonic %omp_out, %omp_in : !llvm.ptr<i32>
}
The two mandatory regions are almost a direct modeling of OpenMP’s declare reduction
directive, immediately enabling the support for custom reductions, and the optional region provides finer-grain control over the generated code.
Reduction Scoping and Participation
Reduction scoping clauses are usually attached to workshare directives, which are modeled as operations with regions. Therefore, I propose to model reduction scoping as an optional attribute/operand list on the scoping op. Each reduction creates an additional region argument of !llvm.token
type that is used to refer to this reduction inside the region. Values that participate in the reduction are passed as operands to the omp.reduction
operation along with the token that identifies the reduction. The token-based approach allows one to refer to reductions in arbitrarily nested constructs and to perform verification such as all reductions are being sent a value or a value of the same type, etc, without relying on the relative order of operations like affine/scf.parallel
do. This allows us to work around the fact that omp
dialect operations may have multiple blocks in their regions unlike higher-level loops.
Loops specifically have additional block arguments that contain the partially accumulated value, carried across iterations. This allows loops to combine values differently than with the declared combiner.
Note that the initialization and results of reductions are discussed separately below.
Simple example:
/*... = */ omp.wsloop ... reduction(%token1 -> %accum1 = ...,
%token2 -> %accum2 = ...) {
// %token* and %accum* are actually arguments of the entry block
%0 = addf %accum1, ... : f32
omp.reduction %r1, @add_f32, %0 : f32
%1 = subi %accum2, ... : i32
omp.reduction %r2, @sub_i32, %1 : i32
}
Example with nesting:
omp.parallel ... reduction(%token1) {
%0 = "do_something"() : () -> f32
omp.wsloop ... reduction(%token2 -> %accum = ...) {
%1 = subi %accum, ... : i32
omp.reduction %token, @sub_i32, %1 : i32
}
// Note that one cannot use %r2 here thanks to value
// visibility rules.
omp.reduction %token, @add_f32, %0 : f32
}
An alternative to using token values is to use an op interface and associate omp.reduction
with the closest op that implements the interface (e.g., we can have omp.reduction
inside scf.if
inside omp.wsloop
and have the reduction associated with the loop). This will likely require having one omp.reduction
op per reduction and is sensitive to the order in which reductions are listed as well as to diverging control flow.
By default, we expect all reductions to have been declared as a symbol. As a simplification, we may later support reduction keywords in omp.reduction
for kinds defined by the OpenMP specification and have a pass that expands those keywords to declarations in the closest module specialized by element type.
Initialization and results
Now to the most interesting part, how do we model reduction initialization and result value propagation. Consider the following example:
omp.parallel {
omp.wsloop … reduction(...) {
omp.reduce ...
}
// The reduced value must be available here.
}
// How does one access the reduction result here?
The specification assumes the existence of variables and that the result of the reduction is going to be stored in one. Furthermore, reduce
is also a data sharing construct that creates private copies of the reduction variable. Finally, the runtime assumes such variables exist for the purpose of implementing tree-style reduction.
The most straightforward modeling here is to introduce the notion of “variable” and require reductions to be associated with variables.
%c1 = llvm.mlir.constant(1 : index) : i64
%0 = llvm.alloca %c1 x f32 : !llvm.ptr<f32>
omp.parallel
// Values are shared by default, but let's specify for clarity.
shared(%0) {
// Pass the reduction variable to the reduction scoping operation.
// Exact syntax TBD.
// The initial value is the one stored in the pointer.
omp.wsloop ... reduction_vars(%0 : !llvm.ptr<f32>)
reduction(%token1 -> %accum : f32) {
omp.reduce %token1, @add_f32, ...
}
// The value can be loaded here if necessary.
llvm.load %0 : !llvm.ptr<f32>
}
// The value can also be trivially loaded here.
One practical difficulty is making this scale to the open type system in MLIR. We cannot assume variables to always be of LLVM pointer type. This can be solved by introducing a “memory referring” type trait or interface in core MLIR and requiring “variables” to be of a type that has this trait or interface. While this is intrusive in core, such an interface is anyway necessary for common compiler things like aliasing analysis.
Note that this “memory referring” type trait is likely necessary for the “atomic” region in the declaration above.
A conceptual difficulty with this approach is that it requires going through memory instead of SSA registers, which can be detrimental to canonicalization and other optimizations (although I don’t have concrete examples).
We can consider an alternative representation closer to how we model reductions in other dialects with loops: the initial value is passed into the reduction-carrying op and the result is returned from it. However, it is unclear how to marshall the reduction result upwards in the region hierarchy.
omp.parallel {
%1 = omp.wsloop ... reduction(%token1 -> %accum = %0 : f32) {
omp.reduce %token1, @add_f32, ...
}
// %1 is the reduced value readily available
}
// How does one access the reduction result here?
This is actually an instance of a more general issue of **returning values from omp.parallel
.
We can assume that all parallel threads have the same value in %1
after reduction. Theoretically, one could just do llvm.store
of this value into some variable, but this store creates a race condition on purpose, which isn’t nice. Using an atomic store or restricting it to one thread would introduce synchronization overhead. So would using a second reduction at the omp.parallel
level (one could use a min
or max
reduction that will just yield the value if all reduced values are identical, or introduce an “assignment” reduction kind). We can also consider omp.yield
ing the value and defining the yield semantics so that the result of the parent operation is undefined if threads are not yielding identical values, leaving the lowering passes the choice of how to implement this in practice. In general, I think this should adopt the same mechanism as other cases of returning values from parallel regions.
There is an additional hurdle with this modeling though, the nowait
attribute/clause. In presence of this clause, the threads can continue executing after the worksharing construct without waiting for other threads to finish executing the construct. In the reduction case, it means that the final reduced value may not yet be available. It is guaranteed to be available after all threads synchronize, e.g., on an explicit barrier. This creates a weird situation where an SSA value may have different values depending on where it is accessed, which sounds problematic wrt to SSA itself. For example, one cannot move a side effect-free user of this value above the barrier without breaking the semantics.
%0 = omp.wsloop … reduction(...) nowait {
}
// the value of %0 is unspecified here (may be partial)
omp.barrier
// now the value of %0 is well-defined
// doesn't this contradict SSA?
These issues, together with the additional complexity of implementing the translation of the value-based approach to LLVM IR (since the runtime ultimately expects reductions to go through memory, the lowering or the translation will have to introduce memory buffers for the reduction “variable” and for its private copies), make me lean towards the variable/pointer-based approach.
Discussion questions:
- Variable/pointer-based approach or value-based approach?
- If variable/pointer-based approach, do you see any optimizations that can be prevented or significantly complexified by this approach?
- If value-based approach, how do we return values from the parallel construct (this needs an answer regardless of reductions, but unnecessary to design now if we don’t take the value-based approach)
- If value-based approach, how do we handle the nowait+barrier case?
- Any other comments on the design?
cc @kiranchandramohan for OpenMP dialect, @jdoerfert and @Meinersbur for other OpenMP insights, @wsmoses for barrier semantics and @mehdi_amini for SSA and parallelism.