[RFC] OpenMP reduction support

Hi there,

I’d like to have parallel reduction support in the OpenMP dialect. It seems to boil down to two parts (1) modeling the aspects reduction scoping and participation clauses in the dialect and (2) modeling the data sharing aspects of reduction clauses. I understand how to model (1) reasonably well but (2) has a non-trivial interplay with not-yet-designed parts of the dialect.

For (1), I propose to model the reduction kind definition through a dedicated op and model reduction participation through token-typed region arguments.

Reduction Declarations

The definition is similar to defining any other symbol, along the lines of

// Reduction declarations are symbols living in the surrounding module.
// They have the type of the value being reduced as an attribute.
// New reduction is declared for every type (no generics in MLIR).
omp.declare_reduction @sub_i32 : i32
// Initialization region describes how the private reduction value is
// initialized. Its entry argument is the value of the reduction
// initializer that can be used inside. The region is isolated from above.
// It is expected to omp.yield the initialization result on all control
// flow paths.
init(%omp_orig: i32) {
  %c0 = constant 0 : i32
  omp.yield %c0 : i32
}
// The combiner region describes how to combine the two partial reduction
// accumulators. Note that in case of parallel loops it is *NOT* executed
// on each iteration. Instead, loops perform partial reductions through
// loop-carried values, which are then combined across different threads
// using the combiner. For example, subtraction reduction is implemented
// by *subtracting from zero* in the loop and by *adding up* the partial
// accumulator values (which are negative).
// The combiner takes the accumulator values as region arguments, with the
// first value coming from the accumulator. It is expected to omp.yield the
// combined value on all control flow paths.
combine(%omp_out: i32, %omp_in: i32) {
  %0 = addi %omp_out, %omp_in : i32
  omp.yield %0 : i32
}
// The optional atomic region describes how to combine the partial
// values atomically. It takes a *pointer* to the accumulator and a 
// partial value and expects the pointer to be updated atomically
// to account for the partial value. In absence, the lowering can either
// generate the default `cmpxchg`-based version around the combiner
// or advise the OpenMP runtime that the atomic version is not
// available.
atomic(%omp_out: !llvm.ptr<i32>, %omp_in: i32) {
  llvm.atomicrmw add monotonic %omp_out, %omp_in : !llvm.ptr<i32>
}

The two mandatory regions are almost a direct modeling of OpenMP’s declare reduction directive, immediately enabling the support for custom reductions, and the optional region provides finer-grain control over the generated code.

Reduction Scoping and Participation

Reduction scoping clauses are usually attached to workshare directives, which are modeled as operations with regions. Therefore, I propose to model reduction scoping as an optional attribute/operand list on the scoping op. Each reduction creates an additional region argument of !llvm.token type that is used to refer to this reduction inside the region. Values that participate in the reduction are passed as operands to the omp.reduction operation along with the token that identifies the reduction. The token-based approach allows one to refer to reductions in arbitrarily nested constructs and to perform verification such as all reductions are being sent a value or a value of the same type, etc, without relying on the relative order of operations like affine/scf.parallel do. This allows us to work around the fact that omp dialect operations may have multiple blocks in their regions unlike higher-level loops.

Loops specifically have additional block arguments that contain the partially accumulated value, carried across iterations. This allows loops to combine values differently than with the declared combiner.

Note that the initialization and results of reductions are discussed separately below.

Simple example:

/*... = */ omp.wsloop ... reduction(%token1 -> %accum1 = ...,
                                    %token2 -> %accum2 = ...) {
// %token* and %accum* are actually arguments of the entry block
  %0 = addf %accum1, ... : f32
  omp.reduction %r1, @add_f32, %0 : f32
  %1 = subi %accum2, ... : i32
  omp.reduction %r2, @sub_i32, %1 : i32
}

Example with nesting:

omp.parallel ... reduction(%token1) {
  %0 = "do_something"() : () -> f32
  omp.wsloop ... reduction(%token2 -> %accum = ...) {
    %1 = subi %accum, ... : i32
    omp.reduction %token, @sub_i32, %1 : i32
  }
  // Note that one cannot use %r2 here thanks to value
  // visibility rules.
  omp.reduction %token, @add_f32, %0 : f32
}

An alternative to using token values is to use an op interface and associate omp.reduction with the closest op that implements the interface (e.g., we can have omp.reduction inside scf.if inside omp.wsloop and have the reduction associated with the loop). This will likely require having one omp.reduction op per reduction and is sensitive to the order in which reductions are listed as well as to diverging control flow.

By default, we expect all reductions to have been declared as a symbol. As a simplification, we may later support reduction keywords in omp.reduction for kinds defined by the OpenMP specification and have a pass that expands those keywords to declarations in the closest module specialized by element type.

Initialization and results

Now to the most interesting part, how do we model reduction initialization and result value propagation. Consider the following example:

omp.parallel {
  omp.wsloop … reduction(...) {
    omp.reduce ...
  }
  // The reduced value must be available here.
}
// How does one access the reduction result here?

The specification assumes the existence of variables and that the result of the reduction is going to be stored in one. Furthermore, reduce is also a data sharing construct that creates private copies of the reduction variable. Finally, the runtime assumes such variables exist for the purpose of implementing tree-style reduction.

The most straightforward modeling here is to introduce the notion of “variable” and require reductions to be associated with variables.

%c1 = llvm.mlir.constant(1 : index) : i64
%0 = llvm.alloca %c1 x f32 : !llvm.ptr<f32>
omp.parallel 
// Values are shared by default, but let's specify for clarity.
shared(%0) {
  // Pass the reduction variable to the reduction scoping operation.
  // Exact syntax TBD.
  // The initial value is the one stored in the pointer.
  omp.wsloop ... reduction_vars(%0 : !llvm.ptr<f32>) 
                 reduction(%token1 -> %accum : f32) {
    omp.reduce %token1, @add_f32, ...
  }
  // The value can be loaded here if necessary.
  llvm.load %0 : !llvm.ptr<f32>
}
// The value can also be trivially loaded here.

One practical difficulty is making this scale to the open type system in MLIR. We cannot assume variables to always be of LLVM pointer type. This can be solved by introducing a “memory referring” type trait or interface in core MLIR and requiring “variables” to be of a type that has this trait or interface. While this is intrusive in core, such an interface is anyway necessary for common compiler things like aliasing analysis.

Note that this “memory referring” type trait is likely necessary for the “atomic” region in the declaration above.

A conceptual difficulty with this approach is that it requires going through memory instead of SSA registers, which can be detrimental to canonicalization and other optimizations (although I don’t have concrete examples).

We can consider an alternative representation closer to how we model reductions in other dialects with loops: the initial value is passed into the reduction-carrying op and the result is returned from it. However, it is unclear how to marshall the reduction result upwards in the region hierarchy.

omp.parallel {
  %1 = omp.wsloop ... reduction(%token1 -> %accum = %0 : f32) {
    omp.reduce %token1, @add_f32, ...
  }
  // %1 is the reduced value readily available
}
// How does one access the reduction result here?

This is actually an instance of a more general issue of **returning values from omp.parallel.

We can assume that all parallel threads have the same value in %1 after reduction. Theoretically, one could just do llvm.store of this value into some variable, but this store creates a race condition on purpose, which isn’t nice. Using an atomic store or restricting it to one thread would introduce synchronization overhead. So would using a second reduction at the omp.parallel level (one could use a min or max reduction that will just yield the value if all reduced values are identical, or introduce an “assignment” reduction kind). We can also consider omp.yielding the value and defining the yield semantics so that the result of the parent operation is undefined if threads are not yielding identical values, leaving the lowering passes the choice of how to implement this in practice. In general, I think this should adopt the same mechanism as other cases of returning values from parallel regions.

There is an additional hurdle with this modeling though, the nowait attribute/clause. In presence of this clause, the threads can continue executing after the worksharing construct without waiting for other threads to finish executing the construct. In the reduction case, it means that the final reduced value may not yet be available. It is guaranteed to be available after all threads synchronize, e.g., on an explicit barrier. This creates a weird situation where an SSA value may have different values depending on where it is accessed, which sounds problematic wrt to SSA itself. For example, one cannot move a side effect-free user of this value above the barrier without breaking the semantics.

%0 = omp.wsloop … reduction(...) nowait {
}
// the value of %0 is unspecified here (may be partial)
omp.barrier
// now the value of %0 is well-defined
// doesn't this contradict SSA?

These issues, together with the additional complexity of implementing the translation of the value-based approach to LLVM IR (since the runtime ultimately expects reductions to go through memory, the lowering or the translation will have to introduce memory buffers for the reduction “variable” and for its private copies), make me lean towards the variable/pointer-based approach.

Discussion questions:

  • Variable/pointer-based approach or value-based approach?
  • If variable/pointer-based approach, do you see any optimizations that can be prevented or significantly complexified by this approach?
  • If value-based approach, how do we return values from the parallel construct (this needs an answer regardless of reductions, but unnecessary to design now if we don’t take the value-based approach)
  • If value-based approach, how do we handle the nowait+barrier case?
  • Any other comments on the design?

cc @kiranchandramohan for OpenMP dialect, @jdoerfert and @Meinersbur for other OpenMP insights, @wsmoses for barrier semantics and @mehdi_amini for SSA and parallelism.

1 Like

Thanks @ftynse for this detailed RFC. I haven’t got my head through all this. You raise some interesting questions and I welcome the general question about variables and values in OpenMP dialect. In the past we have hand-waved and added AnyType to some of the operands which corresponds to variables. I have a few questions as a first reply.

In the section “Reduction Declarations”, you mention about reduction declaration symbols and three regions. I did not understand how these are all tied together. Are you suggesting that there is an operation with three regions and a constant symbol?

In the examples and explanation you have used the llvm dialect. You clarify further about not restricting to the LLVM pointer type. Does this extend to other usage of the LLVM dialect, like the token type.

At the moment there are no values returned from the parallel region. What are the cases other than reduction results that has to be returned from parallel region? I am assuming that updates to variables will be stores. For worksharing loops (and a few others) there is a last-private clause which updates the original variable with the last value. One thought I had was there can be a region which does finalisation (storing the results of lastprivate and reduction). Does that make any sense? I guess it will still incur the penalty.

I have not thought about the nowait and reduction results issue. Thanks for pointing out this issue.

I responded by email, but discourse didn’t like it. Here it is again:

Thanks for the extensive write up. Here are some of my thoughts:

  1. I’d find the value approach nicer for an SSA-based IR, with well-defined content that does not change when it is accessed. The possibility of delaying the result might be a major problem that would require an alternative means to define a definition as well. This reminds me of a ‘future’, and we could base an SSA representation on that (please excuse my broken MLIR):
%parallel_reduction_future = omp.parallel ... reduction(@add_f32: %in -> % wsloop_reduction_result : f32){
  %wsloop_reduction_future = omp.wsloop ... reduction(@add_f32: %in -> %out : f32) {
    %intermediate = @add_f32(%in, %value)
    %out = @add_f32(%intermediate, %another_value) ...
    yield %out
  }
  %wsloop_reduction_result = omp.reduction.get(%wsloop_reduction_future, %in)
  yield %wsloop_reduction_result
}
omp.barrier
%parallel_reduction_result = omp.reduction.get(%parallel_reduction_future, %in)

use(%parallel_reduction_result)
  1. It might be beneficial to have in-built reductions (for add, min, max, …) instead of having to define the init/combine functions for them. Some architectures have dedicated support for them and these could not be used easily if their semantics is ‘hidden’ inside some function definitions.

  2. You mention the runtime would decide whether to use the atomic. To keep the overhead down, I think the involvement of the runtime should be minimal, and the majority of decisions be made by the OpenMPIRBuilder.

  3. A lot of OpenMP directives support reduction clauses, the newest being the Scope directive (scope Construct). This might motivate to define reductions orthogonal to other constructs, maybe as independent operations instead of attributes, but I don;t know how that would look like.

  4. In your suggested syntax the omp.reduction operation seems to be the only one carrying which reduction to apply. What if is optimized away, e.g. by dead code elimination:

omp.wsloop ... reduction(%token1 -> %accum1 = ...,) {
  if (false) {
    omp.reduction %r1, @add_f32, %0 : f32
  }
}

The result of the reduction with 0 elements to reduction could be 0 (for addition), but also 1 (if multiplication), i.e. it needs to know the neutral element.

  1. A @sub_i32 reduction is weird to define. In OpenMP a subtract reduction is actually that same as addition, but the OpenMP committee was considering deprecating it.

Michael

TBH, I only skimmed over most of this, it’s a lot.
From what I can tell, the value-based approach won’t work, conceptually. Take

void do_some_additions(int *p) { *p += *p; }
int test(int x, int y, int *A) {
#pragma omp parallel reduction(+:x,y)
  {
    do_some_additions(&x);
    do_some_additions(&y);
    do_some_additions(&A[omp_get_thread_num]);
  }
  return x + y;
}

there is no way you see all “reduction operations”, nor is it that an operation is always part of a reduction.

I fail to understand the token thing, but I suspect it breaks down with the same example above.

I would have expected something very simple, matching the OpenMP source. Let’s take the example above, and forgive my inability to write MLIR:

%x = llvm.alloca i32 : !llvm.ptr<i32>
%y = llvm.alloca i32 : !llvm.ptr<i32>
omp.parallel reduction(%x, %y) {
   call @do_some_additions(i32* %x)
   call @do_some_additions(i32* %y)
   call @do_some_additions(i32* %A)
}
%xv = load i32, i32* %x
%yv = load i32, i32* %y
%add = add i32 %xv, %yv
ret i32 %add

That seems to be easy and sufficient for the MLIR level, the OMPIRBuilder can generate the proper code from that, I think.

P.S. I don’t really monitor discourse nor my email folder with all the mails I get, ping me if it looks I am not paying attention.

I feel like resolving the value/variable discussion is the key to properly modeling OpenMP :slight_smile:

Yes, it it is an operation with three regions. Symbol is a “property” of an operation and the symbol name is stored as an operation attribute. We do this for functions, modules, etc.

Tokens are only there to communicate between two operations in the OpenMP dialect and will disappear at translation, I see no point in supporting arbitrary types there. The OpenMP dialect already depends on the LLVM dialect, so we can just use the available type and not define a new one.

Anything that one wants to be usable as value after the region and be subject to SSA-style analyses. We don’t need to return reduction results strictly speaking, they can also pass through memory aka variables. However, I’ve repeatedly seen benefits of modeling things as data flow of values instead of going through memory (for example, scf and affine loops model reductions through loop-carried “iteration arguments” so reduction detection is trivial and not subject to aliasing and value analysis, and so is parallelization). I am not saying we necessarily want to adopt maximally value-based approach for the OpenMP dialect, but going only though memory/variables should be a conscious choice. Hence my question about missed optimization opportunities.

I’m not sure it has to be a region.

Indeed, I thought about using futures for this problem, and that was exactly the moment where I decided that being fully value-based might be counter-productive. OpenMP itself doesn’t have futures AFAIK, so I suppose it can be strange for the compiler to have to deal with them when targeting the dialect. Even worse if OpenMP decides to have futures in some form…

We can have built-in kinds as in the extended version I proposed, and lower those directly to target-specific constructs instead of expanding the definitions.

The runtime has a flag that lets the compiler tell the runtime whether it generated an atomic version or not. Not providing the “atomic” region means that downstream flow is free to unset that flag or to generate the atomic version itself. Providing it means that the downstream flow is expected to set the flag and use the instructions from that region.

Using independent operations sounds messy to me, we would have to define complex inter-operation validity rules that make manipulating IR a mess. We could use a trait or an operation interface instead (interfaces are exactly what you think they are in OOP).

This is a good example. For one, we may want to make sure reduction operations communicate with DCE properly. I suppose we can move the symbol reference to the reduction clause and look it up through the token. This would solve the problem as the neutral element is known given the symbol reference.

Yeah, that’s why I wanted to explore @sub_i32 specifically. In other places in MLIR we wouldn’t “repeat” the reduction operation in the body, so subtraction would have to be implemented differently than what OpenMP does.

None of that is actually required. Your example needs a reduction that works on pointers-to-values rather than just values, but it can be expressed along the lines of

/* definition of "variable addition" */
omp.declare_reduction @add_i32ptr
init(%omp_orig : !llvm.ptr<i32>) {
  %0 = llvm.alloca ... : !llvm.ptr<i32>
  %c0 = constant 0 : i32
  llvm.store %c0, %0
  omp.yield %0
}
combine(%omp_out : !llvm.ptr<i32>, %omp_in : !llvm.ptr<i32>) {
  %0 = llvm.load %omp_out
  %1 = llvm.load %omp_in
  %2 = llvm.add %0, %1
  llvm.store %2, %omp_out
  omp.yield %omp_out
}

%x = llvm.alloca ... : !llvm.ptr<i32>
%y = llvm.alloca ... : !llvm.ptr<i32>
%A = ... : !llvm.ptr<i32>
// The results will be values of pointer type aliasing %x and %y.
%xr, %yr = omp.parallel reduction(%t1 -> %xv = %x : !llvm.ptr<i32>,
                                  %t2 -> %yv = %y : !llvm.ptr<i32>) {
  // %xv and %yv are privatized values of pointer type.

  call @do_some_additions(%xv)
  omp.reduce @add_i32ptr %t1, %xv
  
  call @do_some_additions(%yv)
  omp.reduce @add_i32ptr %t2, %yv
  
  %0 = call @omp_get_thread_num() : () -> i32
  %Aelem = llvm.gep %A, %0 : !llvm.ptr<i32>
  call @do_some_additions(%Aelem)
  // No "omp.reduce" here because we are not reducing.
}

Emitting OpenMP dialect directly from a language representation with the notion of variable will likely have to create this pointer-reduction form, which can be later transformed into the value-reduction form if desired and possible. The transformation is similar to an advanced mem2reg. Emitting OpenMP dialect from other dialects can directly target the value-reduction form though, without having to deal with pointers and variables. In the flows I have, there will be no pointers when OpenMP is generated and I have a mild preference of not introducing pointers unless it is necessary.

There are other things that make me doubt the value-based approach – the “future” scenario with nowait, described in the original post, and returning values out of nested constructs (i.e., #pragma omp parallel has a #pragma omp loop reduction(...) in it). Both are trivially solved by the variable-based approach.

It is not. But now that I think about it more, we don’t necessarily need tokens, at least in the variable-based approach. We can use values containing the variable instead.

FWIW, I have a prototype that implements the variable-based approach and it is indeed easy and connects well to the builder. I had also tried the value-based approach and it felt more complex to translate.

I guess if people want multiple representations for the same thing and we can ensure the value one can be lowered to the pointer one in the end, I don’t have strong objections. Personally, I’d not invent a second OpenMP reduction encoding without having a good use case that couldn’t be solved by the first reasonably well.

Cool, we can use that as a starting point :slight_smile:


Long story short, let’s do the variable-based one first with Builder integration for lowering. That is necessary for OpenMP in general. If we have additional MLIR representations other than that is fine with me. I would potentially do specialization it without OpenMP though, e.g., if you see a reduction you want to deal with in the MLIR level, remove it from the OpenMP reduction list and drop all the necessary blobs to completely deal with it.

Even if OpenMP doesn’t have futures doesn’t mean the internal cannot use them, e.g. the internal representation uses SSA which the source languages do not use. For dialects that don’t know futures, it is just another function call that returns some value.

However, the variable holding the reduction result doesn’t necessarily need be local to the function. If the first use after the reduction result is outside of the function doing the reduction, it will not know about the future it has to call to get the value. That might be a showstopper.

You mention the issue of using the LLVM pointer type. Doesn’t have MLIR the dialect-independent MemRef type? Why not use that one?

Like finding that the if statement is dead code in the following?

integer :: x
!$omp parallel
x = 5
!$omp end parallel
if (x .ne. 5)
  call impossible_func()
end if

+1

Possibly to keep it more general. Like there will be no memrefs in the Flang flow.

Hi, we’ve noticed a couple of issues with the current flow for reductions in OpenMP, most notably that there are a few situations that aren’t modelled correctly. In particular, the OpenMP specification does not require that the only operations performed on the reduction variable match the reduction clause; in fact all the specification says is that each thread is given a private instance of that variable and those private instances are combined using the reduction function.

Take the following example in Fortran:

program main
  integer :: n = 0
  !$omp parallel reduction(+:n)
    n = 1
  !$omp end parallel
  print *, n
end program main

This should print the number of threads, as each thread will assign 1 to its private variable and then they will be combined using +. However currently it will print 1.

In order to better represent the semantics in the OpenMP specification I’d like to propose we change how reductions are done. For example for the above program we could generate something like

  func.func @main() {
    %c0 = arith.constant 0 : i32
    %cst = arith.constant 0.000000e+00 : f32
    %1 = llvm.alloca %c0 x i32 : !llvm.ptr<i32>
    omp.parallel  reduction(@__add_reduction %1 -> %prv : !llvm.ptr<i32>) {
      %c1 = arith.constant 1 : i32
      llvm.store %c1 to %prv : !llvm.ptr<i32>
      omp.terminator
    }
    %2 = llvm.load %1 : !llvm.ptr<i32> -> i32
    // print %2 ...
    return
  }

This would be similar to one of the proposed solutions above except that in this example instead of having a special omp.reduce operation inside the body we just use normal operations on a specially created private pointer. That pointer can then be initialized using the specified default value for the reduction. This would better match the semantics in OpenMP where there is nothing “special” about the reduction variable other than that it is collected at the end of the region.

I wonder if anyone has an opinion on this?
Thanks!
David Truby

cc @ftynse @kiranchandramohan @jdoerfert

1 Like

Makes sense to me. I am unsure why we do anything different than Clang, which does not treat the variable different in the region, just replaces it with the private copy.

The following code is probably OK for the parallel reduction. Assuming %prv is a block argument here. In the case of loop-like operations (omp.wsloop), not sure whether that would mean instantiating a private value in the body of the loop.

    omp.parallel  reduction(@__add_reduction %1 -> %prv : !llvm.ptr<i32>) {
      %c1 = arith.constant 1 : i32
      llvm.store %c1 to %prv : !llvm.ptr<i32>
      omp.terminator
    }

The current modelling did not have to explicitly make the private value visible in the code. It has other advantages like unifying the combination function in the body as well as for combining the thread values. The scf-to-openmp translation makes use of this. So changes will have to be made there.

Privatizing simple types is easy, but complex types (allocatables, pointers, derived types) might need special handling.

I don’t think you’d need to do a manual initialisation; my idea was that %prv would create a private variable which is automatically initialised to the correct initial value not in the loop body but as part of starting the reduction.

The question I was asking was whether the following modelling of reduction in a worksharing loop would lead to the instantiation of the private copy inside the loop-body? Instantiating in the body would mean instantiation on every iteration. This is not a problem for omp.parallel since instantiation in the body is fine and correct.

    omp.wsloop  reduction(@__add_reduction %1 -> %prv : !llvm.ptr<i32>)  for %iv : i32 %start to %end step(%c1) {
    ^bb0(%prv: !llvm.ptr<i32>):
      %c1 = arith.constant 1 : i32
      llvm.store %c1 to %prv : !llvm.ptr<i32>
      omp.terminator
    }

Ok I see what you mean by a block argument here, I was getting confused with something else. Is it necessary to have a block argument like that? I was somewhat under the impression that the reduction clause could create and bring into scope the %prv variable itself inside the region rather than having to have it as a block argument but perhaps I’ve misunderstood something.
I was mostly following the examples above posted by @ftynse and @Meinersbur that appear to have variables introduced in the reduction clause that are used in the region and don’t require block arguments

The proposed change is now implemented in the following patches and is ready for review. The first one introduces changes for Parallel reduction, the second one makes changes for the Worksharing loop reduction and also modifies the SCF to OpenMP conversion.