Summary
We need to decide on a way to represent combined and composite OpenMP constructs in the OpenMP dialect. A couple of options are being considered and it would be helpful to get wider community feedback to set a path forward. This decision will impact the representation of OpenMP loop constructs and possibly require changes to the current MLIR representation for the SIMD and DISTRIBUTE directives, as well as how the lowering of worksharing loops is done.
Below is an introduction to the problem and the a set of proposed solutions, with some comments about some of their pros and cons based on earlier discussions in the following PRs: #67798, #79843.
The proposal here is to keep the representation of combined constructs unchanged and to choose between the follwing composite construct representation alternatives, defined in the following sections:
- Wrapper operations (only
omp.wsloop
). - Wrapper operations (with
omp.{ws,simd,dist}loop
). - Composite operations.
- Explicit outer loops.
- Composite operations + Explicit outer loops.
Though the chosen representation will have to integrate with the representation of loop transformations that is still under discussion, I present an example to show that these two decisions can be made independently as long as we agree on whether to use the existing OpenMPIRBuilder
support or move that logic to a separate MLIR pass.
Background
Certain OpenMP block and loop constructs can be grouped into what are called combined and composite constructs, where each of the integrating individual constructs is called a âleaf constructâ. We want to be able to represent combined and composite constructs in the OpenMP dialect in a non-redundant, scalable and standard-conforming way that works well with the existing lowering process.
Combined constructs are those in which a construct is immediately nested inside of the first leaf construct and contains no other statements. In addition, the construct nested inside can be a leaf, but it can also be another combined or composite construct. For example, the target teams
combined construct can be decomposed as shown below:
// Combined construct.
#pragma omp target teams
{
foo();
}
// Equivalent individual constructs.
#pragma omp target
{
#pragma omp teams
{
foo();
}
}
Composite constructs, on the other hand, differ from combined constructs in that they do not have the same semantics as nesting one of the constructs immediately inside of another. They can for example define specific semantics to join together multiple leaf constructs that could not otherwise be directly nested inside of each other. In particular, composite constructs allow applying multiple loop-associated constructs to a single loop by introducing some additional implicit loops. For example, the distribute simd
composite construct can be decomposed as shown below:
// Composite construct.
#pragma omp distribute simd
for (int i = 0; i < n; i++)
foo(i);
// Equivalent individual constructs.
// This implementation assumes trip count is divisible by block size.
#pragma omp distribute
for (int j = 0; j < n; j += blockSize) {
#pragma omp simd
for (int i = j; i < j + blockSize; i++)
foo(i);
}
As of version 5.2 of the OpenMP specification, the complete list of combined and composite constructs (using â+â to denote composite and â â to denote combined) is the following:
for+simd
distribute+simd
distribute+parallel+for
distribute+parallel+for+simd
taskloop+simd
parallel for
parallel loop
parallel sections
parallel workshare
parallel for+simd
parallel masked
masked taskloop
masked taskloop+simd
parallel masked taskloop
parallel masked taskloop+simd
teams distribute
teams distribute+simd
teams distribute+parallel+for
teams distribute+parallel+for+simd
teams loop
target parallel
target parallel for
target parallel for+simd
target parallel loop
target simd
target teams
target teams distribute
target teams distribute+simd
target teams loop
target teams distribute+parallel+for
target teams distribute+parallel+for+simd
As it can be gathered from that list, the current list of composite constructs is relatively limited, as there are only 5 different ones: for simd
, distribute simd
, distribute parallel for
, distribute parallel for simd
and taskloop simd
. However, there is ongoing discussion to significantly increase the number of supported combined and composite constructs in OpenMP 6.0 or later versions of the specification, which may extend that list in the future.
MLIR Representation
Combined Constructs
Representing combined constructs in the OpenMP MLIR dialect is currently done through nesting the set of operations that represent each of the leaf constructs. This is a natural representation, since it directly follows the definition of a combined construct, it introduces no redundancy in the dialect and it is scalable. If new combined constructs based on existing leaf constructs are added, no changes to the MLIR operations would be needed by following this approach.
Below the correspondence between an example use of the target teams
combined construct and the matching MLIR representation is shown.
#pragma omp target teams <clauses...>
{
foo();
}
omp.target <target-clauses...> {
omp.teams <teams-clauses...> {
llvm.call @foo() : () -> ()
omp.terminator
}
omp.terminator
}
Composite Constructs
There is currently not a widely accepted approach to representing composite constructs in the OpenMP MLIR dialect. Proof of this is the lack of support for representing the for simd
composite construct even though the for
and simd
constructs are supported through the omp.wsloop
and omp.simdloop
operations. At the same time, the omp.distribute
operation is designed to serve as a wrapper operation over a loop defined through one of these two operations, but there is no omp.distloop
or similar operation to represent the case where it does not appear as part of a composite construct.
In this section, the main alternatives discussed so far to represent composite constructs are presented, using the following source code as a reference to show how each alternative would represent the same composite directive in MLIR.
#pragma omp distribute parallel for simd <clauses>
for (int i = 0; i < n; i++)
foo(i);
The âwrapper operationsâ alternative consists of having a single MLIR operation for each leaf construct that can be part of a composite construct. These operations hold information about clauses applicable to the construct and contain a single region where following leaf constructs or the actual loop or code block, if it is the last, are nested inside.
// Wrapper operations.
omp.distribute <distribute-clauses...> {
omp.parallel <parallel-clauses...> {
omp.wsloop %i... <for-clauses...> {
omp.simd <simd-clauses> {
llvm.call @foo(%i) : (i32) -> ()
omp.terminator
}
omp.yield
}
omp.terminator
}
omp.terminator
}
The âcomposite operationsâ alternative introduces a new MLIR operation to represent each composite construct with its corresponding list of supported clauses.
// Composite operations.
omp.distparwssimdloop %i... <clauses> {
llvm.call @foo(%i) : (i32) -> ()
omp.yield
}
The âexplicit outer loopsâ alternative consists of not representing composite constructs at all in MLIR, but to rather split them and represent them as the equivalent set of individual constructs. This makes the implicit loops introduced by leaf constructs explicit.
// Explicit outer loops.
%n = ...
%blksz0 = ...
%blksz1 = ...
%c0 = arith.constant 0 : i32
%c1 = arith.constant 1 : i32
omp.distloop for (%k) : i32 = (%c0) to (%n) step (%blksz0) <distribute-clauses> {
omp.parallel <parallel-clauses> {
%end0 = arith.addi %k, %blksz0 : i32
omp.wsloop for (%j) : i32 = (%k) to (%end0) step (%blksz1) <for-clauses> {
%end1 = arith.addi %j, %blksz1 : i32
omp.simdloop for (%i) : i32 = (%j) to (%end1) step (%c1) <simd-clauses> {
llvm.call @foo(%i) : (i32) -> ()
omp.yield
}
omp.yield
}
omp.terminator
}
omp.yield
}
Evaluation of Alternatives
For the representation of combined constructs, it seems clear that the current approach already meets the goals of being non-redundant, scalable and standard-conforming, so my proposal in that regard is to leave it unchanged.
Regarding the various alternatives for composite constructs, I summarize below some of the thoughts and concerns of each one based on three main categories.
Wrapper Operations
OpenMP dialect and standard-conformance:
- This is a scalable representation that would not require new operations if new composite constucts are added to the specification, since it is based on only representing leaf constructs.
- Composite operations would look very similar to combined constructs in MLIR, though the way they have to be processed would be different. The MLIR representation would look like non-conforming OpenMP with operations directly nested inside of other operations they cannot really be directly nested inside of.
omp.wsloop
âs semantics would become context-dependent rather than exclusively representing worksharing across threads (â+â here denotes nesting of MLIR operations):omp.parallel
+omp.wsloop
: worksharing across threads.omp.distribute
+omp.wsloop
: worksharing across teams.omp.wsloop
+omp.simd
: worksharing across SIMD lanes.omp.distribute
+omp.parallel
+omp.wsloop
: worksharing across threads over multiple teams.omp.distribute
+omp.parallel
+omp.wsloop
+omp.simd
: worksharing across threads and SIMD lanes over multiple teams.omp.parallel
+omp.wsloop
+omp.simd
: worksharing across threads and SIMD lanes.omp.distribute
+omp.wsloop
+omp.simd
: worksharing across teams and SIMD lanes.
- Alternatively, rather than representing all worksharing loops using
omp.wsloop
and using wrappers to represent all cases as described in the previous point,omp.simdloop
andomp.distloop
could be kept and added, respectively, to represent#pragma omp simd
and#pragma omp distribute
loops, using wrapper operations only for composite constructs.- It might be confusing to keep both
omp.{simd,dist}loop
andomp.{simd,distribute}
in the dialect, and there would be a significant overlap between the arguments loop and wrapper variants would take. - It does not remove the need to make these loop operations context-dependent.
- It might be confusing to keep both
MLIR creation:
- The current approach for creating MLIR for combined constructs could be used for composite constructs as well, so not much additional work is required.
MLIR to LLVM IR translation:
- Expressing the hierarchy of leaf constructs explicitly in MLIR would match certain already-existing LLVM IR code generation flows implemented in the
OpenMPIRBuilder
. - However, certain other flows might require sharing of information gathered in nested constructs to some of its parents, like for reductions or even host code generation for target device kernel launches.
- This can be solved by adding more state to the
OpenMPIRBuilder
, which might be something to be avoided if possible. - Also solvable by implementing custom translation functions for certain patterns of nested OpenMP MLIR operations rather than always calling a single translation function per leaf construct.
- This can be solved by adding more state to the
Composite Operations
OpenMP dialect and standard-conformance:
- Semantically distinct groups of constructs are represented by their own operation, which seems like something that we want to have.
- However, the semantics are not quite that different from that of applying each leaf construct. These operations only really represent the addition of some loops and making some transformation to the original one and then applying each leaf construct to one of the resulting loops.
- Introduces redundancy in the dialect, as composite operations would contain arguments for clauses applicable to each of their leaf constructs, for which we would have standalone operations too.
- Maybe it is possible to refactor tablegen definition of certain argument lists, etc. to allow some reuse across different operations.
- This representation is not scalable with regards to the number of different composite constructs in the specification.
- Whether this should be a concern entirely depends on how many of the potential additions that may come in OpenMP 6.0 or later are expected to be for composite constructs specifically.
- clang/flang already list individually each combined/composite construct with their applicable clauses (see here), so this MLIR representation would not be the only part of the compiler that would make adding new ones cumbersome.
MLIR creation:
- Trivial construction of the MLIR representation, since there would be a 1-to-1 correspondence between source constructs and MLIR operations.
MLIR to LLVM IR translation:
- Some refactoring would be needed to allow translation functions for the composite operations to share code with translation functions for each leaf operation.
- This might be made easier by introducing an MLIR interface for each leaf operation that can appear on a composite operation.
- It should generally give flexibility as to how to translate each composite leaf construct vs. standalone constructs, as well as passing information shared between leaf constructs.
Explicit Outer Loops
OpenMP dialect and standard-conformance:
- This representation mirrors what is done for combined constructs, since it expands composite constructs into exactly what they represent.
- It results in producing the same MLIR representation for semantically equivalent source codes regardless of whether composite constructs are used, so composite constructs would be treated as syntactic sugar and not result in different binaries.
- Although it would be possible, this would be more difficult to achieve starting from different MLIR representations.
- This approach would most likely prevent the use of the
OpenMPIRBuilder
for loop transformations, since they need to be applied before the new worksharing loops are introduced.
MLIR creation:
- More complicated construction of the MLIR representation, as it needs to also represent blocking/scheduling according to the constructs and clauses specified while considering non-divisible tripcounts.
- Alternatively, it could be left to the compiler frontend to split these constructs on its own (i.e. make parse tree rewrites) before creating the MLIR representation. This might be unnecessarily difficult to do.
- It is possible to combine this approach with the âcomposite operationsâ approach to simplify initial MLIR creation. There would be two levels of abstraction and an MLIR pass to lower composite operations into a set of loops implementing explicit blocking and applying each leaf operation independently.
- Creating an âhlompâ dialect might be overkill for this purpose, but the alternative of having high-level and low-level versions of these operations which cannot be mixed complicates the dialect.
- It would be difficult to implement both loop transformations and explicit outer loops directly when creating the MLIR representation.
- Mixing it with the âcomposite operationsâ approach enables the application of loop transformations in MLIR before introducing the new worksharing loops.
MLIR to LLVM IR translation:
- Translation to LLVM IR might become significantly more complicated in some cases, because any composite construct for which we need to target specific runtime calls would require recognizing certain MLIR patterns including multiple nested operations and chunk calculations.
- However, when the expected LLVM IR matches exactly the hierarchical structure of the source code after splitting composite constructs, this approach could be the simplest to implement.
- Lowering information sharing between leaf constructs would be more difficult following this approach, since they apply to different loops and it would most likely require adding more state to the
OpenMPIRBuilder
.
Interaction with Loop Transformations
The OpenMP specification also defines some loop transformation constructs (e.g. tile, unroll) that replace a given loop nest for a structured block that might also be another loop nest. It is expected that OpenMP 6.0 will introduce some more of these, so extensible support for these in the OpenMP MLIR dialect will be needed.
Currently, existing worksharing loop operations (omp.wsloop
and omp.simdloop
) represent the loop nest themselves, including index variables, ranges and steps. This approach has the problem that it makes it difficult to represent arbitrary loop transformations independently of how they might later be executed, so there has been discussion of introducing the omp.canonical_loop
operation to be able to separate the loop nest definition, the transformations done to it and its execution. The discussion on how this operation might be represented is still ongoing, and it spans multiple threads: D155765, #65380, #67720.
One important feature of loop transformation constructs is that they should be processed before any loop constructs are applied, so that the latter refer to the result of the loop transformations and not the original loop nest. This is where there can be some interactions between the decision to represent loop transformations and how loop constructs in general and combined/composite loop constructs in particular are represented. Both representations should be compatible and not introduce unreasonable difficulties.
Some loop transformation are already implemented in the OpenMPIRBuilder
, which is also the preferred place where LLVM IR codegen for OpenMP should reside, as it can be shared between clang and flang. So, in principle, any approach to representing OpenMP loop transformations in MLIR should be able to target this existing support.
The alternative would be to implement loop transformations entirely within MLIR, which would introduce a point of divergence between clang and flang, as well as requiring more development work. Following this would mean that any alternatives for representing composite constructs that require handling loop transformations in MLIR would need agreement from the community to move this logic out of the OpenMPIRBuilder
.
Assuming that, at least initially, we would prefer to rely on the OpenMPIRBuilder
, we can narrow down the available options to the âwrapper operationsâ and the âcomposite operationsâ alternatives because these are the ones that allow delaying the loop transformations to LLVM IR rather than doing them in MLIR. If we also assume that we defined omp.canonical_loop
as a kind of lambda function that returns an MLIR value representing an OpenMPIRBuilder::CanonicalLoopInfo
object, both options would be able to convey the semantics of composite constructs while integrating seamlessly with loop transformation operations. Below shows an example of how that could potentially look like:
#pragma omp target teams distribute parallel for
#pragma omp tile sizes(2)
for (int i = 0; i < n; i++)
foo(i)
// Define loop body (%i would be an entry block argument to represent the
// induction variable).
%loop = omp.canonical_loop (%i) : i32 context(...) {
// Use additional block args to access context values defined in the execution
// site...
llvm.call @foo(%i) : (i32) -> ()
omp.terminator
} (...) -> !omp.cli
// Loop transformations.
%tloop = omp.tile %loop { sizes=[2] } : (!omp.cli) -> (!omp.cli)
// OPTION 1: Run loop with wrapper operations.
omp.target {
%n = ... : i32
%c0 = arith.constant 0 : i32
%c1 = arith.constant 1 : i32
omp.teams {
omp.distribute {
omp.parallel {
// Execute loop, specifying pre-transformations loop range.
omp.wsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.cli, i32, i32, i32, ...) -> ()
omp.terminator
}
omp.terminator
}
omp.terminator
}
omp.terminator
}
// OPTION 2: Run loop with composite operations.
omp.target {
%n = ... : i32
%c0 = arith.constant 0 : i32
%c1 = arith.constant 1 : i32
omp.teams {
// Execute loop, specifying pre-transformations loop range.
omp.distparwsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.cli, i32, i32, i32, ...) -> ()
omp.terminator
}
omp.terminator
}
The example above shows that, after deciding on where loop transformations should be implemented (MLIR or LLVM IR), the representation of these and that of loop constructs should be orthogonal decisions. So my proposal is to:
- Decide whether it would be justified to implement loop transformations as an MLIR pass.
- Focus on what the best pre-canonical-loop representation for composite and combined constructs would be regardless of how loop transformations eventually get represented. These should be separate discussions and adapting what we come up with here to what eventually gets decided on that other discussion should not take significant effort.