Background
The OpenMP specification defines a set of standalone constructs which can be grouped together into combined and composite constructs. Combined constructs are those that can be replaced by splitting the construct into its first “leaf” and the rest of the construct, and composite constructs are those which would require further code transformations to be split.
As a very simple overview, some OpenMP constructs are block-associated (i.e. they can be applied to a block of code) and there are others which are loop-associated (i.e. they can only be applied to a loop). Currently, all composite constructs are loop-associated.
The parallel
construct is peculiar in that it generally works as a block-associated construct when used standalone or in a combined construct, but it can be a leaf in a loop-associated composite construct as well. That is the case for distribute parallel do
and distribute parallel do simd
. Below there is an example of each.
// Block-associated (equivalent to the combined 'parallel for' construct).
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < 10; ++i) {
// ...
}
}
// Loop-associated.
#pragma omp distribute parallel for
for (int i = 0; i < 10; ++i) {
// ...
}
MLIR Representation
Loop-associated constructs in the OpenMP dialect in MLIR are currently represented as loop wrappers. These are operations that define a single region where only a single loop wrapper or omp.loop_nest
operation and a terminator are allowed.
This representation enables us to apply multiple loop wrappers to a single loop nest, which is how composite operations can currently be represented in a scalable manner.
On the other hand, the region defined by an operation in the OpenMP dialect representing a block-associated construct does not impose these restrictions that loop wrappers have.
Below are examples of what the MLIR representation for parallel for
and distribute parallel for
would look like.
// #pragma omp parallel for
omp.parallel { // NON-WRAPPER (block-associated)
// <Allocations inserted here...>
%c0_i32 = arith.constant 0 : i32
%c10_i32 = arith.constant 10 : i32
%c1_i32 = arith.constant 1 : i32
omp.wsloop {
omp.loop_nest (%arg0) : i32 = (%c0_i32) to (%c10_i32) step (%c1_i32) {
// ...
omp.yield
}
omp.terminator
}
omp.terminator
}
// #pragma omp distribute parallel for
// <Allocations inserted here...>
%c0_i32 = arith.constant 0 : i32
%c10_i32 = arith.constant 10 : i32
%c1_i32 = arith.constant 1 : i32
omp.distribute {
omp.parallel { // WRAPPER (loop-associated)
omp.wsloop {
omp.loop_nest (%arg0) : i32 = (%c0_i32) to (%c10_i32) step (%c1_i32) {
// ...
omp.yield
}
omp.terminator
}
omp.terminator
}
omp.terminator
}
Problem
The problem with having the omp.parallel
operation be able to represent both block and loop associated cases is that, as part of its behavior as a block-associated construct, it has the OutlineableOpenMPOpInterface
. This interface is then used in various places in Flang to identify blocks where allocations can be inserted, which is only legal for omp.parallel
if it’s not taking a loop wrapper role.
In the examples above, the intended place where allocations should go is marked with a comment. However, the current behavior would be to always insert these allocations inside of the omp.parallel
region.
Potential solutions
There are different ways we could go about this, but I wanted to get some opinions first before committing to any option:
- Updating each place where the
OutlineableOpenMPOpInterface
is used to make an exception if it’s anomp.parallel
taking a loop wrapper role. It doesn’t seem like a very scalable solution, but there’s an implementation downstream that could be upstreamed relatively quickly. - Updating the
OutlineableOpenMPOpInterface::getAllocaBlock()
method itself to check if the operation is anomp.parallel
, and returning a null pointer if it is and it’s also taking a loop wrapper role. Callers would need to be updated to not rely on it always returning a valid pointer and from a design perspective it won’t look very good (maybe we can force all operations using this interface to define this method or somehow override it foromp.parallel
instead). - Adding an
omp.parallel.wrapper
or similar operation to disambiguate between block and loop associatedparallel
constructs. It would basically be a copy ofomp.parallel
differing only in their lists of traits and description. This may make passes and translation to LLVM IR more difficult, since then there would be two operations representing basically the same construct. - Relaxing loop wrapper restrictions to allow other operations. It would be problematic to define what to allow and what not to, and how to deal with it when translating to LLVM IR, since a composite construct is intended to work as a semantic unit that implicitly incorporates some code transformations.
There might also be other solutions I haven’t thought about, so I’m open to other ideas as well.