[RFC] Representing combined/composite constructs in the OpenMP dialect

Summary

We need to decide on a way to represent combined and composite OpenMP constructs in the OpenMP dialect. A couple of options are being considered and it would be helpful to get wider community feedback to set a path forward. This decision will impact the representation of OpenMP loop constructs and possibly require changes to the current MLIR representation for the SIMD and DISTRIBUTE directives, as well as how the lowering of worksharing loops is done.

Below is an introduction to the problem and the a set of proposed solutions, with some comments about some of their pros and cons based on earlier discussions in the following PRs: #67798, #79843.

The proposal here is to keep the representation of combined constructs unchanged and to choose between the follwing composite construct representation alternatives, defined in the following sections:

  • Wrapper operations (only omp.wsloop).
  • Wrapper operations (with omp.{ws,simd,dist}loop).
  • Composite operations.
  • Explicit outer loops.
  • Composite operations + Explicit outer loops.

Though the chosen representation will have to integrate with the representation of loop transformations that is still under discussion, I present an example to show that these two decisions can be made independently as long as we agree on whether to use the existing OpenMPIRBuilder support or move that logic to a separate MLIR pass.

Background

Certain OpenMP block and loop constructs can be grouped into what are called combined and composite constructs, where each of the integrating individual constructs is called a “leaf construct”. We want to be able to represent combined and composite constructs in the OpenMP dialect in a non-redundant, scalable and standard-conforming way that works well with the existing lowering process.

Combined constructs are those in which a construct is immediately nested inside of the first leaf construct and contains no other statements. In addition, the construct nested inside can be a leaf, but it can also be another combined or composite construct. For example, the target teams combined construct can be decomposed as shown below:

// Combined construct.
#pragma omp target teams
{
  foo();
}

// Equivalent individual constructs.
#pragma omp target
{
  #pragma omp teams
  {
    foo();
  }
}

Composite constructs, on the other hand, differ from combined constructs in that they do not have the same semantics as nesting one of the constructs immediately inside of another. They can for example define specific semantics to join together multiple leaf constructs that could not otherwise be directly nested inside of each other. In particular, composite constructs allow applying multiple loop-associated constructs to a single loop by introducing some additional implicit loops. For example, the distribute simd composite construct can be decomposed as shown below:

// Composite construct.
#pragma omp distribute simd
for (int i = 0; i < n; i++)
  foo(i);

// Equivalent individual constructs.
// This implementation assumes trip count is divisible by block size.
#pragma omp distribute
for (int j = 0; j < n; j += blockSize) {
  #pragma omp simd
  for (int i = j; i < j + blockSize; i++)
    foo(i);
}

As of version 5.2 of the OpenMP specification, the complete list of combined and composite constructs (using ‘+’ to denote composite and ’ ’ to denote combined) is the following:

  • for+simd
  • distribute+simd
  • distribute+parallel+for
  • distribute+parallel+for+simd
  • taskloop+simd
  • parallel for
  • parallel loop
  • parallel sections
  • parallel workshare
  • parallel for+simd
  • parallel masked
  • masked taskloop
  • masked taskloop+simd
  • parallel masked taskloop
  • parallel masked taskloop+simd
  • teams distribute
  • teams distribute+simd
  • teams distribute+parallel+for
  • teams distribute+parallel+for+simd
  • teams loop
  • target parallel
  • target parallel for
  • target parallel for+simd
  • target parallel loop
  • target simd
  • target teams
  • target teams distribute
  • target teams distribute+simd
  • target teams loop
  • target teams distribute+parallel+for
  • target teams distribute+parallel+for+simd

As it can be gathered from that list, the current list of composite constructs is relatively limited, as there are only 5 different ones: for simd, distribute simd, distribute parallel for, distribute parallel for simd and taskloop simd. However, there is ongoing discussion to significantly increase the number of supported combined and composite constructs in OpenMP 6.0 or later versions of the specification, which may extend that list in the future.

MLIR Representation

Combined Constructs

Representing combined constructs in the OpenMP MLIR dialect is currently done through nesting the set of operations that represent each of the leaf constructs. This is a natural representation, since it directly follows the definition of a combined construct, it introduces no redundancy in the dialect and it is scalable. If new combined constructs based on existing leaf constructs are added, no changes to the MLIR operations would be needed by following this approach.

Below the correspondence between an example use of the target teams combined construct and the matching MLIR representation is shown.

#pragma omp target teams <clauses...>
{
  foo();
}
omp.target <target-clauses...> {
  omp.teams <teams-clauses...> {
    llvm.call @foo() : () -> ()
    omp.terminator
  }
  omp.terminator
}

Composite Constructs

There is currently not a widely accepted approach to representing composite constructs in the OpenMP MLIR dialect. Proof of this is the lack of support for representing the for simd composite construct even though the for and simd constructs are supported through the omp.wsloop and omp.simdloop operations. At the same time, the omp.distribute operation is designed to serve as a wrapper operation over a loop defined through one of these two operations, but there is no omp.distloop or similar operation to represent the case where it does not appear as part of a composite construct.

In this section, the main alternatives discussed so far to represent composite constructs are presented, using the following source code as a reference to show how each alternative would represent the same composite directive in MLIR.

#pragma omp distribute parallel for simd <clauses>
for (int i = 0; i < n; i++)
  foo(i);

The “wrapper operations” alternative consists of having a single MLIR operation for each leaf construct that can be part of a composite construct. These operations hold information about clauses applicable to the construct and contain a single region where following leaf constructs or the actual loop or code block, if it is the last, are nested inside.

// Wrapper operations.
omp.distribute <distribute-clauses...> {
  omp.parallel <parallel-clauses...>  {
    omp.wsloop %i... <for-clauses...>  {
      omp.simd <simd-clauses> {
        llvm.call @foo(%i) : (i32) -> ()
        omp.terminator
      }
      omp.yield
    }
    omp.terminator
  }
  omp.terminator
}

The “composite operations” alternative introduces a new MLIR operation to represent each composite construct with its corresponding list of supported clauses.

// Composite operations.
omp.distparwssimdloop %i... <clauses> {
  llvm.call @foo(%i) : (i32) -> ()
  omp.yield
}

The “explicit outer loops” alternative consists of not representing composite constructs at all in MLIR, but to rather split them and represent them as the equivalent set of individual constructs. This makes the implicit loops introduced by leaf constructs explicit.

// Explicit outer loops.
%n = ...
%blksz0 = ...
%blksz1 = ...
%c0 = arith.constant 0 : i32
%c1 = arith.constant 1 : i32
omp.distloop for (%k) : i32 = (%c0) to (%n) step (%blksz0) <distribute-clauses> {
  omp.parallel <parallel-clauses> {
    %end0 = arith.addi %k, %blksz0 : i32
    omp.wsloop for (%j) : i32 = (%k) to (%end0) step (%blksz1) <for-clauses> {
      %end1 = arith.addi %j, %blksz1 : i32
      omp.simdloop for (%i) : i32 = (%j) to (%end1) step (%c1) <simd-clauses> {
        llvm.call @foo(%i) : (i32) -> ()
        omp.yield
      }
      omp.yield
    }
    omp.terminator
  }
  omp.yield
}

Evaluation of Alternatives

For the representation of combined constructs, it seems clear that the current approach already meets the goals of being non-redundant, scalable and standard-conforming, so my proposal in that regard is to leave it unchanged.

Regarding the various alternatives for composite constructs, I summarize below some of the thoughts and concerns of each one based on three main categories.

Wrapper Operations

OpenMP dialect and standard-conformance:

  • This is a scalable representation that would not require new operations if new composite constucts are added to the specification, since it is based on only representing leaf constructs.
  • Composite operations would look very similar to combined constructs in MLIR, though the way they have to be processed would be different. The MLIR representation would look like non-conforming OpenMP with operations directly nested inside of other operations they cannot really be directly nested inside of.
  • omp.wsloop’s semantics would become context-dependent rather than exclusively representing worksharing across threads (‘+’ here denotes nesting of MLIR operations):
    • omp.parallel + omp.wsloop: worksharing across threads.
    • omp.distribute + omp.wsloop: worksharing across teams.
    • omp.wsloop + omp.simd: worksharing across SIMD lanes.
    • omp.distribute + omp.parallel + omp.wsloop: worksharing across threads over multiple teams.
    • omp.distribute + omp.parallel + omp.wsloop + omp.simd: worksharing across threads and SIMD lanes over multiple teams.
    • omp.parallel + omp.wsloop + omp.simd: worksharing across threads and SIMD lanes.
    • omp.distribute + omp.wsloop + omp.simd: worksharing across teams and SIMD lanes.
  • Alternatively, rather than representing all worksharing loops using omp.wsloop and using wrappers to represent all cases as described in the previous point, omp.simdloop and omp.distloop could be kept and added, respectively, to represent #pragma omp simd and #pragma omp distribute loops, using wrapper operations only for composite constructs.
    • It might be confusing to keep both omp.{simd,dist}loop and omp.{simd,distribute} in the dialect, and there would be a significant overlap between the arguments loop and wrapper variants would take.
    • It does not remove the need to make these loop operations context-dependent.

MLIR creation:

  • The current approach for creating MLIR for combined constructs could be used for composite constructs as well, so not much additional work is required.

MLIR to LLVM IR translation:

  • Expressing the hierarchy of leaf constructs explicitly in MLIR would match certain already-existing LLVM IR code generation flows implemented in the OpenMPIRBuilder.
  • However, certain other flows might require sharing of information gathered in nested constructs to some of its parents, like for reductions or even host code generation for target device kernel launches.
    • This can be solved by adding more state to the OpenMPIRBuilder, which might be something to be avoided if possible.
    • Also solvable by implementing custom translation functions for certain patterns of nested OpenMP MLIR operations rather than always calling a single translation function per leaf construct.

Composite Operations

OpenMP dialect and standard-conformance:

  • Semantically distinct groups of constructs are represented by their own operation, which seems like something that we want to have.
    • However, the semantics are not quite that different from that of applying each leaf construct. These operations only really represent the addition of some loops and making some transformation to the original one and then applying each leaf construct to one of the resulting loops.
  • Introduces redundancy in the dialect, as composite operations would contain arguments for clauses applicable to each of their leaf constructs, for which we would have standalone operations too.
    • Maybe it is possible to refactor tablegen definition of certain argument lists, etc. to allow some reuse across different operations.
  • This representation is not scalable with regards to the number of different composite constructs in the specification.
    • Whether this should be a concern entirely depends on how many of the potential additions that may come in OpenMP 6.0 or later are expected to be for composite constructs specifically.
    • clang/flang already list individually each combined/composite construct with their applicable clauses (see here), so this MLIR representation would not be the only part of the compiler that would make adding new ones cumbersome.

MLIR creation:

  • Trivial construction of the MLIR representation, since there would be a 1-to-1 correspondence between source constructs and MLIR operations.

MLIR to LLVM IR translation:

  • Some refactoring would be needed to allow translation functions for the composite operations to share code with translation functions for each leaf operation.
    • This might be made easier by introducing an MLIR interface for each leaf operation that can appear on a composite operation.
  • It should generally give flexibility as to how to translate each composite leaf construct vs. standalone constructs, as well as passing information shared between leaf constructs.

Explicit Outer Loops

OpenMP dialect and standard-conformance:

  • This representation mirrors what is done for combined constructs, since it expands composite constructs into exactly what they represent.
  • It results in producing the same MLIR representation for semantically equivalent source codes regardless of whether composite constructs are used, so composite constructs would be treated as syntactic sugar and not result in different binaries.
    • Although it would be possible, this would be more difficult to achieve starting from different MLIR representations.
  • This approach would most likely prevent the use of the OpenMPIRBuilder for loop transformations, since they need to be applied before the new worksharing loops are introduced.

MLIR creation:

  • More complicated construction of the MLIR representation, as it needs to also represent blocking/scheduling according to the constructs and clauses specified while considering non-divisible tripcounts.
    • Alternatively, it could be left to the compiler frontend to split these constructs on its own (i.e. make parse tree rewrites) before creating the MLIR representation. This might be unnecessarily difficult to do.
  • It is possible to combine this approach with the “composite operations” approach to simplify initial MLIR creation. There would be two levels of abstraction and an MLIR pass to lower composite operations into a set of loops implementing explicit blocking and applying each leaf operation independently.
    • Creating an “hlomp” dialect might be overkill for this purpose, but the alternative of having high-level and low-level versions of these operations which cannot be mixed complicates the dialect.
  • It would be difficult to implement both loop transformations and explicit outer loops directly when creating the MLIR representation.
    • Mixing it with the “composite operations” approach enables the application of loop transformations in MLIR before introducing the new worksharing loops.

MLIR to LLVM IR translation:

  • Translation to LLVM IR might become significantly more complicated in some cases, because any composite construct for which we need to target specific runtime calls would require recognizing certain MLIR patterns including multiple nested operations and chunk calculations.
  • However, when the expected LLVM IR matches exactly the hierarchical structure of the source code after splitting composite constructs, this approach could be the simplest to implement.
  • Lowering information sharing between leaf constructs would be more difficult following this approach, since they apply to different loops and it would most likely require adding more state to the OpenMPIRBuilder.

Interaction with Loop Transformations

The OpenMP specification also defines some loop transformation constructs (e.g. tile, unroll) that replace a given loop nest for a structured block that might also be another loop nest. It is expected that OpenMP 6.0 will introduce some more of these, so extensible support for these in the OpenMP MLIR dialect will be needed.

Currently, existing worksharing loop operations (omp.wsloop and omp.simdloop) represent the loop nest themselves, including index variables, ranges and steps. This approach has the problem that it makes it difficult to represent arbitrary loop transformations independently of how they might later be executed, so there has been discussion of introducing the omp.canonical_loop operation to be able to separate the loop nest definition, the transformations done to it and its execution. The discussion on how this operation might be represented is still ongoing, and it spans multiple threads: D155765, #65380, #67720.

One important feature of loop transformation constructs is that they should be processed before any loop constructs are applied, so that the latter refer to the result of the loop transformations and not the original loop nest. This is where there can be some interactions between the decision to represent loop transformations and how loop constructs in general and combined/composite loop constructs in particular are represented. Both representations should be compatible and not introduce unreasonable difficulties.

Some loop transformation are already implemented in the OpenMPIRBuilder, which is also the preferred place where LLVM IR codegen for OpenMP should reside, as it can be shared between clang and flang. So, in principle, any approach to representing OpenMP loop transformations in MLIR should be able to target this existing support.

The alternative would be to implement loop transformations entirely within MLIR, which would introduce a point of divergence between clang and flang, as well as requiring more development work. Following this would mean that any alternatives for representing composite constructs that require handling loop transformations in MLIR would need agreement from the community to move this logic out of the OpenMPIRBuilder.

Assuming that, at least initially, we would prefer to rely on the OpenMPIRBuilder, we can narrow down the available options to the “wrapper operations” and the “composite operations” alternatives because these are the ones that allow delaying the loop transformations to LLVM IR rather than doing them in MLIR. If we also assume that we defined omp.canonical_loop as a kind of lambda function that returns an MLIR value representing an OpenMPIRBuilder::CanonicalLoopInfo object, both options would be able to convey the semantics of composite constructs while integrating seamlessly with loop transformation operations. Below shows an example of how that could potentially look like:

#pragma omp target teams distribute parallel for
#pragma omp tile sizes(2)
for (int i = 0; i < n; i++)
  foo(i)
// Define loop body (%i would be an entry block argument to represent the
// induction variable).
%loop = omp.canonical_loop (%i) : i32 context(...) {
  // Use additional block args to access context values defined in the execution
  // site...
  llvm.call @foo(%i) : (i32) -> ()
  omp.terminator
} (...) -> !omp.cli

// Loop transformations.
%tloop = omp.tile %loop { sizes=[2] } : (!omp.cli) -> (!omp.cli)

// OPTION 1: Run loop with wrapper operations.
omp.target {
  %n = ... : i32
  %c0 = arith.constant 0 : i32
  %c1 = arith.constant 1 : i32
  omp.teams {
    omp.distribute {
      omp.parallel {
        // Execute loop, specifying pre-transformations loop range.
        omp.wsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.cli, i32, i32, i32, ...) -> ()
        omp.terminator
      }
      omp.terminator
    }
    omp.terminator
  }
  omp.terminator
}

// OPTION 2: Run loop with composite operations.
omp.target {
  %n = ... : i32
  %c0 = arith.constant 0 : i32
  %c1 = arith.constant 1 : i32
  omp.teams {
    // Execute loop, specifying pre-transformations loop range.
    omp.distparwsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.cli, i32, i32, i32, ...) -> ()
    omp.terminator
  }
  omp.terminator
}

The example above shows that, after deciding on where loop transformations should be implemented (MLIR or LLVM IR), the representation of these and that of loop constructs should be orthogonal decisions. So my proposal is to:

  1. Decide whether it would be justified to implement loop transformations as an MLIR pass.
  2. Focus on what the best pre-canonical-loop representation for composite and combined constructs would be regardless of how loop transformations eventually get represented. These should be separate discussions and adapting what we come up with here to what eventually gets decided on that other discussion should not take significant effort.
2 Likes

Thanks for the well described RFC. I didn’t have time to go through all of it yet but just a simple comment about leave the current representation unchanged.

I think it will still be valuable to carry the information in the IR that this was a combined construct to begin with. Maybe with an attribute or smth like that. It would be useful if you plan to do some of the processing in MLIR passes rather than during lowering from the different frontends.

I prefer the wrapper approach as it appears to be more scalable and future-proof when compared to composite operations. We can defer the discussion on representing composite constructs, including additional attributes, until the OpenMP 6.0 standard is released and we know the final set of combined constructs. With the release of OpenMP 6.0 soon, new combined constructs could also be introduced. Until then, we can implement lowering from MLIR to LLVM IR and draw conclusions about the important points for discussion before making a final decision.

Thanks @skatrak for the detailed RFC and explanation and further info and tables in the tickets.

The wrapper approach probably fits in with the infrastructure that we have currently. And also fits well with what @kparzysz is implementing for splitting the clauses among the constructs.
My only concern is we have already created a wsloop construct that has pulled in the loop control-flow into it. This will cause difficulties like requiring a simd operation that encloses the body of the wsloop but is nested inside it.

The alternatives for this are to deploy the canonical approach and have all the OpenMP operations as wrapper operations.

omp.wsloop {
  omp.simd {
    omp.canonical_loop i=1, 100 {
    }
  }
}

or to have simd as an attribute.

omp.wsloop for i=1,00 simd {
}

Did the lowering to LLVM IR work reasonably well for you for the composite constructs with simd?

We can retain composite and combined information using either an attribute like @clementval suggested or we can do something like the following to maintain chains that can be traversed in both directions.

omp.distribute leaf(%t1) {
bb0 (%t1 : !omp.token):   
  omp.parallel leaf(%t2), construct(%t1:!omp.token) {
  bb0 (%t2 : !omp.token:
    omp.wsloop  construct(%t2:!omp.token) for i=1,100 {
     }
   }
 }
}

Given that composite and combined information can be retained by using attributes/operands I guess it is fine to split and represent as wrapper operations.

I’m also in favor of wrapper ops, since this keeps the number of ops lower and it is more modular. For lowering of composite constructs I think matching a set of ops and call a single single lowering function is probably preferable, since that will avoid adding more state to the OpenMPIRBuilder and keep the dependencies within a single function.

Thank you @clementval, @DominikAdamski, @kiranchandramohan and @jansjodi for the comments and suggestions. I think we’re leaning towards the “wrapper operations” option at the moment, so it looks like it would be good to nail down the details of that approach, since there are some decisions to be made still. I’ll try to summarize here some of the things that we talked about in Monday’s community call as well.

Retaining combined/composite information

With regards to either using attributes, block arguments or any other kind of explicit information to the MLIR ops to mark them as part of a combined/composite construct, I’m not sure whether that is necessary. I understand that we might want to process certain combined/composite constructs differently and this information would be useful to easily detect them. However, for example these two snippets represent semantically the same code:

#pragma omp parallel for
for ...
#pragma omp parallel
{
  #pragma omp for
  for ...
}

If we relied on an explicit attribute to tell us this is a combined parallel for construct, the second case wouldn’t be marked as such unless we did some work in the frontend to combine them beforehand. In my opinion, we’d want a solution that could represent such cases as closely as possible, so that they would follow the same codegen path.

I think the “wrapper operations” representation already holds all the information that’s needed to tell whether a given nest of operations could be represented as combined/composite constructs. We can check the region of a given operation and know whether it could have been represented by a combined construct by looking at the operations inside. Likewise, if we wanted to do some kind of bottom-up lowering, we could look at the parent operation and siblings to tell if this is a combined/composite operation. Composite constructs would be easy to identify because they would result in operation nests that couldn’t come from legal OpenMP code otherwise.

One way to make such pattern-checking less cumbersome would be to make it a part of each applicable operation’s extra class declarations, so it would be as simple to use as checking some explicit attribute. Eventually, it seems plausible that we could be able to extract this pattern-checking into some operation interfaces/traits to generalize it and simplify reuse if new combined/composite constructs are added.

Wrapper operations alternatives

There are a few ways we can use the “wrapper operations” approach to represent composite constructs, which I tried to summarize in the original post. After thinking about it and hearing other’s opinions, I guess at the moment we have two main alternatives to do this. Either way, I propose we don’t try to solve the loop transformation problem in the same discussion. We can take the current loop operations (omp.{ws,simd}loop) and make some changes to represent all combined and composite constructs, while later on focusing on making the necessary changes to the representation of the loops themselves (excluding worksharing, parallelism-generation information, …) to accept transformations.

I’ll use the following constructs to illustrate both alternatives:

#pragma omp distribute parallel for simd
for (i = lb; i < ub; i += step) {}
#pragma omp distribute simd
for (i = lb; i < ub; i += step) {}

Canonical loop solution

This proposal consists of simply extracting the loop range, step and induction variable definition into its own operation. This simplified canonical loop wouldn’t be the final representation intended to work with loop transformations, but just a way to avoid having multiple MLIR operations each holding loop information.

omp.distribute {
  omp.parallel {
    omp.wsloop {
      omp.simd {
        omp.canonical_loop (%i) in (%lb) to (%ub) step (%step) {
          omp.yield
        }
        omp.terminator
      }
      omp.terminator
    }
    omp.terminator
  }
  omp.terminator
}
omp.distribute {
  omp.simd {
    omp.canonical_loop (%i) in (%lb) to (%ub) step (%step) {
      omp.yield
    }
    omp.terminator
  }
  omp.terminator
}

The steps needed would be to add the simplified omp.canonical_loop operation and turning the existing omp.simdloop and omp.wsloop operations into wrapper-style operations by removing their information about the loop itself (ranges, induction variables and steps).

Having the loop information already extracted out of omp.wsloop into its own operation seems like a good step towards eventually getting this to work with loop transformations at the expense of more work in the short-term.

Non-canonical loop solution

The other main alternative would be to have specific operations that contain the loop information (ranges, induction variables and steps) for those that can appear on their own and also have wrapper-style variants for those that can appear in a composite construct as anything but the last leaf construct (i.e. for, distribute, parallel, taskloop).

omp.distribute {
  omp.parallel {
    omp.ws {
      omp.simdloop (%i) in (%lb) to (%ub) step (%step) {
        omp.yield
      }
      omp.terminator
    }
    omp.terminator
  }
  omp.terminator
}
omp.distribute {
  omp.simdloop (%i) in (%lb) to (%ub) step (%step) {
    omp.yield
  }
  omp.terminator
}

In this case, we would need to add the omp.ws, “omp.taskloopwrapper” (with hopefully a much better name than that) and omp.simd wrappers, as well as the omp.distloop loop operation. In the original proposal no omp.ws or “omp.taskloopwrapper” operations were added, but otherwise it’s quite unnatural to represent composite simd constructs because the options in that case are:

// Rather than a regular wrapper op over a loop, it wraps the
// loop body because the parent operation is the one defining
// the loop.
omp.{ws,task}loop (%i) ... {
  omp.simd {
    omp.terminator
  }
  omp.yield
}
// The wrapper op here is expressing some behavior that is
// conceptually applied after its child op.
omp.simd {
  omp.{ws,task}loop (%i) ... {
    omp.yield
  }
  omp.terminator
}

The advantage of non-canonical loop alternatives is that they possibly introduce fewer changes, so they should be simpler to get incorporated. However, they also introduce some redundancy into the dialect, since there will be a couple of variants of some operations sharing most of their attributes and arguments, as well as adding yet another operation that holds loop information.

In my opinion, I think it might be the time to introduce a simplified omp.canonical_loop as a starting point and build the “wrapper operations” on top of it, since it seems to introduce the least compromises.

With regards to @kiranchandramohan’s suggestion to perhaps make simd an attribute to omp.wsloop, I think there are some problems with that. Mainly, we’d then have to add SIMD-specific attributes and arguments to omp.wsloop and also we’d have to do this not just for that operation, but also for omp.taskloop. It doesn’t seem to scale very well and it would be a kind of exception that wouldn’t follow the “wrapper operations” design.

Did the lowering to LLVM IR work reasonably well for you for the composite constructs with simd?

As far as I know, we haven’t gotten there, since that wouldn’t benefit us for GPU codegen at the moment. The idea with PR #79843 we had was to simply allow representing in MLIR do simd differently from simd, so that we could just treat do simd as just do (simd length=1) on the GPU but still allow its clauses to be stored somewhere in case more sophisticated lowering for that case is needed later or for another target. At the moment do simd is just lowered as omp.wsloop and SIMD information discarded.

Thank you @skatrak for exploring all the various alternatives. I agree that we should probably use omp.canonical_loop since that is something we are going to need in the future. I think the current omp.canonical_loop PR can be used since the CLIs are optional and we could even remove them and add them in later in case the way we want to represent CLIs changes. Keeping things smaller and simpler, but still in the right direction is a good thing imo. Also, since we can infer the information about combined/composite constructs from the structure of the wrapper op, then we don’t need to add anything. We would initially only need this for the codegen lowering afaik. I am working on prototyping this matching to get a better sense of what the lowering will look like when matching and lowering multiple operations in a single function (using existing functions as helpers).

1 Like

+1. I agree that this seems to be the best solution for us. Although it will bring some short-term pain, this is more future proof and also will help us with implement the openmp loop transformation operations.

Thanks for all the investigation and detailed writeups.

1 Like

I am not fully on board with this part. There is the question of how much the dialect should match what is specified in the standard and also adding verifiers to ensure that what is not permitted in the standard is not representable in the dialect. Carrying additional info about composite constructs will ensure that we can have wrapper ops only if they are part of a composite construct.

And also it is important for a user of the dialect to understand clearly the representation of a combined or composite construct. If we carry information in the dialect to mark operations as part of composite constructs then this is clear to the user.

When pattern matching is done for lowering or transformations may be the additional info can be ignored.

1 Like

I see, would you suggest adding this information for both combined and composite constructs or only for composite ones? It seems reasonable that we made sure composite constructs are explicitly linked in MLIR, since their leaf constructs could only have been specified together, but I’d still suggest not doing the same for combined ones as they can be split by the user with no changes to semantics. Using a similar syntax to what you previously shared and a teams distribute parallel for construct as reference, it would look like this:

// No leaf(...) because it's a combined construct
omp.teams {
  // 'distribute parallel for' is composite
  omp.distribute leaf(%t1) {
    omp.parallel leaf(%t2) composite(%t1) {
      omp.wsloop composite(%t2) {
        omp.canonical_loop %i... { }
      }
    }
  }
}

I think we should definitely have a representation for composite. Since combined is legal in both ways, we can may be skip it for now and add it later if required.

This representation is fine. You can also start with attributes to keep it simple if you do not anticipate an immediate use.

This threads tends strongly towards a wrapper solution, but let be give some arguments in favor of composite operations. With composite operation, I mean something like

omp.distribute_parallel_ws_loop (%i) in (%lb) to (%ub) step (%step) <clauses...> {
  ...
}

This is not discussing the “combined” part of constructs which is more obvious to split, i.e.

omp.parallel {
  omp.wsloop(%i) in (%lb) to (%ub) step (%step) <clauses...> {
    ...
  }
}

should remain as-is. So we only have operations that apply to loop nests to care about.

  1. First the downsides: It is less flexible (allows only pre-determined combinations), requires an explicit implementation/lowering of each combination, does not allow loop-transformations/non-perfectly nested loops/non-rectangular loops, and passes are difficult to be generic over all operations that contain e.g. worksharing-loop. However, I think these are acceptable for a temporary solution, e.g. we don’t have MLIR-specific passes optimizing these directives yet.

  2. We can still have building blocks to implement each compositive constructs: A function lowering the distribute part to be called from each composite operation lowering function. Lower-bound, upper-bound, step etc. would be member of a base class for composite operations.

  3. omp.wsloop is basically already modeled this way: A composite of a canonical loop nest and a worksharing-loop directive.

  4. This is how Clang already models composite directives for the AST and CodeGen, we can directly clone its behavior.

  5. Deciding which clause goes onto which wrapper operation is non-trivial. We don’t have to with composite operations.

  6. The listing of possible modeling in [RFC] Representing combined/composite constructs in the OpenMP dialect - #6 by skatrak shows that a normalized modeling using wrappers is not so obvious.

  7. It offers more control over which combinations of lead directives we allow. A pass transforming a (wrapper) region might transform it into something that cannot be nested inside the wrapper operation that it is nesting in. For the goal of orthogonality, passes should not generally not need to consider what outer wrappers it is nested in. IMHO leaf(%t2) composite(%t1) adds too much unnecessary complexity for a temporary solution.

  8. I don’t think we save any work transitioning from a wrapper solution to an omp.canonical_loop solution. If the lowering itself should be implemented in their own functions (as outlined in 1.), then the difference is where the arguments come from which will be different in any case. IMHO we should not reuse any operation names for the permanent solution which otherwise would contribute to confusion and should not be combined (can a omp.tile be nested inside a wrapper operation?) and later will make it harder to remove the temporary solution.

I think the main thing about the wrapper ops is composability and the potential to non-pure nesting of the operations in case there needs to be some ops there in the future (CLI handling, instrumentation or similar) which wouldn’t be possible with a single composite operation. Krzysztof is working on a solution for point 4, which I think is probably the main issue with the nesting approach. Having a single function that lowers the code (which uses shared utility functions) can still be done by matching a nest. As far as marking something as a combined op, I think just having an attribute ‘combined’ on an op, to indicate it is combined with the parent op would be enough.

Thank you Michael for adding your concerns to the discussion. In general, at this point I already lean towards going with the wrapper approach, as I’ve already started work on this, so I might be a bit biased at this point. For a different approach to be taken, I think it should have clear advantages. I’ll try to share my opinion on the points you’ve raised below.

  1. First the downsides: It is less flexible (allows only pre-determined combinations), requires an explicit implementation/lowering of each combination, does not allow loop-transformations/non-perfectly nested loops/non-rectangular loops, and passes are difficult to be generic over all operations that contain e.g. worksharing-loop. However, I think these are acceptable for a temporary solution, e.g. we don’t have MLIR-specific passes optimizing these directives yet.

The main issue I have with this point is that you see the composite ops approach as a temporary solution to doing it differently later on. As far as I understand it, our goal with this discussion was to agree on a longer term path. In the future we might learn it has some issues that force us to do something else, but the wrapper approach was chosen as a long term solution.

  1. We can still have building blocks to implement each compositive constructs: A function lowering the distribute part to be called from each composite operation lowering function. Lower-bound, upper-bound, step etc. would be member of a base class for composite operations.

Yes, lowering from MLIR to LLVM IR wouldn’t be prevented from being able to reuse functions for each leaf construct. This wouldn’t be much different for the wrapper ops because, as @jansjodi mentioned, the plan for the wrapper ops approach would be to also match the composite construct first and then we can decide how to lower it based on the leaf constructs present.

Where I see more of a problem in this regard is in the definition of the MLIR dialect. The problems is that composite constructs will take the union of the lists of arguments expected for each leaf construct, and then there will be also operations for each leaf construct as well. That’s quite a bit of redundancy, even if we don’t expect the number of composite constructs to grow much.

  1. omp.wsloop is basically already modeled this way: A composite of a canonical loop nest and a worksharing-loop directive.

Yes, and I think this lack of separation of responsibilities is already something to address. To me, that is one of the reasons to introduce the omp.canonical_loop operation. Currently, omp.wsloop, omp.simdloop and omp.taskloop already define the loop as well as the worksharing. omp.distribute would have to as well if we go with the composite ops approach.

  1. This is how Clang already models composite directives for the AST and CodeGen, we can directly clone its behavior.

I think this is related to my response to point 2. If we match the composite construct based on the set of wrappers present at the MLIR to LLVM IR translation stage, we are effectively able to match Clang if needed. That is, we could have convertOmpDistributeParallelWsLoop-style translation functions.

  1. Deciding which clause goes onto which wrapper operation is non-trivial. We don’t have to with composite operations.

I agree that matching clauses to leaf constructs is not trivial. As @jansjodi said, @kparzysz is already working on an approach to do this prior to MLIR creation. But even without that, creating composite operations wouldn’t prevent us from having to deal with this problem. It would just be delayed to the MLIR to LLVM IR translation rather than doing it during the initial MLIR creation.

  1. The listing of possible modeling in [RFC] Representing combined/composite constructs in the OpenMP dialect - #6 by skatrak shows that a normalized modeling using wrappers is not so obvious.

Well, the only discussion at that point was really whether to keep combining loop range information into each loop construct or to just have pure wrappers and a single loop definition. In this case, we agreed that it was best to separate loop information into something that eventually should be replaced by the full omp.canonical_loop which for the time being could just mirror the current way loops are defined in the existing loop operations.

  1. It offers more control over which combinations of lead directives we allow. A pass transforming a (wrapper) region might transform it into something that cannot be nested inside the wrapper operation that it is nesting in. For the goal of orthogonality, passes should not generally not need to consider what outer wrappers it is nested in. IMHO leaf(%t2) composite(%t1) adds too much unnecessary complexity for a temporary solution.

That’s right, in the wrapper ops approach we must make sure that no transformations make changes to a wrapper without taking into account any other wrappers applied to the same loop (i.e. process composite constructs as a whole). Here the main thing is that wrappers do indeed make it so an operation’s meaning can be context-dependent. It’s not the same to have a loop inside of omp.wsloop than having it inside of omp.wsloop and omp.simd. However, this only means that the set of wrappers attached to a loop must be treated by passes and lowering as a single entity. I don’t expect generic MLIR passes to do anything to break this, and OpenMP passes will know how to deal with them.

The leaf(%t2) composite(%t1) does seem to be slightly overkill, given that we can just check parent and child ops to figure out the full set of wrappers. In my current work, I just went with a “composite” UnitAttr instead, which conveys the needed information. Again, I don’t see this as a temporary solution but as one aspect to iron out from the expected solution going forward.

  1. I don’t think we save any work transitioning from a wrapper solution to an omp.canonical_loop solution. If the lowering itself should be implemented in their own functions (as outlined in 1.), then the difference is where the arguments come from which will be different in any case. IMHO we should not reuse any operation names for the permanent solution which otherwise would contribute to confusion and should not be combined (can a omp.tile be nested inside a wrapper operation?) and later will make it harder to remove the temporary solution.

I think the main problem with this point is that I don’t see wrapper ops as a temporary step before switching to an omp.canonical_loop solution. Both things should be compatible and independent to the extent possible. Canonical loop is for representing loops, supporting transformations, etc. whereas wrappers are intended to represent loop worksharing constructs that apply to a loop, regardless of transformations done to it in advance and of how these end up being represented. So wrappers wouldn’t save much work from implementing omp.canonical_loop (except maybe the fact that as part of it we’re reducing the number of ops with loop information to be modified/replaced to a single one) because they deal with a different problem.

About the naming, I’m not against using a different name for the loop operation to omp.rectangular_loop, omp.collapsed_loops or something else to avoid problems in the process of replacing it for a full omp.canonical_loop. Maybe someone has a better proposal for the name for this placeholder loop operation. A set of wrappers must contain a single loop nest, though perhaps it doesn’t have to be in canonical form, I’m not sure about this.


Currently, I’m not convinced that we should stop ongoing work to support the wrapper ops approach and reconsider following the composite ops approach, but please do let me know if my comments didn’t address the issues you have with this solution.

@skatrak, @jansjodi and me had a call about this topic. @skatrak thinks he can cope with all with all concerns that I brought up and I have no further objections.

I’d still ask to use a different name than canonical_loop, which only represents a single loop in [OpenMP][MLIR] Add omp.canonical_loop operation, !omp.cli type, omp.new_cli operation by jsjodin · Pull Request #71712 · llvm/llvm-project (github.com), and use !omp.cli as handle. Here, it represents an entire loop nest and has no handle, but relies on nesting in wrappers. I think they are conceptually different enough to warrant different operations to avoid conflicts. Name suggestion: omp.(rectangular/canonical_)loop_nest to signify that it represents the entire loop nest.

1 Like