[RFC] Disambiguation between loop and block associated omp::ParallelOp

Background

The OpenMP specification defines a set of standalone constructs which can be grouped together into combined and composite constructs. Combined constructs are those that can be replaced by splitting the construct into its first “leaf” and the rest of the construct, and composite constructs are those which would require further code transformations to be split.

As a very simple overview, some OpenMP constructs are block-associated (i.e. they can be applied to a block of code) and there are others which are loop-associated (i.e. they can only be applied to a loop). Currently, all composite constructs are loop-associated.

The parallel construct is peculiar in that it generally works as a block-associated construct when used standalone or in a combined construct, but it can be a leaf in a loop-associated composite construct as well. That is the case for distribute parallel do and distribute parallel do simd. Below there is an example of each.

// Block-associated (equivalent to the combined 'parallel for' construct).
#pragma omp parallel
{
  #pragma omp for
  for (int i = 0; i < 10; ++i) {
    // ...
  }
}

// Loop-associated.
#pragma omp distribute parallel for
for (int i = 0; i < 10; ++i) {
  // ...
}

MLIR Representation

Loop-associated constructs in the OpenMP dialect in MLIR are currently represented as loop wrappers. These are operations that define a single region where only a single loop wrapper or omp.loop_nest operation and a terminator are allowed.

This representation enables us to apply multiple loop wrappers to a single loop nest, which is how composite operations can currently be represented in a scalable manner.

On the other hand, the region defined by an operation in the OpenMP dialect representing a block-associated construct does not impose these restrictions that loop wrappers have.

Below are examples of what the MLIR representation for parallel for and distribute parallel for would look like.

// #pragma omp parallel for
omp.parallel { // NON-WRAPPER (block-associated)
  // <Allocations inserted here...>
  %c0_i32 = arith.constant 0 : i32
  %c10_i32 = arith.constant 10 : i32
  %c1_i32 = arith.constant 1 : i32
  omp.wsloop {
    omp.loop_nest (%arg0) : i32 = (%c0_i32) to (%c10_i32) step (%c1_i32) {
      // ...
      omp.yield
    }
    omp.terminator
  }
  omp.terminator
}

// #pragma omp distribute parallel for
// <Allocations inserted here...>
%c0_i32 = arith.constant 0 : i32
%c10_i32 = arith.constant 10 : i32
%c1_i32 = arith.constant 1 : i32
omp.distribute {
  omp.parallel { // WRAPPER (loop-associated)
    omp.wsloop {
      omp.loop_nest (%arg0) : i32 = (%c0_i32) to (%c10_i32) step (%c1_i32) {
        // ...
        omp.yield
      }
      omp.terminator
    }
    omp.terminator
  }
  omp.terminator
}

Problem

The problem with having the omp.parallel operation be able to represent both block and loop associated cases is that, as part of its behavior as a block-associated construct, it has the OutlineableOpenMPOpInterface. This interface is then used in various places in Flang to identify blocks where allocations can be inserted, which is only legal for omp.parallel if it’s not taking a loop wrapper role.

In the examples above, the intended place where allocations should go is marked with a comment. However, the current behavior would be to always insert these allocations inside of the omp.parallel region.

Potential solutions

There are different ways we could go about this, but I wanted to get some opinions first before committing to any option:

  • Updating each place where the OutlineableOpenMPOpInterface is used to make an exception if it’s an omp.parallel taking a loop wrapper role. It doesn’t seem like a very scalable solution, but there’s an implementation downstream that could be upstreamed relatively quickly.
  • Updating the OutlineableOpenMPOpInterface::getAllocaBlock() method itself to check if the operation is an omp.parallel, and returning a null pointer if it is and it’s also taking a loop wrapper role. Callers would need to be updated to not rely on it always returning a valid pointer and from a design perspective it won’t look very good (maybe we can force all operations using this interface to define this method or somehow override it for omp.parallel instead).
  • Adding an omp.parallel.wrapper or similar operation to disambiguate between block and loop associated parallel constructs. It would basically be a copy of omp.parallel differing only in their lists of traits and description. This may make passes and translation to LLVM IR more difficult, since then there would be two operations representing basically the same construct.
  • Relaxing loop wrapper restrictions to allow other operations. It would be problematic to define what to allow and what not to, and how to deal with it when translating to LLVM IR, since a composite construct is intended to work as a semantic unit that implicitly incorporates some code transformations.

There might also be other solutions I haven’t thought about, so I’m open to other ideas as well.

I think we should lower distribute parallel xyz as parallel (distribute (xyz ...)), and have the distribute op bind to all encountering threads.

I may be misunderstanding your suggestion, but is it to lower distribute parallel do/for simd into this?

omp.parallel {
  omp.distribute {
    omp.wsloop {
      omp.simd {
        omp.loop_nest ... {
          ...
        }
      }
    }
  }
}

If that’s the case, I’m not too sure about it. It would mean basically creating a representation that doesn’t match the way these constructs work with each other in the spec.

The goal with loop wrappers was basically to be able to attach clause information related to each leaf of a composite construct to its corresponding operation while closely following the order in which they are applied as per the spec, making it simpler to translate to LLVM IR later.

Making this change would mean omp.parallel is no longer a loop wrapper, which addresses the issue raised by this RFC, but if it contains omp.distribute + omp.wsloop inside (potentially located after any arbitrary set of operations) then its contents should be lowered as if nested inside of omp.distribute. Otherwise, it should be treated as the block-associated variant. I don’t know if making an exception to break the representation in this way is worth it just to avoid having to disambiguate between loop and block associated omp.parallel.

IIUC, the semantics of distribute parallel is that the parallel creates a team for each thread that encountered the distribute parallel, and then the distribute part of it assigns some task to each of these threads. In essence, it looks like creation of extra threads precedes the work distribution, hence my suggestion.

Is ParallelOp the only op with this issue? Maybe we can get rid of it, and replace it with an attribute “creates-more-threads”. We could come up with a representation of standalone parallel, e.g. NopOp + attribute…

My aim would be at composability. We can represent the OpenMP operations with ops that perform specific actions, e.g. “create team”, “create tasks”, “distribute work”, etc. It would be a big change from what we’re trying to do now, but maybe it’s better in the long term, since we wouldn’t be tied to the set of ops defined in the OpenMP standard.

Edit: Maybe in the short term we can replace the trait you mentioned above with an attribute? This way the ParallelOp could have it or not, depending on the context. Maybe that could be a viable permanent solution…?

The teams construct creates a set (“league”, per the spec) of “initial” teams composed of one thread, and then the distribute construct workshares chunks of the associated loop across these teams. Similarly, the parallel construct creates a team of multiple threads and the for/do construct workshares iterations across these threads. So I believe we can’t really reorder leaf constructs in distribute parallel for/do [simd] if we want the MLIR representation to be consistent with the spec (which also makes it easier to translate to LLVM IR later), because the “create teams - distribute across teams - create threads - workshare across threads” sequence needs to be maintained.

Yes, this is only an issue for omp.parallel since it’s the one construct that can be attached to a block or a loop (only as part of a composite construct). I guess it’s possible to abstract the representation more than the spec calls for, but given that it is the OpenMP dialect I wonder if that would make things more difficult than necessary.

Maybe I’m not completely seeing your suggestion, but it seems like this is what we already have by following the constructs in the spec: omp.teams = create league of teams; omp.parallel = create team of threads; omp.distribute = workshare across teams; omp.wsloop = workshare across threads. It does seem that what you’re proposing here is to create another higher level dialect that can be lowered to OpenMP (otherwise, it seems odd trying to make the OpenMP dialect not be defined in terms of OpenMP-specific concepts). In any case, I feel like such a fundamental redesign wouldn’t really be needed to address this issue. Maybe still something worth discussing separately if you see it can address some shortcomings of the current approach.

I think that’s one possible solution. Maybe the interface doesn’t say “this operation 100% for sure will be outlined and can hold allocas”, but rather this could overridden by a discardable attribute.

It’s still not implemented, but when consensus was reached on following the wrapper approach to represent OpenMP loops, we decided we would attach an omp.composite unit attribute to each loop wrapper. Maybe that’s the attribute we could check in the OutlineableOpenMPOpInterface, which would only be true for the loop-associated omp.parallel operation.

I agree with @kparzysz to just move the parallel outside the loop-associated constructs.

In one of my presentation, I described the core idea of OpenMP in pairs of directives: One to create parallelism (teams, parallel) and ones to work-distribute/-share the workload for that parallelism (distribute/workdistribute, for/do/sections/workshare)[1]. The composite directive distribute parallel ... uses the teams-level, and starts a new thread-level of parallelism. The order does not matter. In fact, that the order does not matter is essential to LLVM’s SPMD-mode. So the construct

#pragma omp target teams distribute parallel for
for (int i = ...) { ... }

allows lower to something CUDA-like:

__global__ void kernel() {
  int i = threadId.x; // for
  i += blockId.x * blockDim.x; // distribute
  ...
}
...
kernel<<<num_teams,num_threads>>>(); // (target) teams parallel

In these lower-level APIs team-creation and thread-creation cannot be split. When using separate teams and parallel directives, the implementation has to guess the number of threads at the kernel launch and then use a state machine to track in which parallel region we currently are.

OpenMP has the concept of “binding”. E.g. for binds to the current team. That is, even if used inside a teams, for workshares only between threads of the same team (which is just a single thread with no parallel between them). So the nesting order for it is unimportant. Unfortunately, the following is currently not allowed for syntactic reasons:

#pragma omp teams
#pragma omp parallel
{
  #pragma omp distribute
  for (int i = ...) {
    #pragma omp for
    for (int j = ...)
      ;
  }
}

There are clauses that apply only to the innermost leaf directive (In the upcoming OpenMP 6.0, those with the “innermost-leaf property”) or outermost leaf directive. For those the order of the directives in the combined/compound construct matters. But AFAICS, none of those change the semantics for this case. Even if, it belongs to the same logic in the frontend that decide which clause belongs to which wrapper operation, and it knows the original order.

With the compound/composite directive split into several MLIR operation, we already diverge from the literal representation of the source. E.g. even though #pragma omp parallel for simd is a single directive from the user’s point of view, only for simd belongs to the wrapper set, while parallel does not. IMHO this is a good thing since mid-ends do not need to bother about syntactic-only differences. Moving all the non-loop-associated directives outside is just one step further.


  1. simd violates this by doing both in a single directive. In the OpenMP committee we were discussing to split it up, which would result in something like the implicit SPMD compiler ispc. ↩︎

Thank you @Meinersbur for sharing your insights. Then, I understand that it wouldn’t be a problem to extract omp.parallel out of the loop wrapper nest because, even when it appears inside of a composite construct, it can still be conceptually understood to be block-associated and executed prior to distribute without necessarily breaking anything.

I guess the reason why I didn’t like this idea is because, to me, loop wrappers were basically a way to split the information attached to each leaf of a composite construct while still working as a unit. In that case, it never makes sense to reorder, since you’re basically representing the original syntax 1-to-1, only splitting it into several ops. But if we dig further and realize that loop wrappers should only represent loop-associated constructs, then from your explanation I get that omp.parallel shouldn’t be one.

Since we know that the only way a loop can be wrapped with omp.distribute and omp.wsloop (plus an optional omp.simd) is as a result of lowering a distribute parallel for/do [simd], then we should be able to figure out what to do to translate it to LLVM IR. It will probably be a bit trickier, but as long as the location where other operations existing inside of omp.parallel should be lowered is well defined in that case, it can be done.

One minor thing that I’m thinking is that this will distance us a bit further from having a similar representation for a composite construct and the equivalent code that applies each directive separately. Starting with this code as an example:

#pragma omp teams
{
  // Composite.
  #pragma omp distribute parallel do
  for (int i = ...) { // loop over items
    // ...
  }

  // Equivalent non-composite (assuming the same blocking method, etc).
  #pragma omp distribute
  for (int i = ...) { // loop over blocks
    #pragma omp parallel
    {
      #pragma omp for
      for (int j = ...) { // loop over items
        // ...
      }
    }
  }
}

The corresponding MLIR representation would be:

omp.teams {
  // Composite (omp.parallel extracted).
  omp.parallel {
    // ...
    omp.distribute {
      omp.wsloop {
        omp.loop_nest i... {
          // ...
        }
      }
    }
  }
  // Composite (omp.parallel wrapper).
  omp.distribute {
    omp.parallel {
      omp.wsloop {
        omp.loop_nest i... {
          // ...
        }
      }
    }
  }
  // Non-composite.
  omp.distribute {
    omp.loop_nest i... {
      omp.parallel {
        omp.wsloop {
          omp.loop_nest j... {
            // ...
          }
        }
      }
    }
  }
}

I think it’s a minor consideration, since we will produce different MLIR representations for cases that should be semantically equivalent either way. But I think the closer they are, the easier it should be to actually produce the same code and the easier it is to understand the dialect, because of a higher level of consistency.

The purpose of an IR is to abstract over syntax differences, so processing it does not need to have different code for syntax-only differences. parallel in a composite construct and stand-alone parallel or in a combined construct seems to be such as case.

I am not unsympathetic to your argument. We would basically pre-lower to SPMD-mode in the frontend. MLIR allows iterative lowering, so that simplification could be done later by a pass. Another lowering possibility would be your example where a single loop is split into i and j according to dist_schedule/OpenMP’s definition of composite constructs:

The semantics of each composite construct first apply the semantics of the enclosing construct as specified by directive-name-A and any clauses that apply to it. For each task (possibly implicit, possibly initial) as appropriate for the semantics of directive-name-A, the application of its semantics yields a nested loop of depth two in which the outer loop iterates over the chunks assigned to that task and the inner loop iterates over the logical iterations of each chunk. The semantics of directive-name-B and any clauses that apply to it are then applied to that inner loop.

IMHO the cost of that modelling is high: omp.parallel needs to be a polymorphic operation that can be either applied on regions or set between loops, or using separate operations (your third proposal: omp.parallel.wrapper) for each. If given the choice, I prefer the latter: It makes it explicit that this is “loop-associated” and as well as preserving the modeling of separating of block-associated and loop-associated operations. It may avoid subtle errors e.g. when interpreting a loop as block or the other way around:

#pragma omp teams distribute parallel for num_teams(1)
for (int i = 0; i < n; ++i) ;
// omp.distribute optimized away because `num_teams(1)`
// Is omp.parallel part of the wrapper nest? It was before.
omp.parallel {
   omp.wsloop {
     omp.loop_nest (%iv) : i32 = (0) to (%n) step (1) {
     }
  }
}

In any case, there is the cost of introducing semantics of a wrapper omp.parallel for this edge case when at the end we will want lower this to “SPMD-mode” anyway for performance, even on non-GPUs. Is it worth it? I was mostly making the point that by the OpenMP specification we are allowed to hoist the parallel outside, and Clang effectively already does in form of SPMD-mode.

I think that at this point it’s pretty clear that we don’t want to have an omp.parallel operation that can take both the role of a loop wrapper and a regular block-associated operation. Trying to force this approach results in having to add exceptions frequently when time loop wrappers are involved. Some of these may be helped by adding the omp.composite attribute as well, but I think that we currently have two main options:

  • Adding an omp.parallel.wrapper operation to disambiguate between loop and block associated parallel. This means there is some redundancy in the dialect, but perhaps it can be minimized by deriving it and omp.parallel from a common tablegen class. It also follows the representation of the construct in the spec a bit more closely.
  • Always use a SPMD-mode representation for distribute parallel do by extracting omp.parallel before all loop distribution operations. This makes it so that omp.parallel is never a loop wrapper and maps easily to the form already expected by the runtime.

The second option has grown on me, but I’d like to know what your thoughts are on making a decision on this.

I think the second option makes sense. Only one operation for parallel is needed instead of two. I don’t think it will introduce any ambiguity with nested parallel ops since the ParallelOp closest to the DistributeOp would be one associated with the loop. Having two ops and the resulting redundancy seems worse, but if it turns out to be needed it can always be added later on.

Over the past week I haven’t heard any objections to the current proposed solution of removing omp.parallel from the set of loop wrappers and to hoist it out when it appears in a distribute parallel do/for composite construct.

If anybody isn’t convinced and prefers to discuss alternatives, please do let me know. Otherwise, I’m planning to start working on this approach shortly. Thank you everyone for your input so far, it’s very much appreciated!

I havent followed the discussion fully. Would it be possible to hoist it out as a canonicalization transformation for ease of lowering (or handwriting MLIR) and understanding?

Thank you @kiranchandramohan for sharing your thoughts.

The main problem is that omp.parallel has the OutlineableOpenMPOpInterface and LoopWrapperInterface, which are functionally incompatible. The first is used to find blocks where allocas are inserted during lowering and certain transforms (e.g. FirOpBuilder::getAllocaBlock() or AllocMemConversion::findAllocaInsertionPoint()), and the second disallows that kind of use.

So, this solution removes the loop wrapper interface, which is a “static property” of the operation. This forces us to hoist omp.parallel out of an omp.distribute + omp.wsloop loop wrapper nest, because leaving it in the middle of two wrappers would be an invalid MLIR representation. What this means in practice is that the “hoisted omp.parallel representation” would be the only way to represent distribute parallel do/for. To pretend omp.parallel was a loop wrapper while constructing an initial MLIR representation, we’d have to initially produce something that wouldn’t pass op verifiers. My guess is that this wouldn’t work with handwritten MLIR, since I suppose it would have to pass verification as a first step.

I think the way to achieve something like that would involve creating a separate parallel wrapper op, which would later be hoisted and swapped for the non-wrapper version. The only difference would be that temporary allocas created in lowering would have already materialized into the parent function rather than the parallel op’s region. Though, if we were to add that operation, then there wouldn’t be a need for hoisting the parallel region out at all; we could just make that the canonical representation too (this is option 1 in my other comment above).

@kiranchandramohan did my previous reply address your concerns about that approach or should we discuss further before commiting to it? Thank you!

Would it work if we restrict the use of OutlineableOpenMPOpInterface to only if it is
→ a standalone construct (not part of combined/composite).
or it is
→ The outermost leaf construct in a combined/composite construct
?

If we hoist out the omp.parallel operation, wouldn’t the allocas also end up being hoisted. And wouldn’t that be incorrect?

I think that’s close to the second potential solution I presented initially in the RFC. We wouldn’t have to restrict combined constructs, regardless of the other leaves in the construct, since they are equivalent to specifying each leaf construct separately. In this case, the exception we could make to the OutlineableOpenMPOpInterface would just have to be that the operation is not currently part of a composite construct. This can be easily done by checking that it implements the LoopWrapperInterface and looking at its child/parent ops or querying the yet-to-be-added “composite” attribute.

In terms of what happens with allocas if we hoist the omp.parallel operation, we can make a comparison for a distribute parallel do construct:

  1. If omp.parallel was a loop wrapper and we inhibited the OutlineableOpenMPOpInterface as suggested, allocas for temporaries would be inserted in the entry block of the function. Alternatively, we could implement further changes to get them as close to the loop as possible (i.e. in the entry block of the parent operation of omp.distribute), but they wouldn’t be added to any loop wrapper.
  2. If we hoisted omp.parallel out of the loop wrapper set, then allocas would be inserted into that operation’s entry block. When lowering to LLVM IR we’d be able to easily identify this case because there’s no other way omp.distribute and omp.wsloop wrappers could end up being applied to the same loop nest (there’s no distribute do/for construct). At that point we should be able to make sure that allocations are lowered in the right spot and that control flow follows the expected behavior.

Moving out allocas outside the parallel is likely to lead to race conditions and is incorrect.

Are you talking about something like the following?

omp.teams
omp.parallel
omp.distribute
omp.wsloop
omp.loop_nest

Wouldn’t this violate constraints like A distribute region must be strictly nested inside a teams region? Althought we are probably not checking this constraint, it would look a bit odd that this is allowed and is our canonical representation.

Yes, by default it would result in that kind of behavior. However, once delayed privatization support is rolled out to all operations involved, it should be possible to add “private” clauses to the relevant operation when these allocas are created. I think that’s an option to address that issue.

Yes, that’s the hoisted omp.parallel approach. In that case, allocas would go between omp.parallel and omp.distribute, there could be any number of ops between omp.teams and omp.parallel, and omp.distribute and omp.wsloop would wrap omp.loop_nest with no other ops in between.

Not necessarily. That is a constraint of the OpenMP distribute construct, but we can choose to say omp.distribute doesn’t quite represent the same restrictions the same way. Even now we can’t check in the op verifier that omp.distribute ops always have an omp.teams direct parent because the teams region might be defined in another function that calls the one omp.distribute resides in. As long as we can represent the same constructs allowed by the spec and detect unsupported cases, the MLIR representation doesn’t have to look exactly the same (though we do want to be as close as possible). We already made that decision when we chose to follow the loop wrapper approach instead of the “explicit outer loops” one, which wouldn’t have had any of these issues (but it would have been difficult to make work with the OpenMPIRBuilder-based translation to LLVM IR).

If we decide to go with the hoisted omp.parallel approach, we would be making a unique representation for distribute parallel do where operations appear out of order. However, I think that doesn’t necessarily mean their semantics would also be applied in that new order, we just know this is a special case because that sort of nesting of operations cannot represent anything else (distribute do is an invalid construct). We should be able to implement the MLIR to LLVM IR translation in a way that processes this pattern such that the semantics of the OpenMP construct it represents is still followed.

General FIR/HLFIR lowering can also create these temporary allocas. It will be difficult to create these as privateClause operations without being intrusive.

We should add the composite attribute or something similar to explicitly call this out. Also document this appropriately so that another frontend or a user manually writing MLIR can do this without being confused.

I agree, it would be difficult to do this transparently.

Yes, we’re currently working on the attribute, since we intended to add it in any case, and I think it’s a good idea to document this to avoid confusion as well. If you’re ok with this approach, then I’ll start working on it.