[OpenMP Dialect] Workshare loop lowering flow

The OpenMP worksharing loop specifies that the iterations of a loop will be executed in parallel by the threads in the team of a parallel region to which this loop belongs.
The loop’s body is constrained to be a structured block (single entry, single exit). But branches internally are not disallowed. Branching instructions are also part of high level languages, particularly older Fortran style code. The body of the loop can be modelled as an operation with body of AnyRegion. The worksharing loop construct also has several clauses associated with it. These can be modelled as in the review created by David for the omp.do construct, https://reviews.llvm.org/D86071. The loop bounds and increment can be Index or LLVM integers.
The general lowering flow that we are using for the OpenMP dialect is: High Level Dialect + OpenMP → Conversion → LLVM + OpenMP → Translation (Use OpenMP IRBuilder[1]) → LLVM IR. So a lowering flow for with the omp.do operation and an scf.for loop nested inside would like the following two code blocks. Note: There is another RFC which discusses whether we should have the one operation or two operations for the worksharing loop flow, OpenMP Worksharing Loop RFC. Note: This RFC assumes the two operation flow, if we decide to go with the single operation flow then the omp.wsloop with Index type for bounds+increment and all other clauses will be created and this loop is converted to omp.wsloop with LLVM dialect and then passed to the OpenMP IRBuilder. The single operation flow is similar to starting from (2) below. For the lowering flow described below, I assume two operations.
i) omp.do : This works like the directive mentioned in the OpenMP standard. The loop to be workshared is nested inside this directive. If someone wants to workshare their loop they can add an omp.do operation around it.
ii) omp.wsloop : This is the real OpenMP worksharing loop with bounds and increment. This loop (with LLVM dialect) will be passed to the OpenMP IRBuilder.

  1. Consider an scf.for loop nested inside an omp.do operation as given below.
func @some_ops_inside_loop(%start: index, %stop: index, %inc: index, %threshold: index) {
  omp.do schedule(static) {
    scf.for %i = %start to %stop step %inc {
       %cond = cmpi "slt", %i, %threshold : index
       scf.if %cond -> () {
        "some.op1"() : () -> ()
      } else {
        "some.op2"() : () -> ()
      }
    }
  }
  return
}
  1. An early transformation pass can merge the omp.do and scf.for and create an omp.wsloop operation. This pass can also call transformation passes like loop coalescing to collapse loops if necessary before the merge happens.
func @some_ops_inside_loop(%start: index, %end: index, %inc: index, %threshold: index) {
  omp.wsloop %i = %start to %end step %inc schedule(static) {
    %cond = cmpi "slt", %i, %threshold : index
    scf.if %cond -> () {
      "some.op1"() : () -> ()
    } else {
      "some.op2"() : () -> ()
    }
  }
  return
}
  1. Now the normal lowering flow can kick in for lowering from rest of scf to standard and then to LLVM. And we end up with OpenMP + LLVM dialect.
llvm.func @some_ops_inside_loop(%start: !llvm.i64, %end: !llvm.i64, %inc: !llvm.i64, %threshold: !llvm.i64) {
  omp.wsloop %i = %start to %end step %inc schedule(static) {
    %2 = llvm.icmp "slt" %i, %threshold : !llvm.i64
    llvm.cond_br %2, ^bb1, ^bb2
  ^bb1:
    "some.op1"() : () -> ()
    llvm.br ^bb3
  ^bb2:
    "some.op2"() : () -> ()
    llvm.br ^bb3
  ^bb3:  // 2 preds: ^bb1, ^bb2
  }
  llvm.return
}
  1. The OpenMP+LLVM dialect is then translated to LLVM IR using the OpenMP IRbuilder. The OpenMP IRBuilder will insert the control flow for the worksharing loop as well as the runtime calls. The OpenMP IRBuilder function for creating the worksharing loop in LLVM IRwill have a signature similar to the existing CreateParallel function. https://llvm.org/doxygen/OMPIRBuilder_8cpp_source.html#l00396
    InsertPos CreateWorksharingLoop(…, InductionVar, LowerBound, UpperBound, Step, …, BodyCodeGenCallback)

[1] The OpenMP IRBuilder project generates the LLVM IR with runtime calls for an
OpenMP construct. It also aims to unify the OpenMP LLVM IR codegeneration for
Clang and Flang. This is achieved by refactoring the codegen for OpenMP directives
from Clang and placing them in the llvm/frontend directory.

FWIW, I would go with a single operation until you have a clear reason not to. So far I am not aware of anything specific you can, correctly, reuse anyway.

I think we should do the collapse in the IRBuilder but I can be convinced that we should not.

1 Like

I assume, there must be some sort of constraints to trigger this transformation ? (Otherwise why go for 2 new Ops anyway).
I’m interested in cases where this will fail and we are not able to create “omp.wsloop” what will be the lowering control flow then.

Thanks, @SouraVX for the question. I believe the one operation flow is the top contender now. But I will answer since you asked.

If the conversion is not possible then that would signal an error. There is no alternative flow. It is just that that two operations might be good for,

  1. Debugging of the lowering from parse tree to MLIR. (And assuming no OpenMP specific transformations have happened and there are no calls to the runtime directly in the code).
  2. If the OpenMP IRBuilder is capable of only lowering a single loop then it provides the opportunity in MLIR to do the loop collapse operation.
  3. The directive operation provides an easy to use mechanism to use inside MLIR. Like someone might have an affine loop or scf loop and they would want to parallelize it, then putting the omp.do operation around the loop is straightforward.

What is exit procedure from this loop, i.e. what kind of terminators are accepted and return the control flow to the loop (continue, break?), and where (which blocks)?

This is probably out of scope, but I would seriously considering introducing a type interface for integer-like types and allow any such type to be used as a loop induction variable. This would remove the dependence on the LLVM dialect and make it future-proof with somebody writing custom integers (although OpenMP arguably won’t be compatible with integers that we cannot model as standard types).

What is the benefit/need of having this split-op representation? I understand that OpenMP standard has it as a separate entity, but I assume that the goal of the standard is to avoid significantly modifying languages that embed OpenMP, which is not the case for MLIR where ops are cheap.

Furthermore, the layering is unclear. If core is only aware of the “omp.do + scf.for -> omp.wsloop” transformation, an out-of-tree user of “omp.do + out-of-tree.for” will have to provide an “omp.do + out-of-tree-for” -> omp.do + scf.for" transformation out-of-tree, at which point they can just as well target “omp.wsloop” directly.

If we come from a frontend that has a pragma with a collapse clause, I think we should thread it through different representations, e.g. with multi-for loops. However, if we generate loops from higher-level abstraction such as TF and then map them to OpenMP loops as a parallelization strategy, I’d pretty much like to be able to collapse the loops myself, either in a multi-for or in a single loop.

I would much rather have transformation directives as attributes rather than ops. GPU dialect has made some progress there. In this specific case, I would expect affine.for to be converted to scf.parallel, which in turn is converted to omp.wsloop. We can have a should-map-to-openmp attribute attached early to an operation and have a mechanism to check validity at the level where the conversion actually happens.

Thanks @ftynse for your comments.

Continue (or the equivalent in Fortran) is allowed. Only the loop statement check can branch out of the loop.

This sounds interesting. Can you give me some pointers for similar code?

The only concern with attributes is whether it is adding information about one dialect (OpenMP) into another (scf).
You talk about about a parallelization strategy in the previous point, is that a concept in MLIR?
Also just to be sure. Is the following the flow that you are suggesting?

  1. affine.for i=1 to n
  2. scf.parallel i =1 to n {should-map-to-openmp}

omp.parallel {
omp.wsloop i = 1 to n
}

Maybe this a lack of understanding on my part.
-> Since scf.for is the only general loop construct that is available in core I was thinking that out of tree users will have a conversion from their loop construct to scf.for in their flow. And scf.for is a common intermediate point which users can reach to get/reuse a bunch of loop transformations.
-> And since there is a transformation in core from omp.do { omp.for } to omp.wsloop, it would just work for the user without any further changes. Why does the user need an omp.do + out-of-tree-for to omp.do + scf.for?

How are these represented in the IR? It is not specified in the proposal, and the example just omits the terminator in one of the blocks, which is probably invalid IR.

I’m not aware of type interfaces being used in-tree. That’s another reason why it is interesting :slight_smile:

This is a tricky question. I do want to avoid the dependency. Anybody can attach any (dialect) attributes to operations, but they can also be removed. Dialect attributes have a verifier, in which it is possible to add additional checks and catch IR in an invalid state. But nothing other than having a dedicated op prevents passes that are unaware of the additional semantics from breaking it. That’s why I’m thinking in terms of “declaration-of-intent” attributes and additional checks that happen before the actual transformation.

affine.for -> affine.parallel -> scf.parallel -> omp.parallel/wsloop

The first two arrows exist already, which is the whole point of progressive lowering and reuse.

1 Like

Good question and I will need some handholding here. For the terminator, we have a few points to consider,

  1. Only the loop condition check can branch out. So the terminator can be thought of as giving control back to the parent Op and then execution continuing on the next operation.
  2. Continue can be modeled as a branch to the block with the terminator.
  3. Since there are no loop carried dependence in a work-sharing loop, I guess these kinds of values need not be carried around or returned.
  4. There is a reduction clause in workshare loop, this has to be handled. I see that scf.reduce returns the reduced values in the terminator and also accepts initial values. You did mention previously that SizedRegion<1> is required for the reduction. Does that mean that we cannot use such a reduction for the workshare loop and will have to use loads and stores? I guess if the reduction is always executed (and not bypassed) it should be OK.
  5. The lastprivate clause sets the value of a variable to the last iteration value of the variable. Could this be modelled with a return value in the terminator?

I updated the example with a yield terminator below.

llvm.func @some_ops_inside_loop(%start: !llvm.i64, %end: !llvm.i64, %inc: !llvm.i64, %threshold: !llvm.i64) {
  omp.wsloop %i = %start to %end step %inc schedule(static) {
    %2 = llvm.icmp "slt" %i, %threshold : !llvm.i64
    llvm.cond_br %2, ^bb1, ^bb2
  ^bb1:
    "some.op1"() : () -> ()
    llvm.br ^bb3
  ^bb2:
    "some.op2"() : () -> ()
    llvm.br ^bb3
  ^bb3:  // 2 preds: ^bb1, ^bb2
    omp.yield
  }
  llvm.return
}