OpenMP Worksharing Loop RFC

Hi all,
We have been working on the openmp “worksharing loop” construct (aka “do”) for
MLIR and would like to start a discussion on a couple of design topics we have
encountered. A separate RFC on the lowering process for this construct will be posted soon.

The OpenMP workshare loop represents a loop whose iterations can be distributed
across a number of threads and executed in parallel.

When developing a construct for this in the openmp MLIR dialect, the question
arises whether to have this as a construct that contains another loop consturct,
or whether to have it as a loop construct itself.
For example:

omp.do {
scf.for %iv = %lb to %ub step %step {
  // body
}
}

or:

omp.do %iv = %lb to %ub step %step {
  // body
}

The first has the advantage of allowing us to take advantage of optimisations
that already exist in the scf.for operation, or other loop-style operations that
we might want to be contained inside the omp operation (for example fir.do).
However, when the loop body is lowered to the LLVM dialect the loop bounds are
no longer present since LLVM doesn’t have a loop operation itself. As such the
loop indices would need to somehow be remembered so that they can be used when
generating the relevant OpenMP runtime calls with OpenMPIRBuilder[1]. This approach
is also more similar to how the OpenMP operations behave in higher level languages
and is more consistent with the other operations like omp.parallel.

The second resolves this problem but doesn’t allow optimisations to be reused
from other dialects.

Another option would be to have both available, with a preference for mlir to be
generated using the first and then a later lowering step adding the indexes to
the omp.do construct so that they can be used later. Or alternatively the lower
bound, upper bound and step could just be replicated on both constructs every
time (this might be quite error prone, if ever the two sets of values get out of
sync).

Does anyone have any input on this, or possibly any ideas that I haven’t thought
of for resolving this issue?

You can find the current proposed patch and some discussion on the review for it
here: https://reviews.llvm.org/D86071

Thanks!
David Truby

[1] The OpenMP IRBuilder project generates the LLVM IR with runtime calls for an
OpenMP construct. It also aims to unify the OpenMP LLVM IR codegeneration for
Clang and Flang. This is achieved by refactoring the codegen for OpenMP directives
from Clang and placing them in the llvm/frontend directory.

Mentioning everyone who commented in D86071 or was pinged there. @mehdi_amini @River707 @jdoerfert @ftynse @bondhugula @nicolasvasilache @clementval @SouraVX

The first has the advantage of allowing us to take advantage of optimisations
that already exist in the scf.for operation, or other loop-style operations that
we might want to be contained inside the omp operation (for example fir.do).

I have to stress this again, a loop “inside” an omp.do is not a regular loop; it is a worksharing loop. In addition to all the effects omp.do clauses have, the loop itself is special. Let’s assume you want to do a transformation of an omp.do loop. You certainly can do that but only if you check for legality. Even transformations which are legal for all regular loop, i.a., strip-mininig a few iterations (see below), are not legal for a worksharing loop. Thus, whatever you want to “reuse” needs to be aware of the omp.do and therefore it is questionable why you can’t make the thing you want to “reuse” directly work with the omp.do.

strip-mining:

for (int i = 0; i < 12; ++i)
    body(i);

===>

for (int i = 0; i < 12; i+=4)
    for (int j = 0; j < 4; ++j)
        body(i+j);

Thanks @jdoerfert for highlighting the issue about illegal transformations. Thanks @DavidTruby for the RFC. Just wanted to add two more points.

  • The openmp do construct can apply to more than one loop, like when it has the collapse(n) clause. In this case what would be the representation? Keeping it separate from the loops seemed to be a better representation here. And then a transformation can be performed to collapse the n loops and then the collapsed loop can be converted to an omp loop like operation.

  • Keeping omp.do separate from the loop to be workshared allows us to remove the omp.do without changing the semantics of the loop. This was suggested as a nice to have by @schweitz.
    http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000277.html

Keeping omp.do separate from the loop to be workshared allows us to remove the omp.do without changing the semantics of the loop. This was suggested as a nice to have by @schweitz.

This I don’t understand. In general, omp.do cannot be removed from a loop. The most obvious example uses a firstprivate(x) clause and then each iteration increments x. If you remove omp.do the result of x is different for each iteration, with omp.do it “starts fresh” at the beginning of each chunk. This is clearly a change in semantics. Not to speak of thread local memory, the result of omp_get_thread_num(), …

Thanks for the feedback @jdoerfert, given what you’ve outlined here it might be that the simple option of just having one omp loop operation that does not contain another loop is the best approach, since sharing optimisations might be an issue anyway.

I wonder if there is a way to share some optimisations on an opt-in basis? For example, do affine.loop and scf.loop share any optimisations at all? If so we could use the same mechanism that they do.

I guess the next question is how to represent collapse if we go this way.

I guess the next question is how to represent collapse if we go this way.

One way is to store lower bound, upper bound, and step of each associated loop with the omp.do and only nest the body (plus intermediate statements that we implicitly sink into the innermost body). We then allow OMPIRBuilder::CreateWorksharing to take a list of loop bounds and the body callback.

Thanks for bringing the proposal up for discussion @DavidTruby, and thanks for the feedback @jdoerfert!

A couple of general observations first:

Generally, we encourage folks to implement transformations on op interfaces rather than make them aware of specific ops. The idea is that ops can opt into a transformation by implementing a corresponding interface and thus establishing a contract with the transformation about legality. This is a relatively recent feature and “older” transformations such as those on loops may not be using it to its full potential.

When the generalization is impossible or undesirable, the progressivity of lowering still plays an important role. If at some stage in your compilation pipeline, scf.for exists as an op, you can run optimizations on it before converting it to something else.

Generally, semantic differences lead to a new op being introduced.

I also share the layering concern. We don’t want to make other dialects or transformations unnecessarily aware of OpenMP. Consider lowering a nest of three scf.for nested inside omp.do. The lowering would need to transform some loop ops to CFG but keep the others intact depending on the attributes of omp.do. Arguably, this can be achieved by setting up pattern benefits accordingly, but it sounds too brittle.

You can also consider using dialect attributes to annotate existing loops with omp.do. Attributes give you a verifier hook and can be attached to any operation, partially inverting the awareness chain (attribute may know about operations it can be attached to, but operations don’t have to). This is still subject to the layering concern above when it comes to lowering.

With all this considered, I would suggest to have a separate omp.do operation that has a semantics of a worksharing loop nest, similarly to scf.parallel.

omp.do (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%stepi,%stepj)
        other-clauses-here-if-necessary {
  // collapse(2) is implied by having two iterators
  omp.yield  // custom terminator can be helpful for reductions
             // or reuse scf.reduce if suitable
}

Other loop types have conversion patterns to omp.do. I would encourage going through scf.parallel that has a similar structure and can be targeted from higher-level dialects, it also has some useful transformations like coalescing (fold two nested loops into a single “collapse(2)” loop). Creating instances of this operation in the IR will essentially tell the transformations to leave it alone all the way down to LLVM, modulo type changes, while letting the code around and inside it to be transformed.

Now the question remains about what should be emitted by frontends. I won’t mind if the frontend directly emits an omp.do instead of its language-specific loop construct, with the same argument as above: it may make language-specific transformations agnostic of OpenMP (by making them treat unknown “omp”-dialect ops conservatively). It can also emit a “wrapper” around its loop construct or attach an attribute to it if it wants to make certain transformations aware of OpenMP, then convert the wrapped/attributed construct to omp.do. I can imagine the wrapper being a language-specific “pragma” construct, for example.

I added the lowering flow RFC (link below). Had prepared it before reading @ftynse’s comments.

[… reuse optimizations …]

Can someone please give me an example of an optimization you want to reuse for omp.do loops? I feel this discussion needs some more concrete details rather than “ideas”.

FWIW, that is what I would vote for here. (Unrelated to my thoughts on the general concept here.)

And new patterns will break old code.

+1

I feel this is again a risky encoding for which we hope everything knows about it or it breaks. IMHO, we should revisit my first question (the why) before we choose something like this.

That sounds fine. I have a question in general about loops in MLIR. Most of the loops (scf, affine) seem to have only a single block with SizedRegion<1>, is there any particular reason for choosing a single block? Here we would want any number of blocks (AnyRegion). Will that cause any issues?

I personally read this as “use existing optimizations on non-OpenMP loops”, which remain represented as scf.for or equivalent.

New patterns should just fail to apply if there is no OpenMP operation in the IR. The benefit is generally configurable independently of the pattern action, but I’m not aware of any upstream flows that actually do this.

This is a relatively underexplored area, but in general MLIR transformations should have strong precondition/postcondition checks and conservatively bail out on unknown operations they cannot handle. This is not always the case in practice and I’m certain we can make most loop transformations do bad things with respect to the code surrounding them. I agree this is risky, my line of thought was to represent AST-like constructs, including a pragma node, in the IR and lower it to something else before calling any sort of non-language-specific transformation.

Originally, this was due to the notion of structured control flow and the unwillingness to mix structured control flow and CFG. For affine analyses, it still matters that the loop body executes entirely. For SCF, today, it is mostly the question of modeling reductions without resorting to memory load/stores – it is easier to have a single terminator that “yields” values from the iteration and no possibility of undefined iteration results. I don’t think there is a fundamental problem with multi-block loops.

It seems that we’ve reached a consensus that omp.do needs treating differently enough from other types of loop that it makes the most sense to have it as a separate loop operation rather than a directive-like operation attached to another loop.

I’ll proceed with the implementation of the operation, parser and pretty printer along those lines.

Thanks!

Originally, the consideration was to having the OMP dialect look more like the directives. One advantage to this was conjectured to be that it would allow the non-directive code to be lowered in a straightforward, regular, and rigorous way, and the dialects could co-exist and cooperate. Secondly, it should make eliminating the directive-like Ops easy to do, leaving the original non-directive annotated code.

It was also clearly understood that the OMP IR design would have to evolve as more experience was gained.

It should be pointed out that there is a compromise alternative here between this original “OMP directive” alternative and the alternative of lowering the code directly into an agnostic OMP dialect. The alternative would be to have lowering use a directive-like IR and then have an IR pass that transforms those directives + “other dialect” ops into an OMP IR (source language agnostic) substrate, which would mean replacing fir.do_loop, scf.loop, etc. potentially with some OMP parallel loop form. This would keep the advantages mentioned above, and allow the OMP dialect to perform a rewrite on the MLIR representation, which is very well suited to performing rewrites. A disadvantage might be argued that the complexity would be split across lowering (to a directive-like form) and a pass (to rewrite the IR). However, that point could also be argued equally as an advantage by providing a separation of concerns. The show-stopper would be if there are just some OMP semantics that would be truly unrecoverable because they cannot be properly expressed in any sort of directive-like IR form regardless of its design.

1 Like

@schweitz I am unsure if you followed this discussion (it is rather long) but your proposal was discussed and there are multiple problems with it.

A worksharing loop is not a worksharing thing and a regular loop. You cannot pretend the loop is like any other scf/fir loop and expect correctness. As pointed out earlier, any transformation that modifies the loop stride or bounds would be incorrect. Loop interchange, collapse, strip-mine, tile, … are all incorrect if there is a workshare loop bound to them (and they do not take appropriate precautions). That means, merging late exposes the potential for mis-compilation, as was pointed out today in the Flang-Dev call.

It was argued that separation doesn’t even provide benefits. If you have to teach everything about the worksharing loop, why not teach it about a wsloop op instead. In case you disagree, could you please provide a specific example to justify separation? Also, why do you believe it is sound to do so [I mean why can generic transformations be unaware of the special meaning of worksharing, offloading, data-sharing (private/firsptrivate/…)]?

1 Like