MLIR Summit OpenMP Roundtable discussion (Summary)

I am trying to summarize the roundtable discussion at the MLIR summit last Thursday. We were around 15 people, most expressed interest in only the general development of the OpenMP dialect. I had hoped that someone from the team currently developing OpenMP for Flang would have been there as well.

This summary is subjective to what I can remember. Feel free to add topics we discussed.

Dialect Pipeline Overview

  • The current design has Flang generate FIR and the OpenMP dialect from its internal AST. The OpenMP dialect is lowered directly to the LLVM substrate (NOT the LLVM dialect[1]) by the OpenMPIRBuilder. This design allows Clang to use the OpenMPIRBuilder such that we only have to maintain a single OpenMP lowering component. Clang’s AST-Based OpenMP Codegen would be moved gradually to OpenMPIRBuilder. This codegen path is enabled with the -fopenmp-enable-irbuilder flag, which is off by default.

  • OpenMP-specific optimization takes place in the LLVM openmp-opt pass which recognizes libomp runtime calls.

  • In the long term, if Clang moves to an MLIR substrate[1:1], it could also emit the OpenMP Dialect. If this is that case, the OpenMPIRBuilder would not be needed anymore and OpenMP could instead lower to other dialects, such as GPU or scf. Optimizations could take place on a higher level directly in MLIR.

  • Until then, OpenMP semantics should be implemented in the OpenMPIRBuilder only to avoid duplication of logic for Clang and Flang, and to not diverge in semantics.

  • After the summit I had a discussion with the ClangIR developers at Meta about the possibility for Clang generating the OpenMP dialect, and how long it may take.

OpenMP Offloading

  • Offloading support for the OpenMP dialect is currently under development (e.g. D136872)

  • Clang parses the source (at least) twice: for the host code and for each target. However, the OpenMPIRBuilder, at least for #pragma omp parallel, does outlining itself. I do not know how Flang will handle this. Personally, I think parsing only once is better in that it compiles faster and ensures consistency between host and offload code[2].

OpenMP Loop Transformations

Implementation of #pragma omp unroll and #pragma omp tile

  • I intended this topic to be a continuation of
  • In OpenMP semantics, further loop-associated constructs can apply to the loops generated by loop transformations. The generated loop nest can have a completely different structure than the one the loop transformation applies to. For instance, tiling a loop nest with 3 loops results in 6 loops. Further loop-associated constructs can be applied to those 6 loops.

  • The current design of omp.wsloop does not consider loop transformations. The above thread considers using scf.for and the transform dialect for it.

  • I think this is a “leaky abstraction”: It doesn’t quite fit the OpenMP semantics. For instance, an optimization may optimize-way an scf.for because it has only a single iteration, not considering that it is syntactically required by OpenMP because wsloop needs a loop to apply to. Another example is that #pragma omp unroll has a “compiler heuristic mode”, but transform.loop.unroll always requires an unroll factor. It also does not return a handle, but #pragma omp unroll partial generates a loop that other directives can apply to.

  • I consider it best if the OpenMP dialect had its own loop and transform operations, which by definition correspond to the OpenMP definitions, even as the OpenMP specification evolves in newer version. The OpenMPIRBuilder already has tileLoops and unrollLoop methods that return handles that can be further transformed. Even if not using those, the OpenMP operations can still be lowered to other dialect if their definitions match.

  • An idea how this could look like:

%outer = omp.canonical_loop for (%c0) to (%c10) step (%c1) {
  %inner = omp.canonical_loop for (%d0) to (%d10) step (%d1) {
    ..
  }
}
%tiled:4 = omp.tile loops(%outer,%inner) { tile_sizes=[4,4] } ;
omp.ws loops(%tiled#0)

for first tiling and then worksharing (omp.ws instead of omp.wsloop) a loop.

@ftynse %inner is within the scope of %outer and therefore would need to be yielded into the surrounding scope, which is awkward for a static property. How does the transform dialect handle this?


  1. “Substrate” was coined by @nhaehnle in his LLVM DevMtg keynote to distinguish the LLVM Dialect within the MLIR substrate from the plain LLVM-IR (with PHINode instructions etc.). ↩︎ ↩︎

  2. There were issues where different overload were selected on host and accelerator when an argument is long double. ↩︎

1 Like

Thanks @Meinersbur for updating us about the OpenMP roundtable meeting. None of us from the OpenMP for Flang team could make it for the LLVM meeting or the MLIR summit since most of us are based outside the US.

Some trivial canonicalisation is modelled in the OpenMP dialect like removing empty OpenMP parallel regions. ⚙ D130657 [mlir][OpenMP] omp.parallel side effects

The OpenMPIRBuilder could still be useful in that it can emit the LLVM dialect in MLIR instead of LLVMIR. Also we probably cannot completely do away with the requirement of using the OpenMP runtime and hence a full decomposition to other dialects might be difficult. The OpenMPIRBuilder might still be a useful interface for interfacing with the OpenMP runtime.

Is applying further loop transformation constructs to produced inner loops part of the standard?

There were a few other proposals as well.

This looks fine. The question I have is whether this will be more powerful than what will be in the OpenMP standard? And whether successive application of loop transformation constructs can be modelled by nesting. An example is given below.

omp.ws loops
omp.tile loops(%outer,%inner) { tile_sizes=[4,4] }
omp.canonical_loop for (%c0) to (%c10) step (%c1) {
  omp.canonical_loop for (%d0) to (%d10) step (%d1) {
    ..
  }
}

need clarification for understanding. If we have three loops, what will be the arguments for loops?

omp.ws loops
omp.tile loops(%outer,%inner) { tile_sizes=[4,4,4] }
omp.canonical_loop for (%c0) to (%c10) step (%c1) {
  omp.canonical_loop for (%d0) to (%d10) step (%d1) {
    omp.canonical_loop for (%y0) to (%y10) step (%y1) {
....
 }
  }
}

If the tile_sizes with three entries [4,4,4] here mean applying tiling to the loops with index %c0, %d0 and %y0 then there is probably no need to distinguish the loops with the loops attribute/operand. So it can effectively be simplified to the following. The omp.ws operation applies worksharing to the outermost loop formed after the tiling operation.

omp.ws
omp.tile { tile_sizes=[4,4,4] }
omp.canonical_loop for (%c0) to (%c10) step (%c1) {
  omp.canonical_loop for (%d0) to (%d10) step (%d1) {
    omp.canonical_loop for (%y0) to (%y10) step (%y1) {
....
 }
  }
}

The question I was asking is whether nesting sufficient to model the loop transformation operations in the OpenMP standard?

For OpenMP 5.2 this would be sufficient. However, future OpenMP standards will include an apply clause that will allow applying directives into the middle of a generated loop. E.g.

#pragma omp tile sizes(4) apply(intratile:unroll)
for (int i = 0; i < 64; ++i) ; 

which is equivalent to

#pragma omp 
for (int i1 = 0; i1 < 64; i1+=4) 
  #pragma omp unroll
  for (int i = i1; i < i1+4; ++i) ; 
1 Like