About OpenMP dialect in MLIR

Hi,

I have few questions / concerns regarding the design of OpenMP dialect in MLIR that is currently being implemented, mainly for the f18 compiler. Below, I summarize the current state of various efforts in clang / f18 / MLIR / LLVM regarding this. Feel free to add to the list in case I have missed something.

  1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and clang frontends. Note that this proposal was before considering MLIR for FIR.

a. llvm-dev proposal : http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html

b. Patches in review: https://reviews.llvm.org/D70290. This also includes the clang codegen changes.

  1. [July - September 2019] OpenMP dialect for MLIR was discussed / proposed with respect to the f18 compilation stack (keeping FIR in mind).

a. flang-dev discussion link: https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html

b. Design decisions captured in PPT: https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view

c. MLIR google groups discussion: https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw

d. Target constructs design: http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html

e. SIMD constructs design: http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html

  1. [Jan 2020] OpenMP dialect RFC in llvm discourse : https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397

  2. [Jan- Feb 2020] Implementation of OpenMP dialect in MLIR:

a. The first patch which introduces the OpenMP dialect was pushed.

b. Review of barrier construct is in progress: https://reviews.llvm.org/D72962

I have tried to list below different topics of interest (to different people) around this work. Most of these are in the design phase (or very new) and multiple parties are interested with different sets of goals in mind.

I. Flang frontend and its integration

II. Fortran representation in MLIR / FIR development

III. OpenMP development for flang, OpenMP builder in LLVM.

IV. Loop Transformations in MLIR / LLVM with respect to OpenMP.

It looks like the design has evolved over time and there is no one place which contains the latest design decisions that fits all the different pieces of the puzzle. I will try to deduce it from the above mentioned references. Please correct me If I am referring to anything which has changed.

A. For most OpenMP design discussions, FIR examples are used (as seen in (2) and (3)). The MLIR examples mentioned in the design only talks about FIR dialect and LLVM dialect.

This completely ignores the likes of standard, affine (where most loop transformations are supposed to happen) and loop dialects. I think it is critical to decouple the OpenMP dialect development in MLIR from the current flang / FIR effort. It would be useful if someone can mention these examples using existing dialects in MLIR and also how the different transformations / lowerings are planned.

B. In latest RFC(3), it is mentioned that the initial OpenMP dialect version will be as follows,

omp.parallel {

omp.do {

fir.do %i = 0 to %ub3 : !fir.integer {

}

}

}

and then after the “LLVM conversion” it is converted as follows:

omp.parallel {

%ub3 =

omp.do %i = 0 to %ub3 : !llvm.integer {

}

}

a. Is it the same omp.do operation which now contains the bounds and induction variables of the loop after the LLVM conversion? If so, will the same operation have two different semantics during a single compilation?

b. Will there be different lowerings for various loop operations from different dialects? loop.for and affine.for under omp operations would need different OpenMP / LLVM lowerings. Currently, both of them are lowered to the CFG based loops during the LLVM dialect conversion (which is much before the proposed OpenMP dialect lowering).

There would be no standard way to represent OpenMP operations (especially the ones which involve loops) in MLIR. This would drastically complicate lowering.

C. It is also not mentioned how clauses like firstprivate, shared, private, reduce, map, etc are lowered to OpenMP dialect. The example in the RFC contains FIR and LLVM types and nothing about std dialect types. Consider the below example:

#pragma omp parallel for reduction(+:x)

for (int i = 0; i < N; ++i)

x += a[i];

How would the above be represented in OpenMP dialect? and What type would “x” be in MLIR? It is not mentioned in the design as to how the various SSA values for various OpenMP clauses are passed around in OpenMP operations.

D. Because of (A), (B) and (C), it would be beneficial to have an omp.parallel_do operation which has semantics similar to other loop structures (may not be LoopLikeInterface) in MLIR. To me, it looks like having OpenMP operations based on standard MLIR types and operations (scalars and memrefs mainly) is the right way to go.

Why not have omp.parallel_do operation with AffineMap based bounds, so as to decouple it from Value/Type similar to affine.for?

  1. With the current design, the number of transformations / optimizations that one can write on OpenMP constructs would become limited as there can be any custom loop structure with custom operations / types inside it.

  2. It would also be easier to transform the Loop nests containing OpenMP constructs if the body of the OpenMP operations is well defined (i.e., does not accept arbitrary loop structures). Having nested redundant “parallel” , “target” and “do” regions seems unnecessary.

  3. There would also be new sets of loop structures in new dialects when C/C++ is compiled to MLIR. It would complicate the number of possible combinations inside the OpenMP region.

E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct lowering to LLVM IR ignoring all the advantages that MLIR provides. Being able to compile the code for heterogeneous hardware is one of the biggest advantages that MLIR brings to the table. That is being completely missed here. This also requires solving the problem of handling target information in MLIR. But that is a problem which needs to be solved anyway. Using GPU dialect also gives us an opportunity to represent offloading semantics in MLIR.

Given the ability to represent multiple ModuleOps and the existence of GPU dialect, couldn’t higher level optimizations on offloaded code be done at MLIR level?. The proposed design would lead us to the same problems that we are currently facing in LLVM IR.

Also, OpenMP codegen will automatically benefit from the GPU dialect based optimizations. For example, it would be way easier to hoist a memory reference out of GPU kernel in MLIR than in LLVM IR.

Thanks,

Vinay

Hi Vinay,

Thanks for taking an interest and the detailed discussion.

To start by picking a few paragraph from your email to clarify a couple
of things that lead to the current design or that might otherwise need
clarification. We can talk about other points later as well.

[
  Site notes:
    1) I'm not an MLIR person.
    2) It seems unfortnuate that we do not have a mlir-dev list.
]

1. With the current design, the number of transformations / optimizations
that one can write on OpenMP constructs would become limited as there can
be any custom loop structure with custom operations / types inside it.

OpenMP, as an input language, does not make many assumptions about the
code inside of constructs*. So, inside a parallel can be almost anything
the base language has to offer, both lexically and dynamically.
Assuming otherwise is not going to work. Analyzing a "generic" OpenMP
representation in order to determine if can be represented as a more
restricted "op" seems at least plausible. You will run into various
issue, some mentioned explicitly below. For starters, you still have to
generate proper OpenMP runtime calls, e.g., from your GPU dialect, even
if it is "just" to make sure the OMPD/OMPT interfaces expose useful
information.

* I preclude the `omp loop` construct here as it is not even implemented
  anywhere as far as I know.

2. It would also be easier to transform the Loop nests containing OpenMP
constructs if the body of the OpenMP operations is well defined (i.e., does
not accept arbitrary loop structures). Having nested redundant "parallel" ,
"target" and "do" regions seems unnecessary.

As mentioned above, you cannot start with the assumption OpenMP input is
structured this this way. You have to analyze it first. This is the same
reason we cannot simply transform C/C++ `for loops` into `affine.for`
without proper analysis of the loop body.

Now, more concrete. Nested parallel and target regions are not
necessarily redundant, nor can/should we require the user not to have
them. Nested parallelism can easily make sense, depending on the problem
decomposition. Nested target will make a lot of sense with reverse
offload, which is already in the standard, and it also should be allowed
for the sake of a modular (user) code base.

3. There would also be new sets of loop structures in new dialects when
C/C++ is compiled to MLIR. It would complicate the number of possible
combinations inside the OpenMP region.

Is anyone working on this? If so, what is the timeline? I personally was
not expecting Clang to switch over to MLIR any time soon but I am happy
if someone wants to correct me on this. I mention this only because it
interacts with the arguments I will make below.

E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct
lowering to LLVM IR ignoring all the advantages that MLIR provides. Being
able to compile the code for heterogeneous hardware is one of the biggest
advantages that MLIR brings to the table. That is being completely missed
here. This also requires solving the problem of handling target information
in MLIR. But that is a problem which needs to be solved anyway. Using GPU
dialect also gives us an opportunity to represent offloading semantics in
MLIR.

I'm unsure what the problem with "handling target information in MLIR" is but
whatever design we end up with, we need to know about the target
(triple) in all stages of the pipeline, even if it is just to pass it
down.

Given the ability to represent multiple ModuleOps and the existence of GPU
dialect, couldn't higher level optimizations on offloaded code be done at
MLIR level?. The proposed design would lead us to the same problems that we
are currently facing in LLVM IR.

Also, OpenMP codegen will automatically benefit from the GPU dialect based
optimizations. For example, it would be way easier to hoist a memory
reference out of GPU kernel in MLIR than in LLVM IR.

While I agree with the premise that you can potentially reuse MLIR
transformations, it might not be as simple in practice.

As mentioned above, you cannot assume much about OpenMP codes, almost
nothing for a lot of application codes I have seen. Some examples:

If you have a function call, or any synchronization event for that
matter, located between two otherwise adjacent target regions (see
below), you cannot assume the two target regions will be offloaded to
the same device.

  #omp target
  {}
  foo();
  #omp target
  {}

Similarly, you cannot assume a `omp parallel` is allowed to be executed
with more than a single thread, or that a `omp [parallel] for` does not
have loop carried data-dependences, ...
Data-sharing attributes are also something that has to be treated
carefully:

x = 5;
#omp task
  x = 3;
print(x);

Should print 5, not 3.

I hope I convinced you that OpenMP is not trivially mappable to existing
dialects without proper analysis. If not, please let me know why you
expect it to be.

Now when it comes to code analyses, LLVM-IR offers a variety of
interesting features, ranging from a mature set of passes to the
cross-language LTO capabilities. We are working on the missing parts,
e.g., heterogeneous llvm::Modules as we speak. Simple OpenMP
optimizations are already present in LLVM and interesting ones are
prototyped for a while now (let me know if you want to see more not-yet
merged patches/optimizations). I also have papers, results, and
talks that might be interesting here. Let me know if you need pointers
to them.

Cheers,
  Johannes

Hi Vinay,

Thanks for taking an interest and the detailed discussion.

To start by picking a few paragraph from your email to clarify a couple
of things that lead to the current design or that might otherwise need
clarification. We can talk about other points later as well.

[
Site notes:

  1. I’m not an MLIR person.
  2. It seems unfortnuate that we do not have a mlir-dev list.

MLIR uses discourse, llvm.discourse.group.

Hello Vinay,

Thanks for your mail about the OpenMP dialect in MLIR. Happy to know that you and several other groups are interested in the OpenMP dialect. At the outset, I must point out that the design is not set in stone and will change as we make progress. You are welcome to participate, provide feedback and criticism to change the design as well as to contribute to the implementation. I provide some clarifications and replies to your comments below. If it is OK we can have further discussions in discourse as River points out.

  1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and clang frontends. Note that this proposal was before considering MLIR for FIR.

A correction here. The proposal for OpenMPIRBuilder was made when MLIR was being considered for FIR.
(i) Gary Klimowicz’s minutes for Flang call in April 2019 mentions considering MLIR for FIR.
http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-April/000194.html
(ii) My reply to Johaness’s proposal in May 2019 mentions MLIR for FIR.
http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000220.html

b. Review of barrier construct is in progress: https://reviews.llvm.org/D72962

Minor correction here. The addition of barrier construct was accepted and has landed (https://reviews.llvm.org/D7240). It is the review for translation to LLVM IR that is in progress.

It looks like the design has evolved over time and there is no one place which contains the latest design decisions that fits all the different pieces of the puzzle. I will try to deduce it from the above mentioned references. Please correct me If I am referring to anything which has changed.

Yes, the design has mildly changed over time to incorporate feedback. But the latest is what is there in the RFC in discourse.

For most OpenMP design discussions, FIR examples are used (as seen in (2) and (3)). The MLIR examples mentioned in the design only talks about FIR dialect and LLVM dialect.

Our initial concern was how will all these pieces (FIR, LLVM Dialect, OpenMPIRBuilder, LLVM IR) fit together. Hence you see the prominence of FIR and LLVM dialect and more information about lowering/translation than transformations/optimisations.

This completely ignores the likes of standard, affine (where most loop transformations are supposed to happen) and loop dialects.

Adding to the reply above. We would like to take advantage of the transformations in cases that are possible. FIR loops will be converted to affine/loop dialect. So the loop inside an omp.do can be in these dialects as clarified in the discussion in discourse and also shown in slide 20 of the fosdem presentation (links to both below).
https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397/7?u=kiranchandramohan
https://fosdem.org/2020/schedule/event/llvm_flang/attachments/slides/3839/export/events/attachments/llvm_flang/slides/3839/flang_llvm_frontend.pdf

I must also point out that the question of where to do loop transformations is a topic we have not fully converged on. See the following thread for discussions.
http://lists.llvm.org/pipermail/flang-dev/2019-September/000042.html

Is it the same omp.do operation which now contains the bounds and induction variables of the loop after the LLVM conversion?

The point here is that i) we need to keep the loops separately so as to take advantage of transformations that other dialects like affine/loop would provide. ii) We will need the loop information while lowering the OpenMP do operation. For implementation, if reusing the same operation (in different contexts) is difficult then we can add a new operation.

It is also not mentioned how clauses like firstprivate, shared, private, reduce, map, etc are lowered to OpenMP dialect.

Yes, it is not mentioned. We did a study of a few constructs and clauses which was shared as mails to flang-dev and the RFC. As we make progress and before implementation, we will share further details.

it would be beneficial to have an omp.parallel_do operation which has semantics similar to other loop structures (may not be LoopLikeInterface) in MLIR.

I am not against adding parallel_do if it can help with transformations or reduce the complexity of lowering. Please share the details in discourse as a reply to the RFC or a separate thread.

it looks like having OpenMP operations based on standard MLIR types and operations (scalars and memrefs mainly) is the right way to go.

This will definitely be the first version that we implement. But I do not understand why we should restrict to only the standard types and operations. To ease lowering and translation and to avoid adding OpenMP operations to other dialects, I believe OpenMP dialect should also be able to exist with other dialects like FIR and LLVM.

E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct lowering to LLVM IR ignoring all the advantages that MLIR provides.

Also, OpenMP codegen will automatically benefit from the GPU dialect based optimizations. For example, it would be way easier to hoist a memory reference out of GPU kernel in MLIR than in LLVM IR.

I might not have fully understood you here. But the dialect lives independently of the translation to LLVM IR. If there are optimisations (like hoisting that you mention here) I believe they can be performed as transformation passes on the dialect. It is not ruled out.

–Kiran

Thanks for the reply!

It sounds like LLVM IR is being considered for optimizations in OpenMP constructs. There seems to be plans regarding improvement of LLVM IR Framework for providing things required for OpenMP / flang(?)

Are there any design considerations which contain pros and cons about using the MLIR vs LLVM IR for various OpenMP related optimizations/ transformations?

The latest RFC [ (3) in my original post ] mentions that:

So there exist some questions regarding where the optimisations should be carried out.

Could you please provide more details on this?

I would like to quote Chris here:

“if you ignore the engineering expense, it would clearly make sense to reimplement the mid-level LLVM optimizers on top of MLIR and replace include/llvm/IR with a dialect definition in MLIR instead.“ – http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.html

Rest of the comment are inlined.

Hi Vinay,

Thanks for taking an interest and the detailed discussion.

To start by picking a few paragraph from your email to clarify a couple
of things that lead to the current design or that might otherwise need
clarification. We can talk about other points later as well.

[
Site notes:

  1. I’m not an MLIR person.
  2. It seems unfortnuate that we do not have a mlir-dev list.
    ]
  1. With the current design, the number of transformations / optimizations
    that one can write on OpenMP constructs would become limited as there can
    be any custom loop structure with custom operations / types inside it.

OpenMP, as an input language, does not make many assumptions about the
code inside of constructs*.

This isn’t entirely correct because the current OpenMP API specification (https://www.openmp.org/spec-html/5.0/openmpch1.html) assumes that the code inside the constructs belong to C, C++ and Fortran programs.

So, inside a parallel can be almost anything
the base language has to offer, both lexically and dynamically.

I am mostly concerned with the MLIR side of things for OpenMP representation.

MLIR can not only support operations for General Purpose languages like C,C++, Fortran, etc but also various Domain Specific Language representations as dialects (Example, ML, etc.). Note that there is also SPIR V dialect which is again meant for “Parallel Compute”.

It becomes important to define the scope of the dialects / operations / types supported inside OpenMP operations in MLIR.

Assuming otherwise is not going to work. Analyzing a “generic” OpenMP
representation in order to determine if can be represented as a more
restricted “op” seems at least plausible. You will run into various
issue, some mentioned explicitly below.

Isn’t it the other way around? For example, it doesn’t make much sense to wrap OpenMP operations for SPIR-V operations / types.

I think it is important to specify (in the design) which existing MLIR dialects are supported in this effort and the various lowerings / transformations / optimizations which are planned for them.

For starters, you still have to
generate proper OpenMP runtime calls, e.g., from your GPU dialect, even
if it is “just” to make sure the OMPD/OMPT interfaces expose useful
information.

You can have a well-defined call-like mlir::Operation which calls the GPU kernel. Perform all cross-device transformations in an easier way. Then, this operation can be lowered to OpenMP runtime calls during LLVM dialect conversion. I think this is much better than directly having calls to the OpenMP runtime library based on a kernel name mentioned in llvm::GlobalVariable.

  • I preclude the omp loop construct here as it is not even implemented
    anywhere as far as I know.
  1. It would also be easier to transform the Loop nests containing OpenMP
    constructs if the body of the OpenMP operations is well defined (i.e., does
    not accept arbitrary loop structures). Having nested redundant “parallel” ,
    “target” and “do” regions seems unnecessary.

As mentioned above, you cannot start with the assumption OpenMP input is
structured this this way. You have to analyze it first. This is the same
reason we cannot simply transform C/C++ for loops into affine.for
without proper analysis of the loop body.

Now, more concrete. Nested parallel and target regions are not
necessarily redundant, nor can/should we require the user not to have
them. Nested parallelism can easily make sense, depending on the problem
decomposition. Nested target will make a lot of sense with reverse
offload, which is already in the standard, and it also should be allowed
for the sake of a modular (user) code base.

Just to be clear, having all three of “target”, “parallel” and “do” doesn’t represent “Nested parallelism” at all in the proposed design! ( 2(d) ).

omp.target {

omp.parallel {

omp.do {

……

}

}

}

Above invokes a call to the tgt_target() for the code inside omp.do as mentioned in the proposal.

  1. There would also be new sets of loop structures in new dialects when
    C/C++ is compiled to MLIR. It would complicate the number of possible
    combinations inside the OpenMP region.

Is anyone working on this? If so, what is the timeline? I personally was
not expecting Clang to switch over to MLIR any time soon but I am happy
if someone wants to correct me on this. I mention this only because it
interacts with the arguments I will make below.

E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct
lowering to LLVM IR ignoring all the advantages that MLIR provides. Being
able to compile the code for heterogeneous hardware is one of the biggest
advantages that MLIR brings to the table. That is being completely missed
here. This also requires solving the problem of handling target information
in MLIR. But that is a problem which needs to be solved anyway. Using GPU
dialect also gives us an opportunity to represent offloading semantics in
MLIR.

I’m unsure what the problem with “handling target information in MLIR” is but
whatever design we end up with, we need to know about the target
(triple) in all stages of the pipeline, even if it is just to pass it
down.

Given the ability to represent multiple ModuleOps and the existence of GPU
dialect, couldn’t higher level optimizations on offloaded code be done at
MLIR level?. The proposed design would lead us to the same problems that we
are currently facing in LLVM IR.

Also, OpenMP codegen will automatically benefit from the GPU dialect based
optimizations. For example, it would be way easier to hoist a memory
reference out of GPU kernel in MLIR than in LLVM IR.

While I agree with the premise that you can potentially reuse MLIR
transformations, it might not be as simple in practice.

As mentioned above, you cannot assume much about OpenMP codes, almost
nothing for a lot of application codes I have seen. Some examples:

If you have a function call, or any synchronization event for that
matter, located between two otherwise adjacent target regions (see
below), you cannot assume the two target regions will be offloaded to
the same device.

#omp target
{}
foo();
#omp target
{}

These kinds of optimizations are much easier to write in MLIR:

LLVM IR for the above code would contain a series of instructions of OpenMP runtime call setup and foo() in the middle followed by another set of OpenMP runtime related instructions. The body of the two target constructs would be in two different outlined functions (if not modules).

It takes quite a bit of code to do analysis / transformation to write any optimization on the generated LLVM IR.

vs.

MLIR provides a way to represent the operations closer to the source. It is as simple as checking the next operation(s) in the mlir::Block. OpenMP target operation contains an inlined region which can easily be fused/ split / or any other valid transformation for that matter.

Note that you can also perform various Control Structure Analysis / Transformations much easier in MLIR. For example, you can decide to execute foo() based on certain conditions, and you can merge the two target regions in the else path.

Similarly, you cannot assume a omp parallel is allowed to be executed
with more than a single thread, or that a omp [parallel] for does not
have loop carried data-dependences, …

With multi-dimensional index support for arrays, wouldn’t it be better to do the data dependence analysis in MLIR?

LLVM IR has linearized subscripts for multi-dimensional arrays. llvm::DependenceAnalysis tries to “guess” the indices based on different patterns in SCEV. It takes an intrinsic or metadata or some other mechanism of communication from the front end (not the built-in set of instructions) to solve this problem.

Data-sharing attributes are also something that has to be treated
carefully:

x = 5;
#omp task
x = 3;
print(x);

Should print 5, not 3.

You can have “x” as a locally defined variable inside the “task” contained region in MLIR OR custom data-sharing attributes in OpenMP dialect.

I hope I convinced you that OpenMP is not trivially mappable to existing
dialects without proper analysis. If not, please let me know why you
expect it to be.

I do not see much reason why the issues you mentioned can’t trivially be mapped to the MLIR infrastructure. There is an easy way to define custom operations / types / attributes in OpenMP dialect and perform optimizations based on the IR that is created especially for OpenMP. The analysis / transformations required can be easily written on the custom operations defined rather than having a lowered form in the LLVM IR.

The various dialects / transformations in MLIR are in development / early phase (Example, GPU dialect) waiting to be improved with use cases such as this!

It sounds like LLVM IR is being considered for optimizations in OpenMP
constructs. There seems to be plans regarding improvement of LLVM IR
Framework for providing things required for OpenMP / flang(?)

LLVM has the OpenMPOpt pass now [0] in which we can put OpenMP specific
transformations. For now it is simple but we have some more downstream
patches, e.g., parallel region expansion [Section 5, 1]. Other
optimizations [Section 3 & 4, 1], will be performed by the Attributor
(see [4] after [2,3]) after one missing piece (basically [5] with some
more plumming) was put in place, see [2,3] for details on the idea.

Please feel free to ask questions on any of this.

[0] ⚙ D69930 [OpenMP] Introduce the OpenMPOpt transformation pass
[1] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
[2] https://www.youtube.com/watch?v=zfiHaPaoQPc
[3] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf
[4] https://youtu.be/CzWkc_JcfS0
[5] ⚙ D71505 [Utils] Provide a callback encapsulation utility for call sites

It might also worth looking into [6,7] mentioned below.

Are there any design considerations which contain pros and cons about using
the MLIR vs LLVM IR for various OpenMP related optimizations/
transformations?

The biggest pro for LLVM-IR is that it works for C/C++ right now. In
addition, as I mentioned before, LLVM-IR has mature analysis and
transformation passes for real world programs and support things like
LTO out of the box.

The latest RFC [ (3) in my original post ] mentions that:

> So there exist some questions regarding where the optimisations should be
carried out.

Could you please provide more details on this?

I would like to quote Chris here:

“if you ignore the engineering expense, it would clearly make sense to
reimplement the mid-level LLVM optimizers on top of MLIR and replace
include/llvm/IR with a dialect definition in MLIR instead.“ --
[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

*Rest of the comment are inlined.*

> Hi Vinay,
>
> Thanks for taking an interest and the detailed discussion.
>
> To start by picking a few paragraph from your email to clarify a couple
> of things that lead to the current design or that might otherwise need
> clarification. We can talk about other points later as well.
>
> [
> Site notes:
> 1) I'm not an MLIR person.
> 2) It seems unfortnuate that we do not have a mlir-dev list.
> ]
>
>
> > 1. With the current design, the number of transformations / optimizations
> > that one can write on OpenMP constructs would become limited as there can
> > be any custom loop structure with custom operations / types inside it.
>
> OpenMP, as an input language, does not make many assumptions about the
> code inside of constructs*.

This isn’t entirely correct because the current OpenMP API specification (
1 Introduction) assumes that the code
inside the constructs belong to C, C++ and Fortran programs.

(FWIW, my next sentence specified that I talk about the base language
       but anyway.)

While technically true, I will recommend not to make that assumption. We
do already allow non-base language constructs, e.g., CUDA intrinsics in
target regions, and that will not go away because it is required to
maximize performance.

> So, inside a parallel can be almost anything
> the base language has to offer, both lexically and dynamically.
>

I am mostly concerned with the MLIR side of things for OpenMP
representation.

MLIR can not only support operations for General Purpose languages like
C,C++, Fortran, etc but also various Domain Specific Language
representations as dialects (Example, ML, etc.). Note that there is also
SPIR V dialect which is again meant for “Parallel Compute”.

It becomes important to define the scope of the dialects / operations /
types supported inside OpenMP operations in MLIR.

Arguably, the OpenMP dialect in MLIR should match the OpenMP directives
and clauses as defined by the standard. Anything else is "not OpenMP".

> Assuming otherwise is not going to work. Analyzing a "generic" OpenMP
> representation in order to determine if can be represented as a more
> restricted "op" seems at least plausible. You will run into various
> issue, some mentioned explicitly below.

Isn’t it the other way around? For example, it doesn’t make much sense to
wrap OpenMP operations for SPIR-V operations / types.

I maybe misunderstanding but I thought you want to use something like
the GPU / Affine dialect to represent an OpenMP target region / loop.
That is plausible if you analyze the target region / loop and verify it
fits into the more generic dialect semantics.

I think it is important to specify (in the design) which existing MLIR
dialects are supported in this effort and the various lowerings /
transformations / optimizations which are planned for them.

That I cannot really help you with. TBH, I don't even know what
transformations people plan to do on OpenMP MLIR (and why).

> For starters, you still have to
> generate proper OpenMP runtime calls, e.g., from your GPU dialect, even
> if it is "just" to make sure the OMPD/OMPT interfaces expose useful
> information.
>
>
You can have a well-defined call-like mlir::Operation which calls the GPU
kernel. Perform all cross-device transformations in an easier way.
Then, this operation can be lowered to OpenMP runtime calls during LLVM
dialect conversion.

You missed my point I made in the other email. An OpenMP target region
is statically not a GPU offload so you should not model it as such "for
some time".

I think this is much better than directly having calls
to the OpenMP runtime library based on a kernel name mentioned in
llvm::GlobalVariable.

(Current) implementation is not semantics. There is no reason not to
change the way we lower OpenMP, e.g., by getting rid of the global
variables. They are present for a reason but not intrinsically required.
See the TRegions for example [6,7], they totally change the GPU lowering,
making it device agnostic and easy to analyze and optimize in the middle
end. Arguing the current encoding of OpenMP in LLVM-IR is problematic is
the same as arguing MLIR's LLVM dialect doesn't support atomic_rmw, it
might be true but its changeable.

[6] The TRegion Interface and Compiler Optimizations for OpenMP Target Regions | SpringerLink
[7] http://parallel.auckland.ac.nz/iwomp2019/slides_TRegion.pdf

> * I preclude the `omp loop` construct here as it is not even implemented
> anywhere as far as I know.
>
>
> > 2. It would also be easier to transform the Loop nests containing OpenMP
> > constructs if the body of the OpenMP operations is well defined (i.e.,
> does
> > not accept arbitrary loop structures). Having nested redundant
> "parallel" ,
> > "target" and "do" regions seems unnecessary.
>
> As mentioned above, you cannot start with the assumption OpenMP input is
> structured this this way. You have to analyze it first. This is the same
> reason we cannot simply transform C/C++ `for loops` into `affine.for`
> without proper analysis of the loop body.
>
> Now, more concrete. Nested parallel and target regions are not
> necessarily redundant, nor can/should we require the user not to have
> them. Nested parallelism can easily make sense, depending on the problem
> decomposition. Nested target will make a lot of sense with reverse
> offload, which is already in the standard, and it also should be allowed
> for the sake of a modular (user) code base.
>

Just to be clear, having all three of “target”, “parallel” and “do” doesn’t
represent “Nested parallelism” at all in the proposed design! ( 2(d) ).

omp.target {

  omp.parallel {

     omp.do {

      …...

      }

   }

}

Above invokes a call to the tgt_target() for the code inside omp.do as
mentioned in the proposal.

I do not follow. Just to make sure, the above should be roughly
equivalent to the code below, correct? There is no "nested"
parallelism, sure, but I thought you were talking about the case where
there is, e.g. add another `#pragma omp parallel` inside the one that
already is there. That is nested parallelism which can happen and make
total sense for the application.

#pragma omp target
{
  #pragma omp parallel
  {
    #pragma omp for
    for (...)
    {
      ...
    }
  }
}

>
> > 3. There would also be new sets of loop structures in new dialects when
> > C/C++ is compiled to MLIR. It would complicate the number of possible
> > combinations inside the OpenMP region.
>
> Is anyone working on this? If so, what is the timeline? I personally was
> not expecting Clang to switch over to MLIR any time soon but I am happy
> if someone wants to correct me on this. I mention this only because it
> interacts with the arguments I will make below.
>
>
> > E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct
> > lowering to LLVM IR ignoring all the advantages that MLIR provides. Being
> > able to compile the code for heterogeneous hardware is one of the biggest
> > advantages that MLIR brings to the table. That is being completely missed
> > here. This also requires solving the problem of handling target
> information
> > in MLIR. But that is a problem which needs to be solved anyway. Using GPU
> > dialect also gives us an opportunity to represent offloading semantics in
> > MLIR.
>
> I'm unsure what the problem with "handling target information in MLIR" is
> but
> whatever design we end up with, we need to know about the target
> (triple) in all stages of the pipeline, even if it is just to pass it
> down.
>
>
> > Given the ability to represent multiple ModuleOps and the existence of
> GPU
> > dialect, couldn't higher level optimizations on offloaded code be done at
> > MLIR level?. The proposed design would lead us to the same problems that
> we
> > are currently facing in LLVM IR.
> >
> > Also, OpenMP codegen will automatically benefit from the GPU dialect
> based
> > optimizations. For example, it would be way easier to hoist a memory
> > reference out of GPU kernel in MLIR than in LLVM IR.
>
> While I agree with the premise that you can potentially reuse MLIR
> transformations, it might not be as simple in practice.
>
> As mentioned above, you cannot assume much about OpenMP codes, almost
> nothing for a lot of application codes I have seen. Some examples:
>
> If you have a function call, or any synchronization event for that
> matter, located between two otherwise adjacent target regions (see
> below), you cannot assume the two target regions will be offloaded to
> the same device.
> ```
> #omp target
> {}
> foo();
> #omp target
> {}
> ```
>

These kinds of optimizations are much easier to write in MLIR:

LLVM IR for the above code would contain a series of instructions of OpenMP
runtime call setup and foo() in the middle followed by another set of
OpenMP runtime related instructions. The body of the two target constructs
would be in two different outlined functions (if not modules).

It takes quite a bit of code to do analysis / transformation to write any
optimization on the generated LLVM IR.

You are right about the module's being a problem. As I mentioned in my
last email, we are working on that by not having them in different ones
during the optimization pipeline. If we make the `target` `parallel`
instead we can simulate that right now. The bodies are in different
functions, sure, but does it matter? Let's walk through parallel region
expansion (see above [Section 5, 1]) so you can judge for yourself:

#omp parallel
{ body0 }
some_code
#omp parallel
{ body1 }

will become

__kmpc_fork_call(..., @body0_fn, ...)
some_code
__kmpc_fork_call(..., @body1_fn, ...)

in IR. Simplified, there are 3 cases here:
1) some_code is harmless, meaning all of it can be executed redundantly.
2) parts of some some_code need to be guarded to be sequential but
    they can be executed in a parallel region otherwise, e.g., the code
    will not observe the difference through runtime calls.
3) parts of some some_code cannot be executed in a parallel region as
   they might observe the difference through runtime calls.
First note that you need to do the classification regardless of your
encoding (=IR). In case of 3) we are done and nothing is happening.
Let's consider case 2) as 1) is basically a special case of it. As shown
in the paper [1], you need to broadcast values created by some_code
across all threads and synchronize appropriately to preserve semantic.
Other than that, the transformation is straight forward:

A) Create a function "@body01_fn" that is basically the outlined region
in which code is then guarded and __kmpc_fork_call are replaced by
direct calls. It looks like this:

  call @body0_fn(...)
  #omp master
  some_code
  #omp barrier
  call @body1_fn(...)

B) Replace the region you put in the new function with a
   __kmpc_fork_call to it:

  __kmpc_fork_call(..., @body01_fn, ...)

C) Done.

If you are interested in the implementation I'll add you as a reviewer
once I put it on Phab. I'm in the process of cleaning up my stand alone
pass and moving it into the OpenMPOpt pass instead.

vs.

MLIR provides a way to represent the operations closer to the source. It is
as simple as checking the next operation(s) in the mlir::Block. OpenMP
target operation contains an inlined region which can easily be fused/
split / or any other valid transformation for that matter.

Note that you can also perform various Control Structure Analysis /
Transformations much easier in MLIR. For example, you can decide to execute
foo() based on certain conditions, and you can merge the two target regions
in the else path.

At the end, it's an encoding difference. Sure, the handling might be
easier in certain situations but all the validity checks, hence code
analyses, are still required. The actual "rewrite" is usually not the
hard part.

> Similarly, you cannot assume a `omp parallel` is allowed to be executed
> with more than a single thread, or that a `omp [parallel] for` does not
> have loop carried data-dependences, ...
>

With multi-dimensional index support for arrays, wouldn’t it be better to
do the data dependence analysis in MLIR?

Yes, probably.

LLVM IR has linearized subscripts for multi-dimensional arrays.
llvm::DependenceAnalysis tries to “guess” the indices based on different
patterns in SCEV. It takes an intrinsic
<The LLVM Compiler Infrastructure Project; or metadata or
some other mechanism of communication from the front end (not the built-in
set of instructions) to solve this problem.

Not disagreeing with you on this one :wink:

The only caveat is that we still live in a world in which C/C++ is a
thing.

> Data-sharing attributes are also something that has to be treated
> carefully:
> ```
> x = 5;
> #omp task
> x = 3;
> print(x);
> ```
> Should print 5, not 3.
>

You can have “x” as a locally defined variable inside the “task” contained
region in MLIR OR custom data-sharing attributes in OpenMP dialect.

I'm not saying it is impossible or even hard, but maybe not as straight
forward as one might think. Your encoding is for example very reasonable.

In the example below you need to print 3, not 5, e.g., constant prop on
the outer level should not happen.

x = 5;
#omp task shared(x)
{
  x = 3;
  some_form_of_sync();
  ...
}
some_form_of_sync();
print(x);

> I hope I convinced you that OpenMP is not trivially mappable to existing
> dialects without proper analysis. If not, please let me know why you
> expect it to be.
>
I do not see much reason why the issues you mentioned can’t trivially be
mapped to the MLIR infrastructure. There is an easy way to define custom
operations / types / attributes in OpenMP dialect and perform optimizations
based on the *IR that is created especially for OpenMP*. The analysis /
transformations required can be easily written on the custom operations
defined rather than having a lowered form in the LLVM IR.

You can totally define your OpenMP dialect and map it to that. Mapping
to other dialects is the problematic part. As I mentioned, `omp
parallel` does not mean "parallel" or "dependence-free".

Since you mention it, why do you think it is conceptually or practically
harder to write an analysis/transformations on IR? I mean, you teach
your analysis what the op "omp.parallel" means, right? Why not teach an
(interprocedural) analysis what __kmpc_fork_call() does (see [2,3] above)?

FWIW, there are LLVM analyses and transformations that already know
about the transitive call made by __kmpc_fork_call and pthread_create
(see [4] above). It is done in a way that you can easily annotate your
own C/C++ or IR to make use of it, e.g., for your own transitive
callbacks:
  Attributes in Clang — Clang 18.0.0git documentation
  LLVM Language Reference Manual — LLVM 18.0.0git documentation

The various dialects / transformations in MLIR are in development / early
phase (Example, GPU dialect) waiting to be improved with use cases such as
this!

Great! I am eagerly looking forward to this.

Cheers,
  Johannes

IMHO, it's not just the engineering expense, but also additional
overhead from having a more general data structure that clang does not
need. In some sense, LLVM-IR has been designed to match the semantics
of C, such that a more general representation makes less sense. There
are still opportunities, e.g. representing C++ virtual methods instead
lowering to a vtable lookup. This could make devirtualization easier.

However, it seems nobody is currently pushing for this change to
happen, there is not even an RFC on whether the community wants this
change. As such, I would not plan on using MLIR if your frontend
language is C/C++/Objective-C.

Michael

Reply to Kiran Chandramohan:

You are welcome to participate, provide feedback and criticism to change the design as well as to contribute to the implementation.

Thank you Kiran.

But the latest is what is there in the RFC in discourse.

I have used this as reference for the response.

We did a study of a few constructs and clauses which was shared as mails to flang-dev and the RFC. As we make progress and before implementation, we will share further details.

“ Yes, parallel and flush would be the next two constructs that we will do.” – from a comment in latest RFC

For the above mentioned reasons, I will try to restrict my reply to how the “parallel (do)” construct would be lowered.

If it is OK we can have further discussions in discourse as River points out.

Given that the multiple components of the LLVM project, namely clang, flang, MLIR and LLVM are involved, llvm-dev is probably a better place, with a much wider audience, until it is clear how different components must interact.

It is the review for translation to LLVM IR that is in progress.

“If we decide that the OpenMP construct (for e.g. collapse) can be handled fully in MLIR and that is the best place to do it (based on experiments) then we will not use the OpenMP IRBuilder for these constructs.” – latest RFC in discourse

If it is not finalized that the OpenMPIRBuilder will be used for all the constructs, wouldn’t it be better to delay the submission of “translation to LLVM IR” patch in MLIR? Lowering code will become inconsistent if the OpenMPIRBuilder is used only for a few constructs and not for others.

Also, the patch does OpenMP dialect lowering alongside LLVM Dialect to LLVM IR. This is different from most dialects which get directly lowered to LLVM Dialect. I think lowering to LLVM Dialect would be a cleaner way if OpenMPIRBuilder is not being considered for all constructs.

Mehdi also seems to have the same suggestion: “I agree that having dialect lowering would be cleaner” in https://reviews.llvm.org/D72962

Yes, the design has mildly changed over time to incorporate feedback. But the latest is what is there in the RFC in discourse.

RFC fails to discuss the following (I have also mentioned some of them in my reply to Johannes):

The proposed plan involves a) lowering F18 AST with OpenMP directly to a mix of OpenMP and FIR dialects. b) converting this finally to a mix of OpenMP and LLVM dialects.

It is unclear in the RFC what other dialects are considered as supported for OpenMP dialect (std, affine, vector, loop, etc) and how it would be transformed, used and lowered from FIR to LLVM.

It becomes important to list down the various dialects / operations / types supported for OpenMP (which is mainly defined for C, C++ and Fortran programs. MLIR has a much wider scope.

It wouldn’t add much value for the proposed OpenMP dialect to be in the MLIR tree if it cannot support at least the relevant standard dialect types / operations.

We would like to take advantage of the transformations in cases that are possible. FIR loops will be converted to affine/loop dialect. So the loop inside an omp.do can be in these dialects as clarified in the discussion in discourse and also shown in slide 20 of the fosdem presentation (links to both below).

https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397/7?u=kiranchandramohan

https://fosdem.org/2020/schedule/event/llvm_flang/attachments/slides/3839/export/events/attachments/llvm_flang/slides/3839/flang_llvm_frontend.pdf

Although it is mentioned that the affine/ loop.for is used, following things are unclear:

I am assuming that there will be lowering / conversion code in f18 repo dialect from fir.do to loop.for / affine.for. Is it the case? If so, I think it is worth mentioning it in the “sequential code flow representation” in the RFC.

This raises the following questions.

  1. Which types are supported? Standard dialect types and FIR types?

For example, what types are used for Fortran arrays used inside OpenMP regions? Is it std.memref OR Fortran array representation in FIR dialect (fir.array?) OR both? Note that Fortran has support for column major arrays. std.memref supports custom memory layouts. What custom layouts are supported?

How would different non-scalar types in standard dialect be lowered to LLVM IR and passed to OpenMP runtime calls? Can you please elaborate on this?

The example provided in slide 20 of the fosdem presentation contains

“loop.for %j = %lb2 to %ub2 : !integer {“

But loop.for accepts “index” type. Not sure what type “!integer” represents here.

  1. What are the different memory access operations which are supported inside the OpenMP region and lowered to proper OpenMP runtime calls in LLVM IR?

The possibilities are:

  1. affine.load / affine.store

  2. std.load / std.store

  3. FIR dialect memory access operations.

I must also point out that the question of where to do loop transformations is a topic we have not fully converged on. See the following thread for discussions. http://lists.llvm.org/pipermail/flang-dev/2019-September/000042.html

Looks like placement (MLIR / LLVM) of various transformations related to OpenMP has not been finalized, from what I could infer from Johannes’s reply and the below text in the latest RFC in discourse:

“So there exist some questions regarding where the optimisations should be carried out. We will decide on which framework to choose only after some experimentation.”

i) we need to keep the loops separately so as to take advantage of transformations that other dialects like affine/loop would provide.

  1. Keeping the loops separate from the OpenMP operations will expose them to the “regular” transformations passes in MLIR inside the OpenMP region. Most of them are invalid or in-efficient for OpenMP operations.

Examples:

  1. Constant propagation example mentioned by Johannes in this thread. (omp task shared(x))

  2. Loop (nest) transformations (permute / split / fuse / tile, etc) will happen ignoring the surrounding OpenMP operations.

  3. Hoisting and sinking of various memory/ SSA values inside the OpenMP region. This goes against the likes of “map”, “firstprivate”, shared, etc clauses and more.

  1. Various loop operations (loop.for, affine.for, fir.do) have (or will have) different transformations/ optimization passes which are different from one another.

Example:

  1. AffineLoopInvariantCodeMotion.cpp is different from LoopInvariantCodeMotion.cpp.

  2. Other Loop transformation passes for affine.for

These loops also use different Types and memory access operations in general for transformations. Example, most Affine dialect transformations (if not all) work on affine.load and affine.store operations.

Supporting different loop operations means that there would be OpenMP specific transformations for each one of them and also requires a way to restrict each of them from existing transformations (when nested in OpenMP constructs).

There would be different lowerings for different loop operations as well. Example, affine.for and loop.for would have to be lowered to omp.do in different ways.

From slide 20 of fosdem presentation you mentioned, the LLVM + OpenMP dialect representation is as follows:

It sounds like LLVM IR is being considered for optimizations in OpenMP
constructs. There seems to be plans regarding improvement of LLVM IR
Framework for providing things required for OpenMP / flang(?)

LLVM has the OpenMPOpt pass now [0] in which we can put OpenMP specific
transformations. For now it is simple but we have some more downstream
patches, e.g., parallel region expansion [Section 5, 1]. Other
optimizations [Section 3 & 4, 1], will be performed by the Attributor
(see [4] after [2,3]) after one missing piece (basically [5] with some
more plumming) was put in place, see [2,3] for details on the idea.

Please feel free to ask questions on any of this.

[0] https://reviews.llvm.org/D69930
[1] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
[2] https://www.youtube.com/watch?v=zfiHaPaoQPc
[3] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf
[4] https://youtu.be/CzWkc_JcfS0
[5] https://reviews.llvm.org/D71505

It might also worth looking into [6,7] mentioned below.

Are there any design considerations which contain pros and cons about using
the MLIR vs LLVM IR for various OpenMP related optimizations/
transformations?

The biggest pro for LLVM-IR is that it works for C/C++ right now. In
addition, as I mentioned before, LLVM-IR has mature analysis and
transformation passes for real world programs and support things like
LTO out of the box.

+1: MLIR is awesome (purely unbiased opinion ;)) but it won’t be in the C/C++ path anytime soon (I hope it’ll be one day though).

There are plenty of frontend targeting LLVM directly, and LLVM should continue to improve.

The latest RFC [ (3) in my original post ] mentions that:

So there exist some questions regarding where the optimisations should be
carried out.

Could you please provide more details on this?

I would like to quote Chris here:

“if you ignore the engineering expense, it would clearly make sense to
reimplement the mid-level LLVM optimizers on top of MLIR and replace
include/llvm/IR with a dialect definition in MLIR instead.“ –
http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.html

Rest of the comment are inlined.

Hi Vinay,

Thanks for taking an interest and the detailed discussion.

To start by picking a few paragraph from your email to clarify a couple
of things that lead to the current design or that might otherwise need
clarification. We can talk about other points later as well.

[
Site notes:

  1. I’m not an MLIR person.
  2. It seems unfortnuate that we do not have a mlir-dev list.
    ]
  1. With the current design, the number of transformations / optimizations
    that one can write on OpenMP constructs would become limited as there can
    be any custom loop structure with custom operations / types inside it.

OpenMP, as an input language, does not make many assumptions about the
code inside of constructs*.

This isn’t entirely correct because the current OpenMP API specification (
https://www.openmp.org/spec-html/5.0/openmpch1.html) assumes that the code
inside the constructs belong to C, C++ and Fortran programs.

(FWIW, my next sentence specified that I talk about the base language
but anyway.)

While technically true, I will recommend not to make that assumption. We
do already allow non-base language constructs, e.g., CUDA intrinsics in
target regions, and that will not go away because it is required to
maximize performance.

+1 : I worked on a project where we added support of OpenMP directives to Python Numpy loops and compiling these!

So, inside a parallel can be almost anything
the base language has to offer, both lexically and dynamically.

I am mostly concerned with the MLIR side of things for OpenMP
representation.

MLIR can not only support operations for General Purpose languages like
C,C++, Fortran, etc but also various Domain Specific Language
representations as dialects (Example, ML, etc.). Note that there is also
SPIR V dialect which is again meant for “Parallel Compute”.

It becomes important to define the scope of the dialects / operations /
types supported inside OpenMP operations in MLIR.

Arguably, the OpenMP dialect in MLIR should match the OpenMP directives
and clauses as defined by the standard. Anything else is “not OpenMP”.

+1

Assuming otherwise is not going to work. Analyzing a “generic” OpenMP
representation in order to determine if can be represented as a more
restricted “op” seems at least plausible. You will run into various
issue, some mentioned explicitly below.

Isn’t it the other way around? For example, it doesn’t make much sense to
wrap OpenMP operations for SPIR-V operations / types.

I maybe misunderstanding but I thought you want to use something like
the GPU / Affine dialect to represent an OpenMP target region / loop.
That is plausible if you analyze the target region / loop and verify it
fits into the more generic dialect semantics.

I think it is important to specify (in the design) which existing MLIR
dialects are supported in this effort and the various lowerings /
transformations / optimizations which are planned for them.

I agree, but I am puzzled why you bring this here instead of discussing this in the RFC thread. You seem to have good ideas overall, but if you’d like to discuss the development of MLIR and MLIR dialects I invite you to engage there instead, I suspect you’ll have more impact.

That I cannot really help you with. TBH, I don’t even know what
transformations people plan to do on OpenMP MLIR (and why).

For starters, you still have to
generate proper OpenMP runtime calls, e.g., from your GPU dialect, even
if it is “just” to make sure the OMPD/OMPT interfaces expose useful
information.

You can have a well-defined call-like mlir::Operation which calls the GPU
kernel. Perform all cross-device transformations in an easier way.
Then, this operation can be lowered to OpenMP runtime calls during LLVM
dialect conversion.

You missed my point I made in the other email. An OpenMP target region
is statically not a GPU offload so you should not model it as such “for
some time”.

I think this is much better than directly having calls
to the OpenMP runtime library based on a kernel name mentioned in
llvm::GlobalVariable.

(Current) implementation is not semantics. There is no reason not to
change the way we lower OpenMP, e.g., by getting rid of the global
variables. They are present for a reason but not intrinsically required.
See the TRegions for example [6,7], they totally change the GPU lowering,
making it device agnostic and easy to analyze and optimize in the middle
end. Arguing the current encoding of OpenMP in LLVM-IR is problematic is
the same as arguing MLIR’s LLVM dialect doesn’t support atomic_rmw, it
might be true but its changeable.

[6] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
[7] http://parallel.auckland.ac.nz/iwomp2019/slides_TRegion.pdf

  • I preclude the omp loop construct here as it is not even implemented
    anywhere as far as I know.
  1. It would also be easier to transform the Loop nests containing OpenMP
    constructs if the body of the OpenMP operations is well defined (i.e.,
    does
    not accept arbitrary loop structures). Having nested redundant
    “parallel” ,
    “target” and “do” regions seems unnecessary.

As mentioned above, you cannot start with the assumption OpenMP input is
structured this this way. You have to analyze it first. This is the same
reason we cannot simply transform C/C++ for loops into affine.for
without proper analysis of the loop body.

Now, more concrete. Nested parallel and target regions are not
necessarily redundant, nor can/should we require the user not to have
them. Nested parallelism can easily make sense, depending on the problem
decomposition. Nested target will make a lot of sense with reverse
offload, which is already in the standard, and it also should be allowed
for the sake of a modular (user) code base.

Just to be clear, having all three of “target”, “parallel” and “do” doesn’t
represent “Nested parallelism” at all in the proposed design! ( 2(d) ).

omp.target {

omp.parallel {

omp.do {

……

}

}

}

Above invokes a call to the tgt_target() for the code inside omp.do as
mentioned in the proposal.

I do not follow. Just to make sure, the above should be roughly
equivalent to the code below, correct? There is no “nested”
parallelism, sure, but I thought you were talking about the case where
there is, e.g. add another #pragma omp parallel inside the one that
already is there. That is nested parallelism which can happen and make
total sense for the application.

#pragma omp target
{
#pragma omp parallel
{
#pragma omp for
for (…)
{

}
}
}

  1. There would also be new sets of loop structures in new dialects when
    C/C++ is compiled to MLIR. It would complicate the number of possible
    combinations inside the OpenMP region.

Is anyone working on this? If so, what is the timeline? I personally was
not expecting Clang to switch over to MLIR any time soon but I am happy
if someone wants to correct me on this. I mention this only because it
interacts with the arguments I will make below.

E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct
lowering to LLVM IR ignoring all the advantages that MLIR provides. Being
able to compile the code for heterogeneous hardware is one of the biggest
advantages that MLIR brings to the table. That is being completely missed
here. This also requires solving the problem of handling target
information
in MLIR. But that is a problem which needs to be solved anyway. Using GPU
dialect also gives us an opportunity to represent offloading semantics in
MLIR.

I’m unsure what the problem with “handling target information in MLIR” is
but
whatever design we end up with, we need to know about the target
(triple) in all stages of the pipeline, even if it is just to pass it
down.

Given the ability to represent multiple ModuleOps and the existence of
GPU
dialect, couldn’t higher level optimizations on offloaded code be done at
MLIR level?. The proposed design would lead us to the same problems that
we
are currently facing in LLVM IR.

Also, OpenMP codegen will automatically benefit from the GPU dialect
based
optimizations. For example, it would be way easier to hoist a memory
reference out of GPU kernel in MLIR than in LLVM IR.

While I agree with the premise that you can potentially reuse MLIR
transformations, it might not be as simple in practice.

As mentioned above, you cannot assume much about OpenMP codes, almost
nothing for a lot of application codes I have seen. Some examples:

If you have a function call, or any synchronization event for that
matter, located between two otherwise adjacent target regions (see
below), you cannot assume the two target regions will be offloaded to
the same device.

#omp target
{}
foo();
#omp target
{}

These kinds of optimizations are much easier to write in MLIR:

LLVM IR for the above code would contain a series of instructions of OpenMP
runtime call setup and foo() in the middle followed by another set of
OpenMP runtime related instructions. The body of the two target constructs
would be in two different outlined functions (if not modules).

It takes quite a bit of code to do analysis / transformation to write any
optimization on the generated LLVM IR.

I agree that MLIR makes it easier, with the constraint that you have to express it at the right level of abstraction in the first place.
But what is the suggestion here? Not doing it in LLVM at all? This does not seem quite reasonable with respect to all the LLVM users.

You are right about the module’s being a problem. As I mentioned in my
last email, we are working on that by not having them in different ones
during the optimization pipeline. If we make the target parallel
instead we can simulate that right now. The bodies are in different
functions, sure, but does it matter? Let’s walk through parallel region
expansion (see above [Section 5, 1]) so you can judge for yourself:

#omp parallel
{ body0 }
some_code
#omp parallel
{ body1 }

will become

__kmpc_fork_call(..., @body0_fn, ...)
some_code
__kmpc_fork_call(..., @body1_fn, ...)

in IR. Simplified, there are 3 cases here:

  1. some_code is harmless, meaning all of it can be executed redundantly.
  2. parts of some some_code need to be guarded to be sequential but
    they can be executed in a parallel region otherwise, e.g., the code
    will not observe the difference through runtime calls.
  3. parts of some some_code cannot be executed in a parallel region as
    they might observe the difference through runtime calls.
    First note that you need to do the classification regardless of your
    encoding (=IR). In case of 3) we are done and nothing is happening.
    Let’s consider case 2) as 1) is basically a special case of it. As shown
    in the paper [1], you need to broadcast values created by some_code
    across all threads and synchronize appropriately to preserve semantic.
    Other than that, the transformation is straight forward:

A) Create a function “@body01_fn” that is basically the outlined region
in which code is then guarded and __kmpc_fork_call are replaced by
direct calls. It looks like this:

call @body0_fn(...)
#omp master
some_code
#omp barrier
call @body1_fn(...)

B) Replace the region you put in the new function with a
__kmpc_fork_call to it:

__kmpc_fork_call(..., @body01_fn, ...)

C) Done.

If you are interested in the implementation I’ll add you as a reviewer
once I put it on Phab. I’m in the process of cleaning up my stand alone
pass and moving it into the OpenMPOpt pass instead.

vs.

MLIR provides a way to represent the operations closer to the source. It is
as simple as checking the next operation(s) in the mlir::Block. OpenMP
target operation contains an inlined region which can easily be fused/
split / or any other valid transformation for that matter.

Note that you can also perform various Control Structure Analysis /
Transformations much easier in MLIR. For example, you can decide to execute
foo() based on certain conditions, and you can merge the two target regions
in the else path.

At the end, it’s an encoding difference. Sure, the handling might be
easier in certain situations but all the validity checks, hence code
analyses, are still required. The actual “rewrite” is usually not the
hard part.

In general this is correct. One subtlety about the “validity checks” being easier on MLIR is because the “encoding” can include in the IR the guarantees you need: for example a parallel loop can be encoded directly in the IR, you don’t have to recover the information with an analysis. Even if you do recover information with an analysis, encoding it in the IR makes it more likely to survive through other transformations (it is more robust that LLVM metadata for example).
But I’m off-topic here :slight_smile:

Similarly, you cannot assume a omp parallel is allowed to be executed
with more than a single thread, or that a omp [parallel] for does not
have loop carried data-dependences, …

With multi-dimensional index support for arrays, wouldn’t it be better to
do the data dependence analysis in MLIR?

Yes, probably.

LLVM IR has linearized subscripts for multi-dimensional arrays.
llvm::DependenceAnalysis tries to “guess” the indices based on different
patterns in SCEV. It takes an intrinsic
<http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_88> or metadata or
some other mechanism of communication from the front end (not the built-in
set of instructions) to solve this problem.

Not disagreeing with you on this one :wink:

The only caveat is that we still live in a world in which C/C++ is a
thing.

+1

Reply to Kiran Chandramohan:

You are welcome to participate, provide feedback and criticism to change the design as well as to contribute to the implementation.

Thank you Kiran.

But the latest is what is there in the RFC in discourse.

I have used this as reference for the response.

We did a study of a few constructs and clauses which was shared as mails to flang-dev and the RFC. As we make progress and before implementation, we will share further details.

“ Yes, parallel and flush would be the next two constructs that we will do.” – from a comment in latest RFC

For the above mentioned reasons, I will try to restrict my reply to how the “parallel (do)” construct would be lowered.

If it is OK we can have further discussions in discourse as River points out.

Given that the multiple components of the LLVM project, namely clang, flang, MLIR and LLVM are involved, llvm-dev is probably a better place, with a much wider audience

Possibly wider, but maybe less focused about discussing MLIR dialect design. In particular there is an RFC thread for this particular dialect on Discourse, which is the canonical place to discuss its design.

, until it is clear how different components must interact.

They don’t need to interact so closely: they are very loosely related: flang will use MLIR but clang won’t (in the foreseeable future) and LLVM has many other frontends.

It is the review for translation to LLVM IR that is in progress.

“If we decide that the OpenMP construct (for e.g. collapse) can be handled fully in MLIR and that is the best place to do it (based on experiments) then we will not use the OpenMP IRBuilder for these constructs.” – latest RFC in discourse

If it is not finalized that the OpenMPIRBuilder will be used for all the constructs, wouldn’t it be better to delay the submission of “translation to LLVM IR” patch in MLIR? Lowering code will become inconsistent if the OpenMPIRBuilder is used only for a few constructs and not for others.

Also, the patch does OpenMP dialect lowering alongside LLVM Dialect to LLVM IR. This is different from most dialects which get directly lowered to LLVM Dialect. I think lowering to LLVM Dialect would be a cleaner way if OpenMPIRBuilder is not being considered for all constructs.

I don’t disagree, but there are a lot of speculation here: your quote starts with “If we decide that the OpenMP construct (for e.g. collapse) can be handled fully in MLIR”, are you thinking that we need to first decide this once and for all before making progress on building this path?
What disadvantages do you perceive to an approach where we would bring up this dialect using the OpenMPIRBuilders for exporting to LLVM IR until we gain enough experience? Do you think starting like this will make it significantly harder to transition away from the builders if this is what we want?
It seemed to me like it wouldn’t, and that’s why I’m supportive of this path: the omp dialect design, implementation, and the transformation/analysis that will be performed there seems entirely disjoint from the LLVM lowering, I’d hope we can swap the LLVM lowering at a later time (if this is what we’d want).

Mehdi also seems to have the same suggestion: “I agree that having dialect lowering would be cleaner” in https://reviews.llvm.org/D72962

Since you’re calling me out: yes it would be cleaner from a pure MLIR point of view, I don’t think there is much disagreement on this (I think?).
However we already have the OpenMP builders available and they will continue to be maintained/evolved to support OpenMP in clang.
Duplicating them entirely in MLIR for the sake of purity seems like a lack of pragmatism here, so I support the current approach with the current tradeoffs.

Yes, the design has mildly changed over time to incorporate feedback. But the latest is what is there in the RFC in discourse.

RFC fails to discuss the following (I have also mentioned some of them in my reply to Johannes):

The proposed plan involves a) lowering F18 AST with OpenMP directly to a mix of OpenMP and FIR dialects. b) converting this finally to a mix of OpenMP and LLVM dialects.

It is unclear in the RFC what other dialects are considered as supported for OpenMP dialect (std, affine, vector, loop, etc) and how it would be transformed, used and lowered from FIR to LLVM.

It becomes important to list down the various dialects / operations / types supported for OpenMP (which is mainly defined for C, C++ and Fortran programs. MLIR has a much wider scope.

It wouldn’t add much value for the proposed OpenMP dialect to be in the MLIR tree if it cannot support at least the relevant standard dialect types / operations.

I agree, and I think this was something I called out as important in the RFC: “It seems that the dialect can be orthogonal to FIR and its type system, which the most important thing to me to integrate MLIR (favor reusability across other frontends / compiler frameworks)”.
If you don’t think that this is the case, then please raise this in the RFC!
I think it is perfectly fair to ask for more examples from the author and digging a bit deeper if you’re unconvinced that the proposed modeling can be applicable outside of FIR. This is exactly why we ask such proposal to go through RFC by the way: to allow people like you to point at the blindspot and ask the right questions.

Best,

I'd point out that from an engineering perspective, instead of
switching everybody from LLVM IR to MLIR, considering the relative
maturity of the projects the smarter move would be to switch folks
from MLIR to LLVM IR, by making LLVM IR itself extensible via
dialects.

It's understandable that MLIR was developed as a completely separate
thing, given that there was no proven example of an extensible IR.
However, now that such an example exists, integrating porting the
extensibility features that have been proven to work into LLVM IR is a
logical next step.

Cheers,
Nicolai

Thanks, Vinay for further details and discussion.

“If we decide that the OpenMP construct (for e.g. collapse) can be handled fully in MLIR and that is the best place to do it (based on experiments) then we will not use the OpenMP IRBuilder for these constructs.” – latest RFC in discourse

If it is not finalized that the OpenMPIRBuilder will be used for all the constructs, wouldn’t it be better to delay the submission of “translation to LLVM IR” patch in MLIR? Lowering code will become inconsistent if the OpenMPIRBuilder is used only for a few constructs and not for others.

I was hoping that we can identify a set of constructs/clauses which can be fully handled inside the MLIR layer itself. As an example, I provided the collapse clause. This will include constructs/clauses which do not generate runtime API calls, which pass metadata to LLVM to do some optimisation etc. Yes, the list is not finalized. The point here is that we do not want to use MLIR just as a pass-through layer because MLIR has a lot of strengths. Would it be OK to continue with only constructs that use OpenMPIRBuilder before the final list is made?

Also, the patch does OpenMP dialect lowering alongside LLVM Dialect to LLVM IR. This is different from most dialects which get directly lowered to LLVM Dialect. I think lowering to LLVM Dialect would be a cleaner way if OpenMPIRBuilder is not being considered for all constructs.
Mehdi also seems to have the same suggestion: “I agree that having dialect lowering would be cleaner” in https://reviews.llvm.org/D72962

The point here is that if we lower to LLVM dialect, we will not be able to reuse OpenMP codegen & optimisation code from Clang/LLVM.
It was pointed out to me early on by the MLIR developers that there are a few dialects (like NVVM) which are lowered along with LLVM dialect.

You ask a lot of specific questions about which types, dialects, memory access operations will be supported and also the lowering for parallel do. Yes, the RFC does not provide all this information. This will become clear only as we make progress with the OpenMP dialect. I would like and I am interested to provide answers to all your questions in the following weeks. Please allow some time.

–Kiran

> Mehdi also seems to have the same suggestion: “I agree that having
> dialect lowering would be cleaner” in ⚙ D72962 [MLIR, OpenMP] Translation of OpenMP barrier construct to LLVM IR
>

Since you're calling me out: yes it would be cleaner from a pure MLIR point
of view, I don't think there is much disagreement on this (I think?).
However we already have the OpenMP builders available and they will
continue to be maintained/evolved to support OpenMP in clang.
Duplicating them entirely in MLIR for the sake of purity seems like a lack
of pragmatism here, so I support the current approach with the current
tradeoffs.

What benefit would we hope to achieve by representing runtime API calls
in the LLVM dialect? I mean, as far as I understand, the central idea of
MLIR is to retain and exploit "high-level" information. Translating to
LLVM dialect for the sake of it seems in any context like a suboptimal
choice.

> > Yes, the design has mildly changed over time to incorporate feedback.
> But the latest is what is there in the RFC in discourse.
>
> RFC fails to discuss the following (I have also mentioned some of them in
> my reply to Johannes):
>
> > The proposed plan involves a) lowering F18 AST with OpenMP directly to a
> mix of OpenMP and FIR dialects. b) converting this finally to a mix of
> OpenMP and LLVM dialects.
>
> It is unclear in the RFC what other dialects are considered as supported
> for OpenMP dialect (std, affine, vector, loop, etc) and how it would be
> transformed, used and lowered from FIR to LLVM.
>
> It becomes important to list down the various dialects / operations /
> types supported for OpenMP (which is mainly defined for C, C++ and Fortran
> programs. MLIR has a much wider scope.
>
> It wouldn’t add much value for the proposed OpenMP dialect to be in the
> MLIR tree if it cannot support at least the relevant standard dialect types
> / operations.
>

I agree, and I think this was something I called out as important in the
RFC: "It seems that the dialect can be orthogonal to FIR and its type
system, which the most important thing to me to integrate MLIR (favor
reusability across other frontends / compiler frameworks)".
If you don't think that this is the case, then please raise this in the RFC!
I think it is perfectly fair to ask for more examples from the author and
digging a bit deeper if you're unconvinced that the proposed modeling can
be applicable outside of FIR. This is exactly why we ask such proposal to
go through RFC by the way: to allow people like you to point at the
blindspot and ask the right questions.

I was told that there is no technical downside of having FIR in
`/flang/`. If that is the case, why doesn't the OpenMP dialect live
there as well? There is also `/openmp/`, which arguably would make
sense, but that would introduce a dependence we yet not have.

Long story short, does it make much of a difference? It seems we are in
agreement the dialect will live in `/llvm-project/`, which subproject
seems less important (to me).

> > We would like to take advantage of the transformations in cases that are
> possible. FIR loops will be converted to affine/loop dialect. So the loop
> inside an omp.do can be in these dialects as clarified in the discussion in
> discourse and also shown in slide 20 of the fosdem presentation (links to
> both below).
>
>
> RFC: OpenMP dialect in MLIR - #7 by kiranchandramohan - MLIR - LLVM Discussion Forums
>
>
> https://fosdem.org/2020/schedule/event/llvm_flang/attachments/slides/3839/export/events/attachments/llvm_flang/slides/3839/flang_llvm_frontend.pdf
>
> Although it is mentioned that the affine/ loop.for is used, following
> things are unclear:
>
> I am assuming that there will be lowering / conversion code in f18 repo
> dialect from fir.do to loop.for / affine.for. Is it the case? If so, I
> think it is worth mentioning it in the “sequential code flow
> representation” in the RFC.

Off topic:
Do we have a counter for the different number of "loop-like" constructs
in and around MLIR? I think plotting it over time will eventually make a
great XKCD comic :wink:

Please find the reply inline below:

It sounds like LLVM IR is being considered for optimizations in OpenMP
constructs. There seems to be plans regarding improvement of LLVM IR
Framework for providing things required for OpenMP / flang(?)

LLVM has the OpenMPOpt pass now [0] in which we can put OpenMP specific
transformations. For now it is simple but we have some more downstream
patches, e.g., parallel region expansion [Section 5, 1]. Other
optimizations [Section 3 & 4, 1], will be performed by the Attributor
(see [4] after [2,3]) after one missing piece (basically [5] with some
more plumming) was put in place, see [2,3] for details on the idea.

Please feel free to ask questions on any of this.

[0] https://reviews.llvm.org/D69930
[1] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
[2] https://www.youtube.com/watch?v=zfiHaPaoQPc
[3] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf
[4] https://youtu.be/CzWkc_JcfS0
[5] https://reviews.llvm.org/D71505

It might also worth looking into [6,7] mentioned below.

Thanks for sharing all the interesting work going on LLVM IR side of things for OpenMP. I will take a look at it.

As I have mentioned before, I am mostly concerned (in this thread) about the MLIR side of things for OpenMP and various optimizations/ transformations that are planned. I was surprised to see the plans for LLVM IR and not for MLIR.

Are there any design considerations which contain pros and cons about using
the MLIR vs LLVM IR for various OpenMP related optimizations/
transformations?

The biggest pro for LLVM-IR is that it works for C/C++ right now.

TBH, I don’t even know what transformations people plan to do on OpenMP MLIR (and why).

A good support for C/C++ in LLVM-IR shouldn’t be an impediment for considering OpenMP dialect transformations / optimizations for the ongoing flang effort. As we all agree, things like dependencies etc can be solved in an easier way in MLIR and C/C++ seems to be the “only caveat”.

In
addition, as I mentioned before, LLVM-IR has mature analysis and
transformation passes for real world programs and support things like
LTO out of the box.

I think LTO for flang is planned to be based on LLVM. I do not see any other option for now!

The latest RFC [ (3) in my original post ] mentions that:

So there exist some questions regarding where the optimisations should be
carried out.

Could you please provide more details on this?

I would like to quote Chris here:

“if you ignore the engineering expense, it would clearly make sense to
reimplement the mid-level LLVM optimizers on top of MLIR and replace
include/llvm/IR with a dialect definition in MLIR instead.“ –
http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.html

Rest of the comment are inlined.

Hi Vinay,

Thanks for taking an interest and the detailed discussion.

To start by picking a few paragraph from your email to clarify a couple
of things that lead to the current design or that might otherwise need
clarification. We can talk about other points later as well.

[
Site notes:

  1. I’m not an MLIR person.
  2. It seems unfortnuate that we do not have a mlir-dev list.
    ]
  1. With the current design, the number of transformations / optimizations
    that one can write on OpenMP constructs would become limited as there can
    be any custom loop structure with custom operations / types inside it.

OpenMP, as an input language, does not make many assumptions about the
code inside of constructs*.

This isn’t entirely correct because the current OpenMP API specification (
https://www.openmp.org/spec-html/5.0/openmpch1.html) assumes that the code
inside the constructs belong to C, C++ and Fortran programs.

(FWIW, my next sentence specified that I talk about the base language
but anyway.)

While technically true, I will recommend not to make that assumption. We
do already allow non-base language constructs, e.g., CUDA intrinsics in
target regions, and that will not go away because it is required to
maximize performance.

So, inside a parallel can be almost anything
the base language has to offer, both lexically and dynamically.

I am mostly concerned with the MLIR side of things for OpenMP
representation.

MLIR can not only support operations for General Purpose languages like
C,C++, Fortran, etc but also various Domain Specific Language
representations as dialects (Example, ML, etc.). Note that there is also
SPIR V dialect which is again meant for “Parallel Compute”.

It becomes important to define the scope of the dialects / operations /
types supported inside OpenMP operations in MLIR.

Arguably, the OpenMP dialect in MLIR should match the OpenMP directives
and clauses as defined by the standard. Anything else is “not OpenMP”.

Yes! This is what needs to be “defined” for the various dialects…

Assuming otherwise is not going to work. Analyzing a “generic” OpenMP
representation in order to determine if can be represented as a more
restricted “op” seems at least plausible. You will run into various
issue, some mentioned explicitly below.

Isn’t it the other way around? For example, it doesn’t make much sense to
wrap OpenMP operations for SPIR-V operations / types.

I maybe misunderstanding but I thought you want to use something like
the GPU / Affine dialect to represent an OpenMP target region / loop.
That is plausible if you analyze the target region / loop and verify it
fits into the more generic dialect semantics.

I think it is important to specify (in the design) which existing MLIR
dialects are supported in this effort and the various lowerings /
transformations / optimizations which are planned for them.

That I cannot really help you with. TBH, I don’t even know what
transformations people plan to do on OpenMP MLIR (and why).

For starters, you still have to
generate proper OpenMP runtime calls, e.g., from your GPU dialect, even
if it is “just” to make sure the OMPD/OMPT interfaces expose useful
information.

You can have a well-defined call-like mlir::Operation which calls the GPU
kernel. Perform all cross-device transformations in an easier way.
Then, this operation can be lowered to OpenMP runtime calls during LLVM
dialect conversion.

You missed my point I made in the other email. An OpenMP target region
is statically not a GPU offload so you should not model it as such “for
some time”.

I did not get your point here.

Why is OpenMP program compiled with single target (say, nvptx) isn’t a GPU offload case? Are you saying that tgt_target might invoke some code other than GPU kernel even if the user intends to run it on GPU?

Even when there are multiple targets, all the code should get “statically” compiled and kept ready.

In any case, unless it is statically proved that the GPU kernel is not executed, you need to optimize the code inside the GPU kernel and it would be better to do it in MLIR was the point.

I think this is much better than directly having calls
to the OpenMP runtime library based on a kernel name mentioned in
llvm::GlobalVariable.

(Current) implementation is not semantics. There is no reason not to
change the way we lower OpenMP, e.g., by getting rid of the global
variables. They are present for a reason but not intrinsically required.
See the TRegions for example [6,7], they totally change the GPU lowering,
making it device agnostic and easy to analyze and optimize in the middle
end. Arguing the current encoding of OpenMP in LLVM-IR is problematic is
the same as arguing MLIR’s LLVM dialect doesn’t support atomic_rmw, it
might be true but its changeable.

[6] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
[7] http://parallel.auckland.ac.nz/iwomp2019/slides_TRegion.pdf

  • I preclude the omp loop construct here as it is not even implemented
    anywhere as far as I know.
  1. It would also be easier to transform the Loop nests containing OpenMP
    constructs if the body of the OpenMP operations is well defined (i.e.,
    does
    not accept arbitrary loop structures). Having nested redundant
    “parallel” ,
    “target” and “do” regions seems unnecessary.

As mentioned above, you cannot start with the assumption OpenMP input is
structured this this way. You have to analyze it first. This is the same
reason we cannot simply transform C/C++ for loops into affine.for
without proper analysis of the loop body.

Now, more concrete. Nested parallel and target regions are not
necessarily redundant, nor can/should we require the user not to have
them. Nested parallelism can easily make sense, depending on the problem
decomposition. Nested target will make a lot of sense with reverse
offload, which is already in the standard, and it also should be allowed
for the sake of a modular (user) code base.

Just to be clear, having all three of “target”, “parallel” and “do” doesn’t
represent “Nested parallelism” at all in the proposed design! ( 2(d) ).

omp.target {

omp.parallel {

omp.do {

……

}

}

}

Above invokes a call to the tgt_target() for the code inside omp.do as
mentioned in the proposal.

I do not follow. Just to make sure, the above should be roughly
equivalent to the code below, correct? There is no “nested”
parallelism, sure, but I thought you were talking about the case where
there is, e.g. add another #pragma omp parallel inside the one that

already is there. That is nested parallelism which can happen and make
total sense for the application.

No, actually. I was just saying that, all three could have been merged to one OpenMP operation “target_parallel_for” in MLIR rather than having three and analyzing them.

#pragma omp target
{
#pragma omp parallel
{
#pragma omp for
for (…)
{

}
}
}

  1. There would also be new sets of loop structures in new dialects when
    C/C++ is compiled to MLIR. It would complicate the number of possible
    combinations inside the OpenMP region.

Is anyone working on this? If so, what is the timeline? I personally was
not expecting Clang to switch over to MLIR any time soon but I am happy
if someone wants to correct me on this. I mention this only because it
interacts with the arguments I will make below.

E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct
lowering to LLVM IR ignoring all the advantages that MLIR provides. Being
able to compile the code for heterogeneous hardware is one of the biggest
advantages that MLIR brings to the table. That is being completely missed
here. This also requires solving the problem of handling target
information
in MLIR. But that is a problem which needs to be solved anyway. Using GPU
dialect also gives us an opportunity to represent offloading semantics in
MLIR.

I’m unsure what the problem with “handling target information in MLIR” is
but
whatever design we end up with, we need to know about the target
(triple) in all stages of the pipeline, even if it is just to pass it
down.

Given the ability to represent multiple ModuleOps and the existence of
GPU
dialect, couldn’t higher level optimizations on offloaded code be done at
MLIR level?. The proposed design would lead us to the same problems that
we
are currently facing in LLVM IR.

Also, OpenMP codegen will automatically benefit from the GPU dialect
based
optimizations. For example, it would be way easier to hoist a memory
reference out of GPU kernel in MLIR than in LLVM IR.

While I agree with the premise that you can potentially reuse MLIR
transformations, it might not be as simple in practice.

As mentioned above, you cannot assume much about OpenMP codes, almost
nothing for a lot of application codes I have seen. Some examples:

If you have a function call, or any synchronization event for that
matter, located between two otherwise adjacent target regions (see
below), you cannot assume the two target regions will be offloaded to
the same device.

#omp target
{}
foo();
#omp target
{}

These kinds of optimizations are much easier to write in MLIR:

LLVM IR for the above code would contain a series of instructions of OpenMP
runtime call setup and foo() in the middle followed by another set of
OpenMP runtime related instructions. The body of the two target constructs
would be in two different outlined functions (if not modules).

It takes quite a bit of code to do analysis / transformation to write any
optimization on the generated LLVM IR.

You are right about the module’s being a problem. As I mentioned in my
last email, we are working on that by not having them in different ones
during the optimization pipeline. If we make the target parallel
instead we can simulate that right now. The bodies are in different
functions, sure, but does it matter? Let’s walk through parallel region
expansion (see above [Section 5, 1]) so you can judge for yourself:

#omp parallel
{ body0 }
some_code
#omp parallel
{ body1 }

will become

__kmpc_fork_call(..., @body0_fn, ...)
some_code
__kmpc_fork_call(..., @body1_fn, ...)

in IR. Simplified, there are 3 cases here:

  1. some_code is harmless, meaning all of it can be executed redundantly.
  2. parts of some some_code need to be guarded to be sequential but
    they can be executed in a parallel region otherwise, e.g., the code
    will not observe the difference through runtime calls.
  3. parts of some some_code cannot be executed in a parallel region as
    they might observe the difference through runtime calls.
    First note that you need to do the classification regardless of your
    encoding (=IR). In case of 3) we are done and nothing is happening.
    Let’s consider case 2) as 1) is basically a special case of it. As shown
    in the paper [1], you need to broadcast values created by some_code
    across all threads and synchronize appropriately to preserve semantic.
    Other than that, the transformation is straight forward:

A) Create a function “@body01_fn” that is basically the outlined region
in which code is then guarded and __kmpc_fork_call are replaced by
direct calls. It looks like this:

call @body0_fn(...)
#omp master
some_code
#omp barrier
call @body1_fn(...)

B) Replace the region you put in the new function with a
__kmpc_fork_call to it:

__kmpc_fork_call(..., @body01_fn, ...)

C) Done.

If you are interested in the implementation I’ll add you as a reviewer
once I put it on Phab. I’m in the process of cleaning up my stand alone
pass and moving it into the OpenMPOpt pass instead.

vs.

MLIR provides a way to represent the operations closer to the source. It is
as simple as checking the next operation(s) in the mlir::Block. OpenMP
target operation contains an inlined region which can easily be fused/
split / or any other valid transformation for that matter.

Note that you can also perform various Control Structure Analysis /
Transformations much easier in MLIR. For example, you can decide to execute
foo() based on certain conditions, and you can merge the two target regions
in the else path.

At the end, it’s an encoding difference. Sure, the handling might be
easier in certain situations but all the validity checks, hence code
analyses, are still required. The actual “rewrite” is usually not the
hard part.

Similarly, you cannot assume a omp parallel is allowed to be executed
with more than a single thread, or that a omp [parallel] for does not
have loop carried data-dependences, …

With multi-dimensional index support for arrays, wouldn’t it be better to
do the data dependence analysis in MLIR?

Yes, probably.

LLVM IR has linearized subscripts for multi-dimensional arrays.
llvm::DependenceAnalysis tries to “guess” the indices based on different
patterns in SCEV. It takes an intrinsic
<http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_88> or metadata or
some other mechanism of communication from the front end (not the built-in
set of instructions) to solve this problem.

Not disagreeing with you on this one :wink:

The only caveat is that we still live in a world in which C/C++ is a
thing.

Doesn’t mean that we should not be having optimizations in MLIR for Fortran :slight_smile:

Data-sharing attributes are also something that has to be treated
carefully:

x = 5;
#omp task
x = 3;
print(x);

Should print 5, not 3.

You can have “x” as a locally defined variable inside the “task” contained
region in MLIR OR custom data-sharing attributes in OpenMP dialect.

I’m not saying it is impossible or even hard, but maybe not as straight
forward as one might think. Your encoding is for example very reasonable.

In the example below you need to print 3, not 5, e.g., constant prop on
the outer level should not happen.

Yes, that is why I am concerned about the design considerations of clauses like shared, map, firstprivate, etc.

x = 5;
#omp task shared(x)
{
x = 3;
some_form_of_sync();
...
}
some_form_of_sync();
print(x);

I hope I convinced you that OpenMP is not trivially mappable to existing
dialects without proper analysis. If not, please let me know why you
expect it to be.

I do not see much reason why the issues you mentioned can’t trivially be
mapped to the MLIR infrastructure. There is an easy way to define custom
operations / types / attributes in OpenMP dialect and perform optimizations
based on the IR that is created especially for OpenMP. The analysis /
transformations required can be easily written on the custom operations
defined rather than having a lowered form in the LLVM IR.

You can totally define your OpenMP dialect and map it to that. Mapping
to other dialects is the problematic part.

Yes, this is why I have mentioned that RFC should talk about the following:

"I think it is important to specify (in the design) which existing MLIR

dialects are supported in this effort and the various lowerings /

transformations / optimizations which are planned for them."

Please find the reply inline below

Reply to Kiran Chandramohan:

You are welcome to participate, provide feedback and criticism to change the design as well as to contribute to the implementation.

Thank you Kiran.

But the latest is what is there in the RFC in discourse.

I have used this as reference for the response.

We did a study of a few constructs and clauses which was shared as mails to flang-dev and the RFC. As we make progress and before implementation, we will share further details.

“ Yes, parallel and flush would be the next two constructs that we will do.” – from a comment in latest RFC

For the above mentioned reasons, I will try to restrict my reply to how the “parallel (do)” construct would be lowered.

If it is OK we can have further discussions in discourse as River points out.

Given that the multiple components of the LLVM project, namely clang, flang, MLIR and LLVM are involved, llvm-dev is probably a better place, with a much wider audience

Possibly wider, but maybe less focused about discussing MLIR dialect design. In particular there is an RFC thread for this particular dialect on Discourse, which is the canonical place to discuss its design.

, until it is clear how different components must interact.

They don’t need to interact so closely: they are very loosely related: flang will use MLIR but clang won’t (in the foreseeable future) and LLVM has many other frontends.

It is the review for translation to LLVM IR that is in progress.

“If we decide that the OpenMP construct (for e.g. collapse) can be handled fully in MLIR and that is the best place to do it (based on experiments) then we will not use the OpenMP IRBuilder for these constructs.” – latest RFC in discourse

If it is not finalized that the OpenMPIRBuilder will be used for all the constructs, wouldn’t it be better to delay the submission of “translation to LLVM IR” patch in MLIR? Lowering code will become inconsistent if the OpenMPIRBuilder is used only for a few constructs and not for others.

Also, the patch does OpenMP dialect lowering alongside LLVM Dialect to LLVM IR. This is different from most dialects which get directly lowered to LLVM Dialect. I think lowering to LLVM Dialect would be a cleaner way if OpenMPIRBuilder is not being considered for all constructs.

I don’t disagree, but there are a lot of speculation here: your quote starts with “If we decide that the OpenMP construct (for e.g. collapse) can be handled fully in MLIR”, are you thinking that we need to first decide this once and for all before making progress on building this path?
What disadvantages do you perceive to an approach where we would bring up this dialect using the OpenMPIRBuilders for exporting to LLVM IR until we gain enough experience? Do you think starting like this will make it significantly harder to transition away from the builders if this is what we want?
It seemed to me like it wouldn’t, and that’s why I’m supportive of this path: the omp dialect design, implementation, and the transformation/analysis that will be performed there seems entirely disjoint from the LLVM lowering, I’d hope we can swap the LLVM lowering at a later time (if this is what we’d want).

The statement you quoted is from the RFC in discourse by Kiran. It is actually unclear to whom you are referring to here. I am assuming that it is for him to answer.

The below details would cover some of your questions as well.

About Clang / MLIR / LLVM being loosely related and not being relevant in llvm-dev:

With the introduction of the OpenMPIRBuilder in MLIR (from this review : https://reviews.llvm.org/D72962), Clang and MLIR would now have the common code for building OpenMP constructs. I do not think it is so loosely related anymore. Note that MLIR and Clang frontends for LLVM are very different. Clang emits LLVM IR with almost no optimizations and MLIR already supports considerable amount of optimizations.

Decision of using the OpenMPIRBuilder for MLIR was discussed in the following flang-dev threads (Please correct me If I am missing some newer discussions on the below topics)

  1. [May 2019] http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html

  2. [June 2019] http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-June/000251.html

However I could not find any conclusions for the concerns raised by Kiran:

  1. Early outlining (in MLIR) vs. Late outlining (in LLVM)

  2. Handling of target constructions: high-level transformations for GPUs and CPUs (offloading in LLVM vs. MLIR?)

Kiran seems to suggest the early outlining (version 2) would be better(http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000224.html). But currently, the late outlining has been implemented in LLVM (version 1) (https://github.com/llvm/llvm-project/blob/master/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp).

Early outlining in MLIR would have the following benefits as suggested in the thread:

  1. Enables more optimization in MLIR (intra-procedural because of regions).

  2. Offloading in MLIR (which is designed for heterogenous hardware compilation support)

  3. Direct LLVM Dialect lowering of OpenMP operations (no LLVM IR lowering)

MLIR google groups discussion (https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw) regarding the use of OpenMPIRBuilder doesn’t seem to discuss the above concerns and also about how the various design decisions in OpenMPIRBuilder affects MLIR in general.

Also,

“The point here is that we do not want to use MLIR just as a pass-through layer because MLIR has a lot of strengths”

“The point here is that if we lower to LLVM dialect, we will not be able to reuse OpenMP codegen & optimisation code from Clang/LLVM.”

— by Kiran in https://lists.llvm.org/pipermail/llvm-dev/2020-February/139181.html

In the latest reply from Kiran (quoted above) to this thread, Kiran seems to suggest that lowering to LLVM Dialect (instead of LLVM IR) would restrict the use of OpenMP Optimization code from LLVM and also MLIR will just be a pass-through to the OpenMPIRBuilder.

Because of the above reasons, it seems to me that design considerations of using OpenMPIRBuilder for MLIR should also be mentioned (and discussed) before commiting LLVM IR lowering part for OpenMP dialect in https://reviews.llvm.org/D72962

The point here is that we do not want to use MLIR just as a pass-through layer because MLIR has a lot of strengths

The point here is that if we lower to LLVM dialect, we will not be able to reuse OpenMP codegen & optimisation code from Clang/LLVM.

Just to be clear, are you suggesting that if OpenMPIRBuilder is used, MLIR will have to be used as pass-through without the optimizations?

Thanks, Vinay for further details and discussion.

“If we decide that the OpenMP construct (for e.g. collapse) can be handled fully in MLIR and that is the best place to do it (based on experiments) then we will not use the OpenMP IRBuilder for these constructs.” – latest RFC in discourse

If it is not finalized that the OpenMPIRBuilder will be used for all the constructs, wouldn’t it be better to delay the submission of “translation to LLVM IR” patch in MLIR? Lowering code will become inconsistent if the OpenMPIRBuilder is used only for a few constructs and not for others.

I was hoping that we can identify a set of constructs/clauses which can be fully handled inside the MLIR layer itself. As an example, I provided the collapse clause. This will include constructs/clauses which do not generate runtime API calls, which pass metadata to LLVM to do some optimisation etc. Yes, the list is not finalized. The point here is that we do not want to use MLIR just as a pass-through layer because MLIR has a lot of strengths. Would it be OK to continue with only constructs that use OpenMPIRBuilder before the final list is made?

Also, the patch does OpenMP dialect lowering alongside LLVM Dialect to LLVM IR. This is different from most dialects which get directly lowered to LLVM Dialect. I think lowering to LLVM Dialect would be a cleaner way if OpenMPIRBuilder is not being considered for all constructs.
Mehdi also seems to have the same suggestion: “I agree that having dialect lowering would be cleaner” in https://reviews.llvm.org/D72962

The point here is that if we lower to LLVM dialect, we will not be able to reuse OpenMP codegen & optimisation code from Clang/LLVM.
It was pointed out to me early on by the MLIR developers that there are a few dialects (like NVVM) which are lowered along with LLVM dialect.

You ask a lot of specific questions about which types, dialects, memory access operations will be supported and also the lowering for parallel do. Yes, the RFC does not provide all this information. This will become clear only as we make progress with the OpenMP dialect. I would like and I am interested to provide answers to all your questions in the following weeks. Please allow some time.

Thank you all for your responses so far. Awaiting your further responses.

I replied inlined and I marked one section with the tag [MOST
INTERESTING] in order to highlight it. I believe it is important that
the Flang + OpenMP community takes a closer look at this and we discuss
this as early as possible.

Please find the reply inline below:

> > It sounds like LLVM IR is being considered for optimizations in OpenMP
> > constructs. There seems to be plans regarding improvement of LLVM IR
> > Framework for providing things required for OpenMP / flang(?)
>
> LLVM has the OpenMPOpt pass now [0] in which we can put OpenMP specific
> transformations. For now it is simple but we have some more downstream
> patches, e.g., parallel region expansion [Section 5, 1]. Other
> optimizations [Section 3 & 4, 1], will be performed by the Attributor
> (see [4] after [2,3]) after one missing piece (basically [5] with some
> more plumming) was put in place, see [2,3] for details on the idea.
>
> Please feel free to ask questions on any of this.
>
> [0] ⚙ D69930 [OpenMP] Introduce the OpenMPOpt transformation pass
> [1] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
> [2] https://www.youtube.com/watch?v=zfiHaPaoQPc
> [3] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf
> [4] https://youtu.be/CzWkc_JcfS0
> [5] ⚙ D71505 [Utils] Provide a callback encapsulation utility for call sites
>
> It might also worth looking into [6,7] mentioned below.
>
>
Thanks for sharing all the interesting work going on LLVM IR side of things
for OpenMP. I will take a look at it.

As I have mentioned before, I am mostly concerned (in this thread) about
the MLIR side of things for OpenMP and various optimizations/
transformations that are planned. I was surprised to see the plans for LLVM
IR and not for MLIR.

We publicize our plans as widely as possible and as early as possible.
The talks above are from LLVM-Dev meetings*, usually a good spot to look
for bigger changes. Also, there were early discussions/RFCs on the -dev
lists.

* We spoke on the Euro and US developers meeting in 2018 and 2019 about
  these things. EuroLLVM 2020 has an OpenMP Bof/Panel

> > Are there any design considerations which contain pros and cons about
> using
> > the MLIR vs LLVM IR for various OpenMP related optimizations/
> > transformations?
>
> The biggest pro for LLVM-IR is that it works for C/C++ right now.

> TBH, I don't even know what transformations people plan to do on OpenMP
MLIR (and why).

A good support for C/C++ in LLVM-IR shouldn’t be an impediment for
considering OpenMP dialect transformations / optimizations for the ongoing
*flang* effort. As we all agree, things like dependencies etc can be solved
in an *easier way* in MLIR and C/C++ seems to be the “only caveat”.

Agreed. C/C++ should not be an impediment for the *flang* effort. It
should be taken into account wrt. resources and how we spend them but I
never tried to say that for flang we couldn't do something special in
addition.

> In
> addition, as I mentioned before, LLVM-IR has mature analysis and
> transformation passes for real world programs and support things like
> LTO out of the box.
>
I think LTO for flang is planned to be based on LLVM. I do not see any
other option for now!

Seems reasonable to me. Just to make the implication clear: If you want
OpenMP optimizations that require inter-translation-unit information,
e.g., merge parallel regions from different files, you'll need
LLVM-IR-based OpenMP transformations.

> > The latest RFC [ (3) in my original post ] mentions that:
> >
> > > So there exist some questions regarding where the optimisations should
> be
> > carried out.
> >
> > Could you please provide more details on this?
> >
> > I would like to quote Chris here:
> >
> > “if you ignore the engineering expense, it would clearly make sense to
> > reimplement the mid-level LLVM optimizers on top of MLIR and replace
> > include/llvm/IR with a dialect definition in MLIR instead.“ --
> > [llvm-dev] [RFC] Writing loop transformations on the right representation is more productive
> >
> > *Rest of the comment are inlined.*
> >
> >
> > > Hi Vinay,
> > >
> > > Thanks for taking an interest and the detailed discussion.
> > >
> > > To start by picking a few paragraph from your email to clarify a couple
> > > of things that lead to the current design or that might otherwise need
> > > clarification. We can talk about other points later as well.
> > >
> > > [
> > > Site notes:
> > > 1) I'm not an MLIR person.
> > > 2) It seems unfortnuate that we do not have a mlir-dev list.
> > > ]
> > >
> > >
> > > > 1. With the current design, the number of transformations /
> optimizations
> > > > that one can write on OpenMP constructs would become limited as
> there can
> > > > be any custom loop structure with custom operations / types inside
> it.
> > >
> > > OpenMP, as an input language, does not make many assumptions about the
> > > code inside of constructs*.
> >
> >
> > This isn’t entirely correct because the current OpenMP API specification
> (
> > 1 Introduction) assumes that the
> code
> > inside the constructs belong to C, C++ and Fortran programs.
>
> (FWIW, my next sentence specified that I talk about the base language
> but anyway.)
>
> While technically true, I will recommend not to make that assumption. We
> do already allow non-base language constructs, e.g., CUDA intrinsics in
> target regions, and that will not go away because it is required to
> maximize performance.
>
>
> > > So, inside a parallel can be almost anything
> > > the base language has to offer, both lexically and dynamically.
> > >
> >
> >
> > I am mostly concerned with the MLIR side of things for OpenMP
> > representation.
> >
> > MLIR can not only support operations for General Purpose languages like
> > C,C++, Fortran, etc but also various Domain Specific Language
> > representations as dialects (Example, ML, etc.). Note that there is also
> > SPIR V dialect which is again meant for “Parallel Compute”.
> >
> > It becomes important to define the scope of the dialects / operations /
> > types supported inside OpenMP operations in MLIR.
>
> Arguably, the OpenMP dialect in MLIR should match the OpenMP directives
> and clauses as defined by the standard. Anything else is "not OpenMP".
>
>
Yes! This is what needs to be “defined” for the various dialects...

I think we are in agreement here.

>
> > > Assuming otherwise is not going to work. Analyzing a "generic" OpenMP
> > > representation in order to determine if can be represented as a more
> > > restricted "op" seems at least plausible. You will run into various
> > > issue, some mentioned explicitly below.
> >
> > Isn’t it the other way around? For example, it doesn’t make much sense to
> > wrap OpenMP operations for SPIR-V operations / types.
>
> I maybe misunderstanding but I thought you want to use something like
> the GPU / Affine dialect to represent an OpenMP target region / loop.
> That is plausible if you analyze the target region / loop and verify it
> fits into the more generic dialect semantics.
>
>
> > I think it is important to specify (in the design) which existing MLIR
> > dialects are supported in this effort and the various lowerings /
> > transformations / optimizations which are planned for them.
>
> That I cannot really help you with. TBH, I don't even know what
> transformations people plan to do on OpenMP MLIR (and why).
>
>
> > > For starters, you still have to
> > > generate proper OpenMP runtime calls, e.g., from your GPU dialect, even
> > > if it is "just" to make sure the OMPD/OMPT interfaces expose useful
> > > information.
> > >
> > >
> > You can have a well-defined call-like mlir::Operation which calls the GPU
> > kernel. Perform all cross-device transformations in an easier way.
> > Then, this operation can be lowered to OpenMP runtime calls during LLVM
> > dialect conversion.
>
> You missed my point I made in the other email. An OpenMP target region
> is statically not a GPU offload so you should not model it as such "for
> some time".
>
> I did not get your point here.

Why is OpenMP program compiled with single target (say, nvptx) isn’t a GPU
offload case? Are you saying that tgt_target might invoke some code other
than GPU kernel even if the user intends to run it on GPU?

Even when there are multiple targets, all the code should get “statically”
compiled and kept ready.

In any case, unless it is statically proved that the GPU kernel is not
executed, you need to optimize the code inside the GPU kernel and it would
be better to do it in MLIR was the point.

There are at least two different things that need to be considered here:
  1) Fallback code for the initial device which *has* to be present in
     the general case.
  2) How do you handle multiple targets (which can be just a single GPU
     and the host) wrt. preprocessing & compilation.

The first is fairly simple to handle by doing some "local" versioning,
e.g., you can have your "GPU version" and you keep a "non-GPU one"
around as well. I only mention this because not keeping the CPU version
around could arguably introduce subtle bugs when "GPU behavior" is
assumed during some analysis/transformation.

[MOST INTERESTING]

The second point is more complex though and I'm unsure of the wider
community has consider this completely:

Let's start by looking at the C/C++ side (which is what I am more
familiar with so my expectations/argumentation is skewed):
- Clang basically compiles the code for each target once. That includes
  preprocessing, parsing, ...
- GCC runs the preprocessor only once and then splits the code
  internally at some point.
There are interesting consequences to this choice, mostly when it comes
to the question how you deal with "non-standard" code, e.g., target
specific vector types, target specific assembly, target specific
intrinsics, ....

In clang, you can use #ifdefs to guard code for the host and device
respectively.
In GCC all #ifdefs are resolved wrt. the host.
[ Site note: We introduced `omp begin/end declare variant` to provide an
  OpenMP way that avoids #ifdefs but that is not the point right now. ]

The central question now is:
  What is the flang story on this and how does that interact with the
  MLIR representation?

We might want talk about some example situation:
Let's say we want to compile a target region for 2 targets, a GPU and a
multi-core of a different vendor. We want to share most of the code but
we'll end up with target specific intrinsics and target specific
assembly either because we typed them explicitly or because we included
some specialized library which we can just call <math> for simplicity.
Now in Clang you would get 2 modules:
  1) Host (=multi-core device) + fallback code
  2) GPU code

In fact, declare variant, metadirective, and an #ifdef solution
compatible with clang require us to do some sort of *early* versioning,
potentially on a finer-granularity though.

I would be really interested in the ways Flang/FC is supposed to handle
(or not?) these situations:
  - target specific #ifdefs
  - declare variant
  - (nested) metadirective

[MOST INTERESTING DONE]

> > I think this is much better than directly having calls
> > to the OpenMP runtime library based on a kernel name mentioned in
> > llvm::GlobalVariable.
>
> (Current) implementation is not semantics. There is no reason not to
> change the way we lower OpenMP, e.g., by getting rid of the global
> variables. They are present for a reason but not intrinsically required.
> See the TRegions for example [6,7], they totally change the GPU lowering,
> making it device agnostic and easy to analyze and optimize in the middle
> end. Arguing the current encoding of OpenMP in LLVM-IR is problematic is
> the same as arguing MLIR's LLVM dialect doesn't support atomic_rmw, it
> might be true but its changeable.
>
> [6] The TRegion Interface and Compiler Optimizations for OpenMP Target Regions | SpringerLink
> [7] http://parallel.auckland.ac.nz/iwomp2019/slides_TRegion.pdf
>
>
> > > * I preclude the `omp loop` construct here as it is not even
> implemented
> > > anywhere as far as I know.
> > >
> > >
> > > > 2. It would also be easier to transform the Loop nests containing
> OpenMP
> > > > constructs if the body of the OpenMP operations is well defined
> (i.e.,
> > > does
> > > > not accept arbitrary loop structures). Having nested redundant
> > > "parallel" ,
> > > > "target" and "do" regions seems unnecessary.
> > >
> > > As mentioned above, you cannot start with the assumption OpenMP input
> is
> > > structured this this way. You have to analyze it first. This is the
> same
> > > reason we cannot simply transform C/C++ `for loops` into `affine.for`
> > > without proper analysis of the loop body.
> > >
> > > Now, more concrete. Nested parallel and target regions are not
> > > necessarily redundant, nor can/should we require the user not to have
> > > them. Nested parallelism can easily make sense, depending on the
> problem
> > > decomposition. Nested target will make a lot of sense with reverse
> > > offload, which is already in the standard, and it also should be
> allowed
> > > for the sake of a modular (user) code base.
> > >
> >
> > Just to be clear, having all three of “target”, “parallel” and “do”
> doesn’t
> > represent “Nested parallelism” at all in the proposed design! ( 2(d) ).
> >
> > omp.target {
> >
> > omp.parallel {
> >
> > omp.do {
> >
> > …...
> >
> > }
> >
> > }
> >
> > }
> >
> > Above invokes a call to the tgt_target() for the code inside omp.do as
> > mentioned in the proposal.
>
> I do not follow. Just to make sure, the above should be roughly
> equivalent to the code below, correct? There is no "nested"
> parallelism, sure, but I thought you were talking about the case where
> there is, e.g. add another `#pragma omp parallel` inside the one that

already is there. That is nested parallelism which can happen and make
> total sense for the application.
>

No, actually. I was just saying that, all three could have been merged to
one OpenMP operation “target_parallel_for” in MLIR rather than having three
and analyzing them.

Sure, if they are closely nested without other code nested with them you
can combine them just fine. Given that Clang does that as well I'm
unsure what benefit we get from MLIR.

What I was talking about is the general situation of nested parallelism,
e.g., more along the lines of:
    
#pragma omp target
{
   #pragma omp parallel
   { ... some code ... }
   ... some code ...
   #pragma omp parallel
   {
     ... some code ...
     #pragma omp for
     for (...)
     {
       ...
     }
     ... some code ...
     #pragma omp parallel
     {
       ... some code ...
     }
     ... some code ...
   }
   ... some code ...
}

>
> #pragma omp target
> {
> #pragma omp parallel
> {
> #pragma omp for
> for (...)
> {
> ...
> }
> }
> }
>

>
> > >
> > > > 3. There would also be new sets of loop structures in new dialects
> when
> > > > C/C++ is compiled to MLIR. It would complicate the number of possible
> > > > combinations inside the OpenMP region.
> > >
> > > Is anyone working on this? If so, what is the timeline? I personally
> was
> > > not expecting Clang to switch over to MLIR any time soon but I am happy
> > > if someone wants to correct me on this. I mention this only because it
> > > interacts with the arguments I will make below.
> > >
> > >
> > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies
> direct
> > > > lowering to LLVM IR ignoring all the advantages that MLIR provides.
> Being
> > > > able to compile the code for heterogeneous hardware is one of the
> biggest
> > > > advantages that MLIR brings to the table. That is being completely
> missed
> > > > here. This also requires solving the problem of handling target
> > > information
> > > > in MLIR. But that is a problem which needs to be solved anyway.
> Using GPU
> > > > dialect also gives us an opportunity to represent offloading
> semantics in
> > > > MLIR.
> > >
> > > I'm unsure what the problem with "handling target information in MLIR"
> is
> > > but
> > > whatever design we end up with, we need to know about the target
> > > (triple) in all stages of the pipeline, even if it is just to pass it
> > > down.
> > >
> > >
> > > > Given the ability to represent multiple ModuleOps and the existence
> of
> > > GPU
> > > > dialect, couldn't higher level optimizations on offloaded code be
> done at
> > > > MLIR level?. The proposed design would lead us to the same problems
> that
> > > we
> > > > are currently facing in LLVM IR.
> > > >
> > > > Also, OpenMP codegen will automatically benefit from the GPU dialect
> > > based
> > > > optimizations. For example, it would be way easier to hoist a memory
> > > > reference out of GPU kernel in MLIR than in LLVM IR.
> > >
> > > While I agree with the premise that you can potentially reuse MLIR
> > > transformations, it might not be as simple in practice.
> > >
> > > As mentioned above, you cannot assume much about OpenMP codes, almost
> > > nothing for a lot of application codes I have seen. Some examples:
> > >
> > > If you have a function call, or any synchronization event for that
> > > matter, located between two otherwise adjacent target regions (see
> > > below), you cannot assume the two target regions will be offloaded to
> > > the same device.
> > > ```
> > > #omp target
> > > {}
> > > foo();
> > > #omp target
> > > {}
> > > ```
> > >
> >
> > These kinds of optimizations are much easier to write in MLIR:
> >
> > LLVM IR for the above code would contain a series of instructions of
> OpenMP
> > runtime call setup and foo() in the middle followed by another set of
> > OpenMP runtime related instructions. The body of the two target
> constructs
> > would be in two different outlined functions (if not modules).
> >
> > It takes quite a bit of code to do analysis / transformation to write any
> > optimization on the generated LLVM IR.
>
> You are right about the module's being a problem. As I mentioned in my
> last email, we are working on that by not having them in different ones
> during the optimization pipeline. If we make the `target` `parallel`
> instead we can simulate that right now. The bodies are in different
> functions, sure, but does it matter? Let's walk through parallel region
> expansion (see above [Section 5, 1]) so you can judge for yourself:
>
> ```
> #omp parallel
> { body0 }
> some_code
> #omp parallel
> { body1 }
> ```
>
> will become
>
> ```
> __kmpc_fork_call(..., @body0_fn, ...)
> some_code
> __kmpc_fork_call(..., @body1_fn, ...)
> ```
>
> in IR. Simplified, there are 3 cases here:
> 1) some_code is harmless, meaning all of it can be executed redundantly.
> 2) parts of some some_code need to be guarded to be sequential but
> they can be executed in a parallel region otherwise, e.g., the code
> will not observe the difference through runtime calls.
> 3) parts of some some_code cannot be executed in a parallel region as
> they might observe the difference through runtime calls.
> First note that you need to do the classification regardless of your
> encoding (=IR). In case of 3) we are done and nothing is happening.
> Let's consider case 2) as 1) is basically a special case of it. As shown
> in the paper [1], you need to broadcast values created by some_code
> across all threads and synchronize appropriately to preserve semantic.
> Other than that, the transformation is straight forward:
>
> A) Create a function "@body01_fn" that is basically the outlined region
> in which code is then guarded and __kmpc_fork_call are replaced by
> direct calls. It looks like this:
> ```
> call @body0_fn(...)
> #omp master
> some_code
> #omp barrier
> call @body1_fn(...)
> ```
>
> B) Replace the region you put in the new function with a
> __kmpc_fork_call to it:
> ```
> __kmpc_fork_call(..., @body01_fn, ...)
> ```
>
> C) Done.
>
> If you are interested in the implementation I'll add you as a reviewer
> once I put it on Phab. I'm in the process of cleaning up my stand alone
> pass and moving it into the OpenMPOpt pass instead.
>
>
> > vs.
> >
> > MLIR provides a way to represent the operations closer to the source. It
> is
> > as simple as checking the next operation(s) in the mlir::Block. OpenMP
> > target operation contains an inlined region which can easily be fused/
> > split / or any other valid transformation for that matter.
> >
> > Note that you can also perform various Control Structure Analysis /
> > Transformations much easier in MLIR. For example, you can decide to
> execute
> > foo() based on certain conditions, and you can merge the two target
> regions
> > in the else path.
>
> At the end, it's an encoding difference. Sure, the handling might be
> easier in certain situations but all the validity checks, hence code
> analyses, are still required. The actual "rewrite" is usually not the
> hard part.
>
>
> > > Similarly, you cannot assume a `omp parallel` is allowed to be executed
> > > with more than a single thread, or that a `omp [parallel] for` does not
> > > have loop carried data-dependences, ...
> > >
> >
> > With multi-dimensional index support for arrays, wouldn’t it be better to
> > do the data dependence analysis in MLIR?
>
> Yes, probably.
>
>
> > LLVM IR has linearized subscripts for multi-dimensional arrays.
> > llvm::DependenceAnalysis tries to “guess” the indices based on different
> > patterns in SCEV. It takes an intrinsic
> > <The LLVM Compiler Infrastructure Project; or
> metadata or
> > some other mechanism of communication from the front end (not the
> built-in
> > set of instructions) to solve this problem.
>
> Not disagreeing with you on this one :wink:
>
> The only caveat is that we still live in a world in which C/C++ is a
> thing.
>
>
Doesn’t mean that we should not be having optimizations in MLIR for Fortran
:slight_smile:

Agreed. FWIW, I never tried to say that.

>
>
> > > Data-sharing attributes are also something that has to be treated
> > > carefully:
> > > ```
> > > x = 5;
> > > #omp task
> > > x = 3;
> > > print(x);
> > > ```
> > > Should print 5, not 3.
> > >
> >
> > You can have “x” as a locally defined variable inside the “task”
> contained
> > region in MLIR OR custom data-sharing attributes in OpenMP dialect.
>
> I'm not saying it is impossible or even hard, but maybe not as straight
> forward as one might think. Your encoding is for example very reasonable.
>
> In the example below you need to print 3, not 5, e.g., constant prop on
> the outer level should not happen.
>
>
Yes, that is why I am concerned about the design considerations of clauses
like shared, map, firstprivate, etc.

> ```
> x = 5;
> #omp task shared(x)
> {
> x = 3;
> some_form_of_sync();
> ...
> }
> some_form_of_sync();
> print(x);
> ```
>
>
> > > I hope I convinced you that OpenMP is not trivially mappable to
> existing
> > > dialects without proper analysis. If not, please let me know why you
> > > expect it to be.
> > >
> > I do not see much reason why the issues you mentioned can’t trivially be
> > mapped to the MLIR infrastructure. There is an easy way to define custom
> > operations / types / attributes in OpenMP dialect and perform
> optimizations
> > based on the *IR that is created especially for OpenMP*. The analysis /
> > transformations required can be easily written on the custom operations
> > defined rather than having a lowered form in the LLVM IR.
>
> You can totally define your OpenMP dialect and map it to that. Mapping
> to other dialects is the problematic part.

Yes, this is why I have mentioned that RFC should talk about the following:

"I think it is important to specify (in the design) which existing MLIR

dialects are supported in this effort and the various lowerings /
transformations / optimizations which are planned for them."

Seems reasonable to me, though I'm *not* an MLIR person.

Cheers,
  Johannes

Just to be clear, are you suggesting that if OpenMPIRBuilder is used, MLIR will have to be used as pass-through without the optimizations?

No, I am not suggesting that. For e.g. We can perform optimizations in the MLIR layer like constant propagation into the OpenMP region, barrier reduction, loop transformations etc and still use the OpenMPIRBuilder to generate IR for the OpenMP Constructs. With the collapse example, I was just pointing out that we need not always use the OpenMPIRBuilder.

OpenMPIRBuilder provides us with a fast and easy to use mechanism to generate IR for OpenMP constructs which is as good as what Clang generates. It is our choice whether to use it or not. But if we lower all the OpenMP operations to the LLVM dialect then we cannot use this mechanism.

Thanks,
Kiran

Please find the reply inline below

[...]

About Clang / MLIR / LLVM being loosely related and not being relevant in
llvm-dev:

With the introduction of the OpenMPIRBuilder in MLIR (from this review :
⚙ D72962 [MLIR, OpenMP] Translation of OpenMP barrier construct to LLVM IR), Clang and MLIR would now have the common
code for building OpenMP constructs. I do not think it is so loosely
related anymore. Note that MLIR and Clang frontends for LLVM are very
different. Clang emits LLVM IR with almost no optimizations and MLIR
already supports considerable amount of optimizations.

Decision of using the OpenMPIRBuilder for MLIR was discussed in the
following flang-dev threads (Please correct me If I am missing some newer
discussions on the below topics)

I still am unsure about a basic detail that seems really important:

Could you explain to me what the differences/benefits are between
lowering OpenMP Ops with something we call CGOpenMP[0] into LLVM dialect
(as far as I understand this) instead of using the OpenMPIRBuilder to
generate LLVM-IR from these Ops?

I am unsure because the functions/functionality in those two files look
pretty similar to me, except that the CGOpenMP solution requires the
entire* runtime call encoding of clang to be duplicated and maintained
inside MLIR [1] while it is shared in a single space for the
OpenMPIRBuilder.

* For comparison: OpenMPLowering in FC lowers to 9 runtime calls [1].
                  Clang for the host runtime part knows about 64 [2].

Once the above is sorted out we can discuss other things but IMHO
arguments are all over the place right now which makes it hard to
justify anything. I mean, OpenMPIRBuilder is about creating OpenMP
runtime calls for OpenMP directives. To me it seems a lot of the
arguments below talk about high-level transformations on MLIR and for
which I have a hard time to relate them to the OpenMPIRBuilder.

Cheers,
  Johannes

[0] https://github.com/compiler-tree-technologies/fc/blob/master/lib/codegen/CGOpenMP.cpp
[1] https://github.com/compiler-tree-technologies/fc/blob/master/lib/transforms/OpenMPLowering.cpp
[2] https://github.com/llvm/llvm-project/blob/master/clang/lib/CodeGen/CGOpenMPRuntime.cpp#L568

No I’m asking you. You quoted Kiran and you concluded from this quote “wouldn’t it be better to delay the submission […]”. I am questioning this aspect in particular when I wrote “are you thinking that we need to first decide this once and for all before making progress on building this path?”
This question and the following are important to answer, it isn’t clear to that you did in you answer below. In particular “Do you think starting like this will make it significantly harder to transition away from the builders if this is what we want?” is important: even if using the OpenMPIRBuilder would be suboptimal on the long-term, how much of it would be a problem to replace later? It seems to me that it shouldn’t limit anything, unless you plan to write optimization on the LLVM Dialect itself.

Best,