[RFC] - Lowering and codegen of the depend clause on target ops in the omp dialect in MLIR

bhandarkar-pranav · April 19, 2024, 8:52pm

I am posting this RFC as per @kiranchandramohan’s feedback on my PR.

Goal

The goal of this RFC is to get feedback on the approach I have used to implement translation of the depend clause on OpenMP offloading directives (omp.target, omp.target_enter_data, omp.target_exit_data and omp.target_update_data)

Background

A few weeks ago, I added support for the depend clause on OpenMP target directives. Here are the relevant PRs

#81081 - Support in MLIR for the depend clause in OpenMP target offloading MLIR ops
#81601 - Lowering the depend clause into MLIR in flang

This is needed so that we can use the depend clause in Fortran.

Translation to LLVMIR

The next step in compilation is to translate this into LLVMIR. In Clang, this is done by enclosing the offloading call (target directive) inside a new OpenMP task when lowering to LLVMIR. The dependencies are then associated with the newly created task. This is convenient to implement and also consistent with the OpenMP standard.

From the 5.2 spec -

The target construct generates a target task. The generated task region encloses the target region.
If a depend clause is present, it is associated with the target task

High-level abstraction and progressive lowering benefit of MLIR

The obvious choice to implement this translation was to move Clang’s implementation into OpenMPIRBuilder. However, I believe this transformation is a higher level problem that MLIR is well suited to solve. So, I implemented a transformation that converts a target offloading omp MLIR op (eg. omp.target) with depend into an omp.task with the same depend clause. The new omp.task now encloses the original omp.target without the depend clause on it.

omp.target depend(..) {
  omp.terminator
}

is transformed into

omp.task depend(..) {
  omp.target {
    omp.terminator
  }
  omp.terminator
}

I have two PRs that are alternatives to each other open for this.

The first approach is an MLIR pass that does the above transformation -
#83966.
The second approach is to do the transformation ‘on-the-fly’ as we lowering to FIR in flang - #85130
The benefit of the first approach is that it is high-level language-agnostic while the second approach doesn’t require a full pass over the code and does the right transformation early while no other changes in MLIR are needed.

I would love to hear thoughts on this especially from the point of view of not going the OpenMPIRBuilder route as I believe, like I said earlier, this transformation is more suited to MLIR.

bhandarkar-pranav · April 19, 2024, 8:54pm

@Meinersbur and @jdoerfert - @kiranchandramohan recommended I seek your input on this.

Meinersbur · April 22, 2024, 11:15am

In essence, I would prefer going the OpenMPIRBuilder route. The reasons are the same as why we introduced the OpenMPIRBuilder in the first place: Common code base, re-use existing Clang implementation as much as possible, consistency of implementations, only one lowering to maintain. If one day for Clang -fopenmp-enable-irbuilder becomes standard, we still need another implementation in OpenMPIRBuilder.

For this concretely, this rather seems to be a question of modelling. Either one directive one operation, or logical nesting as by @skatrak’s wrapper approach. In the latter that would be the canonical representation (depend or no depend), not a lowering. But be careful because a target task is not the same as a task that executes something synchronously on the device, so the lowering using OpenMPIRBuilder would need to handle a omp.task with a omp.target inside differently than without one.

bhandarkar-pranav · April 24, 2024, 5:33am

I had a discussion with @Meinersbur about this and he elaborated on how a target task is different from a regular #pragma omp task used by the programmer. In particular, he mentioned that the runtime calls made to the OpenMP library for a target task are different from the ones made for user-specified OpenMP tasks (#pragma omp task).

This difference will have to be dealt with in translation.
So, I am now exploring two approaches

Peel out the commonalities in the target directives (omp.target, omp.target_enter_data) into a new op and use wrappers around that much in the same way as @skatrak is treating loops in his work.
Or, Implement translation entirely in OpenMPIRBuilder by moving code from codegen in clang to handle the target task.

skatrak · April 24, 2024, 11:57am

If the implicit task associated to a target region has some different behavior or restrictions than a regular #omp task, I agree that adding a parent omp.task is not an option. I believe the way to go would be to deal with it at the OpenMP to LLVM IR / OpenMPIRBuilder level rather than adding another sort of “wrapper” operation.

We can extract common features by creating a TaskInterface OpenMP dialect interface that exposes them, and then have the applicable operations implement it, rather than creating a wrapper operation to hold these. Then this information might be processed in the same way or differently during translation to LLVM IR depending on the actual operation (omp.target, omp.task, omp.target_enter_data, …).

Wrapper operations were chosen to represent loops because they can be stacked in different ways to represent multiple composite and standalone constructs in an extensible way. I see it as sort of an exceptional solution for a particular problem. I think in this case the main disadvantage of that approach of introducing context-dependent operations isn’t compensated by its benefits.

Those are at least my first thoughts on this issue, but if that’s generally preferred I think it’s still possible to make it work.

bhandarkar-pranav · April 25, 2024, 7:54pm

Thank you for your comment @skatrak
Did you perhaps mean TaskInterface to be an OpInterface in the OpenMP dialect and not a Dialect Interface?

skatrak · April 26, 2024, 9:47am

Yeah, exactly. Sorry for not explaining myself properly!

Meinersbur · April 26, 2024, 8:11pm

I still think it is possible, but means at whenever there is a difference between a regular task and a target task (that is potentially in a lot of places), an condition must check whether the task wraps a target. I had a similar concern with the wrappers for loop-associated constructs where it may make a semantic difference whether something is wrapped by another wrapper or not.

What we could benefit from is that for some clauses such as depend that equally apply to either kind of task to be handled by the same logic. At this point I do not know what choice offers the better tradeoff. A TaskInterface might be the right compromise.

bhandarkar-pranav · April 29, 2024, 6:47pm

Thank you once again, @Meinersbur and @skatrak
I looked at how Clang lowers a target construct with depend clause. I compared it with the OpenMP 5.2. spec as well. Here are my thoughts.

The only time that clang creates and lowers an outer task around the lowering of the target construct is
- If the depend clause is used. This is just as well, because here is what the spec says
If a depend clause is present, it is associated with the target task

I take this to mean that we have to “manifest” the target task to “associate” the depend clause with it.
- Or, if the nowait clause is used. Again, from the spec
If the nowait clause is present, execution of the target task may be deferred

Again, this means that the target task is a deferred task just like a vanilla task.

NOTE: an outer task is also created for the in_reduction clause or thread_limit clause but I consider that outside the scope of what we are discussing right now)

Case 1 - #pragma omp target depend(..)
In the presence of depend and no nowait, the code generated does the following

Wait for dependencies to be satisfied.
Signal task-begin
Execute the undeferred and included task
Signal task end

Remember,

If the nowait clause is not present, the target task is an included task.

Further, this is the definition of an included task

A task for which execution is sequentially included in the generating task region.
That is, an included task is undeferred and executed by the encountering thread

Here is the LLVM IR.

  %22 = call ptr @__kmpc_omp_task_alloc(ptr @1, i32 %0, i32 1, i64 64, i64 8, ptr @.omp_task_entry..3)
  ; 
  ; privatization and dependency setup code.
  ;
  
  ; wait for dependencies to be satisfied 
  call void @__kmpc_omp_taskwait_deps_51(ptr @1, i32 %0, i32 1, ptr %30, i32 0, ptr null, i32 0)
  ; Signal task begin
  call void @__kmpc_omp_task_begin_if0(ptr @1, i32 %0, ptr %22)
  ; Execute the undeferred task
  %36 = call i32 @.omp_task_entry..3(i32 %0, ptr %22) #4
  ; Signal task end
  call void @__kmpc_omp_task_complete_if0(ptr @1, i32 %0, ptr %22)
  %37 = load i32, ptr %a, align 4
  ret i32 %37

Case 2 - #pragma omp target depend(..) nowait
Since the nowait task is present, the target task can be deferred. Here is the LLVM IR

%22 = call ptr @__kmpc_omp_target_task_alloc(ptr @1, i32 %0, i32 1, i64 64, i64 8, ptr @.omp_task_entry..3, i64 -1)
;
; privatization and dependency setup code.
;
%36 = call i32 @__kmpc_omp_task_with_deps(ptr @1, i32 %0, ptr %22, i32 1, ptr %30, i32 0, ptr null)

Now, my PR, makes the following transformation
Input code:

omp.target depend(..) {
}

to

omp.task depend(..) {
   omp.target {
   }
}

The LLVM IR for the above is

 %17 = call ptr @__kmpc_omp_task_alloc(ptr @1, i32 %0, i32 1, i64 40, i64 8, ptr @.omp_task_entry..2)
 ;...
 %27 = call i32 @__kmpc_omp_task_with_deps(ptr @1, i32 %0, ptr %17, i32 1, ptr %21, i32 0, ptr null)
 %28 = load i32, ptr %a, align 4

That it, a deferred task is generated for the target task. This violates the spec which calls for the target task to be an included task i.e undeferred and executed by the encountering thread

However, when the if clause on the task construct evalues to false, it results in an undeferred task. So we can change the transformation to

omp.target depend(..) {
}

to

omp.task depend(..) if(0) {
   omp.target {
   }
}

Sure enough, now I see this LLVM IR

%17 = call ptr @__kmpc_omp_task_alloc(ptr @1, i32 %0, i32 1, i64 40, i64 8, ptr @.omp_task_entry..2)
...
...
call void @__kmpc_omp_taskwait_deps_51(ptr @1, i32 %0, i32 1, ptr %21, i32 0, ptr null, i32 0)
call void @__kmpc_omp_task_begin_if0(ptr @1, i32 %0, ptr %17)
%27 = call i32 @.omp_task_entry..2(i32 %0, ptr %17) #4
call void @__kmpc_omp_task_complete_if0(ptr @1, i32 %0, ptr %17)
%28 = load i32, ptr %a, align 4

To summarize, here is the equivalence

User Code	Equivalent code
`#pragma omp target depend(..)`	`#pragma omp task depend(..) if(0) { #pragma omp target}`
`#pragma omp target depend(..) nowait`	`#pragma omp task depend(..) if(1) { #pragma omp target}`

skatrak · April 30, 2024, 2:40pm

Thank you @bhandarkar-pranav for the detailed research on this. If I understand your findings correctly, then it appears that it would be standard-conforming to transform (even at the source code level) a TARGET construct by moving all clauses related to the implicit task it defines into an explicit parent TASK construct. The only extra thing we would need to make sure of is that the IF clause of the parent TASK is set according to the presence of a NOWAIT clause.

In this case, I should say I now lean towards implementing support for these TASK-related clauses on TARGET constructs by creating an explicit TASK parent rather than making changes to the OpenMPIRBuilder. However, I can see two ways we could potentially achieve this:

An MLIR pass: This was your original proposal, and it’s relatively simple to implement. One property it has is that the omp.target operation will still be able to hold task-related operands, and those will be present or not depending on where in the compilation flow they are checked.
A PFT rewrite pass: Since this change can also be represented in the source, another option is to do it earlier (e.g. extending the OmpRewriteMutator in flang/lib/Semantics/rewrite-directives.cpp). One feature of this is that the omp.target operation will no longer have to support representing any TASK-related clauses, since there will always be a parent omp.task whenever necessary. So it won’t see its representation changed in that way depending on the compilation stage we’re on. It might also be quite a bit trickier to implement.

Additionally, there’s the option of dealing with this directly during PFT to MLIR lowering, which would also allow a single representation of omp.target. However, this wasn’t the preferred alternative back when I initially proposed it, so maybe it’s just not the right way to go in this case.

bhandarkar-pranav · April 30, 2024, 4:40pm

Thank you @skatrak. Of the three alternatives you mention -

An MLIR pass: This was your original proposal, and it’s relatively simple to implement. One property it has is that the omp.target operation will still be able to hold task-related operands, and those will be present or not depending on where in the compilation flow they are checked.

Indeed, this was my original proposal. It is an open PR here → [mlir][OpenMP] - Transform target offloading directives for easier translation to LLVMIR by bhandarkar-pranav · Pull Request #83966 · llvm/llvm-project · GitHub

A PFT rewrite pass: Since this change can also be represented in the source, another option is to do it earlier (e.g. extending the OmpRewriteMutator in flang/lib/Semantics/rewrite-directives.cpp). One feature of this is that the omp.target operation will no longer have to support representing any TASK-related clauses, since there will always be a parent omp.task whenever necessary. So it won’t see its representation changed in that way depending on the compilation stage we’re on. It might also be quite a bit trickier to implement.

I haven’t tried this alternative yet.

Additionally, there’s the option of dealing with this directly during PFT to MLIR lowering, which would also allow a single representation of omp.target. However, this wasn’t the preferred alternative back when I initially proposed it, so maybe it’s just not the right way to go in this case.

Funnily enough, before I went on leave, I had been advised by the community to explore this approach and the subsequent PR is here → [flang][OpenMP] - Transform target offloading directives with dependencies during PFT to MLIR conversion by bhandarkar-pranav · Pull Request #85130 · llvm/llvm-project · GitHub

The first alternative (MLIR Pass) is the cleanest IMHO because it gets a chance to operate on well-formed omp.target op. It’ll however require changes to Flang so that we invoke the pass at the right place, which again, I do not think is a big deal.

The last alternative (PFT->MLIR Lowering) is possible as my PR shows, but it requires us to write a clause processing method that concerns itself with two clauses (We need to process depend and nowait both at the same time to arrive at the right if clause for the new omp.task we are to create) thereby breaking the nicely modular design of the ClauseProcessor.

jdoerfert · May 2, 2024, 8:00pm

A target * directive, with or without depend, with or without nowait, should be lowered by the OpenMPIRBuilder.

What you are discussing here is making implementation choices, partially bad ones, explicit in the MLIR layer. There is little point. All the semantics are already in the target * directive with the respective clauses attached. No need to make the implementation explicit early on. This is especially true because 1) we cannot reuse whatever you do for clang (see @Meinersbur first comment and -fopenmp-enable-irbuilder), and 2) we need to change two places as we get rid of the requirement that dependences (and nowait) introduce an explicit “task” around the target. That implementation is very convenient, but not very helpful to us. We’d rather feed the information into the offload runtime and hope it can resolve the dependences without any explicit task, e.g., via events.

Long story short:

Keep omp target .... in MLIR around and lower it in the OpenMP IR Builder.
Move the clang code and use the IR Builder route for clang as well.

My 2c.

Meinersbur · May 6, 2024, 11:38am

Independent of what is semantically equivalent, Clang currently calls __kmpc_omp_target_task_alloc for target tasks instead of __kmpc_omp_task_alloc as for regular tasks. Compiler Explorer vs Compiler Explorer .

That is, it is not possible to have the complete lowering/rewriting to happen in MLIR (unless we also lower to __kmpc_omp_target_task_alloc in MLIR itself, effectively abandoning the OpenMPIRBuilder), even if it is such an IsTargetTask argument passed to OpenMPIRBuilder::createTask I would not like partial or lowering only in certain cases in an MLIR while other things/cases are done by the OpenMPIRBuilder. I also don’t see any issue with adding storage for depend(...) arguments to the omp.target operation. If instead we want to have a dedicated operation for holding task-related information (wait/nowait, depend, priority, threadset, affinity, …) so nothing else has to, I think that would qualify as a wrapper. I’d avoid anything in-between.

Similarly, an undeferred task, although emitted as __tgt_target_kernel, is still considered a task in the OpenMP spec. I don’t think we need to model the difference in MLIR other than with a flag.

bhandarkar-pranav · May 6, 2024, 1:42pm

Thanks, @Meinersbur and @jdoerfert - I have started working on moving clang’s codegen to OMPIRBuilder.

One related question, is there any formal documentation on the OpenMP runtime library API as implemented in llvm-project/openmp?

jdoerfert · May 8, 2024, 10:31am

No.

bhandarkar-pranav · May 8, 2024, 5:22pm

Meinersbur:

Independent of what is semantically equivalent, Clang currently calls __kmpc_omp_target_task_alloc for target tasks instead of __kmpc_omp_task_alloc as for regular tasks. Compiler Explorer vs Compiler Explorer .

That is, it is not possible to have the complete lowering/rewriting to happen in MLIR (unless we also lower to __kmpc_omp_target_task_alloc in MLIR itself, effectively abandoning the OpenMPIRBuilder), even if it is such an IsTargetTask argument passed to OpenMPIRBuilder::createTask I would not like partial or lowering only in certain cases in an MLIR while other things/cases are done by the OpenMPIRBuilder. I also don’t see any issue with adding storage for depend(...) arguments to the omp.target operation. If instead we want to have a dedicated operation for holding task-related information (wait/nowait, depend, priority, threadset, affinity, …) so nothing else has to, I think that would qualify as a wrapper. I’d avoid anything in-between.

Fair point. It does this if the the target construct has the nowait clause on it. I’ll be moving this logic over as I work on moving lowering target to OMPIRBuilder.

Topic		Replies	Views
RFC for omp.target construct MLIR	13	1135	July 1, 2021
MLIR omp.target for gpu offloading MLIR	23	662	November 8, 2023
[RFC] OpenMP reduction support MLIR	15	1719	February 5, 2024
[RFC] Adding new Util file to MLIR Dialect Translation MLIR	2	268	February 7, 2023
MLIR Summit OpenMP Roundtable discussion (Summary) MLIR llvm	10	1133	January 11, 2023

[RFC] - Lowering and codegen of the depend clause on target ops in the omp dialect in MLIR

Goal

Background

Translation to LLVMIR

High-level abstraction and progressive lowering benefit of MLIR

Related Topics