Background
All of OpenMP, OpenACC, and do concurrent provide utilities to control the locality/privatization/shareability of data items within the scopes of their constructs. For OpenMP, we have private, firstprivate, and lastprivate clauses to control how a data item is to privatized within the scope of a construct. For do concurrent, the user can use the local and local_init locality specifiers to achieve a similar goal to those OpenMP clauses. There some semantics differences between the OpenMP clauses and do concurrent specifiers. For example, in the do concurrent case a privatized item is created/allocated for each iteration of a concurrent loop; while in the OpenMP case, the same allocation might be used for all iterations of a chunk. However, on the syntactic level, these constructs are quite similar.
A similar observation can be made for the reduction clause in OpenMP and reduce specifier in do concurrent constructs.
This RFC will mainly focus on OpenMP and do concurrent since I am not as familiar with OpenACC. However, the discussion should be extendable to OpenACC as well.
Current status in flang (and relevant MLIR dialects)
Each of the 3 programming models (OpenMP, OpenACC, and do concurrent) implement the above utilities on its own and, in some cases, differently from the other models.
Delayed privatization vs. early privatization
Over the past year, the OpenMP dialect implemented “delayed privatization”. With delayed privatization, privatization clauses are modeled in the IR and only lowered (or inlined) as late as possible in the pipeline, in particular, when MLIR is lowered to LLVM. For example, consider the following Fortran input:
!$omp target private(simple_var)
simple_var = 10
!$omp end target
When delayed privatization is enabled (which is the case by default of most OpenMP constructs currently), flang emits a separate operation to encapsulate the privatization logic:
omp.private {type = private} @_QFtarget_simpleEsimple_var_private_i32 : i32
and links this op to the relevant OpenMP construct by referncing its symbol:
omp.target private(@_QFtarget_simpleEsimple_var_private_i32 %2#0 -> %arg0 : !fir.ref<i32>) {
.... use %arg0 within the construct's scope ....
}
Note that such delayed privatizers become more complex when they model firstprivate: to model the copying logic, or model more complex data types: e.g. to clean up allocatables.
This is not a new or a unique idea since OpenACC also has a similar way of modeling privatization through its acc.private.recipe operation which has very similar syntax, semantics, and usage to omp.private.
Opposite to delayed privatization, we have early/eager privatization. In this case, instead of modelling the privatization logic in a separate op, we inline that logic early within the construct on which the privatization is specified. The obvious downside of this that the logic of privatization and the parent construct are intermengled reducing debugability within the compiler’s pipeline. At the moment, do concurrent locality specifiers are still modeled using early privatization.
Modeling reduction
For reduction, the 3 programming models have their own separate but very similar approaches as well. For example, OpenMP has the omp.declare_reduction op while OpenACC has the acc.reduction.recipe op. do concurrent does not use a separate op but models reductions using attributes that store the reduction operation, e.g. reduce(#fir.reduce_attr<add> -> %sum : ....).
Proposal
This RFC proposes starting a new separate dialect to model privatization/locality as well as reduction clauses/specifiers across OpenMP, OpenACC, and do concurrent. In particular, such dialect would contain the following ops as a start:
- One operation that merges both
omp.privateandacc.private.recipe. The same op can then be used to model delayed privatization fordo concurrent’slocalandlocal_initspecifiers. - One operation that merges both
omp.declare_reductionandacc.reduction.recipe. The same op can then be used fordo concurrent’sreducespecifier as well.
Proof of concept
To provide a more concrete idea of this looks like, a proof-of-concept was implemented here. This PoC does not create a new dialect but rather reuses OpenMP table-gen constructs for modeling do concurrent’s local and local_init specifiers. The PoC is divided into a number of commits each of which is self-contained and specific to a specific part in the pipeline, e.g. there is a commit for lowering from PFT to MLIR, a commit for parsing and printing, a commit for lowering between relevant MLIR construct, etc. Tests are also included to showcase the resulting MLIR.
Productizing the PoC
As mentioned the PoC does not actually start a new dialect but rather reuses some of the OpenMP table-gen records in the FIR dialect. The next steps towards productizing this PoC might be the following:
- Moving the used OpenMP records to the new dialect. In particular, moving the
OpenMP_Clause,OpenMP_PrivateClauseSkipandOpenMP_PrivateClauserecords and generalizing their names as appropriate. - Using the generalized
OpenMP_PrivateClausein both the OpenMP and FIR dialects similar to whatthe PoC currently does. - Using the generalized
OpenMP_PrivateClausein the OpenACC dialect. - Doing a similar round of changes for reductions.
Questions
Any feedback on the above is, of course, welcome. However, a few questions to start:
- Are there any expected blockers to having such a dialect? I might be missing intricate details specific to the relevant programming models. Therefore, I am interested to know if there are any major issues having shared constructs/table-gen records across the 3 dialects.
- If this new dialect is a reasonable idea, any suggestions for naming the dialect as well the private/local-related records?