[RFC] omp.module and omp.function vs dialect attributes to encode openmp properties

We are in the process of implementing target offloading for OpenMP in flang. Some of the information would be naturally represented as additional attributes on functions/modules e.g. if a function is a target or host function, variants etc. Similarly, on the module level we need to know if the module is for host or device compilation and there are various directives that might be present

One option is to add omp.module and omp.function to the omp dialect and define these attributes. The other option is to use dialect attributes on the built in function/module ops. What is the better approach, and what are the criteria for choosing one approach over the other?

Generally new ops is recommended, since the dialect attributes are technically “discardable” attributes so you would need to audit your whole pass pipeline to ensure that their semantics are preserved.

We haven’t been super diligent about this in the past though upstream or in the ecosystem, but all the system that I have seen that use their own ops end up working extremely well.

* *inherent attributes* are inherent to the definition of an operation’s semantics. The operation itself is expected to verify the consistency of these attributes. An example is the `predicate` attribute of the `arith.cmpi` op. These attributes must have names that do not start with a dialect prefix.
* *discardable attributes* have semantics defined externally to the operation itself, but must be compatible with the operations’s semantics. These attributes must have names that start with a dialect prefix. The dialect indicated by the dialect prefix is expected to verify these attributes. An example is the `gpu.container_module` attribute.

Thank you for the feedback. Yes, the inherent vs discardable attributes was the primary reason why we were considering adding the ops. To argue the other side, since modules and functions are rarely transformed to the extent that the attributes would be discarded, will adding new ops for high level constructs such as modules and functions would cause problems for existing passes?

Thanks @jansjodi for starting this discussion.

I see that other dialects (gpu/spirv) that have device/offloading flows have their own module and function operations. It will be interesting to know whether this is just for the attributes or for something more. I remember that having multiple modules is advantage in MLIR but at the same time since LLVM does not support multiple modules, it was not clear whether we could leverage this in a final conversion to LLVM.

The OpenMP dialect is currently designed to work and co-exist with other dialects. I guess this is different to something like the spirv dialect (which is fairly well contained). I suspect that creating a function operation in the OpenMP dialect might be disruptive for existing flows. Though, on the top of my head, I cannot think of the disruption except for any place in the code where we specifically check or cast for function operations. We should check whether for the immediate requirement of flang whether adding such an omp.function operation causes any issues. Flang uses the builtin module, and the func.func function. But I think it uses the fir.call operation. Have to check whether there is anything special there. Also, currently the OpenMP dialect is translated to LLVM IR along with the LLVM dialect. We have to check whether there will be any issues there as well.

One alternative possiblity is to create interfaces that the functions or modules should have so that they can be processed for target offloading. The interface can require that the offloading attributes be present.

  • There are several single-source offload programming models. There are at least CUDA, OpenMP, and OpenACC. + SYCL.
  • The semantic analysis of Flang should be able to detect for loops with a target directive and which functions are called inside. IDK how OpenMP target works with cross-TU calls.
  • Maybe there is space for an offload dialect on top of the OpenMP and OpenACC dialect to host the necessary interfaces etc.

Offload is not that special. All “(declare) target” functions need to be compiled for the targeted devices but other than that it is like any other compilation. You can call cross TUs. You can take pointers, etc.
I’m not sure why we need so much machinery here in the MLIR level. The frontend can (and should) determine what needs to be compiled for the target, and simply ignore the rest of the code during the target compilation run. We create a regular MLIR module with the target architecture set.

Where do I need special ops for a module? If it’s only to encode the fact that this is a target module, there should be an easier way to set a single “module metadata” bit.
Similarly, “target/openmp” functions are just functions. I don’t understand what a new op solves here.

As far as I know Flang creates one MLIR module and translates everything into one LLVM module.

Does the OpenMPIRBuilder support IDK CUDA + AArch64 in one LLVM module?

Does Flang need to create one CUDA MLIR Module + one LLVM module and one AArch64 MLIR module and one LLVM IR module? The driver is in charge of handling the artefacts?

As far as I know Flang creates one MLIR module and translates everything into one LLVM module.

One module per target. This is the same as in clang (for all our offloading models).

Does the OpenMPIRBuilder support IDK CUDA + AArch64 in one LLVM module?

CUDA is not a target. NVPTX and AArch64 cannot be mixed because LLVM (modules/contexts) do not allow that (reasonably). That is irrelevant at this point since we perform a full compilation per target.

Does Flang need to create one CUDA MLIR Module + one LLVM module and one AArch64 MLIR module and one LLVM IR module? The driver is in charge of handling the artefacts?

Yes (if I understand what you’re saying), and yes (already today).

We are on the same page. Still, it needs a wider discussion.

It seems like it would be an intrinsic attribute which means an op is recommended, but it also seems like a big thing to add a new op. Is adding an op big thing in MLIR, or is a new module/ function is just a module/function with some extra stuff? From an LLVM perspective adding an op would probably be unreasonable.

Having the frontend ignore (remove?) everything that is not going to be compiled may not be desirable. For example, an input to a target region could be determined to be constant if the host code can be analyzed (may be through interprocedural analysis).

Does the driver invoke Clang twice in OpenMP target mode? Once for host and once for every other target?

It seems like it would be an intrinsic attribute which means an op is recommended, but it also seems like a big thing to add a new op.

If all you want is a “flag”, add a global.

extern weak_odr OpenMP_Device_Code;

In the translator to IR you transform it into what clang uses:

!11 = !{"openmp", 51}
!12 = !{"openmp-device"}

We really don’t need more.

Having the frontend ignore (remove?) everything that is not going to be compiled may not be desirable. For example, an input to a target region could be determined to be constant if the host code can be analyzed (may be through interprocedural analysis).

That doesn’t work (as easily as one might think). And we rehash this discussion every few months.
The easiest example to break this is using pre-processor directives.

#ifdef __SSE2
const int VF = 4
#else
const int VF = 1
endif

Now if you use the macros defined for one target to determine values used on the other target you will create mismatches.

Other things likely to break (or cause trouble) if you start mixing code targeting different architectures:
(unavailable or different) types, builtins, predefined macros, predefined/library functions, …

Does the driver invoke Clang twice in OpenMP target mode? Once for host and once for every other target?

Yes. As I mentioned before, this is what all our drivers (flang + clang) do for all our offloading models (cuda, OpenMP, …).

That is one way of doing it. Is that the right thing to do in MLIR? There are many ways to do this, but I guess my primary concern is that we are following the best practices and that that we use the infrastructure that is provided.

I see, yes, there is no way (in the general case) to reason about code that is not defined for the current device.

But the workflow will still be different from OpenMP target with Clang. In target mode Flang would have to somehow add a marker to declare variant functions, an interface?

Flang needs to outline the target regions? Flang just creates MLIR and at the end of the pass pipeline it talks to the OpenMPIRBuilder. But there is a gap in between.

Is it possible to add an interface dynamically to an op? If not, the interface would have to be added to the builtin ops definitions. If it is possible, then it could maybe be combined with the global variable approach.

Flang won’t outline the target regions. That happens in the lowering of the omp ops in mlir, so some information has to be enocded in the IR. I think the simplest and least intrusive thing would be to add the dialect attributes and see how far that takes us. That means no new ops, and it seems unlikely they (the attributes) will be discarded during compilation.

Clang does not use MLIR yet. It directly talks to the OpenMPIRBuilder to create LLVM IR.

As said above Flang will be invoked twice ones for the host- and once for the target-mode. What happens if you have a 1GB Fortran file with a tiny target region? In target mode are you going to lower the 1GB to MLIR or only the target region in whatever form?

There is an omp dialect in MLIR: 'omp' Dialect - MLIR My best guess is that works for shared-memory OpenMP and not yet for Flang in target mode.

The SYCL upstreaming Working Group met and they want to move to one clang invocation for combined host and device to enable new optimisations. If you invoke Flang once and it generates two MLIR modules (host + target) and then create two LLVM modules, there could be new optimisation opportunities on the MLIR side.

@tschuett Do you have any RFC or white paper which describes how to fix preprocessor issue for single clang invocation?

The plan/idea comes from the SYCL guys. But if Clang/Flang only lexes, parses, and performs sema once and then performs code generation, the preprocessor issue may go away.