[RFC] Prevent optimization/analysis across omp.target region boundaries

When compiling for the offloading devices there are various optimizations like constant folding, CSE etc that optimize across target regions so that host and device code become mixed. This is illegal, because the environment (like macro values etc) may be different so there is no way to safely reason about host code when compiling for the device and vice versa. The initial take was to prevent the optimizations from doing this, but this cannot be done in an easy way, and new optimizations would have to take this into account. Instead it seems better to erase all host code from the IR when doing device compilation. However, just deleting the code outside omp.target ops is also not enough because optimizations could try to hoist code from two different regions (in the same function) outside both of them to share code.

The proposed solution is when doing device compilation, outline each omp.target op from a host function into a separate function. (potenitally a new omp.target_func, instead of using the regular func op?). The input values to the omp.target op would be function parameters to the outlined function and the outputs would be return values. The outlined function would also need additional attributes needed to generate the kernel function e.g. parent function name which is used generate the kernel function names. Once this is done the original host function can be deleted.
For host compilation nothing needs to be changed.

2 Likes

I believe the following code is illegal/stupid/it is your own fault:

int i;
#ifdef OMP_HOST
i = 10;
#endif
#ifdef OMP_DEVICE
i = 9;
#endif
#pragma omp parallel for
for(...) {
  foo(i);
}

However, using regions in mlir may enable new optimizations. Maybe instead of the standard CSE pass, you will need an OffLoadCSE pass.

It seems to me that if the source code is compiled twice, once for the device and once for the host, there is inherently no point in having an IR that exposes both host and device constructs.

The whole point of having a unified IR is to be able to perform cross-boundary optimizations, and that assumes a flow where the source code is lowered to the IR once for both device and host together.

Agreed. If I run the example for host-only, Clang will compile it once. If I run:

#pragma omp target parallel for
for(...) {
  foo(i);
}

Notice the additional target. Clang will compile it once for host and once for device. That is why there are issues with macros. Flang will probably follow the same model.

The Intel SYCL compiler was planning to move to a single compilation model:

In mlir, you could have nested modules. One for host and one for device.

This is almost two nested MLIR modules, but it is not:

module.outer_module {
module.host_module {

int a = 0;
#pragma omp target parallel for map(to:a)
for (..) {
  // no a
}
}
module.device {
int a = 0;
#pragma omp target parallel for map(to:a)
for (..) {
  // no a
}

}
}

You might notice that variable a is mapped to the device, but it is unused. The current OpenMP optimiser will have a hard time to eliminate the map clause. It is redundant. With nested modules and cross-module optimizations, you might be able to eliminate the map-clause.

MLIR might benefit from an OptimizationBarrierInterface. Don’t hoist the Gold out of my Block. It is my Gold and not for public consumption.

You can shoot yourself in the foot with macros and such, but the code you have shown will work fine (also if you add target, etc.). It will do what you expect it to do. The consistency the user has to guarantee is more on the level of matching entry points and even then we can generally tolerate a lot.

They do, they have reasons for that. Some of the reasons apply to some degree to OpenMP offload as well. That said, multi-pass has distinct advantages and it will require real world data to drop that support (from Clang compiling OpenMP offload). That said, you can have a unified module while still doing separate “compilation” for all but the middle-end.

As mentioned above, doing pre-processing and parsing and (target) codegen per target does not preclude you from having a single middle-end module with host and target “IR” in any way. We know because we run some experiments that effectively merged the IR from host and device code after the frontend produce it. That said, we also have performance numbers for the performance you gain with “simple” cross-boundary optimizations, e.g., constant propagation, and for our codes the result is not (yet) worth the effort.

Yes, please. That model makes sense. The driver can deal with the merging of modules if we wish to do that.

This is what I have told people from the very beginning (of MLIR and parallel LLVM-IR). If you make the representation “too transparent”, analyses and transformations will assume sequential, single-target semantic and that will cause miscompilations. The approach we take for all parallel languages handled by clang is to “opt-in”, rather than “opt-out” of cross-parallel-boundary optimizations. That is the sane approach and it works.

Once Clang migrates to mlir-Clang, it might be interesting to move the offload programming models to two nested MLIR modules and two separate LLVM modules. Then you can safely start optimising between host and device code.

As this post shows, that is not such a simple statement as people make it out to be. You can do cross-device optimizations with less infrastructure, but safely depends mostly on your encoding/approach.
Anyway, see the next comment.

If that happens, we can talk about it :slight_smile:

I forgot to reply to this part, sorry. Yes. I think that is a very sensible approach to make it work, and once it works, people can try out more “optimistic” encodings to get better performance, but at least then we do not have to fight all the bugs right now during the time we want to get things going.

It will be good to see a worked out example (src, mlir, llvm ir) for both the host and device compilation just to make sure that we all understand this the same way.

Does this mean the nested omp.target representation will be retained for the host side?

If there is nesting, I believe a recursive translation approach could work, where the current target region device type is kept track of and depending on the compilation mode and the level of nesting it would either outline and erase the original, or leave the code alone.

There is more to it than that tough, because everything has to be kept for the host side. So when doing host codegen for an inner region,the host pass would have to create the outined function. The device pass would generate reverse offloading launch code.

If that happens, we can talk about it :slight_smile:
[/quote]

We are trying very hard to share the code generation between Clang and MLIR via the OpenMPIRBuilder It helps to have the same two pass compilation model for both. I imagine it would make it easier for Clang to migrate to mlir-Clang and once that is done, moving to a single pass model will be done for both of them by default.

No worries. The future model would still use the OpenMPIRBuilder for the LLVM part. The difference is that one flang invocation creates 2 nested mlir modules (host and device) and then two LLVM modules.

Maybe you can attach an OptimizationBarrierInterface to target regions to prevent optimisation in and out of target regions, while still have CSE inside the region.

I think just removing the “useless” code for now makes sense since it will speed up compilation and maybe make code inspection easier. Once we know of some specific optimizations that should be done then we can add the OptimizationBarrierInterface. I’m just a bit paranoid that some optimization that we haven’t thought about will cause miscompilation. There are enough bugs to worry about since we are still very early in the development of OpenMP offloading support in llvm-flang/MLIR.

GlobalIsel (instruction selection) just added a barrier for constant folding:
https://reviews.llvm.org/D151945

The change in GlobalISel seems to be about controlling some profitability aspects, and not about correctness issues?

If I understand correctly, the issue right now is that clang is invoked twice, builds different AST, and emits different IR for host and device. But both IR have code for the host and the device, except that in the host case the “device IR” is incorrect and in the device case it’s the “host IR” that is incorrect? Is this accurate? That’s hardly a statement of “cross host-device transformation is difficult” to me.