[RFC] Prevent optimization/analysis across omp.target region boundaries

It prevents constant folding aka optimizations. It is not about correctness.

That is accurate afaik. I still think it is a tricky thing to do. Moving any computation across target region boundaries may give different results since e.g. rounding modes may be different on the device vs host.

#pragma omp target parallel for map(to:a) map(from:b)
for() {
  b[i] = 5 * a[i] + 5 + 5 + 5 + 5 + 5;
}

LICM might secretly hoist the 5 + 5 + 5 + 5 + 5 out of the loop. But you have to copy the sum back to the device. How should I know that I have to do that?

Hopefull the resulting code would be legal:

omp.target {
    tf = 5 + 5 + 5 + 5 + 5
    for () {
        b[i] 5 * a[i] + tf
    }
}

This would be illegal:

tf = 5 + 5 + 5 + 5 + 5
omp.target {
    for () {
        b[i] 5 * a[i] + tf
    }
}

Exactly! tf = 25 should also be fine. omp.target must be some kind of barrier that prevents optimisations across it.

There is a fold interface where we have implemented shouldMaterializeInto to return true for target regions, so it should be “safe” for this case, but the function is not called mustMaterializeInto, so it doesn’t seem to be intended for correctness, rather for profitability. There is really a need for a mechanism to have an optimization/analysis barrier. If someone is interested in adding it that would be nice, but it will take some effort.

@mehdi_amini @ftynse Is there interest from the MLIR team for an OptimizationBarrierInterface to prevent optimizations across region boundaries, see discussion above?

As I mentioned before. We have no evidence that this is the right thing to do. On the contrary, we already know what things will break for users if we do. IIHO, the right thing to do is to merge the IR in the middle end, rather than to parse it once.

No it is not accurate. The AST is not “incorrect” for either side. There might be differences in unrelated parts, but that is not the same as “incorrect”. The important thing is the interfaces are consistent.

Assuming it wasn’t a constant, LLVM-IR LICM will right now hoist the value out of the loop just fine. It will be computed on the target once and used in the loop w/o recomputation.

FWIW, I was always hoping this would be the default. Operations/regions can have arbitrary semantics, how can one justify to move anything in/out without knowing what that really implies.

In the example that was given, a device only macro would affect the « host side », but maybe this example wasn’t really clear in terms of what it was actually showing?

I am not sure I follow. Users can write all sorts of things, that is up to them. If they do not keep the host-device interface consistent, bad things will happen, otherwise, things should be “fine” and you get what you asked for. If you refer to the example with a different value for i on the host and device, that is totally fine. We even have various pragmas to allow exactly such things, diverging code for host and device. Device and host do not need to do the same thing in the same way, they can really do whatever the user want’s to, as long as the interface is sane.

The issue is when LICM hoists the 5+5+5+5+5 out of the loop and target is a network attached device and there is no shared memory. In MLIR the loop is wrapped into a omp.target op with an attached region.

@tschuett: I think we are in agreement. In case we are not:
I never said it should, by default, hoist anything out of a target. In fact, my spiel for a long time is that it should/cannot. See my comment above, LLVM-IR LICM right now will hoist it out of the loop, not the target. That is correct in our (OpenMP and LLVM) memory model, and I am unsure how that could go wrong.

We are close. But I believe in MLIR nobody stops LICM to hoist anything out of region.

I understand, and that is a problem. FWIW, if I talk about LICM (or anything really) I refer to LLVM-IR. I should have made that clear.

There is a LICM in MLIR:

LICM is a good example to show the problem, because it hoists code out of the loop resp. region.

With the early outlining we can easily detect if something gets hoisted. There should only be a omp.target and a func.return operation in the body of the outlined function. If there is anything more we can error out, so at least it won’t miscompile.

I can see why some optimizations like LICM and CSE can be problematic when dealing with mixed host/device code - both from legality perspective and heuristic reason (device heuristics may be different - such as wanting to reduce register pressure).

That said, I am struggling to see how the interface between host and device is reconciled. I am echoing what Kiran asked for - it would be useful to see some examples on the impacts of this proposal.

In my mind, doing early outlining to work around the optimization issue feels limiting. For example, consider that in flang, descriptors are used in many places. In order to get to the data, one also needs the descriptor on the device. But the descriptor must be mapped to the device in an implicit manner. It definitely impacts the device routine depending on the mapping used - and maybe in some cases we would want to firstprivatize the descriptor instead. So how will the host and device interface be reconciled if they are handled on completely different paths from beginning?

Also, what if we wanted to do SROA on descriptor? In many cases, we just use the pointer to the data and maybe upperbound. So we could just pass the info needed by kernel argument instead of a mapping operation. How would this be done when device/host code is split in different compilation paths?

See any actually offloading code compiled with Clang: CUDA, HIP, OpenMP, OpenCL, SYCL, 


Some abstract notion of “optimization” we might want to perform on codes that we can neither compile nor execute should not block us going forward for an unknown amount if time (and work). If people want to try out some optimizations, go ahead and report back, until then, all data clearly favors the Clang approach.

If I understand the problem correctly. Optimization within one target or the other is ok, but the problems occur at the interface between targets. Maybe there is an assumption that has been made in the past that is now not true. For example, if both targets are the same, one can reasonable assume that the same optimization to the interface can be done on both sides of the interface. If the targets are different, one side might be able to make the optimization but the otherside not.
So, unless there is co-ordination between targets, the rule should probably be that one should only make the optimization if one can assure that the same optimization is actually being made on both sides. If that assurance is not available, the optimization across the interface should be suppressed.

The traditional Clang offload model compiles the source code twice: once for host and once for device. If you want to optimise between host and device, you have to coordinate between the two.

The other issue mentioned above. Even in the device compilation, there is host and device code. The offload for OpenMP and OpenACC is modelled as a loop or a region in MLIR. Currently there is no method to prevent optimisations across the border between in the loop/region and out of the region. My favourite anti-optimisation is LICM.