[RFC] Prevent optimization/analysis across omp.target region boundaries

tschuett · June 16, 2023, 4:28pm

It prevents constant folding aka optimizations. It is not about correctness.

jansjodi · June 16, 2023, 7:06pm

That is accurate afaik. I still think it is a tricky thing to do. Moving any computation across target region boundaries may give different results since e.g. rounding modes may be different on the device vs host.

tschuett · June 17, 2023, 5:11am

#pragma omp target parallel for map(to:a) map(from:b)
for() {
  b[i] = 5 * a[i] + 5 + 5 + 5 + 5 + 5;
}

LICM might secretly hoist the 5 + 5 + 5 + 5 + 5 out of the loop. But you have to copy the sum back to the device. How should I know that I have to do that?

jansjodi · June 19, 2023, 3:24pm

Hopefull the resulting code would be legal:

omp.target {
    tf = 5 + 5 + 5 + 5 + 5
    for () {
        b[i] 5 * a[i] + tf
    }
}

This would be illegal:

tf = 5 + 5 + 5 + 5 + 5
omp.target {
    for () {
        b[i] 5 * a[i] + tf
    }
}

tschuett · June 19, 2023, 3:27pm

Exactly! tf = 25 should also be fine. omp.target must be some kind of barrier that prevents optimisations across it.

jansjodi · June 19, 2023, 3:50pm

There is a fold interface where we have implemented shouldMaterializeInto to return true for target regions, so it should be “safe” for this case, but the function is not called mustMaterializeInto, so it doesn’t seem to be intended for correctness, rather for profitability. There is really a need for a mechanism to have an optimization/analysis barrier. If someone is interested in adding it that would be nice, but it will take some effort.

tschuett · June 19, 2023, 4:34pm

@mehdi_amini @ftynse Is there interest from the MLIR team for an OptimizationBarrierInterface to prevent optimizations across region boundaries, see discussion above?

jdoerfert · June 19, 2023, 6:00pm

As I mentioned before. We have no evidence that this is the right thing to do. On the contrary, we already know what things will break for users if we do. IIHO, the right thing to do is to merge the IR in the middle end, rather than to parse it once.

jansjodi:

mehdi_amini:

The change in GlobalISel seems to be about controlling some profitability aspects, and not about correctness issues?

jdoerfert:

As this post shows, that is not such a simple statement as people make it out to be.
You can do cross-device optimizations with less infrastructure, but safely depends mostly on your encoding/approach.

If I understand correctly, the issue right now is that clang is invoked twice, builds different AST, and emits different IR for host and device. But both IR have code for the host and the device, except that in the host case the “device IR” is incorrect and in the device case it’s the “host IR” that is incorrect? Is this accurate? That’s hardly a statement of “cross host-device transformation is difficult” to me.

That is accurate afaik. I still think it is a tricky thing to do. Moving any computation across target region boundaries may give different results since e.g. rounding modes may be different on the device vs host.

No it is not accurate. The AST is not “incorrect” for either side. There might be differences in unrelated parts, but that is not the same as “incorrect”. The important thing is the interfaces are consistent.

tschuett:

#pragma omp target parallel for map(to:a) map(from:b)
for() {
  b[i] = 5 * a[i] + 5 + 5 + 5 + 5 + 5;
}
LICM might secretly hoist the 5 + 5 + 5 + 5 + 5 out of the loop. But you have to copy the sum back to the device. How should I know that I have to do that?

Assuming it wasn’t a constant, LLVM-IR LICM will right now hoist the value out of the loop just fine. It will be computed on the target once and used in the loop w/o recomputation.

FWIW, I was always hoping this would be the default. Operations/regions can have arbitrary semantics, how can one justify to move anything in/out without knowing what that really implies.

mehdi_amini · June 19, 2023, 6:10pm

In the example that was given, a device only macro would affect the « host side », but maybe this example wasn’t really clear in terms of what it was actually showing?

jdoerfert · June 19, 2023, 6:16pm

I am not sure I follow. Users can write all sorts of things, that is up to them. If they do not keep the host-device interface consistent, bad things will happen, otherwise, things should be “fine” and you get what you asked for. If you refer to the example with a different value for i on the host and device, that is totally fine. We even have various pragmas to allow exactly such things, diverging code for host and device. Device and host do not need to do the same thing in the same way, they can really do whatever the user want’s to, as long as the interface is sane.

tschuett · June 19, 2023, 6:23pm

The issue is when LICM hoists the 5+5+5+5+5 out of the loop and target is a network attached device and there is no shared memory. In MLIR the loop is wrapped into a omp.target op with an attached region.

jdoerfert · June 19, 2023, 6:26pm

@tschuett: I think we are in agreement. In case we are not:
I never said it should, by default, hoist anything out of a target. In fact, my spiel for a long time is that it should/cannot. See my comment above, LLVM-IR LICM right now will hoist it out of the loop, not the target. That is correct in our (OpenMP and LLVM) memory model, and I am unsure how that could go wrong.

tschuett · June 19, 2023, 6:29pm

We are close. But I believe in MLIR nobody stops LICM to hoist anything out of region.

jdoerfert · June 19, 2023, 6:31pm

I understand, and that is a problem. FWIW, if I talk about LICM (or anything really) I refer to LLVM-IR. I should have made that clear.

tschuett · June 19, 2023, 6:39pm

There is a LICM in MLIR:

LICM is a good example to show the problem, because it hoists code out of the loop resp. region.

jansjodi · June 21, 2023, 7:55pm

With the early outlining we can easily detect if something gets hoisted. There should only be a omp.target and a func.return operation in the body of the outlined function. If there is anything more we can error out, so at least it won’t miscompile.

razvan.lupusoru · June 22, 2023, 8:57pm

I can see why some optimizations like LICM and CSE can be problematic when dealing with mixed host/device code - both from legality perspective and heuristic reason (device heuristics may be different - such as wanting to reduce register pressure).

That said, I am struggling to see how the interface between host and device is reconciled. I am echoing what Kiran asked for - it would be useful to see some examples on the impacts of this proposal.

In my mind, doing early outlining to work around the optimization issue feels limiting. For example, consider that in flang, descriptors are used in many places. In order to get to the data, one also needs the descriptor on the device. But the descriptor must be mapped to the device in an implicit manner. It definitely impacts the device routine depending on the mapping used - and maybe in some cases we would want to firstprivatize the descriptor instead. So how will the host and device interface be reconciled if they are handled on completely different paths from beginning?

Also, what if we wanted to do SROA on descriptor? In many cases, we just use the pointer to the data and maybe upperbound. So we could just pass the info needed by kernel argument instead of a mapping operation. How would this be done when device/host code is split in different compilation paths?

jdoerfert · June 22, 2023, 9:20pm

See any actually offloading code compiled with Clang: CUDA, HIP, OpenMP, OpenCL, SYCL, …

Some abstract notion of “optimization” we might want to perform on codes that we can neither compile nor execute should not block us going forward for an unknown amount if time (and work). If people want to try out some optimizations, go ahead and report back, until then, all data clearly favors the Clang approach.

jcdutton · June 23, 2023, 6:52am

If I understand the problem correctly. Optimization within one target or the other is ok, but the problems occur at the interface between targets. Maybe there is an assumption that has been made in the past that is now not true. For example, if both targets are the same, one can reasonable assume that the same optimization to the interface can be done on both sides of the interface. If the targets are different, one side might be able to make the optimization but the otherside not.
So, unless there is co-ordination between targets, the rule should probably be that one should only make the optimization if one can assure that the same optimization is actually being made on both sides. If that assurance is not available, the optimization across the interface should be suppressed.

tschuett · June 23, 2023, 7:25am

The traditional Clang offload model compiles the source code twice: once for host and once for device. If you want to optimise between host and device, you have to coordinate between the two.

The other issue mentioned above. Even in the device compilation, there is host and device code. The offload for OpenMP and OpenACC is modelled as a loop or a region in MLIR. Currently there is no method to prevent optimisations across the border between in the loop/region and out of the region. My favourite anti-optimisation is LICM.

Topic		Replies	Views
[RFC] Prevent CSE from removing expressions inside some non-`IsolatedFromAbove` operation regions MLIR	22	582	September 12, 2023
[RFC] Pros and cons query of treating omp::TargetOp similarly to an mlir::FuncOp with an mlir::FunctionOpInterface MLIR	5	152	June 27, 2024
Run optimization pipeline with O3 for host but O0 for target IR & Optimizations	2	162	April 13, 2022
Is nested offloading from host to accelerator A to accelerator B possible? OpenMP	6	436	February 23, 2022
[RFC] Adding support for OpenMP GPU target offload MLIR gpu , openmp	9	682	February 15, 2024

[RFC] Prevent optimization/analysis across omp.target region boundaries

Related Topics