[RFC] Prevent CSE from removing expressions inside some non-`IsolatedFromAbove` operation regions

Currently, the common sub-expression elimination MLIR pass only avoids crossing nested region boundaries of IsolatedFromAbove operations while checking for existing known values. However, reading/writing a value defined on a parent region might be costly for some operations and it would be a better choice to re-calculate the common expression inside. Or there may be some cases where crossing some operation’s boundary means some constants are not the same as in the parent regions, even if they are based on the same values.

One instance of such a behavior being possible to override can be seen for constant hoisting by the OperationFolder class. In that case, constants are inserted in the first parent region that is part of an IsolatedFromAbove operation, unless there is an earlier parent operation for which a DialectFoldInterface has been registered that returns true for a call to shouldMaterializeInto(Region*). This allows dialects overriding the default behavior of the canonicalization pass and prevent constants from being hoisted out of certain regions.

In D159212 I propose adding a similar mechanism to CSE, so that the elimination of common sub-expressions between the region(s) of a given operation and any parents can be prevented without forcing that operation to be IsolatedFromAbove, with the restrictions that would come from that.

We stumbled across this need while working on the OpenMP dialect adding support for target offload for LLVM-Flang. The omp.target operation, which contains a single region, represents code to be executed by a target device which can be different from the host device and with its own memory space. It can use values defined in parent regions, for which they need to be mapped to the target device memory. This mapping may involve sending data from host device (CPU) to target device (e.g. GPU) and vice versa, which can be expensive. Additionally, calculations related to indexing of an array cannot be done in the host device and then passed on to the target device, because the same array can be placed in different addresses for host and
device.

Below is an example showing these problems which we encountered and that could be solved by preventing CSE across omp.target boundaries. This is the pre-CSE MLIR representation, with comments showing the Fortran they were lowered from:

// subroutine f()
func.func @_QPf() {
  // integer :: a(10)
  %0 = fir.alloca !fir.array<10xi32>
  // integer :: b(10)
  %1 = fir.alloca !fir.array<10xi32>
  // b(1) = 100
  %c0_i64 = arith.constant 0 : i64
  %c100_i32 = arith.constant 100 : i32
  %2 = fir.coordinate_of %1, %c0_i64 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
  fir.store %c100_i32 to %2 : !fir.ref<i32>
  // !$omp target map(from: a) map(to: b)
  omp.target map((from -> %0 : !fir.ref<!fir.array<10xi32>>), (to -> %1 : !fir.ref<!fir.array<10xi32>>)) {
    // a(1) = b(1)
    %c0_i64_0 = arith.constant 0 : i64
    %3 = fir.coordinate_of %1, %c0_i64_0 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
    %4 = fir.load %3 : !fir.ref<i32>
    %5 = fir.coordinate_of %0, %c0_i64_0 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
    fir.store %4 to %5 : !fir.ref<i32>
    omp.terminator
  }
  return
}

And this is what happens after the CSE pass runs. The arith.constant 0 : i64 constants are consolidated and also the result of fir.coordinate_of %1, %c0_i64 is reused outside and inside of the omp.target region. The first case could result in the need for copying the zero value from host device to target device and the second case can result in a wrong calculation of the address for the fir.load operation:

func.func @_QPf() {
  %0 = fir.alloca !fir.array<10xi32>
  %1 = fir.alloca !fir.array<10xi32>
  %c0_i64 = arith.constant 0 : i64
  %c100_i32 = arith.constant 100 : i32
  %2 = fir.coordinate_of %1, %c0_i64 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
  fir.store %c100_i32 to %2 : !fir.ref<i32>
  omp.target map((from -> %0 : !fir.ref<!fir.array<10xi32>>), (to -> %1 : !fir.ref<!fir.array<10xi32>>)) {
    // Loading from host device address
    %3 = fir.load %2 : !fir.ref<i32>
    // Implicit map of %c0_i64
    %4 = fir.coordinate_of %0, %c0_i64 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
    fir.store %3 to %4 : !fir.ref<i32>
    omp.terminator
  }
  return
}

So I wanted to ask if the approach I proposed in my patch would be a good way to address this issue, or rather if there are preferred approaches, maybe involving the creation of another trait, extending an existing interface rather than creating a new one, or modifying the omp.target operation definition to be IsolatedFromAbove and dealing with mapped variables differently.

1 Like

Is there a way to have a generic solution to this, like a new trait or more general interface? It would be nice to avoid the case where a new transform will have to touch all the dialects, or a new op will require implementing/modifying a bunch of interfaces expressing basically the same restrictions.

In general I have treated such cases as optimization: that is CSE would be performing a canonicalization, and rematerialization of computations inside a target region is something independent that “undo” the CSE based on some optimization goals.
That avoids overfitting for some specific code form: the input code could already be in the post-CSE form as well and you’d want to optimize it just as well.

1 Like

Unfortunately it is not possible to undo, because the user could have written the code in that way to begin with. It is not legal to add or remove values that cross a target region boundary.

Is there a “device” barrier on regions (like an interface that says “these regions are in a different scope”)? Similar to pointer address space?

Then any kind of canonicalization would have to understand the context (perhaps via another interface) and be specific to the operations at hand, not treat the op region like the others?

A somewhat similar case is linalg.index, where CSE currently results in an incorrect transformation: `linalg.index` shouldn't be marked as `Pure`. · Issue #62644 · llvm/llvm-project · GitHub
It seems in that case that there’s a dependency between linalg.index and its parent linalg.generic that isn’t modelled, and this happens to affect CSE but isn’t specific to it. I wonder if there’s a similar issue here?

I would argue that your IR model is incorrect is this is the case: you shouldn’t use implicit region and make your operation isolated.

(It’s also not clear to me why we would ever have this restriction preventing optimization from crossing target region boundaries)

Thank you all very much for your contributions to the discussion and for pointing to alternative approaches and similar issues encountered in other dialects. I’ll try to share some of my thoughts and hopefully that brings us closer to a decision. It’s possible I’m not making a lot of sense, since I’m not perfectly familiar with all aspects of MLIR, so let me know if I’m misunderstanding something or proposing things that cannot work.

A new trait or more generic interface to both disallow CSE and mark regions to allow constants to be materialized could be defined, and maybe extended over time to disable other generic canonicalizations/optimizations. The issue with this is that I’m not sure what that trait/interface would represent. I see it as just a slightly different take of IsolatedFromAbove where values are allowed to be passed through only in certain dialect-specific ways.

Rematerialization of those expressions inside of target regions seems like it may be difficult in this case. For regular expressions I wouldn’t expect big problems, but dealing with array indexing or pointer arithmetic may be more problematic. How would we tell between a target device address calculation that got improperly hoisted to the host device and a host device address that for some reason the user tries to use inside of the target region? Dereferencing it would not be generally allowed, but I don’t think using it as an integer should be explicitly disallowed, even though I can’t think of a use case myself.

Seems like IsolatedFromAbove is the only region barrier we currently have in MLIR, as far as I can tell. Adding some sort of host/target address space could maybe help preventing the fir.coordinate_of optimization above, but I don’t know how well such a new “OpenMP address” type would integrate with FIR dialect operations that are only supposed to represent Fortran, where that distinction doesn’t exist. This concern probably extends to other dialects we may want to interoperate with OpenMP in the future. At that point we would be dealing with two values representing a single array / pointer anyways: the host and the target device ones. Which may be, together with IsolatedFromAbove, all we need to keep both separate, so address spaces would not be needed in the end by just having that.

These problems do seem similar, although my feeling is that the solution is going to be different. In this case, I don’t think we could model some sort of specific relationship between fir.coordinate_of and omp.target to prevent this from happening. An interface like the one I proposed in the patch mentioned in my post could be used by both linalg.generic and omp.target to prevent any kind of CSE optimization crossing the boundaries of these operations, but it just applies to that optimization, so it wouldn’t address other manifestations of the same root problem. A more generic interface/trait may be able to address those, but I’m not sure what the semantics of that one would be, i.e. what passes would it impact and how would we call it?

I’m tending to agree here. It’s probably not going to be trivial to add the IsolatedFromAbove trait to omp.target, but it does make sense for that operation to be isolated, since it defines a separate memory/execution space and data can only cross boundaries through dialect-specific handling (explicit or implicit mapping). I’m thinking mapped variables passed as arguments could be used to initialize copies only visible inside of the target region. That means we can access them without letting values in from outside the target region.

I think the general problem with the transforms/analyses is that they are by default opt-in, which means that there are restrictions on the semantics of the operations that are allowed in MLIR because of assumptions made by the transforms. There are interfaces to restrict optimizations, but that still leaves in possible assumptions, because it may not be possible to know the various execution models etc. that may exist for future ops/dialects. If the case was opt-out by default, the interfaces would be used to allow optimizations, which would prevent illegal transforms and the assumptions about the semantics of the operations would be explicitly stated.

I’m not sure I follow what you’re trying to say, but the concept of op being not isolated is meant to express direct SSA dependencies and transform more freely across the boundary. The fact that you want to prevent this movement not for optimization but for correctness reasons just points at an incorrect use of a non-isolated region to me.

What I’m trying to express is that operations with internal regions can have different execution modes that may allow some transformations, but not others. E.g. omp.target should be isolated from above, but omp.parallel will run things in parallel with multiple threads. It may not necessarily be isolated from above and some optimizations may not be legal, while other optimizations might be okay. My point was that the current MLIR transforms make assumptions about regions, and if ops don’t adhere to these assumptions/design decisions we end up in the “incorrect use” scenario. If by-default the analyses/transforms assumed that crossing region boundaries is illegal but have interfaces that can enable them by expressing the properties that are required for an op for the transform to be legal, then it would be possible to have finer control of various transforms.

I found this comment from Alex in another RFC:
" Generally, we encourage folks to implement transformations on op interfaces rather than make them aware of specific ops. The idea is that ops can opt into a transformation by implementing a corresponding interface and thus establishing a contract with the transformation about legality. This is a relatively recent feature and “older” transformations such as those on loops may not be using it to its full potential"

I think this is what I was saying, and perhaps CSE is one of these older transforms that should be updated. I think common properties like single-threaded execution, maintained machine state (e.g. rounding mode, or device), no implicit loops, no volatile memory etc. could be part of an interface to limit the transform.

Link to the other RFC:

The quote refers to whether a transformation actually check for a specific operation if (isa<scf::ForOp>(op)) instead of traits/interface if (op->hasTrait<Pure>()).
CSE has never been in the category of transforms that would not use traits/interface.
See for example: hasOtherSideEffectingOpInBetween and how it’s using the MemoryEffectInterface.

int i;
float p[N], v1[N], v2[N];
#pragma omp target map(v1, v2, p)
#pragma omp parallel for
   for (i=0; i<N; i++)
     p[i] = v1[i] * v2[i];

This is a simplified example from the current official OpenMP examples. I believe must use the IsolatedFromAbove interface to separate device from host code and explicitly model the i variable.
Unmodified optimizations like CSE and Constant propagation should not affect the correctness of your code. You are always free to add OpenMP specific optimizations passes ala OMPTargetOptPass.

Can you clarify on this example how CSE would break it? I’m not necessarily familiar with all the subtleties of the OpenMP spec.

Notice the use of the variables i and N, they are defined in host code and used in device code. What stops you from hoisting v1[0] above the pragma? Sorry, it is not CSE specific.

I see, I think the problem is that the semantics the IR was representing when the pass was written has expanded. It is not easy to specify something that is unknown at the time. Some effort could be done in this regard, but it may still not cover all future possibilities. Would it then be an okay approach to expand the interfaces (or create new ones) to capture the new properties of the IR and make the optimizations do the right thing?

My question would still be why it would be invalid under an “as-if” rule?

int i;
float p[N], v1[N], v2[N];
// init v1/v2 here...

#pragma omp target map(v1, v2, p)
#pragma omp parallel for
   for (i=0; i<N; i++)
     p[i] = v1[i] * v2[i];

=>

int i;
float p[N], v1[N], v2[N];
// init v1/v2 here...

p[0] = v1[0] * v2[0];

#pragma omp target map(v1, v2, p)
#pragma omp parallel for
   for (i=1; i<N; i++) # iteration 0 is peeled of.
     p[i] = v1[i] * v2[i];

Now this is a loop transformation that is a bit more specific than CSE, if we go back to CSE the fundamental property is that we can replace an SSA value with another SSA value that is “equivalent” (from “OperationEquivalence” point of view).

So in the IR if we have:

%value = ...

op.region() {
   %value2 = ....
   use(%value2, %value)
}

If OperationEquivalence indicates that %value2 is equivalent to %value, I think the substitution in use should always be valid.

Of course, but we have to agree on what are reasonable IR properties first, which is exactly my line of commenting in this thread so far!
So what IR properties would make the substitution of “equivalent” SSA value invalid?

The standard does say not what a target is. It could be a compute device over a 100m ethernet cable. Talking about equivalent SSA values that are on the host and the target or secretly moved violates the idea of an offload device/target.

You’re now switching the discussion from “semantics” and “correctness” to “optimization”, my understanding was that the line of argumentation was about the validity of the transformation regardless of any profitability aspects.