Bug in `OperationEquivalence` (breaks `-cse` on `linalg.index`)

Hi,

it seems like I found a really horrible bug today. Let me first describe how to reproduce it in a concrete test case, before I get into my analysis.

Reproducing the issue

Consider the following (quite stupid) implementation of matrix multiplication using linalg.map by computing dot products and reducing them with linalg.reduce.

func.func @matmul(%A: tensor<3x4xf64>, %B: tensor<4x3xf64>) -> tensor<3x3xf64> {
    %init = tensor.empty () : tensor<3x3xf64>
    %mm = linalg.map outs(%init: tensor<3x3xf64>) (){
        %i = linalg.index 0 : index
        %j = linalg.index 1 : index
        %init2 = tensor.empty() : tensor<3xf64>
        %mv = linalg.map outs(%init2: tensor<3xf64>) (){
            %k = linalg.index 0 : index
            %a = tensor.extract %A[%i, %k] : tensor<3x4xf64>
            %b = tensor.extract %B[%k, %j] : tensor<4x3xf64>
            %c = arith.mulf %a, %b : f64
            linalg.yield %c : f64
        }
        %init3 = arith.constant dense<0.0> : tensor<f64>
        %reduce = linalg.reduce { arith.addf }
            ins(%mv: tensor<3xf64>)
            outs(%init3: tensor<f64>)
            dimensions = [0]
        %out = tensor.extract %reduce[] : tensor<f64>
        linalg.yield %out : f64
    }
    return %mm : tensor<3x3xf64>
}

If you run -cse on this, you get the following wrong program:

module {
  func.func @matmul(%arg0: tensor<3x4xf64>, %arg1: tensor<4x3xf64>) -> tensor<3x3xf64> {
    %0 = tensor.empty() : tensor<3x3xf64>
    %mapped = linalg.map outs(%0 : tensor<3x3xf64>)
      () {
        %1 = linalg.index 0 : index
        %2 = linalg.index 1 : index
        %3 = tensor.empty() : tensor<3xf64>
        %mapped_0 = linalg.map outs(%3 : tensor<3xf64>)
          () {
            %extracted_1 = tensor.extract %arg0[%1, %1] : tensor<3x4xf64>
            %extracted_2 = tensor.extract %arg1[%1, %2] : tensor<4x3xf64>
            %4 = arith.mulf %extracted_1, %extracted_2 : f64
            linalg.yield %4 : f64
          }
        %cst = arith.constant dense<0.000000e+00> : tensor<f64>
        %reduced = linalg.reduce { arith.addf } ins(%mapped_0 : tensor<3xf64>) outs(%cst : tensor<f64>) dimensions = [0] 
        %extracted = tensor.extract %reduced[] : tensor<f64>
        linalg.yield %extracted : f64
      }
    return %mapped : tensor<3x3xf64>
  }
}

This program result is now different, because cse has substituted %k with %i, which is not semantically preserving. This is not the expected behavior, as the documentation for linalg.index states:

The linalg.index operation returns the iteration index of the immediately enclosing linalg structured operation for the iteration dimension dim. The dim attribute specifies the position of the accessed dimension in the indexing map domain.

Tracing the issue

In my opinion, the root cause of this issue is found in the way that OperationEquivalence decides that the two linalg.index operations are equivalent, which are OpState equal (see OperationEquivalence::isEquivalentTo). Since one dominates the other and is in scope, it is replaced by the CSEDriver with the other.

OpState equivalence does not imply semantic equivalence, as operation semantics in MLIR may depend on structural properties, such as parent-child relationships (as is the case here) or sibling-sibling relationships (think of a struct definition where the fields are not control flow but still ordered).

Fixing the issue

I hope that you all agree that this is a bug and not intended behavior. I can see how the way forward to fix it might be controversial. I will give my own point of view first.

Let me begin with saying that, under the current status quo, I don’t believe linalg is in the wrong here. Declaring the operation as [Pure] is semantically sound in my opinion. Introducing some new SSA dependency on the parent is clearly against the spirit of the abstraction, too.

I have two points I’d like to address:

  1. OperationEquivalence should respect structural semantics.

    To me, MLIR not having mandatory traits that indicate operation “well-behaved-ness” is a huge issue. I do want to have safe “metadata-like”/“non-control-flow-like” operations, and others that have semantics that depend on their surrounding structure. If such a trait, e.g. OpTrait::Structural, was introduced, OperationEquivalence could use it.

    For the particular case of linalg.index, however, we do want CSE to collapse repeated uses in the same scope. In this special case, the actual semantics are along the lines of “this op is structurally dependent on its parent”. A trait that reflects this relationship, e.g. OpTrait::Scoped<Parent>, could be introduced, which would require OperationEquivalence to check that both ops have the same first ancestor of the given op constraint.

    This means that linalg.index and other ops with these semantics will need to explicitly declare them going forward, but I think this is desirable.

  2. CSEDriver should be more conservative.

    Looking at the current implementation (see CSEDriver::simplifyOperation), the CSEDriver only makes a special exception for ops with memory side effects. I think this is way too optimistic.

    The CSEDriver should probably bail on any op that has side-effects it does not recognize. In addition, CSE-ing out non-speculatable ops may change the observed program behavior, and there seems to be no guard against that either.

I have a different analysis here: the issue isn’t with OperationEquivalent or CSE but it is a bug in the modeling of the linalg.index operation which is incorrectly defined as “Pure”. This has been tracked in this issue.

I had similar remark to Mehdi.

Pure is defined as

// Always speculatable operation that does not touch memory.  These operations
// are always legal to hoist or sink.
def Pure : TraitList<[AlwaysSpeculatable, NoMemoryEffect]>;

Neither hoisting nor sinking of linalg.index seems to be always legal.

I disagree with that interpretation on the grounds that the terminology is not applicable here.

  1. Does that mean, according to your interpretation, I may “hoist” any pure operation from a func.func into the surrounding builtin.module?

  2. I would define “hoisting” and “skinking” as control-flow related terminology, which do not have a meaning for ops that don’t participate in control flow, both the op and the context it is in. linalg.index is perfectly “sinkable” and “hoistable” if its surrounding scope stays the same, whereas outside it does not have any meaning at all.

  3. Pure is not a strong trait itself, only the observation that an op has no memory effects and is always speculatable. If you want to add the “can always hoist/sink” semantics to it, then what about all ops that have both the other properties (such as linalg.index), but aren’t?

In any case, the current CSE driver will still cause the issue even if the op becomes impure, as long as it doesn’t declare a memory side effect. And that doesn’t make sense, right?

Thank you for this reference! I thought I was insane that this never came up, I must have been blind instead.

I can’t find a thread about this though, and the issue does not discuss the implications of what that means for pure operations everywhere.

func.func are isolated from above. That prevents any hoisting of ops to the module.

2 Likes

I know, but then that implies legality of hoisting or sinking depends on the op to move AND the context. So clearly Pure doesn’t mean you can always do that.

Right now the one doing the hoisting or sinking has to check that. I’m pretty sure the loop invariant code motion utils allow you to hoist stuff you shouldn’t, you have to supply a sane predicate.

My point is that insisting on the current definition of pure means that a whole class of ops is either in danger of being moved or replaced illegally, or precluded from other optimizations in contexts they behave “normally” in.

Maybe we could start by seeing how linalg.index should be declared, and then work our way towards all the stuff we break by making it impure.

I think you can make strong argument if you can find a few more ops that are semantically tied to one of their enclosing ops. Otherwise, it’s unclear if this issue is due to a missing feature in MLIR or an incorrectly designed linalg.index op.

1 Like

I’d argue that linalg.index is an awkward op at best – my interpretation is that (conceptually) it relies on some implicit state that needs to be read from the environment / memory. One alternative would be to have an explicit block arg with the index tuple that can be unpacked when necessary (and dropped when unused).

3 Likes

I do think there’s a missing op trait weaker than IsolatedFromAbove. Something like EffectsScope or EffectsBarrier, telling passes like CSE or the Canonicalizer, that ops inside such an op cannot be hoisted outside that op.
This would be specially useful for ops like gpu.launch, where after Canonicalize constants have to be re-materialized for outlining the kernel. Which could also help in this particular case.

1 Like

That wouldn’t help here: we want to hoist pure operations out of a linalg.generic I believe, just not linalg.index (because it’s modeling is just wrong).

An op that isn’t declared “Pure” is by default having “unknown side effects”, which should preclude CSE. If I am missing something, happy to look into an example further!

This isn’t right. An impure op will not be CSEd even if doesn’t declare a memory effect. For eg. the func.call op doesn’t declare any memory effects, the gpu.barrier op doesn’t either, etc.. These aren’t CSEd for sure - or else most generated IR would break semantics.

1 Like

Maybe we have been a bit too lax with terminology here, because I don’t feel like my point has gotten across, and none of you have provided a workable solution that does not eliminate linalg.index altogether. Let me be as explicit as I can.

There is the ODS Trait list Pure and the C++ predicate isPure. The former implies AlwaysSpeculatable and NoMemoryEffect. For reference, here is the C++ side:

bool mlir::isPure(Operation *op) {
  return isSpeculatable(op) && isMemoryEffectFree(op);
}

The CSE will remove an op that is !isPure(op) as long as it isMemoryEffectFree(op), even when it !isSpeculatable(op). That means declaring linalg.index with just [NoMemoryEffect] won’t fix the issue, for example.

When I say “unknown side effects”, I actually meant any implementation of an EffectOpInterfaceBase interface. I see why that is too conservative, but I believe we should say somewhere that only memory effects are pessimized and user-introduced side effects are discouraged / at least not pessimized.

Your terminology is off. func.func is not CSE’d because it doesn’t implement the memory effect interface, which means memory effects are pessimized, and thus isMemoryEffectFree(op) is false. I argue func.func actually has all possible memory effects without declaring them.

Also, a !isPure(op) will still be CSEd if it isMemoryEffectFree(op), because the CSEDriver does not evaluate isSpeculatable(op). For example, what if I had an op that models an instruction that could trap based on some CPU state. I could make it [NoMemoryEffect], meaning implicitly never speculatable, but CSE will still touch it.

Summarizing all the above, let me show you how I think this needs a missing feature. Looking at possible existing traits for linalg.index:

  • [Pure] lets CSE wrongfully eliminate the op.

  • [NoMemoryEffect] also lets CSE do that, but my position is that it should check that isSpeculatable(op). Only if we patch the CSEDriver would this fix the issue.

  • [AlwaysSpeculatable] or [] would fix the issue as-is, but mean that it could never be isOpTriviallyDead(op), which is undesirable.

    Also, linalg.index would now act as a barrier to CSE, preventing it for MemRead<...> ops straddling it, which is clearly not necessary. And, of course, there are more such cases - LinalgStructuredBase_Op declares RecursiveMemoryEffects, for example. That means our linalg.map above would suddenly have memory effects even though it is in tensor land. Now that’s just shifting the problem from one pile to another in my opinion.

MLIR is an extensible IR framework. I can create an infinite amount of them if I want. I understand that it is hard to argue about something theoretical, but it needs to be either forbidden or specified.

I also wanted to explicitly express my support for this solution when it comes to linalg.index. In my understanding of idiomatic MLIR usage so far, having the indices be forwarded as block argument(s) is the proper way to accomplish this. But to me linalg.index’s troubles are only a symptom, as operations like it simply can’t exist properly at the moment, but aren’t forbidden either.

CSE should not speculate something that isn’t speculatable, this a bug to fix, it’s independent from the issue with the incorrect modeling of linalg.index though.

Pure does not model linalg.index, we already established this.

NoMemoryEffect does not model linalg.index correctly either: it reads some hidden state.

Linalg.index is really reading some state that is written to by the enclosing operation. An operation with a simple read (or allocate) effect should be compatible with isOpTriviallyDead(op)

I don’t believe this is correct, cf the paragraph right above.
It may have other side-effects (pun) to model it that way, but the modeling is definitely possible in MLIR today.
What MLIR does not give you is a first class concept to tie the particular relationship of passing this hidden state implicitly from the enclosing op to the linalg.index: it goes through the generic effect modeling.

1 Like

I’m not fully caught up on the thread yet, but I have to relevant historic design points to make:

  • “Structuring” linalg operations like linalg.generic or linalg.map are not designed to be nested in each other. If it isn’t explicitly marked as UB, it should be. The rest of the operations, including linalg.index, and most transformations assume that.
  • We used to have linalg.indexed_generic that produced indices as block arguments and moved away from that: [RFC] Linalg (Indexed)GenericOp Unification.
1 Like

AFAIK, making linalg.index have [AlwaysSpeculatable, MemRead<LinalgIndexResource>] will fix the CSE issue since the memory side effects are scoped within blocks already. It will also be able to become trivially dead as you said. But tensor-semantic linalg structured ops are still adversely impacted by this change.

The enclosing LinalgStructuredBase_Op will now have new potential side effects to consider (via recursive side effects). With a linalg.index inside it, it can still become trivially dead, but never isPure(op). I believe this is not intentional, as it could have been effectively pure otherwise, in particular when hasPureTensorSemantics() is true.

While isPure(op) is not used upstream, isMemoryEffectFree(op) is used, for example in the default moveLoopInvariantCode. Making this change to linalg.index could therefore at least break pipelines relying on LICM moving linalg structured ops, which sounds pretty concerning.

What I wanted to prove is that it is not possible in the status quo to give linalg.index traits such that the CSE issue goes away and no currently working analyses and transforms are unnecessarily pessimized.

While linalg.index must have an immediate LinalgOp interface parent, assume there was an op which only needed a specific ancestor from which it captured semantics. The above proposed op traits will then overconstrain CSE, because ops with read memory effects found in different blocks are not CSEd. You get the same problem if you keep the immediate parent constraint, but if the parent is not SingleBlock: inside an SSACFG, CSE will not substitute values across different blocks for ops with memory side effects.

I was not aware of this, but it makes sense. But adding a memory side effect to linalg.index comes at a cost described above, even when no nesting takes place.

I found another CSE problem with a different operation:

func.func @masked_reduce_minf_f32(%arg0: vector<16xf32>, %mask : vector<16xi1>) -> (f32,f32) {
  %2 = vector.reduction <minnumf>, %arg0 : vector<16xf32> into f32
  %0 = vector.mask %mask { vector.reduction <minnumf>, %arg0 : vector<16xf32> into f32 } : vector<16xi1> -> f32
  return %0, %2 : f32, f32
}

If you run CSE, you get this:

  func.func @masked_reduce_minf_f32(%arg0: vector<16xf32>, %arg1: vector<16xi1>) -> (f32, f32) {
    %0 = vector.reduction <minnumf>, %arg0 : vector<16xf32> into f32
    %1 = vector.mask %arg1 { vector.yield %0 : f32 } : vector<16xi1> -> f32
    return %1, %0 : f32, f32
  }
3 Likes

Great, this case is also not covered adequately with my first suggestion. I really appreciate the effort. Because this time, conceptually, the parent is imposing a semantic on the child.

However, I really don’t fault OperationEquivalence or CSEDriver for this one. Well, all vector ops could get a MemRead<VectorMask> I suppose. But since masking is tied to an interface, which is extensible, this would remain a loaded footgun.