Clarification on observed behavior (linalg, CSE)

I have been studying MLIR by expanding the Toy example, and have come across this problem which I can’t quite figure out. I don’t have a compilers background, so there might be something obvious I’m missing here:

My example consists of this basic program, where I have two “linalg.fill” ops where I am filling with two seperate values created by calling out to system time function (toy.time and toy.sleep get lowered into C calls). I subtract the two tensors and print the resulting values of each tensor:

#map = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @main() {
    %cst = constant dense<1.000000e+00> : tensor<f32>
    %0 = "toy.time"() : () -> f64
    %1 = linalg.init_tensor [2, 2] : tensor<2x2xf64>
    %2 = linalg.fill(%0, %1) : f64, tensor<2x2xf64> -> tensor<2x2xf64> 
    "toy.sleep"(%cst) : (tensor<f32>) -> ()
    %3 = "toy.time"() : () -> f64
    %4 = linalg.init_tensor [2, 2] : tensor<2x2xf64>
    %5 = linalg.fill(%3, %4) : f64, tensor<2x2xf64> -> tensor<2x2xf64> 
    %6 = linalg.init_tensor [2, 2] : tensor<2x2xf64>
    %7 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%5, %2 : tensor<2x2xf64>, tensor<2x2xf64>) outs(%6 : tensor<2x2xf64>) {
    ^bb0(%arg0: f64, %arg1: f64, %arg2: f64):  // no predecessors
      %8 = subf %arg0, %arg1 : f64
      linalg.yield %8 : f64
    } -> tensor<2x2xf64>
    toy.print %2 : tensor<2x2xf64>
    toy.print %5 : tensor<2x2xf64>
    toy.print %7 : tensor<2x2xf64>
    return
  }
}

If I run the pass sequence of “CSE, linalgBufferizePass, linalgToAffineLoopsPass”, the result is:

module  {
  func @main() {
    %cst = constant dense<1.000000e+00> : tensor<f32>
    %0 = "toy.time"() : () -> f64
    %1 = memref.alloc() : memref<2x2xf64>
    affine.for %arg0 = 0 to 2 {
      affine.for %arg1 = 0 to 2 {
        affine.store %0, %1[%arg0, %arg1] : memref<2x2xf64>
      }
    }
    %2 = memref.tensor_load %1 : memref<2x2xf64>
    "toy.sleep"(%cst) : (tensor<f32>) -> ()
    %3 = "toy.time"() : () -> f64
    affine.for %arg0 = 0 to 2 {
      affine.for %arg1 = 0 to 2 {
        affine.store %3, %1[%arg0, %arg1] : memref<2x2xf64>
      }
    }
    %4 = memref.tensor_load %1 : memref<2x2xf64>
    %5 = memref.alloc() : memref<2x2xf64>
    affine.for %arg0 = 0 to 2 {
      affine.for %arg1 = 0 to 2 {
        %7 = affine.load %1[%arg0, %arg1] : memref<2x2xf64>
        %8 = affine.load %1[%arg0, %arg1] : memref<2x2xf64>
        %9 = subf %7, %8 : f64
        affine.store %9, %5[%arg0, %arg1] : memref<2x2xf64>
      }
    }
    %6 = memref.tensor_load %5 : memref<2x2xf64>
    toy.print %2 : tensor<2x2xf64>
    toy.print %4 : tensor<2x2xf64>
    toy.print %6 : tensor<2x2xf64>
    return
  }
}

Obviously this is not what I want because now the “FillOp” is repeatedly filling the same buffer and subtracting it from itself. Removing the CSE pass corrects this behavior. What am I doing wrong here? The “toy.time” calls are being preserved, but the analysis merges the output tensors before bufferization, while preserving the actual fill operation calls.

Since you know that the issues happens with CSE, the best is to look at the output right after the CSE pass don’t you think?

Here is the IR post-CSE (after I tweaked it to make it work at the current repo HEAD):

#map = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @main() {
    %cst = arith.constant dense<1.000000e+00> : tensor<f32>
    %0 = "toy.time"() : () -> f64
    %1 = linalg.init_tensor [2, 2] : tensor<2x2xf64>
    %2 = linalg.fill(%0, %1) : f64, tensor<2x2xf64> -> tensor<2x2xf64> 
    "toy.sleep"(%cst) : (tensor<f32>) -> ()
    %3 = "toy.time"() : () -> f64
    %4 = linalg.fill(%3, %1) : f64, tensor<2x2xf64> -> tensor<2x2xf64> 
    %5 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%4, %2 : tensor<2x2xf64>, tensor<2x2xf64>) outs(%1 : tensor<2x2xf64>) {
    ^bb0(%arg0: f64, %arg1: f64, %arg2: f64):  // no predecessors
      %6 = arith.subf %arg0, %arg1 : f64
      linalg.yield %6 : f64
    } -> tensor<2x2xf64>
    "toy.print"(%2) : (tensor<2x2xf64>) -> ()
    "toy.print"(%4) : (tensor<2x2xf64>) -> ()
    "toy.print"(%5) : (tensor<2x2xf64>) -> ()
    return
  }
}

The linalg.init_tensor operation has been CSE’d. This seems fine to me here.

So I suspect that this is a miscompile in the bufferization pass. Adding -linalg-bufferize I get:

#map = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @main() {
    %cst = arith.constant dense<1.000000e+00> : tensor<f32>
    %0 = "toy.time"() : () -> f64
    %1 = memref.alloc() : memref<2x2xf64>
    linalg.fill(%0, %1) : f64, memref<2x2xf64> 
    %2 = memref.tensor_load %1 : memref<2x2xf64>
    "toy.sleep"(%cst) : (tensor<f32>) -> ()
    %3 = "toy.time"() : () -> f64
    linalg.fill(%3, %1) : f64, memref<2x2xf64> 
    %4 = memref.tensor_load %1 : memref<2x2xf64>
    %5 = memref.alloc() : memref<2x2xf64>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%1, %1 : memref<2x2xf64>, memref<2x2xf64>) outs(%5 : memref<2x2xf64>) {
    ^bb0(%arg0: f64, %arg1: f64, %arg2: f64):  // no predecessors
      %7 = arith.subf %arg0, %arg1 : f64
      linalg.yield %7 : f64
    }
    %6 = memref.tensor_load %5 : memref<2x2xf64>
    "toy.print"(%2) : (tensor<2x2xf64>) -> ()
    "toy.print"(%4) : (tensor<2x2xf64>) -> ()
    "toy.print"(%6) : (tensor<2x2xf64>) -> ()
    return
  }
}

Indeed the two fill op are using the same buffer. So a minimal reproducer for the linalg-bufferize would be:

  func @main(%0 : f64, %1 : f64) -> (tensor<2x2xf64>, tensor<2x2xf64>) {
    %2 = linalg.init_tensor [2, 2] : tensor<2x2xf64>
    %3 = linalg.fill(%0, %2) : f64, tensor<2x2xf64> -> tensor<2x2xf64> 
    %4 = linalg.fill(%1, %2) : f64, tensor<2x2xf64> -> tensor<2x2xf64> 
    return %3, %4 : tensor<2x2xf64>, tensor<2x2xf64>
  }

Which produces:

    %0 = memref.alloc() : memref<2x2xf64>
    linalg.fill(%arg0, %0) : f64, memref<2x2xf64> 
    %1 = memref.tensor_load %0 : memref<2x2xf64>
    linalg.fill(%arg1, %0) : f64, memref<2x2xf64> 
    %2 = memref.tensor_load %0 : memref<2x2xf64>
    return %1, %2 : tensor<2x2xf64>, tensor<2x2xf64>
  }

I suspect that the LinalgBufferize Pass is taking a shortcut on the init_tensor in how it should model live-ranges during the allocation decision.

Edit: indeed this is primitive and unsafe

Since you know that the issues happens with CSE, the best is to look at the output right after the CSE pass don’t you think?

Yes, I did look at this, but I stopped at that point because I was under the impression that combining the init_tensor’s was wrong. I see now from your comments that is incorrect, thanks for the clarrification. Does your final point mean this pass has a bug or am I just using it outside of the intended use case? Tomorrow I’ll try out the other linalg bufferize passes that have additional analysis.

Thanks!

Probably both :slight_smile:
I suspect this was developed with simplistic assumptions, that end up working in the use case it has been tested, but it isn’t an excuse to not be conservative here.

@cbate please also see an older post about other issues I had found: Properly using Bufferization related passes.

These days we are using a single -linalg-comprehensive-module-bufferize pass that is more restricted but designed to provide a more “batteries included” experience.

Right now, I think this would fail due to the unknown ops (the pass is conservative and will easily fail in a bunch of cases atm). However @matthias-springer has started a generalization effort with interfaces that you should be able to use in a few days and hook into the toy ops.

Alternatively, you could hide the toy operations that operate on tensors behind an external function call.
Pro: -linalg-comprehensive-module-bufferize should be able to bufferize these properlyand it would work today.
Con: you’ll need to hook toy.print to one of print functions and link with the runtime support libraries and you’ll also need to do the same for toy.sleep that would do the same. But it seems you already have some of that on your end?

You can see relevant tests around here: https://sourcegraph.com/github.com/llvm/llvm-project/-/blob/mlir/test/Dialect/Linalg/comprehensive-module-bufferize.mlir?L395

Thanks for the pointer to the linalg-comprehensive-module-bufferize pass. What information is missing in the IR that the new interfaces add?

Regardless, I can definitely outline the calls to a function as a workaround, thanks for the heads up.

Basically, only a minimal set of ops is supported atm (https://sourcegraph.com/github.com/llvm/llvm-project/-/blob/mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferize.cpp?L439).

The interface will make it easy to extend to new ops and will greatly reduce the amount of code in the pass.

In your particular case I do not expect you’ll need a layout / inplaceable specification such as here: https://sourcegraph.com/github.com/llvm/llvm-project/-/blob/mlir/test/Dialect/Linalg/comprehensive-module-bufferize.mlir?L701.

If you wanted your MLIR programs to take tensors created externally (e.g. numpy), you’d need that.