BufferDeallocationInternal's canonicalization-of-the-target-buffer-of-the-clone-operation

Hello, I’m getting a similar issue from the topic Bufferization: how to avoid extra alloc with 'buffer-results-to-out-params', and my current solution is to do CopyRemoval pass after the -buffer-results-to-out-params -buffer-deallocation, to remove the memref.alloc/copy/dealloc.

Looking at the Buffer Deallocation - Internals - MLIR, I find the “Canonicalization of the Target Buffer of the Clone Operation” has a very similar IR pattern to the original post and can remove the memref.alloc/dealloc and bufferization.clone. Does this mean that we should create an bufferization.clone in the -buffer-results-to-out-params pass instead of memref.copy?

Also, I’m testing the mlir-opt --canonicalize with below IR, but getting error: redefinition of SSA value '%result', %result = bufferization.clone %temp : memref<2xf32> to memref<2xf32>. Any hint that how this error can happen? I’m using the MLIR version from the Jan 2023. Any comment is appreciated, thank you!

#map = affine_map<(d0) -> (d0)>
func.func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){
  %temp = memref.alloc() : memref<2xf32>
  linalg.generic {
    indexing_maps = [#map, #map],
    iterator_types = ["parallel"]} ins(%arg0 : memref<2xf32>) outs(%temp : memref<2xf32>) {
  ^bb0(%gen2_arg0: f32):
    %tmp2 = math.exp %gen2_arg0 : f32
    linalg.yield %tmp2 : f32
  }
  %result = bufferization.clone %temp : memref<2xf32> to memref<2xf32>
  memref.dealloc %temp : memref<2xf32>
  return
}

Your code snippet has invalid code. You have have two definitions of %result (block argument and %result = ...). SSA values cannot be re-assigned.

bufferization.clone cannot be used with buffer-results-to-out-params. That pass adds a new memref parameter to the function signature and it is the caller’s responsibility to provide that buffer. bufferization.clone is essentially an alloc+memcpy. But in the case of buffer-results-to-out-params, we do not alloc.

Can you show the input IR on which you want to run buffer-results-to-out-params and buffer-deallocation? (Or is the snippet above?)

@matthias-springer Thanks for reply and answering my questions. The code snippet above is from the “Canonicalization of the Target Buffer of the Clone Operation” section in Buffer Deallocation - Internals - MLIR. If the code is invalid, perhaps we should update the code snippet in the doc.

For the IR running with buffer-results-to-out-params and buffer-deallocation. Here is an example:
Our IR starting from TOSA:

func.func @func1(%arg0: tensor<1024xbf16>, %arg1: tensor<1024xbf16>) -> (tensor<1024xbf16>) {
    %0 = "tosa.add"(%arg0,%arg1) : (tensor<1024xbf16>,tensor<1024xbf16>)  -> (tensor<1024xbf16>)
    return %0 : tensor<1024xbf16>
}

After doing TOSA->Linalg, the tensor.empty() is introduced and will become memref.alloc eventually. The passes for lowering are --pass-pipeline="builtin.module(func.func(tosa-to-linalg-named, tosa-to-linalg))" --linalg-fuse-elementwise-ops.

#map = affine_map<(d0) -> (d0)>
module {
  func.func @func1(%arg0: tensor<1024xbf16>, %arg1: tensor<1024xbf16>) -> tensor<1024xbf16> {
    %0 = tensor.empty() : tensor<1024xbf16>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1024xbf16>, tensor<1024xbf16>) outs(%0 : tensor<1024xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %2 = arith.addf %in, %in_0 : bf16
      linalg.yield %2 : bf16
    } -> tensor<1024xbf16>
    return %1 : tensor<1024xbf16>
  }
}

Do the bufferization with --empty-tensor-to-alloc-tensor --one-shot-bufferize="allow-return-allocs allow-unknown-ops bufferize-function-boundaries function-boundary-type-conversion=identity-layout-map" --canonicalize --cse:

#map = affine_map<(d0) -> (d0)>
module {
  func.func @func1(%arg0: memref<1024xbf16>, %arg1: memref<1024xbf16>) -> memref<1024xbf16> {
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<1024xbf16>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : memref<1024xbf16>, memref<1024xbf16>) outs(%alloc : memref<1024xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %0 = arith.addf %in, %in_0 : bf16
      linalg.yield %0 : bf16
    }
    return %alloc : memref<1024xbf16>
  }
}

Transform to destination passing style code with --buffer-results-to-out-params --buffer-deallocation --canonicalize --cse. The ops memref.copy and memref.dealloc are created.

#map = affine_map<(d0) -> (d0)>
module {
  func.func @func1(%arg0: memref<1024xbf16>, %arg1: memref<1024xbf16>, %arg2: memref<1024xbf16>) {
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<1024xbf16>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : memref<1024xbf16>, memref<1024xbf16>) outs(%alloc : memref<1024xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %0 = arith.addf %in, %in_0 : bf16
      linalg.yield %0 : bf16
    }
    memref.copy %alloc, %arg2 : memref<1024xbf16> to memref<1024xbf16>
    memref.dealloc %alloc : memref<1024xbf16>
    return
  }
}

Finally, our own canonicalization pass does a CopyRemoval similar to ⚙ D82757 [mlir] Add redundant copy removal transform. The linalg.generic has the outs writing to destination argument %arg2. We are looking for this IR form since our flow includes a few Polygeist passes.

#map = affine_map<(d0) -> (d0)>
module {
  func.func @func1(%arg0: memref<1024xbf16>, %arg1: memref<1024xbf16>, %arg2: memref<1024xbf16>) {
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : memref<1024xbf16>, memref<1024xbf16>) outs(%arg2 : memref<1024xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %0 = arith.addf %in, %in_0 : bf16
      linalg.yield %0 : bf16
    }
    return
  }
}

Your lowering pass pipeline looks good to me.

There is an alternative way to avoid the copy before bufferization. If the function would have a tensor “out” parameter (which is fed to the linalg.generic “out”), the copy would not have been generated in the first place. But we are currently missing a pass to get there from TOSA.

What we are missing is a result-tensor-to-out-param pass. Such a hypothetical pass could rewrite your second snippet as follows:

#map = affine_map<(d0) -> (d0)>
module {
  func.func @func1(%arg0: tensor<1024xbf16>, %arg1: tensor<1024xbf16>, %arg2: tensor<1024xbf16>) -> tensor<1024xbf16> {
    %0 = tensor.empty() : tensor<1024xbf16>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1024xbf16>, tensor<1024xbf16>) outs(%0 : tensor<1024xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %2 = arith.addf %in, %in_0 : bf16
      linalg.yield %2 : bf16
    } -> tensor<1024xbf16>
    %2 = tensor.insert_slice %1 into %arg2 [0][1024][1] : tensor<1024xbf16> into tensor<1024xbf16>
    return %2 : tensor<1024xbf16>
  }
}

Then you could run -eliminate-empty-tensors which will remove the tensor.empty:

#map = affine_map<(d0) -> (d0)>
module {
  func.func @func1(%arg0: tensor<1024xbf16>, %arg1: tensor<1024xbf16>, %arg2: tensor<1024xbf16>) -> tensor<1024xbf16> {
    %0 = tensor.empty() : tensor<1024xbf16>
    %extracted = tensor.extract_slice %arg2 [0][1024][1] : tensor<1024xbf16> to tensor<1024xbf16>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1024xbf16>, tensor<1024xbf16>) outs(%extracted : tensor<1024xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %2 = arith.addf %in, %in_0 : bf16
      linalg.yield %2 : bf16
    } -> tensor<1024xbf16>
    %2 = tensor.insert_slice %1 into %arg2 [0][1024][1] : tensor<1024xbf16> into tensor<1024xbf16>
    return %2 : tensor<1024xbf16>
  }
}

Note that the tensor.empty is dead. If you bufferize this IR with -one-shot-bufferize -cse -canonicalize, there should be no copies.

So you can remove the copy before (your pass pipeline) or after bufferization (the example that I showed here). Either way should work.

1 Like