@matthias-springer Thanks for reply and answering my questions. The code snippet above is from the “Canonicalization of the Target Buffer of the Clone Operation” section in Buffer Deallocation - Internals - MLIR. If the code is invalid, perhaps we should update the code snippet in the doc.
For the IR running with buffer-results-to-out-params
and buffer-deallocation
. Here is an example:
Our IR starting from TOSA:
func.func @func1(%arg0: tensor<1024xbf16>, %arg1: tensor<1024xbf16>) -> (tensor<1024xbf16>) {
%0 = "tosa.add"(%arg0,%arg1) : (tensor<1024xbf16>,tensor<1024xbf16>) -> (tensor<1024xbf16>)
return %0 : tensor<1024xbf16>
}
After doing TOSA->Linalg, the tensor.empty()
is introduced and will become memref.alloc
eventually. The passes for lowering are --pass-pipeline="builtin.module(func.func(tosa-to-linalg-named, tosa-to-linalg))" --linalg-fuse-elementwise-ops
.
#map = affine_map<(d0) -> (d0)>
module {
func.func @func1(%arg0: tensor<1024xbf16>, %arg1: tensor<1024xbf16>) -> tensor<1024xbf16> {
%0 = tensor.empty() : tensor<1024xbf16>
%1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1024xbf16>, tensor<1024xbf16>) outs(%0 : tensor<1024xbf16>) {
^bb0(%in: bf16, %in_0: bf16, %out: bf16):
%2 = arith.addf %in, %in_0 : bf16
linalg.yield %2 : bf16
} -> tensor<1024xbf16>
return %1 : tensor<1024xbf16>
}
}
Do the bufferization with --empty-tensor-to-alloc-tensor --one-shot-bufferize="allow-return-allocs allow-unknown-ops bufferize-function-boundaries function-boundary-type-conversion=identity-layout-map" --canonicalize --cse
:
#map = affine_map<(d0) -> (d0)>
module {
func.func @func1(%arg0: memref<1024xbf16>, %arg1: memref<1024xbf16>) -> memref<1024xbf16> {
%alloc = memref.alloc() {alignment = 64 : i64} : memref<1024xbf16>
linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : memref<1024xbf16>, memref<1024xbf16>) outs(%alloc : memref<1024xbf16>) {
^bb0(%in: bf16, %in_0: bf16, %out: bf16):
%0 = arith.addf %in, %in_0 : bf16
linalg.yield %0 : bf16
}
return %alloc : memref<1024xbf16>
}
}
Transform to destination passing style code with --buffer-results-to-out-params --buffer-deallocation --canonicalize --cse
. The ops memref.copy
and memref.dealloc
are created.
#map = affine_map<(d0) -> (d0)>
module {
func.func @func1(%arg0: memref<1024xbf16>, %arg1: memref<1024xbf16>, %arg2: memref<1024xbf16>) {
%alloc = memref.alloc() {alignment = 64 : i64} : memref<1024xbf16>
linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : memref<1024xbf16>, memref<1024xbf16>) outs(%alloc : memref<1024xbf16>) {
^bb0(%in: bf16, %in_0: bf16, %out: bf16):
%0 = arith.addf %in, %in_0 : bf16
linalg.yield %0 : bf16
}
memref.copy %alloc, %arg2 : memref<1024xbf16> to memref<1024xbf16>
memref.dealloc %alloc : memref<1024xbf16>
return
}
}
Finally, our own canonicalization pass does a CopyRemoval similar to ⚙ D82757 [mlir] Add redundant copy removal transform. The linalg.generic has the outs writing to destination argument %arg2. We are looking for this IR form since our flow includes a few Polygeist passes.
#map = affine_map<(d0) -> (d0)>
module {
func.func @func1(%arg0: memref<1024xbf16>, %arg1: memref<1024xbf16>, %arg2: memref<1024xbf16>) {
linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : memref<1024xbf16>, memref<1024xbf16>) outs(%arg2 : memref<1024xbf16>) {
^bb0(%in: bf16, %in_0: bf16, %out: bf16):
%0 = arith.addf %in, %in_0 : bf16
linalg.yield %0 : bf16
}
return
}
}