Hi everyone,
I am currently working on an end-to-end prototype to make linalg on tensors executable end-to-end in the context of tensor-based transformations. In particular, the stickier point I am facing is more aggressively avoiding copies and allocs across function boundaries. I am having difficulties with properly connecting the many bufferization related passes. I think I may also be hitting more fundamental bufferization design issues and I am unclear how to categorize which of the following is:
- known and is being worked on,
- known but does not have a clear plan for resolution,
- unknown yet
Setup
func @init_and_matmul(%arg0: tensor<128x128xf32>, %arg1: tensor<128x128xf32>, %arg2: tensor<128x128xf32>) -> tensor<128x128xf32> {
%cst = constant 0.000000e+00 : f32
%0 = linalg.fill(%arg2, %cst) : tensor<128x128xf32>, f32 -> tensor<128x128xf32>
%1 = linalg.matmul ins(%arg0, %arg1 : tensor<128x128xf32>, tensor<128x128xf32>) outs(%0 : tensor<128x128xf32>) -> tensor<128x128xf32>
return %1 : tensor<128x128xf32>
}
func @main() {
// Some IR to create %1, %2 and %3
%5 = call @init_and_matmul(%1, %2, %3) : (tensor<128x128xf32>, tensor<128x128xf32>, tensor<128x128xf32>) -> tensor<128x128xf32>
// print stuff from %5.
return
}
In the following, I’ll call my prototype pass -mystuff
.
Dealloc + copy removal
I start with mlir-opt -func-bufferize -buffer-results-to-out-params -mystuff
pass and get to the following IR:
func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
%cst = constant 0.000000e+00 : f32
%0 = alloc() : memref<128x128xf32>
%1 = alloc() : memref<128x128xf32>
linalg.fill(%1, %cst) : memref<128x128xf32>, f32
linalg.copy(%1, %0) : memref<128x128xf32>, memref<128x128xf32>
linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
linalg.copy(%0, %arg3) : memref<128x128xf32>, memref<128x128xf32>
return
}
From this, running mlir-opt -buffer-deallocation -copy-removal
produces:
func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
%cst = constant 0.000000e+00 : f32
%0 = alloc() : memref<128x128xf32>
linalg.fill(%0, %cst) : memref<128x128xf32>, f32
linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%arg3 : memref<128x128xf32>)
return
}
The fill
is now dead, the computation is incorrect and %0
leaks.
For completeness, I’ll add that copy, fill and named ops have their effects properly set.
Am I missing something?
Return memref semantics
To try and circumvent the above, I now remove -buffer-results-to-out-params
and run -func-bufferize -mystuff
, I get:
func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>) -> memref<128x128xf32> {
%cst = constant 0.000000e+00 : f32
%0 = alloc() : memref<128x128xf32>
%1 = alloc() : memref<128x128xf32>
linalg.fill(%1, %cst) : memref<128x128xf32>, f32
linalg.copy(%1, %0) : memref<128x128xf32>, memref<128x128xf32>
linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
return %0 : memref<128x128xf32>
}
This looks reasonable at first sight, after running mlir-opt -buffer-deallocation -copy-removal
, I see:
func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>) -> memref<128x128xf32> {
%cst = constant 0.000000e+00 : f32
%0 = alloc() : memref<128x128xf32>
linalg.fill(%0, %cst) : memref<128x128xf32>, f32
linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
return %0 : memref<128x128xf32>
}
Great, the extra alloc + copy is optimized away.
Unfortunately on the caller side I get:
func @main() {
%cst = constant 0.000000e+00 : f32
%cst_0 = constant 1.000000e+00 : f32
%c0 = constant 0 : index
%c1 = constant 1 : index
%0 = alloc() : memref<128x128xf32>
%1 = alloc() : memref<128x128xf32>
%2 = alloc() : memref<128x128xf32>
linalg.fill(%2, %cst_0) : memref<128x128xf32>, f32
linalg.fill(%1, %cst_0) : memref<128x128xf32>, f32
%4 = call @init_and_matmul(%2, %1, %0) : (memref<128x128xf32>, memref<128x128xf32>, memref<128x128xf32>) -> memref<128x128xf32>
dealloc %2 : memref<128x128xf32>
dealloc %1 : memref<128x128xf32>
dealloc %0 : memref<128x128xf32>
// print stuff from %4.
return
}
i.e. the program executes properly but %4 leaks.
More generally, I am quite unclear what the contract is for passing function boundaries: the return of init_and_matmul
could be either a new alloc that needs to be deallocated (this case) or an alias that must not deallocated.
This makes me wonder whether -func-bufferize
in the absence of -buffer-results-to-out-params
should even be allowed. While we could add a ref-counted type in the future to better model many of the issues related to lifetime, it seems to me the current status is to always create leaks?
Phase ordering
It is unclear to me why it is considered that small “composable” chunks of bufferization are deemed beneficial. In my experience, there is a big difference between:
- small composable IR building blocks, patterns and transformations that compose nicely in any order
- fine-grained passes that perform part of the job on a subset of IR operations
The latter creates phase ordering issues and tight implicit coupling between passes that make the system very hard to use. For instance, copy removal depends on -buffer-deallocation
and cannot work with static allocations. In my bigger picture, phase-ordering is the one big dragon to slay.
Am I missing something?
What I would like
So what I’d really like is a mix of:
- guaranteeing
-buffer-results-to-out-params
is always called (i.e. not returning memref due to correctness re alloc/alias semantics) with - support to fold in + out buffers into in-out buffers + a function attribute that tells me a buffer is safe to write in-place from the perspective of the caller (i.e. when func bufferization occured, there are no other reads / unknown uses of the tensor that folded into an in-out buffer, outside of the function).
Internal function bufferization is responsible for doing its own inplace
analysis and doing the right thing.
If I locally simulate that %arg2
is safe to write to, running -func-bufferize -buffer-results-to-out-params -mystuff
I get:
func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
%cst = constant 0.000000e+00 : f32
linalg.fill(%arg2, %cst) : memref<128x128xf32>, f32
linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%arg2 : memref<128x128xf32>)
linalg.copy(%arg2, %arg3) : memref<128x128xf32>, memref<128x128xf32>
return
}
This is as good as I can do now i.e. the local analysis determines I can do whatever with %arg2
, but I cannot get rid of linalg.copy(%arg2, %arg3)
until I go modify -func-bufferize
and/or -buffer-results-to-out-params
.
If I simulate the attribute not being present I get:
func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
%cst = constant 0.000000e+00 : f32
%0 = alloc() : memref<128x128xf32>
linalg.copy(%arg2, %0) : memref<128x128xf32>, memref<128x128xf32>
linalg.fill(%0, %cst) : memref<128x128xf32>, f32
linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
linalg.copy(%0, %arg3) : memref<128x128xf32>, memref<128x128xf32>
return
}
This is also what I expected.
All this becomes quite more interesting in the presence of transformations on linalg on tensors, but is another level of discussion (i.e. it’s all internal function bufferize).
Getting everything the way I want would require changes to -func-bufferize
and/or -buffer-results-to-out-params
. Given I am considering a much simpler version of the problem (i.e. bail on branch ops), it does not seem reasonable to do any of this in core atm. OTOH, I am interested in high-performance codegen and I’d claim branches should be aggressively multi-versioned/hyperblock-scheduled away.
So I am wondering whether core bufferization prematurely attacked the more difficult problem and now makes it more difficult to have a make progress on the simple problem?
The bigger question
Now all I have in my prototype is definitely not rosy and uses matchers for specific patterns to avoid inserting copies in the first place. However, I would claim that the problem reduces to either:
- turning destructive updates into inplace buffer writes and detecting special cases while relying on SSA use-def chains.
- introducing copies everywhere, dropping SSA values and throwing use-def chains out of the window to instead rely on memory-based dependence analysis. As we consider cross-function-boundary issues, this also begs for aliasing information that we don’t have in MLIR.
None of these are particularly appealing to me, but I have found the first to be relatively surprise-free.
There is also the possibility of refcounting and adding alloc_if
+ copy_if
but this essentially gives up on static guarantees. I would claim that statically knowing whether some fine-grained function performs an alloc and copies is important enough to warrant aggressive multi-versioning/hyperblock-scheduling .
Do people have a significantly better alternative to the binary (ternary?) choice raised in this last paragraph?
Thanks for reading!