The One-Shot Bufferize pass is quite sensitive to op ordering. This IR will bufferize such that there is only one allocation and no copy:
func.func @main(%t: tensor<2xf32>) -> (tensor<4xf32>) {
%c = tensor.empty() : tensor<4xf32>
%ex1 = tensor.extract_slice %c[0][2][1] : tensor<4xf32> to tensor<2xf32>
%1 = linalg.exp ins(%t : tensor<2xf32>) outs(%ex1: tensor<2xf32>) -> tensor<2xf32>
%in1 = tensor.insert_slice %1 into %c[0] [2] [1] : tensor<2xf32> into tensor<4xf32>
%ex2 = tensor.extract_slice %in1[2][2][1] : tensor<4xf32> to tensor<2xf32>
%2 = linalg.negf ins(%t : tensor<2xf32>) outs(%ex2: tensor<2xf32>) -> tensor<2xf32>
%in2 = tensor.insert_slice %2 into %in1[2] [2] [1] : tensor<2xf32> into tensor<4xf32>
%3 = linalg.abs ins(%in2 : tensor<4xf32>) outs(%in2: tensor<4xf32>) -> tensor<4xf32>
return %3 : tensor<4xf32>
}
The following IR, while computing the same thing, results in an extra buffer copy:
func.func @main(%t: tensor<2xf32>) -> (tensor<4xf32>) {
%c = tensor.empty() : tensor<4xf32>
%ex1 = tensor.extract_slice %c[0][2][1] : tensor<4xf32> to tensor<2xf32>
%1 = linalg.exp ins(%t : tensor<2xf32>) outs(%ex1: tensor<2xf32>) -> tensor<2xf32>
%ex2 = tensor.extract_slice %c[2][2][1] : tensor<4xf32> to tensor<2xf32>
%2 = linalg.negf ins(%t : tensor<2xf32>) outs(%ex2: tensor<2xf32>) -> tensor<2xf32>
%in1 = tensor.insert_slice %1 into %c[0] [2] [1] : tensor<2xf32> into tensor<4xf32>
%in2 = tensor.insert_slice %2 into %in1[2] [2] [1] : tensor<2xf32> into tensor<4xf32>
%3 = linalg.abs ins(%in2 : tensor<4xf32>) outs(%in2: tensor<4xf32>) -> tensor<4xf32>
return %3 : tensor<4xf32>
}
It should be possible to fix that by improving the analysis (OneShotAnalysis.cpp). The main problem is that we do not really reason about tensor subsets in the bufferization analysis yet. (Some very simple cases are handled.) I.e., we know that a buffer is written to, but we do not really take into account the fact that only a small part of it is written to.
Any improvement to this would be great. But it’s a lot of work and the devil is in the detail. Most of what we have today was driven by the use cases that I had.
Since one shot bufferize is a pass than can work in module level, wouldn’t it make sense to extend the pass to support optimizations also for non DPS dialects?
Non-DPS ops are difficult to support. The reason why we have DPS is so that the user can inject additional hints about the buffer in which a result should materialize. Without DPS, the problem is kind of similar to register allocation, which is NP-complete. If you have an idea how to bufferize efficiently without DPS, please let me know!