In place memory bufferization

matthias-springer · September 5, 2024, 12:34pm

The One-Shot Bufferize pass is quite sensitive to op ordering. This IR will bufferize such that there is only one allocation and no copy:

  func.func @main(%t: tensor<2xf32>) -> (tensor<4xf32>) {
    %c = tensor.empty() : tensor<4xf32>

    %ex1 = tensor.extract_slice %c[0][2][1] : tensor<4xf32> to tensor<2xf32>
    %1 = linalg.exp ins(%t : tensor<2xf32>) outs(%ex1: tensor<2xf32>) -> tensor<2xf32>
    %in1 = tensor.insert_slice %1 into %c[0] [2] [1] : tensor<2xf32> into tensor<4xf32>

    %ex2 = tensor.extract_slice %in1[2][2][1] : tensor<4xf32> to tensor<2xf32>
    %2 = linalg.negf ins(%t : tensor<2xf32>) outs(%ex2: tensor<2xf32>) -> tensor<2xf32>
    %in2 = tensor.insert_slice %2 into %in1[2] [2] [1] : tensor<2xf32> into tensor<4xf32>

    %3 = linalg.abs ins(%in2 : tensor<4xf32>) outs(%in2: tensor<4xf32>) -> tensor<4xf32>
    return %3 : tensor<4xf32>
  }

The following IR, while computing the same thing, results in an extra buffer copy:

  func.func @main(%t: tensor<2xf32>) -> (tensor<4xf32>) {
    %c = tensor.empty() : tensor<4xf32>

    %ex1 = tensor.extract_slice %c[0][2][1] : tensor<4xf32> to tensor<2xf32>
    %1 = linalg.exp ins(%t : tensor<2xf32>) outs(%ex1: tensor<2xf32>) -> tensor<2xf32>

    %ex2 = tensor.extract_slice %c[2][2][1] : tensor<4xf32> to tensor<2xf32>
    %2 = linalg.negf ins(%t : tensor<2xf32>) outs(%ex2: tensor<2xf32>) -> tensor<2xf32>

    %in1 = tensor.insert_slice %1 into %c[0] [2] [1] : tensor<2xf32> into tensor<4xf32>
    %in2 = tensor.insert_slice %2 into %in1[2] [2] [1] : tensor<2xf32> into tensor<4xf32>

    %3 = linalg.abs ins(%in2 : tensor<4xf32>) outs(%in2: tensor<4xf32>) -> tensor<4xf32>
    return %3 : tensor<4xf32>
  }

It should be possible to fix that by improving the analysis (OneShotAnalysis.cpp). The main problem is that we do not really reason about tensor subsets in the bufferization analysis yet. (Some very simple cases are handled.) I.e., we know that a buffer is written to, but we do not really take into account the fact that only a small part of it is written to.

Any improvement to this would be great. But it’s a lot of work and the devil is in the detail. Most of what we have today was driven by the use cases that I had.

Since one shot bufferize is a pass than can work in module level, wouldn’t it make sense to extend the pass to support optimizations also for non DPS dialects?

Non-DPS ops are difficult to support. The reason why we have DPS is so that the user can inject additional hints about the buffer in which a result should materialize. Without DPS, the problem is kind of similar to register allocation, which is NP-complete. If you have an idea how to bufferize efficiently without DPS, please let me know!

Topic		Replies	Views
One-shot bufferize help Common CodeGen Infrastructure mlir	2	144	March 18, 2024
Properly using Bufferization related passes MLIR	7	1480	March 4, 2021
What is the strategy for tensor->memref conversion? (bufferization) MLIR	25	2877	November 9, 2020
[RFC] Dialect for bufferization-related ops MLIR	52	2174	November 24, 2021
How to bufferize `scf.execute_region` op? MLIR	7	341	February 3, 2023

In place memory bufferization

Related topics