A series of changes (D156662, D156663, D158421, D158756, D158828, D158979, D159432) will improve the way buffer deallocation is performed. The main benefits are a more modular design (decoupling buffer deallocation from One-Shot Bufferize and making it a separate pass), fewer buffer copies and being able to support IR that was previously rejected or produced invalid IR.
Migration Guide
If you use -one-shot-bufferize
: Run -buffer-deallocation-pipeline
after -one-shot-bufferize
. One-Shot Bufferize will no longer insert any buffer deallocations.
If you own BufferizableOpInterface
implementations: bufferizesToAllocation
will be deleted and is no longer necessary, as One-Shot Bufferize no longer deals with deallocations.
If you use -buffer-deallocation
: This pass will be replaced with a new buffer deallocation pass. It is recommended to replace -buffer-deallocation
with -buffer-deallocation-pipeline
, which will perform additional canonicalizations and foldings before lowering deallocation-specific ops.
This should be everything thatâs needed unless the AllocationOpInterface
was used to build custom clone or deallocation operations. In that case, a custom lowering from the bufferization.clone
and bufferization.dealloc
operations has to be implemented alongside.
Background
There are currently two passes/implementations dealing with buffer deallocation:
-one-shot-bufferize
: Bufferizes tensor IR, insertsmemref.alloc
andmemref.dealloc
, everything in a single pass.-buffer-deallocation
: Insertsmemref.dealloc
, so that there are no memory leaks. Assumes that the input program does not have anymemref.dealloc
operations.
The current design has several limitations:
-one-shot-bufferize
is not composable with other passes. E.g.,-buffer-hoisting
/-buffer-loop-hoisting
must run after bufferization but before anymemref.dealloc
were introduced.-one-shot-bufferize
cannot deallocate new buffers that are yielded from blocks (e.g., yielded from a loop or passed to a new block as part of unstructured control flow); bufferization will fail or buffers will leak whenallow-return-allocs
is set.-one-shot-bufferize
cannot deallocate new buffers originating from ops for which it is not known (without an expensive analysis) whether they bufferize to a new allocation or not (e.g.,tensor.collapse_shape
, which may or may not have to allocate based on the layout map of the bufferized source).
Buffer deallocation can be deactivated in -one-shot-bufferize
with create-deallocs=0
and delegated to the existing -buffer-deallocation
pass. However, this pass also has a few downsides/limitations:
- It inserts additional allocations and buffer copies around branches and loops. (E.g., when one
scf.if
branch allocates, so must the other.) - It does not support unstructured control flow loops.
- It assumes that for each buffer, all writes dominate all reads. (This means it cannot be used safely with One-Shot Bufferize.)
- It has known bugs and allocations sometimes leak.
New Buffer Deallocation Pass
The new buffer deallocation pass is based on the concept of âownershipâ and inspired by @jreiffersâs buffer deallocation pass in MLIR-HLO. Memref ownership is similar to a C++ unique_ptr
and may materialize in IR as i1
SSA values (âownership indicatorsâ). You may see ownership indicators being added as operands/results/block arguments to ops that express control flow. Ownership indicators of buffers for which the ownership is known statically at compile time can be optimized away in a separate buffer deallocation simplification pass and using the canonicalizer pass.
The new buffer deallocation pass pipeline is internally broken down into:
memref.realloc
expansion without deallocation- a new
-buffer-deallocation
pass that conservatively insertsbufferization.dealloc
ops at the end of blocks, which lower to runtime checks and guardedmemref.dealloc
ops - a buffer deallocation simplification pass and the canonicalizer pass, which simplify/fold away
bufferization.dealloc
ops based on static information, so that fewer or no runtime checks are necessary. - a pass to lower the
bufferization.dealloc
operations to (guarded)memref.dealloc
operations.
More details can be found in the documentation of the buffer deallocation infrastructure.
Example
Consider a simple diamond shaped CFG structure where the two predecessors of the exit block forward a function argument and a newly allocated buffer, respectively. This newly allocated buffer has to be deallocated in the exit block if the control-flow path along bb2 was taken.
func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
cf.cond_br %arg0, ^bb1, ^bb2
^bb1:
test.buffer_based in(%arg1: memref<2xf32>) out(%arg2: memref<2xf32>)
cf.br ^bb3(%arg1 : memref<2xf32>)
^bb2:
%0 = memref.alloc() : memref<2xf32>
test.buffer_based in(%arg1: memref<2xf32>) out(%0: memref<2xf32>)
cf.br ^bb3(%0 : memref<2xf32>)
^bb3(%1: memref<2xf32>):
test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>)
return
}
The old -buffer-deallocation
pass had to insert two bufferization.clone
operations such that there could be one unified deallocation operation in the exit block. The canonicalizer was able to optimize away one of them when run afterwards:
func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
cf.cond_br %arg0, ^bb1, ^bb2
^bb1: // pred: ^bb0
test.buffer_based in(%arg1 : memref<2xf32>) out(%arg2 : memref<2xf32>)
%0 = bufferization.clone %arg1 : memref<2xf32> to memref<2xf32>
cf.br ^bb3(%0 : memref<2xf32>)
^bb2: // pred: ^bb0
%alloc = memref.alloc() : memref<2xf32>
test.buffer_based in(%arg1 : memref<2xf32>) out(%alloc : memref<2xf32>)
cf.br ^bb3(%alloc : memref<2xf32>)
^bb3(%1: memref<2xf32>): // 2 preds: ^bb1, ^bb2
test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>)
memref.dealloc %1 : memref<2xf32>
return
}
The new -buffer-deallocation-pipeline
forwards a condition instead and performs a guarded deallocation:
func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
%false = arith.constant false
%true = arith.constant true
cf.cond_br %arg0, ^bb1, ^bb2
^bb1: // pred: ^bb0
test.buffer_based in(%arg1 : memref<2xf32>) out(%arg2 : memref<2xf32>)
cf.br ^bb3(%arg1, %false : memref<2xf32>, i1)
^bb2: // pred: ^bb0
%alloc = memref.alloc() : memref<2xf32>
test.buffer_based in(%arg1 : memref<2xf32>) out(%alloc : memref<2xf32>)
cf.br ^bb3(%alloc, %true : memref<2xf32>, i1)
^bb3(%0: memref<2xf32>, %1: i1): // 2 preds: ^bb1, ^bb2
test.copy(%0, %arg2) : (memref<2xf32>, memref<2xf32>)
%base_buffer, %offset, %sizes, %strides = memref.extract_strided_metadata %0 : memref<2xf32> -> memref<f32>, index, index, index
scf.if %1 {
memref.dealloc %base_buffer : memref<f32>
}
return
}
Note that the memref.extract_strided_metadata
is unnecessary here and could be optimized away by a future simplification pattern. We observe that instead of a bufferization.clone
operation, there is only a scf.if
guarding the deallocation now.
Known Limitations and Function Boundary ABI
- The input IR must not have any deallocations.
- Control flow ops must implement the respective interfaces (e.g.,
RegionBranchOpInterface
,BranchOpInterface
). Alternatively, ops can implement theBufferDeallocationOpInterface
if custom deallocation logic is required. - The IR has to adhere to the function boundary ABI (see documentation added in D158421), which is enforced by the deallocation pass.