A series of changes (D156662, D156663, D158421, D158756, D158828, D158979, D159432) will improve the way buffer deallocation is performed. The main benefits are a more modular design (decoupling buffer deallocation from One-Shot Bufferize and making it a separate pass), fewer buffer copies and being able to support IR that was previously rejected or produced invalid IR.
Migration Guide
If you use -one-shot-bufferize: Run -buffer-deallocation-pipeline after -one-shot-bufferize. One-Shot Bufferize will no longer insert any buffer deallocations.
If you own BufferizableOpInterface implementations: bufferizesToAllocation will be deleted and is no longer necessary, as One-Shot Bufferize no longer deals with deallocations.
If you use -buffer-deallocation: This pass will be replaced with a new buffer deallocation pass. It is recommended to replace -buffer-deallocation with -buffer-deallocation-pipeline, which will perform additional canonicalizations and foldings before lowering deallocation-specific ops.
This should be everything thatâs needed unless the AllocationOpInterface was used to build custom clone or deallocation operations. In that case, a custom lowering from the bufferization.clone and bufferization.dealloc operations has to be implemented alongside.
Background
There are currently two passes/implementations dealing with buffer deallocation:
-one-shot-bufferize: Bufferizes tensor IR, insertsmemref.allocandmemref.dealloc, everything in a single pass.-buffer-deallocation: Insertsmemref.dealloc, so that there are no memory leaks. Assumes that the input program does not have anymemref.deallocoperations.
The current design has several limitations:
-one-shot-bufferizeis not composable with other passes. E.g.,-buffer-hoisting/-buffer-loop-hoistingmust run after bufferization but before anymemref.deallocwere introduced.-one-shot-bufferizecannot deallocate new buffers that are yielded from blocks (e.g., yielded from a loop or passed to a new block as part of unstructured control flow); bufferization will fail or buffers will leak whenallow-return-allocsis set.-one-shot-bufferizecannot deallocate new buffers originating from ops for which it is not known (without an expensive analysis) whether they bufferize to a new allocation or not (e.g.,tensor.collapse_shape, which may or may not have to allocate based on the layout map of the bufferized source).
Buffer deallocation can be deactivated in -one-shot-bufferize with create-deallocs=0 and delegated to the existing -buffer-deallocation pass. However, this pass also has a few downsides/limitations:
- It inserts additional allocations and buffer copies around branches and loops. (E.g., when one
scf.ifbranch allocates, so must the other.) - It does not support unstructured control flow loops.
- It assumes that for each buffer, all writes dominate all reads. (This means it cannot be used safely with One-Shot Bufferize.)
- It has known bugs and allocations sometimes leak.
New Buffer Deallocation Pass
The new buffer deallocation pass is based on the concept of âownershipâ and inspired by @jreiffersâs buffer deallocation pass in MLIR-HLO. Memref ownership is similar to a C++ unique_ptr and may materialize in IR as i1 SSA values (âownership indicatorsâ). You may see ownership indicators being added as operands/results/block arguments to ops that express control flow. Ownership indicators of buffers for which the ownership is known statically at compile time can be optimized away in a separate buffer deallocation simplification pass and using the canonicalizer pass.
The new buffer deallocation pass pipeline is internally broken down into:
memref.reallocexpansion without deallocation- a new
-buffer-deallocationpass that conservatively insertsbufferization.deallocops at the end of blocks, which lower to runtime checks and guardedmemref.deallocops - a buffer deallocation simplification pass and the canonicalizer pass, which simplify/fold away
bufferization.deallocops based on static information, so that fewer or no runtime checks are necessary. - a pass to lower the
bufferization.deallocoperations to (guarded)memref.deallocoperations.
More details can be found in the documentation of the buffer deallocation infrastructure.
Example
Consider a simple diamond shaped CFG structure where the two predecessors of the exit block forward a function argument and a newly allocated buffer, respectively. This newly allocated buffer has to be deallocated in the exit block if the control-flow path along bb2 was taken.
func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
cf.cond_br %arg0, ^bb1, ^bb2
^bb1:
test.buffer_based in(%arg1: memref<2xf32>) out(%arg2: memref<2xf32>)
cf.br ^bb3(%arg1 : memref<2xf32>)
^bb2:
%0 = memref.alloc() : memref<2xf32>
test.buffer_based in(%arg1: memref<2xf32>) out(%0: memref<2xf32>)
cf.br ^bb3(%0 : memref<2xf32>)
^bb3(%1: memref<2xf32>):
test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>)
return
}
The old -buffer-deallocation pass had to insert two bufferization.clone operations such that there could be one unified deallocation operation in the exit block. The canonicalizer was able to optimize away one of them when run afterwards:
func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
cf.cond_br %arg0, ^bb1, ^bb2
^bb1: // pred: ^bb0
test.buffer_based in(%arg1 : memref<2xf32>) out(%arg2 : memref<2xf32>)
%0 = bufferization.clone %arg1 : memref<2xf32> to memref<2xf32>
cf.br ^bb3(%0 : memref<2xf32>)
^bb2: // pred: ^bb0
%alloc = memref.alloc() : memref<2xf32>
test.buffer_based in(%arg1 : memref<2xf32>) out(%alloc : memref<2xf32>)
cf.br ^bb3(%alloc : memref<2xf32>)
^bb3(%1: memref<2xf32>): // 2 preds: ^bb1, ^bb2
test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>)
memref.dealloc %1 : memref<2xf32>
return
}
The new -buffer-deallocation-pipeline forwards a condition instead and performs a guarded deallocation:
func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
%false = arith.constant false
%true = arith.constant true
cf.cond_br %arg0, ^bb1, ^bb2
^bb1: // pred: ^bb0
test.buffer_based in(%arg1 : memref<2xf32>) out(%arg2 : memref<2xf32>)
cf.br ^bb3(%arg1, %false : memref<2xf32>, i1)
^bb2: // pred: ^bb0
%alloc = memref.alloc() : memref<2xf32>
test.buffer_based in(%arg1 : memref<2xf32>) out(%alloc : memref<2xf32>)
cf.br ^bb3(%alloc, %true : memref<2xf32>, i1)
^bb3(%0: memref<2xf32>, %1: i1): // 2 preds: ^bb1, ^bb2
test.copy(%0, %arg2) : (memref<2xf32>, memref<2xf32>)
%base_buffer, %offset, %sizes, %strides = memref.extract_strided_metadata %0 : memref<2xf32> -> memref<f32>, index, index, index
scf.if %1 {
memref.dealloc %base_buffer : memref<f32>
}
return
}
Note that the memref.extract_strided_metadata is unnecessary here and could be optimized away by a future simplification pattern. We observe that instead of a bufferization.clone operation, there is only a scf.if guarding the deallocation now.
Known Limitations and Function Boundary ABI
- The input IR must not have any deallocations.
- Control flow ops must implement the respective interfaces (e.g.,
RegionBranchOpInterface,BranchOpInterface). Alternatively, ops can implement theBufferDeallocationOpInterfaceif custom deallocation logic is required. - The IR has to adhere to the function boundary ABI (see documentation added in D158421), which is enforced by the deallocation pass.