[PSA] Bufferization: New Buffer Deallocation Pipeline

A series of changes (D156662, D156663, D158421, D158756, D158828, D158979, D159432) will improve the way buffer deallocation is performed. The main benefits are a more modular design (decoupling buffer deallocation from One-Shot Bufferize and making it a separate pass), fewer buffer copies and being able to support IR that was previously rejected or produced invalid IR.

Migration Guide

If you use -one-shot-bufferize: Run -buffer-deallocation-pipeline after -one-shot-bufferize. One-Shot Bufferize will no longer insert any buffer deallocations.

If you own BufferizableOpInterface implementations: bufferizesToAllocation will be deleted and is no longer necessary, as One-Shot Bufferize no longer deals with deallocations.

If you use -buffer-deallocation: This pass will be replaced with a new buffer deallocation pass. It is recommended to replace -buffer-deallocation with -buffer-deallocation-pipeline, which will perform additional canonicalizations and foldings before lowering deallocation-specific ops.

This should be everything that’s needed unless the AllocationOpInterface was used to build custom clone or deallocation operations. In that case, a custom lowering from the bufferization.clone and bufferization.dealloc operations has to be implemented alongside.

Background

There are currently two passes/implementations dealing with buffer deallocation:

  • -one-shot-bufferize: Bufferizes tensor IR, inserts memref.alloc and memref.dealloc, everything in a single pass.
  • -buffer-deallocation: Inserts memref.dealloc, so that there are no memory leaks. Assumes that the input program does not have any memref.dealloc operations.

The current design has several limitations:

  • -one-shot-bufferize is not composable with other passes. E.g., -buffer-hoisting/-buffer-loop-hoisting must run after bufferization but before any memref.dealloc were introduced.
  • -one-shot-bufferize cannot deallocate new buffers that are yielded from blocks (e.g., yielded from a loop or passed to a new block as part of unstructured control flow); bufferization will fail or buffers will leak when allow-return-allocs is set.
  • -one-shot-bufferize cannot deallocate new buffers originating from ops for which it is not known (without an expensive analysis) whether they bufferize to a new allocation or not (e.g., tensor.collapse_shape, which may or may not have to allocate based on the layout map of the bufferized source).

Buffer deallocation can be deactivated in -one-shot-bufferize with create-deallocs=0 and delegated to the existing -buffer-deallocation pass. However, this pass also has a few downsides/limitations:

New Buffer Deallocation Pass

The new buffer deallocation pass is based on the concept of “ownership” and inspired by @jreiffers’s buffer deallocation pass in MLIR-HLO. Memref ownership is similar to a C++ unique_ptr and may materialize in IR as i1 SSA values (“ownership indicators”). You may see ownership indicators being added as operands/results/block arguments to ops that express control flow. Ownership indicators of buffers for which the ownership is known statically at compile time can be optimized away in a separate buffer deallocation simplification pass and using the canonicalizer pass.

The new buffer deallocation pass pipeline is internally broken down into:

  1. memref.realloc expansion without deallocation
  2. a new -buffer-deallocation pass that conservatively inserts bufferization.dealloc ops at the end of blocks, which lower to runtime checks and guarded memref.dealloc ops
  3. a buffer deallocation simplification pass and the canonicalizer pass, which simplify/fold away bufferization.dealloc ops based on static information, so that fewer or no runtime checks are necessary.
  4. a pass to lower the bufferization.dealloc operations to (guarded) memref.dealloc operations.

More details can be found in the documentation of the buffer deallocation infrastructure.

Example

Consider a simple diamond shaped CFG structure where the two predecessors of the exit block forward a function argument and a newly allocated buffer, respectively. This newly allocated buffer has to be deallocated in the exit block if the control-flow path along bb2 was taken.

func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
  cf.cond_br %arg0, ^bb1, ^bb2
^bb1:
  test.buffer_based in(%arg1: memref<2xf32>) out(%arg2: memref<2xf32>)
  cf.br ^bb3(%arg1 : memref<2xf32>)
^bb2:
  %0 = memref.alloc() : memref<2xf32>
  test.buffer_based in(%arg1: memref<2xf32>) out(%0: memref<2xf32>)
  cf.br ^bb3(%0 : memref<2xf32>)
^bb3(%1: memref<2xf32>):
  test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>)
  return
}

The old -buffer-deallocation pass had to insert two bufferization.clone operations such that there could be one unified deallocation operation in the exit block. The canonicalizer was able to optimize away one of them when run afterwards:

  func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
    cf.cond_br %arg0, ^bb1, ^bb2
  ^bb1:  // pred: ^bb0
    test.buffer_based in(%arg1 : memref<2xf32>) out(%arg2 : memref<2xf32>)
    %0 = bufferization.clone %arg1 : memref<2xf32> to memref<2xf32>
    cf.br ^bb3(%0 : memref<2xf32>)
  ^bb2:  // pred: ^bb0
    %alloc = memref.alloc() : memref<2xf32>
    test.buffer_based in(%arg1 : memref<2xf32>) out(%alloc : memref<2xf32>)
    cf.br ^bb3(%alloc : memref<2xf32>)
  ^bb3(%1: memref<2xf32>):  // 2 preds: ^bb1, ^bb2
    test.copy(%1, %arg2) : (memref<2xf32>, memref<2xf32>)
    memref.dealloc %1 : memref<2xf32>
    return
  }

The new -buffer-deallocation-pipeline forwards a condition instead and performs a guarded deallocation:

  func.func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
    %false = arith.constant false
    %true = arith.constant true
    cf.cond_br %arg0, ^bb1, ^bb2
  ^bb1:  // pred: ^bb0
    test.buffer_based in(%arg1 : memref<2xf32>) out(%arg2 : memref<2xf32>)
    cf.br ^bb3(%arg1, %false : memref<2xf32>, i1)
  ^bb2:  // pred: ^bb0
    %alloc = memref.alloc() : memref<2xf32>
    test.buffer_based in(%arg1 : memref<2xf32>) out(%alloc : memref<2xf32>)
    cf.br ^bb3(%alloc, %true : memref<2xf32>, i1)
  ^bb3(%0: memref<2xf32>, %1: i1):  // 2 preds: ^bb1, ^bb2
    test.copy(%0, %arg2) : (memref<2xf32>, memref<2xf32>)
    %base_buffer, %offset, %sizes, %strides = memref.extract_strided_metadata %0 : memref<2xf32> -> memref<f32>, index, index, index
    scf.if %1 {
      memref.dealloc %base_buffer : memref<f32>
    }
    return
  }

Note that the memref.extract_strided_metadata is unnecessary here and could be optimized away by a future simplification pattern. We observe that instead of a bufferization.clone operation, there is only a scf.if guarding the deallocation now.

Known Limitations and Function Boundary ABI

  • The input IR must not have any deallocations.
  • Control flow ops must implement the respective interfaces (e.g., RegionBranchOpInterface, BranchOpInterface). Alternatively, ops can implement the BufferDeallocationOpInterface if custom deallocation logic is required.
  • The IR has to adhere to the function boundary ABI (see documentation added in D158421), which is enforced by the deallocation pass.
3 Likes

To give downstream users more time for migration we added the new deallocation pass alongside the old (now deprecated) deallocation pass.

  • The migration guide still applies, with the addition that it is not only recommended but required to replace -buffer-deallocation with -buffer-deallocation-pipeline
  • The new pass is now called OwnershipBasedDeallocationPass
  • The old pass (-buffer-deallocation) should be considered deprecated and will be removed in the future once downstream users had enough time to migrate over.

The new series of patches: #66337, #66349, #66350, #66351, #66352, #66517, #66619

Quick question: why dealloc_helper is not defined private?

privateFunctionDynamicOwnership probably should ignore functions which are declared but not defined (currently it generates broken IR)

func.func @test2() -> (memref<1xf64>, memref<1xf64>) {
  %1 = memref.alloc() : memref<1xf64>
  return %1, %1 : memref<1xf64>, memref<1xf64>
}

If I understand function deallocation ABI correctly, one of the returned values must be cloned after the deallocation pipeline, but currently it doesn’t.

Hi @Hardcode84

Thanks for taking a look at the new deallocation pipeline and the great feedback!

Quick question: why dealloc_helper is not defined private?

You’re absolutely right, it should be declared private.

privateFunctionDynamicOwnership probably should ignore functions which are declared but not defined (currently it generates broken IR)

Yes, this is an oversight on my side. I’ll submit a PR fixing this and the above soon!

Regarding your code example w.r.t. the function boundary ABI:

Honestly, the ABI is a bit under-specified in that case. Mostly because we focussed more on deallocation within functions since that was also the focus of the old pass and didn’t properly think about such a case. I would agree with you that a clone should be inserted because then the deallocation pass knows for sure that no aliasing can occur among the results of a call operation and thus eliminate the runtime checks.

Let’s assume that your function @test2 is external (i.e., we cannot analyze its body). If we don’t have the guarantee that no results alias, we end up with (about) the following IR:

func.func @top() {
  %true = arith.constant true
  %0:2 = call @test2() : () -> (memref<1xf64>, memref<1xf64>)
  bufferization.dealloc (%0#0, %0#1 : memref<1xf64>, memref<1xf64>) if (%true, %true)
  return
}

Which can be lowered to something like the following (I omitted the extract_strided_metadata ops to get the base memref):

func.func @top() {
  %true = arith.constant true
  %0:2 = call @test2() : () -> (memref<1xf64>, memref<1xf64>)
  memref.dealloc %0#0 : memref<1xf64>
  %intptr = memref.extract_aligned_pointer_as_index %0#0 : memref<f64> -> index
  %intptr_0 = memref.extract_aligned_pointer_as_index %0#1 : memref<f64> -> index
  %1 = arith.cmpi ne, %intptr, %intptr_0 : index
  scf.if %1 {
    memref.dealloc %0#1 : memref<1xf64>
  }
  return
}

If we know statically that the results don’t alias, we can split up the bufferization.dealloc like the following which lowers directly to memref.dealloc ops without any condition checks.

bufferization.dealloc (%0#0 : memref<1xf64>) if (%true)
bufferization.dealloc (%0#1 : memref<1xf64>) if (%true)

This additional clone in @test2 would probably have to be inserted by the frontend that generates this IR and has enough high-level information to know that the caller does not rely on those two results being aliases. In some sense, this would be an additional limitation of the buffer deallocation pass, while allowing aliasing would make the pass work for more cases but at the cost of performance. But I think it’s still better than the original deallocation pass which would not deallocate the results of the func.call at all (i.e., memory leak).

Regarding your code example w.r.t. the function boundary ABI:

In our case, external ABI expects all function results have to unique ownership, but after some thought, I think I can workaround it on our side by checking extract_aligned_pointer_as_index and conditionally cloning like you are doing in bufferization.dealloc.

But I have another related question then, from CloneOp doc:

Valid implementations of this operation may alias the input and output views

But I assume your extract_aligned_pointer_as_index deallocation logic will break in this case. How deallocation pipeline can be modified to handle this (we can also assume clone is cheap in this case - incref)?

For the context, I’m testing new deallocation pipeline in numba-mlir project GitHub - numba/numba-mlir: POC work on MLIR backend (we are currently use old deallocatioin pass with some hacks on top of it) and besides those 3 issues (first 2 are workaroundble and 3rd cased 1 segfault per 15k tests) things looks good.

Thanks for pointing that out. I didn’t properly read this documentation but just looked at how the clone operation is lowered in BufferizationToMemref. It was always intended to just be memref.alloc+memref.copy and bufferization.clone was (incorrectly) used as a shorthand for that. I’ll change the deallocation pipeline to always insert those two ops instead to fix this.

Does “incref” refer to increasing the counter in a reference counting system?
If so, couldn’t you just query the reference count and use that as the ownership boolean plus a way to override the way aliasing is checked in LowerDeallocations, either by writing your own lowering or by letting the pass take some options to customize the IR inserted for those checks?

Can you make it a pass flag instead (with safe option as default)? Having explicit clones can be still potentially useful.

I have a feeling using reference counter directly can cause unexpected results, but I don’t have a concrete example. But I think I can replace extract_aligned_pointer_as_index with something like mydialect.extract_memref_token and guarantee it is always unique and changes on clone ops.

Also, things will probably break, if user already have dealloc_helper function defined in IR when running deallocation pipeline. Probably need a flag to override this name.

Also, alias check will break if any memref view/subview op change the aligned pointer. They are not doing it currently, but I don’t think it’s guaranteed to stay this way forever. So, we probably need specific memref.get_unique_token op upstream too (which will return allocated pointer by default).

And another question )
How memref.get_global is handled int this new pipeline?

Sure, we can make it configurable what kind of operation(s) should be inserted in those situations.

I’m wondering whether we should just create a new dealloc_helper function with a unique name everytime the pass is called (and needs the helper) and then rely on some function deduplication pass to collapse them into one, which is possible because the function will be private. Because the user of the pass might not know which names are still available and would then just implement such name uniquing functionality in the pass option anyways.

Yes, we need some extract_allocated_pointer_as_index operation or similar to implement this properly at some point. I took a bit of a shortcut here, tbh. There is also a related issue on github (#64334) about how memref metadata is stored.

The result of the memref.get_global op will be assigned ownership ‘false’ and thus deallocation will never be called on it. If it is returned from a function, a copy will be created.

1 Like

OK, now I have

Terminators must implement either BranchOpInterface or RegionBranchTerminatorOpInterface (but not both)!
see current operation: "scf.reduce.return"(%7) : (i64) -> ()

Code doesn’t even have any allocations

// -----// IR Dump After OwnershipBasedBufferDeallocation Failed (ownership-based-buffer-deallocation) ('func.func' operation: @_ZN10numba_mlir4mlir5tests10test_basic12test_prange112_3clocals_3e11py_func_241B106c8tJTC_2fWQAlzW1yBDkop6GEOEUMEOYSPGuIQMViAQ3iQ8IbKQIMbwoOGNoQDDWwQR1NHAS3lQ9XgSucwm4pgLNTQs80DZTPd3JzMIk0AEx) //----- //
module attributes {numba.pipeline_jump_markers = []} {
  func.func @_ZN10numba_mlir4mlir5tests10test_basic12test_prange112_3clocals_3e11py_func_241B106c8tJTC_2fWQAlzW1yBDkop6GEOEUMEOYSPGuIQMViAQ3iQ8IbKQIMbwoOGNoQDDWwQR1NHAS3lQ9XgSucwm4pgLNTQs80DZTPd3JzMIk0AEx(%arg0: i64 {numba.restrict}) -> i64 attributes {gpu_runtime.fp64_truncate = false, gpu_runtime.use_64bit_index = true, numba.max_concurrency = 16 : i64, numba.opt_level = 3 : i64} {
    %c0_i64 = arith.constant 0 : i64
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %0 = numba_util.env_region #numba_util.parallel -> i64 {
      %1 = arith.index_cast %arg0 : i64 to index
      %2 = scf.parallel (%arg1) = (%c0) to (%1) step (%c1) init (%c0_i64) -> i64 {
        %3 = arith.index_cast %arg1 : index to i64
        scf.reduce(%3)  : i64 {
        ^bb0(%arg2: i64, %arg3: i64):
          %4 = arith.addi %arg2, %arg3 : i64
          scf.reduce.return %4 : i64
        }
        scf.yield
      }
      numba_util.env_region_yield %2 : i64
    }
    return %0 : i64
  }
}

A fix for that already landed via #66886

Gesendet von Outlook fĂźr Android

Thanks, I was missing scf::registerBufferDeallocationOpInterfaceExternalModels(registry);

So, I’ve successfully ported numba-mlir to new deallocation pipeline.
Some notes:

  • I’m using our custom get_alloc_token op for alias checks. This token guaranteed to stay the same during various view ops, but our bufferization.clone lowering will generate new token even if result using same underlying buffer. For now I’m just replacing all extract_aligned_pointer_as_index ops with get_alloc_token in separate pass as the only source of them in our pipeline is deallocation passes, but it would be nice to either have this configurable in deallocation pass or to have similar op upstream. This can also potentially lead to better codegen as we can canonicalize this get_alloc_token through various view ops.
  • I’ve written custom pass to ensure all returned memrefs have unique ownership, which is needed for our ABI. The pass just checks all returned memrefs tokens dynamically and inserts clones if necessary. https://github.com/numba/numba-mlir/blob/39adc14f70b6a44a0202c718f084a468df4e73da/numba_mlir/numba_mlir/mlir_compiler/lib/pipelines/PlierToLinalg.cpp#L3243