[RFC] [GSOC] Buffer Reuse Pass for Non-Overlapping Allocations After lower-deallocations

Motivation & Background

Many real-world workloads are expressed as structured, sequential computation graphs such as pipelines, processing chains, and iterative blocks. In these patterns, intermediate buffers are allocated, used, and released in strict sequence, resulting in lifetimes that do not overlap.

After the bufferization and deallocation pipeline, the IR contains explicit memref.alloc /memref.dealloc pairs. Buffers with non-overlapping lifetimes each receive their own heap allocation even when they could use shared memory. As a result, peak memory usage scales with the total number of intermediate allocations rather than with the maximum number of simultaneously live buffers.

// Example IR with two allocations whose lifetimes do not overlap

func.func @example(%arg0: memref<1024xf32>, %arg1: memref<512xf64>) {
%a = memref.alloc() : memref<1024xf32>            // allocates 4096 bytes
linalg.generic … ins(%arg0) outs(%a) …
memref.dealloc %a : memref<1024xf32>

%b = memref.alloc() : memref<512xf64>             // 4096 bytes  
linalg.generic … ins(%arg1) outs(%b) …
memref.dealloc %b : memref<512xf64>
}

// These could share one 4096-byte allocation, a significant drop in peak memory requirement.

The bufferization documentation acknowledges this gap:

ā€œThis implies reusing already allocated buffers when possible, turning bufferization into an algorithmically complex problem with similarities to register allocation.ā€
— mlir/docs/Bufferization.md

Today, there’s no upstream buffer reuse pass in MLIR, which forces every major downstream stack — IREE, TVM , and XLA to build their own. This duplicates effort, fragments memory optimizations across ecosystems, and raises the barrier for new MLIR adopters who must either accept inflated memory usage or build complex infrastructure themselves.

Proposal

This project proposes two components to address this gap :

  1. An analysis pass that computes allocation lifetimes and reports peak live bytes and reuse opportunities without mutating IR.

  2. An opt-in rewrite pass that merges non-overlapping memref.alloc / memref.dealloc pairs into a shared memory pool using memref.view

pass placement:

The pass runs after `lower-deallocations` and `promote-buffers-to-stack`:

one-shot-bufferize

  → buffer-deallocation-pipeline  (ownership-based-buffer-deallocation → lower-deallocations)

  → optimize-allocation-liveness    // shrinks lifetimes (more reuse chances)

  → promote-buffers-to-stack        // small allocs → alloca (exclude from reuse)

  → buffer-reuse                   ←  Proposed Pass

  → lower to LLVM

This placement avoids conflicts with ownership-based deallocation because:

  • bufferization.dealloc ops are already lowered so ownership flags and base-pointer aliasing checks are no longer present.
  • Lifetimes are already optimized, Small allocs are already promoted to stack, Only large heap-allocated buffers with explicit dealloc remain.

Scope

Our current scope is deliberately conservative to guarantee correctness while leaving a clear path for future extensions.

An allocation is eligible for pooling only if all conditions hold:

  • memref.alloc and memref.dealloc are in the same block

  • Static shape

  • Identity (contiguous) layout

  • Default memory space

  • Non-escaping (not returned, not passed to calls, not captured by region ops)

  • Proven non-overlapping lifetime with another eligible allocation

Uncertain cases will be skipped and tracked with per-reason statistics.

Mechanism

  1. Collect alloc/dealloc pairs via MemoryEffectOpInterface.

  2. Compute lifetime intervals using operation numbering and BufferViewFlowAnalysis::resolve().

  3. Assign reusable slots using greedy linear-scan allocation.

  4. Compute pool layout:

    • Total pool size = sum of slot sizes with alignment padding.
  5. Rewrite IR:

    • Insert pool allocation

    • Replace original allocs with memref.view

    • Remove individual deallocs

    • Insert a single pool deallocation

// After Rewrite Example:

func.func @example(%arg0: memref<1024xf32>, %arg1: memref<512xf64>) {
  %c0 = arith.constant 0 : index
  %pool = memref.alloc() {alignment = 64} : memref<4096xi8>

  %a = memref.view %pool[%c0][] : memref<4096xi8> to memref<1024xf32>
  linalg.generic ... ins(%arg0) outs(%a) ...

  %b = memref.view %pool[%c0][] : memref<4096xi8> to memref<512xf64>
  linalg.generic ... ins(%arg1) outs(%b) ...

  memref.dealloc %pool : memref<4096xi8>
}

// Peak memory reduces from 8192 bytes to 4096 bytes.

Expected Impact

  • Peak memory reduction for any pipeline with sequential non-overlapping temporary buffers (common in ML inference workloads).

  • Reusable lifetime analysis infrastructure that other passes or downstream users can build on.

  • Presents an upstream alternative to ad-hoc buffer reuse strategies, potentially reducing the need for each downstream user to develop and maintain separate solutions.

I’m considering this as a GSOC project. I’m posting this RFC for finding potential mentor, to gather feedback and ensure the direction aligns with MLIR expectations. I would appreciate feedback/guidance from maintainers or contributors of this area.

CC: @matthias-springer

Each memref.alloc is a dynamic memory allocation. In case of an m1 = alloc(); dealloc(m1); m2 = alloc() pattern, the underlying memory allocator (such as malloc and free) may choose to reuse the same memory.

It’s unclear to me if it’s worth optimizing this on the MLIR level. Or if the underlying memory allocator does a good enough job. Maybe we can get some data points by looking into downstream projects that utilize the MLIR bufferizer. Do these projects have a buffer reuse pass?

Another potentially interesting angle could be to look at static allocation. Sometimes, you don’t want dynamic memory allocation. E.g., when running a GPU program, calling malloc is really expensive (or may not work at all). Now you have to pre-allocate a sufficiently large buffer (e.g., memref.alloca, memref.global or some other special mechanism) and manage it manually. This actually sounds very similar to what you describe.

Let’s try to get some feedback from the community whether such a feature would be useful. If nobody is interested in this feature, you would just be adding more code that’s eventually become unmaintained and deleted.

Thanks for the thoughtful questions, they pushed me to dig deeper into the ecosystem and gain better insights.

For dynamic memory allocation i.e. in the case of pure CPU target with a good allocator, the benefit of compile-time pooling is minimal (fewer malloc/free calls, deterministic peak memory).

The static allocation angle you mention is exactly the core motivation here. On targets like (GPU-cuda/vulkan,TPU-xla, NPU-hexagon, FPGA and metal) have either very costly dynamic/runtime allocation or not available at all.

So every single stack targetting these targets had implemented some compile time logic for static memory allocation and there they use lifetime analysis plus buffer reuse passes. Here are some concrete data points:-

  1. IREE —stream.resource.pack + ScheduleAllocation
  • compiler/src/iree/compiler/Dialect/Stream/Transforms/ScheduleAllocation.cpp
  • compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.td (definition of stream.resource.pack)
    This pass does operation ordering and for each buffer, it computes a lifetime interval (i.e. the range from where the buffer is first defined to where it is last used). Buffers that alias the same underlying data get their intervals merged into one (because the memory must stay live as long as any alias is in use). These intervals are then passed to the stream.resource.pack op, which performs greedy packing: it takes a list of [start, end] = byte_size entries and assigns each a byte offset inside a single shared memory block, reusing offsets when lifetimes don’t overlap.
  1. XLA —hlo_live_range + heap_simulator

ā€œBufferAllocation class abstracts an allocation of contiguous memory which can hold the values described by LogicalBuffers. Each LogicalBuffer occupies a subrange of the allocation, represented by a Slice. A single BufferAllocation may hold LogicalBuffers with disjoint liveness, which may have overlapping Slices.ā€
— xla/service/buffer_assignment.h

  1. TVM —StorageRewrite + Unified Static Memory Planning (USMP)
    The StorageRewrite operates within individual operators, sharing space between temporary buffers whose lifetimes don’t overlap.
    while for embedded targets they use USMP that can operate on whole program and generates code with zero runtime allocation calls.

  2. PyTorch ExecuTorch — MemoryPlanningPass

  • memory_planning_pass.py
    Official documentation states ā€œIt analyzes the size and lifespan of each mutable tensor and plans their allocation within fixed-size memory arenas, enabling efficient buffer reuse.ā€
  1. Qualcomm Hexagon’s tensor and vector units need even more aggressive buffer reuse as they can only access TCM (8 MB on-chip SRAM), not global memory and there’s no dynamic allocation.

  2. NVIDIA built cudaMallocAsync (runtime pool allocator), but for inference, TensorRT pre-computes all buffer lifetimes at engine build time and performs static memory planning — one large allocation, then offset-based sub-allocation.

Unlike Runtime allocators, Compiler can reason about lifetimes which plays a key role in static memory planning. Based on the evidence above, every stack is doing the same thing that i proposed. The only difference is that they are doing it on their own IR, because upstream MLIR doesn’t provide it.

There should exist an upstream analysis pass that computes object lifetimes and identifies memory reuse opportunities. And an opt-in Rewrite pass for reuse.

There’s a prior RFC thread by @Menooker which explored the same problem space and saw engagement from multiple contributors. I’d appreciate their input on this.

On maintenance cost: The proposed pass utilizes existing analyses and strict legality gates. Almost every framework reimplemented the same algorithm — lifetime intervals, non-overlap detection, offset assignment into shared memory. That’s the textbook signal that the shared infrastructure is missing. Providing it upstream reduces total maintenance across the ecosystem, not increases it.

CC who may have insight @matthias-springer
@Menooker @rengolin @MaheshRavishankar @feiyulv

Relying on a smart enough allocator is a bit dangerous, even on CPUs. For example, it’s less predictable when switching allocators at runtime (common in high-performance code). Reducing the number of malloc calls (when proven correct) is always beneficial.

While I agree with this statement, I also agree with @matthias-springer’s caution.

IIUC, you’re trying to implement an arena allocation model in MLIR that is at least opinionated, but likely restrictive in what it can do (cross-target wise). In itself, this is a worthy contribution, but to get to a point where we can use this upstream (and in multiple downstreams), a lot of customization may be needed.

So, a liveness analysis of various memory regions would be a good start. Some interval tree with meta-data (shape, element type, alignment, address space). One can do different things with that analysis, and it ought to be useful to more than just an arena allocation scheme.

The analysis is likely the same across all projects. The overlapping detection may have small differences (ex. allowing only same type reuse, or allowing smaller buffers reusing larger ones). The actual transform may be different enough (ex. GPU, TPU and CPU would make different decisions further down the pipeline on the meaning of alloc). Devil in the details kind of thing.

Exactly. That’s why I’d start with the analysis only first, and replace downstreams with that. Then iterate.

Although this comment was about the underlying allocator, I’d like to expand this concern to the overall MLIR users’ stack. While IREE and OpenXLA’s MLIR compiler can directly use it (@kuhar, @jpienaar), PyTorch and TVM would never do so. The new Hexagon compiler could probably use that, too (@javedabsar). Nvidia using TensorRT, a much higher abstraction level, would probably never use it either.

So the key question is knowing if the stake holders mentioned above are in sync with your proposals and will use / contribute to the proposal and help maintain it.

Another question, since this was proposed as a GSOC project, who could volunteer as a mentor? Everything bufferization-related would usually fall into my area of expertise. And I’d like to encourage students to participate in GSOC. (I have been a GSOC student myself, for a different project). But I’m afraid I don’t have the time to sign up for any extra tasks apart from the OSS things I’m working on already. But I’d be happy to help with reviewing designs + code.

I am interested in this, and happy to guide (with often help from Matthias) on this. Thanks @rengolin for bringing this to attention.

Based on the @rengolin feedback, it seems an analysis pass will be more beneficial for downstream stacks. So more appropriate direction would be:

  • Start with an analysis pass that computes allocation lifetimes, tracks aliasing relationships, and exposes metadata such as type, shape, size, alignment, and address space.
  • Then provide an extensible framework that identify reuse opportunities and leave transformation decisions to downstream .
  • Then possibly provide a opt-in generic rewrite pass (maybe similar to what XLA has implemented) as a demonstration of how the analysis can be used.

This way, downstream projects could share the same analysis foundation while still keeping flexibility in their memory planning strategies

@rengolin thanks for valuable feedback.

@matthias-springer thanks for clarifying and for offering help with reviewing design+code.

@javedabsar thank you for your interest and willingness to guide.

For some extra context, we also have a pass that uses a similar approach in IREE for handling shared (workgroup) memory allocations for GPU kernels: compiler/Codegen/Common/GPU/GPUReuseSharedMemoryAllocs.cpp

It sounds like we could definitely plug at least the analysis part into IREE, and maybe the transformation as well.

I might be able to help here

One warning I have is that merging allocations runs into issues around expand_strided_metadata - and, in particular, the fact that an identity layout is currently defined to ā€œI assert that the offset field of this memref is and will always remain 0ā€, which causes problems with transforms like this that we’ve been working around in IREE. What should be done about this, if anything, has been a rather lively discussion topic in the past.

I don’t think this should block your project, just figured I’d let you know about disclaimers you might need to put on it.

Thanks @krzysz00 for pointing this out.

yes, identity-layout memrefs requiring a zero offset can create issues when trying to merge allocations using views into a shared buffer.

I’d be interested to know what kinds of workarounds can be use here, like we can introduce explicit strided layouts or perform some layout conversions. If nothing works, we will definitely restrict the scope of the project to cases where this constraint doesn’t apply and document limitations.

Given your experience with IREE, I would appreciate your further guidance. Thanks again.

As mentioned by @akorobeynikov , the mentor(@javedabsar) needs to follow these guidelines to comply with GSOC requirements. I’m happy to help with preparing the draft if needed.
Another concern is , they want two mentors (one primary and another as a backup).

At least two mentors are required. Three or four are fine as well :wink:

Anyone interested in mentoring or providing guidance for this exciting GSoC project is warmly welcome.
your expertise and support would be greatly appreciated.

If @javedabsar agrees to be the primary one, I can be the second one.

Sounds good.

hi @Prince25 , I have a PR [mlir] [memref] Compile-time memref.alloc Scheduling/Merging optimization by Menooker Ā· Pull Request #95882 Ā· llvm/llvm-project Ā· GitHub for a similar propoal https://discourse.llvm.org/t/rfc-compile-time-memref-alloc-scheduling-merging-optimization/78872. Welcome to take it as a reference!

Thanks for sharing, I had followed your RFC thread but hadn’t seen this PR before.

My approach is fairly different particularly in terms of pipeline placement and we’re going with a different method for analysis but still I’ll use this to learn from earlier design challenges and avoid similar pitfalls.

Hello @Prince25 @matthias-springer @javedabsar Thank you for the RFC, I am really interested in this project.

I was looking at the official GSoC project listing , The LLVM Compiler Infrastructure Project and I noticed it describes the expected output as ā€œall allocations are turned to memref.subviews of a statically-allocated memrefā€, so memref.subview, not memref.view.

Since memref.subview can’t change the element type (you can’t go from i8 to f32), would the rewrite need both ops? Something like memref.subview to carve out the byte range, then memref.view to reinterpret the type?

Also I had another question , the official description says ā€œturns dynamic allocations into static onesā€ should the pool itself be a memref.alloca or memref.global (truly static) rather than another memref.alloc?

I’m especially curious about this for targets like GPUs or NPUs where runtime allocation may not exist. Thank you