[RFC] Compile-time memref.alloc Scheduling/Merging optimization

This document proposes a compile-time optimization on existing memref.alloc to reduce memory usage and improve memory locality.

Current status of bufferization and memref pass pipeline

Bufferization is a process in the current MLIR of converting ops with tensor semantics to ops with memref semantics. Currently, MLIR has two different bufferization passes, one-shot-bufferization and older/partial bufferization (legacy version).
One-Shot Bufferize is a new tensor bufferization pass designed for IR in destination-passing style, and with aggressive in-place bufferization. The older/partial bufferization was built around multiple dialects. The community is trying to gradually deprecate the older bufferization and replace them with one-shot bufferization.
The goal of bufferization is to use as little memory as possible and copy as little memory as possible, as a result, the exsiting focus is to determine in-place or out-of-place among the OpOperand and OpResult of individual ops, while not considering much about the overall memory reuse across Operators within a sub-graph (or partition).

The current implementation of Bufferization and memref pass pipeline focuses on copy-avoidance and in-place reusing of the memory. Consider a computation graph of 4 layers of matmul sharing the same weight:

func.func @mlp(%x: tensor<128x128xf32>, %y: tensor<128x128xf32>) -> tensor<128x128xf32> {
   %a0 = tensor.empty() : tensor<128x128xf32>
   %a = linalg.matmul ins(%x, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%a0: tensor<128x128xf32>) -> tensor<128x128xf32>
   %b0 = tensor.empty() : tensor<128x128xf32>
   %b = linalg.matmul ins(%a, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%b0: tensor<128x128xf32>) -> tensor<128x128xf32>
   %c0 = tensor.empty() : tensor<128x128xf32>
   %c = linalg.matmul ins(%b, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%c0: tensor<128x128xf32>) -> tensor<128x128xf32>
   %d0 = tensor.empty() : tensor<128x128xf32>
   %d = linalg.matmul ins(%c, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%d0: tensor<128x128xf32>) -> tensor<128x128xf32>
   return %d : tensor<128x128xf32>
}

The bufferization pass will create an memref.alloc for each of the tensor a0, b0 and c0. The bufferization result should be like:

func.func @mlp(%x: memref<128x128xf32>, %y: memref<128x128xf32>) -> memref<128x128xf32> {
   %a0 = memref.alloc() : memref<128x128xf32>
   linalg.matmul ins(%x, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%a0: memref<128x128xf32>)
   %b0 = memref.alloc() : memref<128x128xf32>
   linalg.matmul ins(%a0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%b0: memref<128x128xf32>)
   %c0 = memref.alloc() : memref<128x128xf32>
   linalg.matmul ins(%b0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%c0: memref<128x128xf32>)
   %d0 = memref.alloc() : memref<128x128xf32>
   linalg.matmul ins(%c0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%d0: memref<128x128xf32>)
   return %d0 : memref<128x128xf32>
}

Without further optimizations, 3 temp buffers will be allocated at the runtime for these tensors. However, as we can see in the IR, the buffer a0 is no longer used when buffer c0 is allocated. So buffer c0 can reuse the memory buffer of buffer a0, to reduce the memory size footprint and improve the locality.

An observation of the current bufferization and memref passes is that they do not consider the memory buffer planning - to reuse the buffer/memref for less total size and better locality.

Proposal

This RFC proposes an optimization to consolidate multiple allocations (memref.alloc ops) into a single memref.alloc op and each static-shaped memref.alloc op will be transformed into a “slice” from the single allocated buffer with memref.view and some compile-time decided offsets. This optimization works on memref instead of tensor ops, so it should be executed after bufferization pass, and before buffer-deallocation.

While merging the memory allocations, the transform should consider the lifetime of each allocated memrefs. By lifetime, we mean the range of time when an memref allocated from memref.alloc is actively used. The references on views of a “base” memref should contribute to the lifetime of the “base”. A later memref.alloc should consider to reuse the memory of a previously allocated memref, if the lifetime of these two does not overlap. The transform will perform the “reusing” of memory by setting the offset of the later memref.view to a position within the memory range of a previous allocation’s memref.view on the single allocated buffer.

Below is the expected transformation result of the example IR in the above section:

func.func @mlp(%x: memref<128x128xf32>, %y: memref<128x128xf32>) -> memref<128x128xf32> {
   %single_buffer = memref.alloc() : memref<131072xi8> // 128*128*sizeof(f32)*2
   %a0 = memref.view %single_buffer[0][] : memref<131072xi8> to memref<128x128xf32> // a0 takes the memory from byte offset 0
   linalg.matmul ins(%x, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%a0: memref<128x128xf32>)
   %b0 = memref.view %single_buffer[65536][] : memref<131072xi8> to memref<128x128xf32> // b0 takes the memory from byte offset 128*128*sizeof(f32)
   linalg.matmul ins(%a0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%b0: memref<128x128xf32>) 
   %c0 = memref.view %single_buffer[0][] : memref<131072xi8> to memref<128x128xf32> // c0 takes the memory from byte offset 0
   linalg.matmul ins(%b0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%c0: memref<128x128xf32>)
   %d0 = memref.alloc() : memref<128x128xf32> // d0 is returned, do not merge
   linalg.matmul ins(%c0, %y: memref<128x128xf32>, memref<128x128xf32>) outs(%d0: memref<128x128xf32>)
   return %d0 : memref<128x128xf32>
}

There is one single allocation single_buffer for all temp buffers and alloc ops for a0, b0 and c0 are removed. The returned memref d0 is untouched. The memrefs a0, b0 and c0 are replaced by memref.view on single_buffer. Since a0 and b0’s lifetime overlaps, the transformation will “allocate” different memory ranges on the single_buffer - note that a0 and b0 has different offsets %single_buffer[0] and %single_buffer[65536] and the memory ranges does not overlap. The memref c0 does not overlap with a0 in their lifetime, so that c0 can reuse the memory range of a0 by setting of offset to %single_buffer[0], which is the same of a0. The final allocation size of temp memory buffer will be 128*128*sizeof(f32)*2 instead of three memref<128x128xf32> buffers in the original IR.

The transformation should only consider to merge a memref.alloc only if

  • the ownership of the memref does not escape from the function. That is, the current function is responsible to alloc and dealloc this memref
  • and, the allocated memref is contiguous and has static shape

In this RFC, we call these memref.alloc mergeable allocations.

The memrefs passed by function arguments, or returned by the function will be untouched by this optimization.

Other solutions

Another (not yet existing) approach to resolve the memory reusing issue is to insert memref.dealloc as soon as the buffer is no longer used. For example, in the above “matmul” example, memref.dealloc can be inserted after the last use of a0 at linalg.matmul ins(%a0, %y...). So even without memref merging transformation, a common runtime memory allocator will try to reuse the memory free’d by memref.dealloc(%a0) when allocating buffer for c0. However, there are some disadvantages of this approach comparing to the compile-time memref merging transformation of this proposal:

  1. it depends on the implementation of the runtime memory allocator.
  2. the runtime memory allocator does not have full picture of the future allocation/deallocation patterns of the program. For example, if we change the above example to make buffer size c0 greater than size of a0, the runtime memory allocator will not likely to be able to reuse the memory of a0 for c0, becuase the free memory chunk size of a0 does not fit allocation of c0. In contrast, the proposed optimization of this document has the knowledge of the allocation patterns. Thus, it can put the memory chunk for a0 in a right place of the single allocation buffer, so that the allocation of c0 can fit into it.
  3. calling runtime memory allocator for each buffer introduces more run time overhead than a single merged allocation after allocation merging.

However, utilizing runtime memory allocator can be viewed as a supplementary approach of the allocation merging at compile-time, for example, to handle memref with dynamic shapes. These two memory optimization approaches should coexist and cowork in the pass pipeline.

Implementation

The transformation first needs to identify the alloc scopes, which are mlir Block of

  • a function’s body
  • or, body of a scf parallel execution op, like scf.forall, scf.parallel

For example, below is an example IR of a function with nested scf.forall ops.

func.func @mlp(...) { // <---- alloc scope 1
   scf.for(...) { // <---- NOT an alloc scope!
      // allocation inside will be merge to alloc scope 1 above
   }
   ...
   scf.forall(...) { // <---- alloc scope 2
      ...
      // allocation here will be merge to alloc scope 2
      %buf = memref.alloc() : ...
      scf.forall(...) { // <---- alloc scope 3
      }
   }
}

There will be three alloc scopes as marked in the comments above. An alloc scope marks the position to insert the single allocation buffer after allocation merging. After the transformation, all “mergeable” memref.alloc will be merged to the single allocation buffer of the nearest ancestor alloc scope.

The transformantion is consist of an analysis sub-pass and a mutation sub-pass. For each alloc scope, the analysis sub-pass finds the lifetime of each mergeable memref.alloc belonging to the alloc scope. And given the lifetime of each allocation, a memory planning algorithm will be run to find the single allocation buffer size of each alloc scope and the offset for each mergeable allocation within its single allocation buffer. Based on the memory planning result, the mutation sub-pass transforms the IR to

  1. insert memref.alloc at the front of alloc scope body for its single allocation buffer
  2. replace mergeable memref.alloc with memref.view on its alloc scope’s single allocation buffer

Ticks are assigned on each operation in the func.func by a increasing counter with pre-order recursive walking of the IR, as the “execution tick” for each operation. The lifetime analysis pass will assign two integers for each mergeable allocations as the analysis result: begin_tick and end_tick, to indicate the first and last tick of the use of the allocated memref in the IR. There should be special handling for loop ops which references memrefs allocated in parent scopes, to avoid wrong reuse of buffers used in the loop.

The analysis result for each mergeable allocations will be an integer range [begin_tick,end_tick], where begin_tick <= end_tick.

The collected ticks for each buffer will be processed by the memory planning algorithm. It should output the total size of the single allocation buffers for each alloc scopes, and the offsets for each individual mergeable buffers. The algorithm should also consider the locality of the buffer to use, when multiple buffer localtion candidates are available.

2 Likes

This is indeed something that the bufferization framework does not take into account. It’s been on our TODO list for a long time and any improvement would be greatly appreciated!

So allocations with dynamic sizes would be ignored by this transformation, right? What about static allocations with different element types? Would you have one alloc per unique element type?

The bufferization framework is already quite complex. So from a modularity perspective, it makes sense to have this as bufferization post-processing pass.

Your transformation will likely require aliasing information. One-Shot Bufferize computes alias sets during bufferization, but given that your transformation is a post-bufferization pass, it won’t be able to access this aliasing information.

When we designed the ownership-based buffer deallocation pass, we needed something similar: an analysis to decide if two memref SSA values originate from the same allocation. We first built something on top of MLIR’s alias analysis. This analysis turned out to be too conservative (but could likely be improved) and not exactly what we needed (“no aliasing” does not imply “different allocation”). We then built a simple “is-same-allocation” analysis (that still treats many cases quite conservatively) on top of BufferViewFlowAnalysis. These may be useful for your transformation.

This has been proposed a while ago: [RFC] Lifetime Annotations of Memory Within MLIR. I’m not sure what happened to that RFC in the end.

This sounds similar to the concept of “ownership” in the ownership-based buffer deallocation pass. In the ownership-based buffer deallocation pass, the thing that owns an allocation is always a block. It is not always statically known if a block owns an allocation or not. In such cases, an “ownership indicator” will be materialized in the form of an i1 value. E.g.:

func.func @foo() {
  %0 = memref.alloc
  // Nested regions/blocks never that ownership, so the ownership of %0 stays with the function
  %1 = scf.for ... iter_args(%arg0 = %0) -> memref<5xf32> {
    // Does this block have ownership of %arg0? Depends on whether %arg0 is %0 or %2 --> runtime ownership indicator needed
    %2 = memref.alloc
    %3 = arith.select ..., %arg0, %2
    scf.yield %3
  }
}

Do you have the same concept of “ownership”? Can you run into such cases?

Side note: The ownership-based buffer deallocation pass has a “dynamic ownership” option. If that option is enabled, ownership indicators are added to the function signature, making it possible to support buffer deallocation across function boundaries.

I think this would be relatively easy to implement in the ownership-based buffer deallocation pass. At the moment, bufferization.dealloc ops (which turn into memref.dealloc) are always inserted at the end of a block (right before the terminator). But we could insert them earlier. It would be interesting to see how much this improves the situation.

Why not any basic block?

I think you could compute these “ticks” (or something similar) with the BufferViewFlowAnalysis. If I remember correctly, for a given allocation, it can give you all SSA values that may be a view/alias of the allocation.

This may be the interesting part. It always gets tricky when there are loops or unstructured control flow.

We have a draft implementation at our downstream repo: [mlir][Memref] Add memref-merge optimization by Menooker · Pull Request #44 · intel/graph-compiler · GitHub

So allocations with dynamic sizes would be ignored by this transformation, right? What about static allocations with different element types? Would you have one alloc per unique element type?

Dynamic sized allocations are untouched. And alloc of different element types are merged in one allocation if possible. We are using single alloc of i8 and memref.view on it.

These may be useful for your transformation.

Thanks for the suggestions. I am new to MLIR and it should be helpful. By the way, in my draft implementation, I recursively find the base of the ViewLike ops to find the base allocations.

This sounds similar to the concept of “ownership” in the ownership-based buffer deallocation pass. In the ownership-based buffer deallocation pass, the thing that owns an allocation is always a block. It is not always statically known if a block owns an allocation or not. In such cases, an “ownership indicator” will be materialized in the form of an i1 value.

In my current implementation, I only consider if a function returns a memref. That kind of memrefs are not merge-able. If a memref.alloc is not returned, we don’t consider the “block-ownership”. Instead, we only consider the lifetime of it. This test file is an example of annotated lifetime: graph-compiler/test/gc-dialects/Transforms/buffer-merge-lifetime.mlir at ef3d150ad591dc9ff3c86686341c2e7ce007695d · intel/graph-compiler · GitHub

Side note: The ownership-based buffer deallocation pass has a “dynamic ownership” option. If that option is enabled, ownership indicators are added to the function signature, making it possible to support buffer deallocation across function boundaries.

Maybe our first step is to support function-local allocation first. We can revisit “dynamic ownership” later?

I think this would be relatively easy to implement in the ownership-based buffer deallocation pass. At the moment, bufferization.dealloc ops (which turn into memref.dealloc ) are always inserted at the end of a block (right before the terminator). But we could insert them earlier. It would be interesting to see how much this improves the situation.

Yes, indeed. It should help to improve the memory usage and locality in some cases.

Why not any basic block?

I will update the wording in the RFC. It should be:
The transformation first needs to identify the alloc scopes, which are mlir Blocks

  • implementing AutomaticAllocationScope
  • and is not scf.for (allocations in an scf.for can be hoisted to parent AutomaticAllocationScope)

I think you could compute these “ticks” (or something similar) with the BufferViewFlowAnalysis . If I remember correctly, for a given allocation, it can give you all SSA values that may be a view/alias of the allocation.

The “ticks” are the core of this pass. It is not just getting the alias of an allocation. We need to assign a [start, end] range for each merge-able allocation, to linearize the lifetime (or maybe liveness) of each buffer into a stream of malloc/free events. And a static memory buffer allocator is used to “schedule” the buffer.

This may be the interesting part. It always gets tricky when there are loops or unstructured control flow.

We can now handle RegionBranchOpInterface or LoopLikeOpInterface. Any memref buffers that are accessed in a for or if will be ensured to have overlapping lifetime.

e.g.

a = memref.alloc() // tick=1
b = memref.alloc() // tick=2
scf.for(...) {
    use1 a // tick=4
    use2 a // tick=5
    use3 b // tick=6
    use4 b // tick=7
}
// end-of-loop-tick=7

The pass makes sure the lifetime of a to be at least [4,7] and b to be at least [5,7]. Because they are used in the loop, even though in the IR, the last use of a is at tick 5, we extend it to end-of-loop-tick=7.

We can be more agressive when the memref.alloc is defined within a for:

scf.for(...) { // tick=0
    a = memref.alloc() // tick=1
    b = memref.alloc() // tick=2
    use1 a // tick=3
    use2 a // tick=4
    use3 b // tick=5
    use4 b // tick=6
}
// end-of-loop-tick=7

The lifetime of a will be [3,4] and b will be [5,6]. They are not overlapping and can reuse the buffer of each other.

BTW, Is there a way to edit my original post? I would like to refine some of the wording, like the definition of alloc scopes.

Not after some time, no. I was also caught in this rule before.

1 Like

Thanks for the post! I have been thinking about something similar for a while and have a few thoughts on this.

I have one main question regarding your implementation: Why do this as a post-bufferization pass and merge memref.allocs? During bufferization, you already have lifetime annotations (last use of each tensor value) and aliasing information of each tensor value. Combining these, you can produce a verifier that checks if your buffer assignment is legal (happy to discuss more details on how). I would expect a smarter bufferizer would use this verifier to transform bufferization.alloc_tensor ops into memref.view operations instead of first bufferizing and then relying on an analysis to get this information back.

Just a word of caution: The bufferization pass is already quite complex. If we can reuse BufferViewFlowAnalysis (or some other existing infrastructure) to gather aliasing information, I’d recommend to implement this transformation as a separate pass instead of making the bufferization pass even larger. (This would also make it easier to test the transformation, etc.)

For reference, see the existing bufferization pre/post-processing passes (red/green boxes), some of which could’ve been part of the One-Shot Bufferize pass, but we decided against it.

1 Like

When I say “bufferization”, I mean the whole process till (and including) one-shot-bufferization in your flow chart. The merging allocation optimization would sit between eliminate-empty-tensor (or whatever state after which every allocation is now an alloc_tensor) and the actual bufferization. It would then perform the merging optimization on alloc_tensor ops instead, possibly converting them into bufferization.alloc_slice %big_buffer_to_allocate_from.

The point I want to make is that we shouldn’t be running an analysis to get aliasing information after buffers have been allocated because aliasing/lifetime information is readily available in tensor land.

The input of the compiler may be an MLIR program directly using memref and memref.alloc, which does not have tensor ops and does not need bufferization. I suppose decoupling this pass from bufferization will help to optimize the program in that case.

I found that we indeed need this analysis to find the base allocation of a memref, as we can select on 2 different memref.

Could you discuss more on how bufferization currently handles the lifetime annotations? I guess the aliasing info is extracted via BufferViewFlowAnalysis, as mentioned by @matthias-springer . I am quite interested how lifetime is annotated and how it correctly handles control flows like scf.if and scf.for.

For example, at the last part of the post [RFC] Compile-time memref.alloc Scheduling/Merging optimization - #4 by Menooker there are two scf.for loops with expected lifetime annotations.

Another question: It seems that the current liveness analysis is Block level. If allocations a and b are used in the same block, will the current liveness analysis say they have overlapped lifetime? What if the IR looks like:

{
    a = memref.alloc() // tick=1
    b = memref.alloc() // tick=2
    use1 a // tick=3
    use2 a // tick=4
    use3 b // tick=5
    use4 b // tick=6
    // a and b are not used outside of this block
}

allocations a and b actually have non-overlapped lifetime.
I would like to confirm whether the current liveness analysis can handle this.

I would like to extend the discussion on the implementation.

Static memory scheduler

Currently we have a “static memory scheduler” which works well, and we would like to contribute that to MLIR. The input of the static memory scheduler is a “vector” of memory “events” - and each “event” contains {bufID, size, freeOrAlloc}. The static memory scheduler outputs the size of the merged single buffer of all the allocations and a map of bufID => offset.
This requires “linearize” the first-use/last-use for each allocation. For example,

    a = memref.alloc() : memref<10xi8> // tick=1
    b = memref.alloc() : memref<20xi8> // tick=2
    use1 a // tick=3
    use2 b // tick=4
    use3 a // tick=5
    use4 b // tick=6

will be linearized to events of

{bufID="a", size = 10, alloc}
{bufID="b", size = 20, alloc}
{bufID="a", size = 10, free}
{bufID="b", size = 20, free}

The alloc and free represents first-use and last-use in the IR.

A possible output of the scheduler may be

totalsize = 30
{
  "a"=> offset=0
  "b"=> offset=10
}

Our current MLIR implementation of the whole pass

Our current MLIR implementation is “tick” based. We use a monotonic increasing clock when walk() into the function and assign a tick with the current clock value for each walked op. While walking the IR, we can find the non-pure (may have memory-effects) ops and extend the [start_tick, end_tick] of each allocation the op references.

After collecting the ticks for each allocation, we sort the start_tick and end_tick for each allocation to form a vector of “events” {bufID, size, freeOrAlloc}. Finally it is passed to the “static memory scheduler”.

We also handle RegionBranchOpInterface and LoopLikeOpInterface specially, to make sure the referenced allocations inside for and if to have overlapped lifetime, so they cannot reuse buffer of each other. That is to ensure the correctness in a conservative way.

In the example above, allocation a has ticks [3,5] and b has ticks [4,6]. That is: “alloc event for a” happens at tick 3 and “free event for a” happens at tick 5. We sort the events for a and b by their ticks, to get the linearized events in the previous section.

Other ways to linearize the memory events

As is suggested, it might be possible to use LivenessAnalysis in the current MLIR code base to linearize the memory events. However, I have some doubts:

  • The static-memory-scheduler requires a linearized stream of memory events. We need to sort all allocation’s first-use and last-use in total-order. I am not sure if the current Liveness can provide such information in an efficient way. For each allocation, we need the liveness range of it, and possiblely compare it with other allocation’s. Will it end up with complexity of O(num_alloc * prorgram_length)?
  • We need to skip the ops explicitly marked no-memory-effects. Those ops does not contribute to the liveness of allocations, even they references a buffer.
1 Like

Skimming through the thread, but this seems pretty much the Linear Scan Register Allocator but for tensors. I have implemented such a thing in the past.

1 Like

Thanks for sharing. The memory-allocator and reg-allocator look alike but they are a bit different.

  • Registers are consider fixed sized - 8,16,32,64 bits or longer SIMD lengths. And memory allocations can have unlimited choices of the size.
  • Allocated/deallocated memory chunks affect allocations in the nearby memory locations. If you free a chunk of memory, and the adjacent memory chunks at LHS and RHS are both free, you can merge the free chunks to have a larger free chunk. The algorithm also needs to handle and minimize memory fragments for better memory utilization. Memory allocation works on a 1-D plain memory space and reg-allocation works on 0-D discrete registers.
  • We also try to produce scheduling results for better cache locality, by choosing the free-chunks that is recently used.

A pertinent comment is that you can avoid all thes problems by not creating intermediate tensor.empty()s in the first place, and using the DPS system to chain the operations.

   %a0 = tensor.empty() : tensor<128x128xf32>
   %a = linalg.matmul ins(%x, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%a0: tensor<128x128xf32>) -> tensor<128x128xf32>
   %b = linalg.matmul ins(%a, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%a: tensor<128x128xf32>) -> tensor<128x128xf32>
   ...

This will avoid all problems you describe and can be done with any sequence of operations that are “known to be” in-place (ie. single consumer, same shapes).

The complexity comes in other forms:

  1. More than one consumer, where now you really need to write the first ones (in whichever order) to new buffers. More users create more buffers, and their own chains need liveness analysis to determine when to reuse.
  2. Different output shapes, where you can’t reuse the input buffer as is without controlling the surrounding memory (ex. a compiler controlled arena). This seems to be part of your proposal, and it’s quite a large detraction of the current approach.
  3. Multiple producers, where it’s not clear which of the inputs should be reused, if any. If all inputs are dead after the operation, you can reuse any, otherwise, you’ll need liveness analysis to determine which, if any, you can reuse.

These problems are usually best solved at the graph level, not the IR level. This is why generating DPS from a graph representation already “deciding” (really, just hinting the compiler), which buffers can/should be reused is the current approach.

I agree. But what works for CPUs (ex. arena) don’t quite work for GPUs (ex. shared memory, work groups).

We need to split the solution in two parts:

  1. A common infrastructure: ops, passes, semantics that can be used by different buffer allocator strategies.
  2. Various implementations (likely incompatible with each other) that take into account the semantics of the target’s memory when de-duplicating things.

Indeed, and it will also play with tiling and fusion strategies that have been decided at tensor level. Aliasing is not as trivial here as it is in low-dim-tensor land. We have to answer questions about sub-buffers, parallel accumulation, multiple read-only copies of the inputs, etc.

I was going to point this out, yes. I don’t know if anyone continued working on that, but it would be a good start to my “common infrastructure” point above.

We can always skip unstructured control flow on the first approach and assume the worst.

Not quite. After you tile and fuse, and run 2D parallelization, and threading, work-grouping, etc. you have completely lost the idea of “values”, and any aliasing you had originally is probably lost.

The value of a memref pass is that it’s not looking at the tensors from a value perspective, but from a buffer perspective, and can reuse buffers even when the original code had no idea it was possible (ex. same size, different sub-graph, dead).

But it also has to know about intentional duplication of inputs (to avoid external loads), lock-step parallel accumulation (avoiding race conditions), etc. None of that is visible from tensor land because the intermediate passes haven’t run yet.

There is an argument to run as many of those passes while still in tensors, but that’d require extensive cost modelling and a rich infrastructure for passing down information (attributes, target info, costs, benefits, etc), which are non-existent today in MLIR.

Hopefully, this will all be redundant once we have intelligent schedule searches yielding the “perfect” compiler pipeline for each input IR, but we’re not there yet.

Memory allocation also needs the ordering of alloc/free for the infomation of “locality” - e.g. we should reuse a buffer that is recently free’d because it is more likely be in cache. This is part of the reason we “linearize” the IR. Not sure if the “time ordering” of the buffers are preserved in graph.

I am not an expert on GPU programming. But I think on GPUs arena still works on shared memory? We can consolidate “single” memory buffers for each address space, one for global memory one for shared memory? The allocation algorithm may need to change to avoid bank-conflicts.

Right, this is what I mean. The algorithms are different because they have different constraints. It should work fine as long as you work with those constraints, which means we’ll need not only different cost models, but also different validation processes.

How about providing a generic framework for memref.alloc merging? And we can implement a default pass based on it. There should be following steps driven by the framework:

  1. Collect the memory alias via BufferViewFlowAnalysis
  2. Collect the memory lifetime traces
  3. Schedule the buffers by an allocation algorithm, compute the offsets of each allocations
  4. Rewrite the IR to replace allocations with views of merged buffers

The memory-alloc-merge passes for different devices/constraints can provide their own implementation for steps 2 and 3 (e.g. memory lifetime traces can be “tick ranges” or graph-based info, and allocator-algorithm can be different for different devices). Steps 1 and 4 can be shared by all implementations.

// abstract base class for lifetime of different buffers
// that are to be merged in the same allocation
class LifetimeTrace {
    virtual ~LifetimeTrace()=default;
};

// the traces and location for a future-merged allocation
class LifetimeTraceScope {
    // the position for the merged allocation to be inserted at
    Block* block; 
    unique_ptr<LifetimeTrace> traces;
}

// top level memory trace info
class MemoryTraces {
    vector<LifetimeTraceScope> scopes;
}

// the memory scheduling result. allocation => offset map.
class MemorySchedule {
    vector<map<memref::AllocOp, int64_t>> allocToOffset;
}

void mergeAllocGeneric(Operation* op, TraceFuncT collectTrace, ScheduleFuncT scheduler) {
   BufferViewFlowAnalysis aliasInfo{op};
   MemoryTraces traces = collectTrace(op, aliasInfo);
   MemorySchedule schdResult = scheduler(traces);
   rewriteOp(op, schdResult);
}

We can provide tick-based linearized memory traces in our implementation of collectTrace and use the linearized traces for our current algorithm at scheduler.

This will be useful. I am thinking of a device-to-device copy that can be done asynchronously. If we reuse the memory thinking it is dead, we may end up overwriting the memory as it is still getting copied. Normally this could be handled by a device-aware memory allocation/placement pass after bufferization. So we would have to provide a way to decide which buffers can be reused under what conditions if we want to make this part of bufferization pipeline.