[RFC] Compile-time memref.alloc Scheduling/Merging optimization

So allocations with dynamic sizes would be ignored by this transformation, right? What about static allocations with different element types? Would you have one alloc per unique element type?

Dynamic sized allocations are untouched. And alloc of different element types are merged in one allocation if possible. We are using single alloc of i8 and memref.view on it.

These may be useful for your transformation.

Thanks for the suggestions. I am new to MLIR and it should be helpful. By the way, in my draft implementation, I recursively find the base of the ViewLike ops to find the base allocations.

This sounds similar to the concept of “ownership” in the ownership-based buffer deallocation pass. In the ownership-based buffer deallocation pass, the thing that owns an allocation is always a block. It is not always statically known if a block owns an allocation or not. In such cases, an “ownership indicator” will be materialized in the form of an i1 value.

In my current implementation, I only consider if a function returns a memref. That kind of memrefs are not merge-able. If a memref.alloc is not returned, we don’t consider the “block-ownership”. Instead, we only consider the lifetime of it. This test file is an example of annotated lifetime: graph-compiler/test/gc-dialects/Transforms/buffer-merge-lifetime.mlir at ef3d150ad591dc9ff3c86686341c2e7ce007695d · intel/graph-compiler · GitHub

Side note: The ownership-based buffer deallocation pass has a “dynamic ownership” option. If that option is enabled, ownership indicators are added to the function signature, making it possible to support buffer deallocation across function boundaries.

Maybe our first step is to support function-local allocation first. We can revisit “dynamic ownership” later?

I think this would be relatively easy to implement in the ownership-based buffer deallocation pass. At the moment, bufferization.dealloc ops (which turn into memref.dealloc ) are always inserted at the end of a block (right before the terminator). But we could insert them earlier. It would be interesting to see how much this improves the situation.

Yes, indeed. It should help to improve the memory usage and locality in some cases.

Why not any basic block?

I will update the wording in the RFC. It should be:
The transformation first needs to identify the alloc scopes, which are mlir Blocks

  • implementing AutomaticAllocationScope
  • and is not scf.for (allocations in an scf.for can be hoisted to parent AutomaticAllocationScope)

I think you could compute these “ticks” (or something similar) with the BufferViewFlowAnalysis . If I remember correctly, for a given allocation, it can give you all SSA values that may be a view/alias of the allocation.

The “ticks” are the core of this pass. It is not just getting the alias of an allocation. We need to assign a [start, end] range for each merge-able allocation, to linearize the lifetime (or maybe liveness) of each buffer into a stream of malloc/free events. And a static memory buffer allocator is used to “schedule” the buffer.

This may be the interesting part. It always gets tricky when there are loops or unstructured control flow.

We can now handle RegionBranchOpInterface or LoopLikeOpInterface. Any memref buffers that are accessed in a for or if will be ensured to have overlapping lifetime.

e.g.

a = memref.alloc() // tick=1
b = memref.alloc() // tick=2
scf.for(...) {
    use1 a // tick=4
    use2 a // tick=5
    use3 b // tick=6
    use4 b // tick=7
}
// end-of-loop-tick=7

The pass makes sure the lifetime of a to be at least [4,7] and b to be at least [5,7]. Because they are used in the loop, even though in the IR, the last use of a is at tick 5, we extend it to end-of-loop-tick=7.

We can be more agressive when the memref.alloc is defined within a for:

scf.for(...) { // tick=0
    a = memref.alloc() // tick=1
    b = memref.alloc() // tick=2
    use1 a // tick=3
    use2 a // tick=4
    use3 b // tick=5
    use4 b // tick=6
}
// end-of-loop-tick=7

The lifetime of a will be [3,4] and b will be [5,6]. They are not overlapping and can reuse the buffer of each other.