Bufferization: how to avoid extra alloc with 'buffer-results-to-out-params'

Hi all-

I feel like I’m missing something. I want to bufferize the following code and convert the result to a memref argument:

  func.func @forward(%arg0: tensor<5xi32>, %arg1: tensor<5xi32>) -> tensor<i32> {
    %c0_i32 = arith.constant 0 : i32
    %0 = tensor.empty() : tensor<i32>
    %1 = linalg.fill ins(%c0_i32 : i32) outs(%0 : tensor<i32>) -> tensor<i32>
    %2 = linalg.dot ins(%arg0, %arg1 : tensor<5xi32>, tensor<5xi32>) outs(%1 : tensor<i32>) -> tensor<i32>
    return %2 : tensor<i32>
  }

I can do this simply enough with -one-shot-bufferize='allow-return-allocs bufferize-function-boundaries' -buffer-results-to-out-params -buffer-deallocation, but this creates a needless memref.alloc and memref.copy which I can’t have.

  func.func @forward(%arg0: memref<5xi32, strided<[?], offset: ?>>, %arg1: memref<5xi32, strided<[?], offset: ?>>, %arg2: memref<i32>) {
    %c0_i32 = arith.constant 0 : i32
    %alloc = memref.alloc() {alignment = 128 : i64} : memref<i32>
    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<i32>)
    linalg.dot ins(%arg0, %arg1 : memref<5xi32, strided<[?], offset: ?>>, memref<5xi32, strided<[?], offset: ?>>) outs(%alloc : memref<i32>)
    memref.copy %alloc, %arg2 : memref<i32> to memref<i32>
    memref.dealloc %alloc : memref<i32>
    return
  }

I think there was once a -copy-removal pass which would handle this scenario? Either way, I’ve tried a huge number of permutations of relevant-sounding passes but still am not getting the results I need. Am I missing something? How should I accomplish this?

The output I want is something like:

  func.func @forward(%arg0: memref<5xi32, strided<[?], offset: ?>>, %arg1: memref<5xi32, strided<[?], offset: ?>>, %arg2: memref<i32>) {
    %c0_i32 = arith.constant 0 : i32
    linalg.fill ins(%c0_i32 : i32) outs(%arg2 : memref<i32>)
    linalg.dot ins(%arg0, %arg1 : memref<5xi32, strided<[?], offset: ?>>, memref<5xi32, strided<[?], offset: ?>>) outs(%arg2: memref<i32>)
    return
  }

Hi!
I wrote a pass as part of our project, and I did not upload it yet (wasn’t sure it is relevant for others).
I ran it now on your example and it works. I am adding here the log so you can verify it’s what you need, and then I can upload a patch :slight_smile:

mlir-opt --one-shot-bufferize='allow-return-allocs bufferize-function-boundaries'

Input:

  func.func @forward(%arg0: tensor<5xi32>, %arg1: tensor<5xi32>) -> tensor<i32> {
    %c0_i32 = arith.constant 0 : i32
    %0 = tensor.empty() : tensor<i32>
    %1 = linalg.fill ins(%c0_i32 : i32) outs(%0 : tensor<i32>) -> tensor<i32>
    %2 = linalg.dot ins(%arg0, %arg1 : tensor<5xi32>, tensor<5xi32>) outs(%1 : tensor<i32>) -> tensor<i32>
    return %2 : tensor<i32>
  }

Output:

  func.func @forward(%arg0: memref<5xi32, strided<[?], offset: ?>>, %arg1: memref<5xi32, strided<[?], offset: ?>>) -> memref<i32> {
    %c0_i32 = arith.constant 0 : i32
    %alloc = memref.alloc() {alignment = 128 : i64} : memref<i32>
    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<i32>)
    linalg.dot ins(%arg0, %arg1 : memref<5xi32, strided<[?], offset: ?>>, memref<5xi32, strided<[?], offset: ?>>) outs(%alloc : memref<i32>)
    return %alloc : memref<i32>
  }

Now I ran my pass replaceAllocWithArg, and DropEquivalentBufferResults:

mlir-opt --replace-alloc-with-arg --drop-equivalent-buffer-results
And the output is:

  func.func @forward(%arg0: memref<5xi32, strided<[?], offset: ?>>, %arg1: memref<5xi32, strided<[?], offset: ?>>, %arg2: memref<i32>) {
    %c0_i32 = arith.constant 0 : i32
    linalg.fill ins(%c0_i32 : i32) outs(%arg2 : memref<i32>)
    linalg.dot ins(%arg0, %arg1 : memref<5xi32, strided<[?], offset: ?>>, memref<5xi32, strided<[?], offset: ?>>) outs(%arg2 : memref<i32>)
    return
  }

The typical way one would achieve this with only --one-shot-bufferize is to convert the function into “destination passing-style” ahead of time and have the caller create the tensor.empty.

  func.func @caller(...) -> (...) {
    %0 = tensor.empty() : tensor<i32>
    %arg0 = ... : tensor<5xi32>
    %arg1 = ... : tensor<5xi32>
    %1 = func.call @forward(%arg0, %arg1, %0) : (tensor<5xi32>, tensor<5xi32>, tensor<i32>) -> (tensor<i32>)
    ... use(%1)
  }

  func.func @forward(%arg0: tensor<5xi32>, %arg1: tensor<5xi32>, %arg2: tensor<i32>) -> tensor<i32> {
    %c0_i32 = arith.constant 0 : i32
    %1 = linalg.fill ins(%c0_i32 : i32) outs(%arg2 : tensor<i32>) -> tensor<i32>
    %2 = linalg.dot ins(%arg0, %arg1 : tensor<5xi32>, tensor<5xi32>) outs(%1 : tensor<i32>) -> tensor<i32>
    return %2 : tensor<i32>
  }

Then, @forward should bufferize in-place without needing return allocs.

You may prefer to do this post-hoc with a solution like @maya_amrami has devised.

Separately, note some of my concerns with

that I had outlined here: Properly using Bufferization related passes, which also discuss concerns with MLIR functions returning a memref in general.

After reading @nicolasvasilache reply, and Properly using Bufferization related passes, I see that there really was a -copy-removal pass.
It worked on linalg.copy and was uploaded here: ⚙ D82757 [mlir] Add redundant copy removal transform.
Then It was removed here: ⚙ D99172 [mlir] Introduce CloneOp and adapt test cases in BufferDeallocation. (@dfki-jugr).
My pass is a bit long and might not be the simplest solution for the given case.
I can think of a few more options:

  1. Editing buffer-results-to-out-params pass, so the copy is not inserted at the first place.
  2. Adding a simple pattern for the canonicalizer pass that will remove the copy.
  3. A more complicated pass if needed, for copy removal.

What do you guys think?

I’m using torch_mlir as the frontend, so I’ll have to dig into how to get it to generate destination passing style code.

class DotModule(torch.nn.Module):

  def forward(self, a, b):
    return torch.matmul(a, b)


shape = torch_mlir.TensorPlaceholder([5], torch.int32)

module = torch_mlir.compile(DotModule(), [shape, shape],
                            output_type="linalg-on-tensors")

with open("dot.mlir", "w") as f:
  f.write(str(module))

I found and skimmed that thread – it’s where I learned of the old copy-removal pass. I agree about returning pointers. It can get tricky whether it’s IR or regular code and it’s typically not performant. There’s a reason it is only allowed in certain cases in the LLVM code base. The only reason I was using allow-return-allocs was that we were combining it with a Polygeist pass mem2reg which eliminated the memref.

Yeah, I was considering writing my own pass to handle this simple case.

That would leave the alloc. IMHO, converting to a “destination passing style” should either be an option to oneshot bufferize or a separate pass run before it.

1 Like

I meant editing buffer-results-to-out-params so there won’t be an alloc.
The pass is naive - it adds an out parameter for the result buffer and copies the alloc to it.
The suggestion was making it less naive - it will identify this case where the copy is not really needed. Thus it will add an out parameter that replaces the alloc.
The other suggestion was cleaning it later with a canonicalizer pattern. Both ways can give you your desired output.

I agree that both could work. But if it was an option to one-shot buffering (instead of allow-return-alloc, which is what creates the alloc) then you wouldn’t have to identify cases where the copy is necessary since neither the alloc nor the copy would be added to begin with.