How to bufferize `scf.execute_region` op?

When I tried to bufferize the temp.mlir:

  func.func nested @func1(%arg0: tensor<10x6xi1>, %arg1: memref<5x5xi1>, %arg2: i32) {
    %13 = tensor.empty() : tensor<10x6xf32>
    %173 = scf.execute_region -> tensor<10x6xf32> {
      scf.yield %13 : tensor<10x6xf32>
    }
    return
  }

with the option mlir-opt --empty-tensor-to-alloc-tensor --func-bufferize --scf-bufferize --bufferization-bufferize, the output is:

module {
  func.func nested @func1(%arg0: memref<10x6xi1>, %arg1: memref<5x5xi1>, %arg2: i32) {
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<10x6xf32>
    %0 = bufferization.to_tensor %alloc : memref<10x6xf32>
    %1 = scf.execute_region -> tensor<10x6xf32> {
      scf.yield %0 : tensor<10x6xf32>
    }
    return
  }
}

The scf.execute_region still yields tensor type.

Is there any step I missed :question:

@matthias-springer

--func-bufferize, --scf-bufferize etc. are deprecated, I will turn them into test passes soon. Use -one-shot-bufferize instead, then it should bufferize.

Is there any other ‘proper’ way to bufferize mixed tensor/memref code? For the context - we have our own custom tensor dialect which allows inplace modifications. We are lowering parts which we can prove safe to linalg-on-tensors and rest to memrefs, and bufferize resulting mixed code later by using these passes, which works reasonably well for us (although we have couple of custom patterns).

There is no good way to bufferize mixed tensor/memref code. We cannot analyze through memref code to decide when copies must be inserted during bufferization. You probably noticed that the passes like --tensor-bufferize introduce many copies. E.g., tensor.insert will always bufferize to alloc+copy+memref.store. That’s why these passes are not very useful in general apart from small unit tests.

You could try bufferizing your code with --one-shot-bufferize="allow-unknown-ops". Your custom ops that don’t implement BufferizableOpInterface will be skipped. Then you can run your own custom bufferization for the remaining code.

We don’t generate insert_slice, I think, but we had to do some custom lowering to avoid copies on extract_slice. But, passes like func/scf bufferize are just changing types on ops boundaries and are useful by itself.

Regarding extract_slice - issue was that it has to insert copy because conversion expects identity layout memrefs on op boundaries and memref.subview result is strided. But we handled this by introducing change_layout op, so extract_slice bufferized like this:

%1 = ... memref<?xf32>
%2 = memref.subview ... -> memref<?xf32 strided>
%3 = change_layout %2 -> memref<?xf32>

And then we have set of patterns which tries to propagate and cancel-out these change_layout ops.

Can you elaborate a bit more with an example? I’m curious :slight_smile:

A bit of background: When we designed One-Shot Bufferize, we had two design options:

  1. Analyze tensor IR and insert buffer copies only when needed.
  2. Insert copies on every write (without analyzing anything). Then run a memref analysis to remove copies again.

We went with the first option. I have no good answer which design is better. My gut feeling says variant 1 is simpler because we can utilize SSA use-def chains for the analysis and implement special rules to bufferize certain tensor ops efficiently in the absence of difficult analyses (e.g., range analyses). Also, a tensor-based analysis fits better with destination-style, which we’ve already been utilizing in other components (e.g., tiling).

The bufferization analysis is driven by BufferizableOpInterface. There are two methods that model the flow of data through the program: getAliasingOpOperands and getAliasingOpResults. These are properties of tensor operands.

E.g.:

// getAliasingOpOperands(%r) = {%t}
// getAliasingOpResults(%t) = {%r}
%r = tensor.insert %cst into %t[%idx] : tensor<?xf32>

This tells us that if %t bufferizes in-place, buffer(%r) == buffer(%t). The bufferization analysis maintains alias sets based on this information.

Operations that are not tensor-based do not implement the BufferizableOpInterface, so there’s no getAliasingOpOperands/Results property that could be queried. Instead, we have bufferization.to_memref at the boundary, for which the BufferizableOpInterface could be queried.

E.g.:

%m = bufferization.to_memref %t : memref<?xf32>
// Do something with %m

Our analysis stops at bufferization.to_memref. We don’t know what’s happening to %m. In particular, we don’t know if some op is going to read from %m and/or write to %m. So we have to be conservative and assume that the answer is “yes”; which could potentially insert unnecessary buffer copies.

Then there’s bufferization.to_tensor on the other end.
E.g.:

// ...
%t = bufferization.to_tensor %m : memref<?xf32>

Our analysis does not know where %t is coming from. But it has to implement BufferizableOpInterface::getAliasingOpOperands. Usually, we would look up the alias set of the OpOperand and maybe union the set of %t and %m. But that doesn’t work because %m is a memref. So we have to be conservative and assume that buffer(%t) may after bufferization alias with any other SSA value who’s definition dominates the bufferization.to_tensor op.

(We don’t do this at the moment. Instead, we assert that there’s no to_tensor/to_memref in the program.)