[bufferization] [linalg] Should CSE be used before bufferization to remove tensor.empty?

Hi all!

I have a fundamental question about how tensor.empty should be used and whether it’s necessary to run CSE with the goal of removing redundant tensor.empty before converting them to bufferization.alloc_tensor and running one-shot-bufferize.

From the documentation, tensor.empty is simply used to communicate the shapes for DestinationStyleOpInterface ops; however, the eliminate-empty-tensors pass only eliminates tensor.empty under some cases where the tensor is inserted into other tensors. This means that tensor.empty ops that aren’t eliminated will be converted to bufferization.alloc_tensor and correspond to a memref.alloc after one-shot-bufferize.

So this leads to my question, when creating a linalg op (or any other ops that implement the DestinationStyleOpInterface), should I:
1. always create a new tensor.empty op with the right shape for its destination operand, or
2. try to reuse tensors returned from other ops

I have an example input where having a new tensor.empty for each destination operand results in one-shot-bufferize allocating one more buffer compared to reusing tensor.empty:

  func.func @test(%arg0: tensor<512x512xbf16>, %arg1: tensor<512x512xbf16>) -> tensor<512x512xbf16> {
    %0 = tensor.empty() : tensor<512x512xbf16>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : tensor<512x512xbf16>, tensor<512x512xbf16>) outs(%0 : tensor<512x512xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %6 = arith.mulf %in, %in_0 : bf16
      linalg.yield %6 : bf16
    } -> tensor<512x512xbf16>
    %2 = tensor.empty() : tensor<512x512xbf16>
    %3 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%1, %1 : tensor<512x512xbf16>, tensor<512x512xbf16>) outs(%2 : tensor<512x512xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %6 = arith.mulf %in, %in_0 : bf16
      linalg.yield %6 : bf16
    } -> tensor<512x512xbf16>
    %4 = tensor.empty() : tensor<512x512xbf16>
    %5 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%3, %arg1 : tensor<512x512xbf16>, tensor<512x512xbf16>) outs(%4 : tensor<512x512xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %6 = arith.addf %in, %in_0 : bf16
      linalg.yield %6 : bf16
    } -> tensor<512x512xbf16>
    return %5 : tensor<512x512xbf16>
  }

Running the above with flags --eliminate-empty-tensors --empty-tensor-to-alloc-tensor --one-shot-bufferize="allow-return-allocs=true allow-unknown-ops=false bufferize-function-boundaries=true", I got the following codegen with 3 memref.alloc:

#map = affine_map<(d0, d1) -> (d0, d1)>

module {
  func.func @optimized(%arg0: memref<512x512xbf16, strided<[?, ?], offset: ?>>, %arg1: memref<512x512xbf16, strided<[?, ?], offset: ?>>) -> memref<512x512xbf16> {
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<512x512xbf16>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : memref<512x512xbf16, strided<[?, ?], offset: ?>>, memref<512x512xbf16, strided<[?, ?], offset: ?>>) outs(%alloc : memref<512x512xbf16>) {
    ^bb0(%in: bf16, %in_2: bf16, %out: bf16):
      %0 = arith.mulf %in, %in_2 : bf16
      linalg.yield %0 : bf16
    }
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<512x512xbf16>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%alloc, %alloc : memref<512x512xbf16>, memref<512x512xbf16>) outs(%alloc_0 : memref<512x512xbf16>) {
    ^bb0(%in: bf16, %in_2: bf16, %out: bf16):
      %0 = arith.mulf %in, %in_2 : bf16
      linalg.yield %0 : bf16
    }
    %alloc_1 = memref.alloc() {alignment = 64 : i64} : memref<512x512xbf16>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%alloc_0, %arg1 : memref<512x512xbf16>, memref<512x512xbf16, strided<[?, ?], offset: ?>>) outs(%alloc_1 : memref<512x512xbf16>) {
    ^bb0(%in: bf16, %in_2: bf16, %out: bf16):
      %0 = arith.addf %in, %in_2 : bf16
      linalg.yield %0 : bf16
    }
    memref.dealloc %alloc : memref<512x512xbf16>
    memref.dealloc %alloc_0 : memref<512x512xbf16>
    return %alloc_1 : memref<512x512xbf16>
  }
}

Rerunning the above flags but with CSE (--cse --eliminate-empty-tensors --empty-tensor-to-alloc-tensor --one-shot-bufferize="allow-return-allocs=true allow-unknown-ops=false bufferize-function-boundaries=true") so that the redundant tensor.empty ops are eliminated produced the following codegen:

#map = affine_map<(d0, d1) -> (d0, d1)>

module {
  func.func @optimized(%arg0: memref<512x512xbf16, strided<[?, ?], offset: ?>>, %arg1: memref<512x512xbf16, strided<[?, ?], offset: ?>>) -> memref<512x512xbf16> {
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<512x512xbf16>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : memref<512x512xbf16, strided<[?, ?], offset: ?>>, memref<512x512xbf16, strided<[?, ?], offset: ?>>) outs(%alloc : memref<512x512xbf16>) {
    ^bb0(%in: bf16, %in_1: bf16, %out: bf16):
      %0 = arith.mulf %in, %in_1 : bf16
      linalg.yield %0 : bf16
    }
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<512x512xbf16>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%alloc, %alloc : memref<512x512xbf16>, memref<512x512xbf16>) outs(%alloc_0 : memref<512x512xbf16>) {
    ^bb0(%in: bf16, %in_1: bf16, %out: bf16):
      %0 = arith.mulf %in, %in_1 : bf16
      linalg.yield %0 : bf16
    }
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%alloc_0, %arg1 : memref<512x512xbf16>, memref<512x512xbf16, strided<[?, ?], offset: ?>>) outs(%alloc : memref<512x512xbf16>) {
    ^bb0(%in: bf16, %in_1: bf16, %out: bf16):
      %0 = arith.addf %in, %in_1 : bf16
      linalg.yield %0 : bf16
    }
    memref.dealloc %alloc_0 : memref<512x512xbf16>
    return %alloc : memref<512x512xbf16>
  }
}

In this case, we see only 2 memref.alloc.

Another question I have is: Is it necessary and correct to always use tensor.empty for the destination operands? Would something like this be correct (I’m reusing the result tensors of a linalg ops as output param for other linalg ops)?

  func.func @test(%arg0: tensor<512x512xbf16>, %arg1: tensor<512x512xbf16>) -> tensor<512x512xbf16> {
    %0 = tensor.empty() : tensor<512x512xbf16>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : tensor<512x512xbf16>, tensor<512x512xbf16>) outs(%0 : tensor<512x512xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %4 = arith.mulf %in, %in_0 : bf16
      linalg.yield %4 : bf16
    } -> tensor<512x512xbf16>
    %2 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%1, %1 : tensor<512x512xbf16>, tensor<512x512xbf16>) outs(%1 : tensor<512x512xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %4 = arith.mulf %in, %in_0 : bf16
      linalg.yield %4 : bf16
    } -> tensor<512x512xbf16>
    %3 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%2, %arg1 : tensor<512x512xbf16>, tensor<512x512xbf16>) outs(%2 : tensor<512x512xbf16>) {
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %4 = arith.addf %in, %in_0 : bf16
      linalg.yield %4 : bf16
    } -> tensor<512x512xbf16>
    return %3 : tensor<512x512xbf16>
  }

Thank you!

CSE seems like a canonicalization of the IR to me here, and hopefully the bufferization is always correct without it, even though possibly less performant as you found out.

Now ideally the bufferization algorithm should not be influenced by whether tensor.empty is or isn’t de-duplicated. That said, maybe there would be extra algorithmic complexity involved there that explains the current situation?

This should be correct. You should be able to reason about the IR at the tensor level, and an important property of an SSA value is a substitution principle: it does not matter how a value is produced, it carries the same information.

Can you link to the documentation? I believe the value is also used a initial value for reductions isn’t it?

1 Like

Thank you for the response!

So, to be safe, we should run canonicalization before bufferization right?

I got the information from the TensorOp’s description:

Sorry about another naive question as I have very limited insights into bufferization; I only know at a very high level that modelling tensors as SSA values helps simplify certain bufferization and optimization analyses.

So if using the result tensor of a linalg op as the init tensor of another op is also correct, would doing this influence the bufferization analysis and therefore result in different bufferization result compared to just using tensor.empty for every init tensor (assuming the IR are semantically equivalent)? From your reply, it seems that as long as the dimensions match up and IR semantics remain the same, it doesn’t matter what I use as the init tensor right?

I guess what I’m getting at is, when lowering to linalg, should I worry about having to “hint” the bufferization pass about buffers that are in-placeable by carefully choosing the outs param? Or can I just use tensor.empty for every init tensor, run canonicalization, and let bufferization decide everything?

Thank you!

Hello @hoangnhat , I have started this post and a place to commonalize experiments answering such questions in PSA: Status of end-to-end Structured Codegen on Tensors and Declarative Pipelines.

As far as “enabling transformations” go (canonicalization, dce, cse, licm), yes these should be sprinkled in a few key places around the pipeline. Once this PR lands in IREE: https://github.com/openxla/iree/pull/12467 I will be able to update the examples described above to more precisely show where such transformations are added.

The rule of thumb is: if you only write to the output then a tensor.empty is the right thing to use.
This is the case when your op contains no “reduction” iterator and does not use the bbArg corresponding to the output tensor in a computation in the region.

If these constraints are not satisfied, your op will read the original value and it better be initialized to something meaningful. A common way is to use a tensor.empty + tensor.fill or any other tensor you have previously computed.

Bufferization will perform an alias analysis based on the “future buffers” that correspond to each tensor and insert the proper copies accordingly; using a chain of tensor SSA values as result -> output -> ... -> result -> output amounts to providing a hint to bufferization in its analysis that the buffers can be reused inplace. Other more sophisticated analyses will be implemented later but this simple heuristic allows creating a system that perform well in most cases and can be influenced by the user (i.e. by changing the use-def chains to provide other hints): this is surprisingly useful in practice.

1 Like

No it should be safe to bufferize without canonicalization.

That seems like a good shortcut to take, but I’m concerned about a long term strategy using this kind of heuristic: this is fragile and does not compose well.
Ideally, I would think that bufferization should internally “canonicalize” its analysis independently of how the chains are built on the value-based representation and be as little influenced as possible by how the chains are presented in the input.

1 Like

Yes there is clearly an analysis/optimization path lurking that would be able to look at sets of future buffers and make better global decisions. This will require heuristics and will be subject to surprises / failures as the problem has (I think) been shown to be difficult.

You can view the current implementation as a low-surprise “all sets are of size 1” approximation.

2 Likes