Linalg vectorizer, reductions, and bufferization

Hi, folks. I’ve been playing a bit with vectorization by way of linalg and i’m curious about the loops it is generating.

For part of my reduction loop i end up with code like this which is accumulating into a single vector:

      %17 = vector.transfer_write %cst, %10[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>
      %18 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %17) -> (tensor<64xf32>) {
        %extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
        %25 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
        %26 = vector.transfer_read %arg13[%c0], %cst_0 {in_bounds = [true]} : tensor<64xf32>, vector<64xf32>
        %27 = vector.shape_cast %25 : vector<1x64xf32> to vector<64xf32>
        %28 = arith.addf %26, %27 : vector<64xf32>
        %29 = vector.transfer_write %28, %arg13[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>
        scf.yield %29 : tensor<64xf32>
      }

When this loop passes through bufferization i get some pretty terrible code where it is spilling the vector to memory on each iteration of the loop. I’m not going to reproduce that IR here, but i think you can imagine each “tensor” variable in the code above becoming a memref backed by an allocation and reading and writing that memref on each loop iteration.

If i can instead get the loop to iterate on the vector itself, the resultant IR looks simpler and the generated code is improved. The modified loop looks like this:

      %18 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %cst) -> (vector<64xf32>) {
        %extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
        %24 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
        %25 = vector.shape_cast %24 : vector<1x64xf32> to vector<64xf32>
        %26 = arith.addf %arg13, %25 : vector<64xf32>
        scf.yield %arg13 : vector<64xf32>
      }

This IR results in no additional memory allocated and no “explicit” spills in the loop after running bufferization.

So my questions:

  • am i thinking about this problem correctly from others point of view?
  • is there a way to get the linalg vectorizer to automatically generate the above code?
  • is the transformation i’ve described above something that already exists in MLIR core?

Thanks,
ian Bearman
AI Compilers, Microsoft

Before bufferization, you want to run LoopInvariantSubsetHoisting which should make the loop iterate of the vector itself: llvm-project/mlir/lib/Transforms/LoopInvariantCodeMotion.cpp at main · llvm/llvm-project · GitHub

In IREE we have a pass “OptimizeTensorExtractInsertSlice” which does runs some of these simplifications (which exist upstream) before bufferization (after vectorization) iree/compiler/src/iree/compiler/Codegen/Common/OptimizeTensorInsertExtractSlices.cpp at main · iree-org/iree · GitHub

1 Like

Thank you!
EDIT i misread the code, the IR is correct, just needs a bit of clean up

     %19:2 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %17, %arg14 = %18) -> (tensor<64xf32>, vector<64xf32>) {
        %extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
        %27 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
        %28 = vector.shape_cast %27 : vector<1x64xf32> to vector<64xf32>
        %29 = arith.addf %arg14, %28 : vector<64xf32>
        scf.yield %arg13, %29 : tensor<64xf32>, vector<64xf32>
      }
      %20 = vector.transfer_write %19#1, %19#0[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>

Yeah, the pass leaves the dead iter_args. The IREE pass runs scf::ForOp canonicalization patterns to fix this: iree/compiler/src/iree/compiler/Codegen/Common/OptimizeTensorInsertExtractSlices.cpp at main · iree-org/iree · GitHub

1 Like