Linalg vectorizer, reductions, and bufferization

manbearian · December 18, 2024, 5:47pm

Hi, folks. I’ve been playing a bit with vectorization by way of linalg and i’m curious about the loops it is generating.

For part of my reduction loop i end up with code like this which is accumulating into a single vector:

      %17 = vector.transfer_write %cst, %10[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>
      %18 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %17) -> (tensor<64xf32>) {
        %extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
        %25 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
        %26 = vector.transfer_read %arg13[%c0], %cst_0 {in_bounds = [true]} : tensor<64xf32>, vector<64xf32>
        %27 = vector.shape_cast %25 : vector<1x64xf32> to vector<64xf32>
        %28 = arith.addf %26, %27 : vector<64xf32>
        %29 = vector.transfer_write %28, %arg13[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>
        scf.yield %29 : tensor<64xf32>
      }

When this loop passes through bufferization i get some pretty terrible code where it is spilling the vector to memory on each iteration of the loop. I’m not going to reproduce that IR here, but i think you can imagine each “tensor” variable in the code above becoming a memref backed by an allocation and reading and writing that memref on each loop iteration.

If i can instead get the loop to iterate on the vector itself, the resultant IR looks simpler and the generated code is improved. The modified loop looks like this:

      %18 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %cst) -> (vector<64xf32>) {
        %extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
        %24 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
        %25 = vector.shape_cast %24 : vector<1x64xf32> to vector<64xf32>
        %26 = arith.addf %arg13, %25 : vector<64xf32>
        scf.yield %arg13 : vector<64xf32>
      }

This IR results in no additional memory allocated and no “explicit” spills in the loop after running bufferization.

So my questions:

am i thinking about this problem correctly from others point of view?
is there a way to get the linalg vectorizer to automatically generate the above code?
is the transformation i’ve described above something that already exists in MLIR core?

Thanks,
ian Bearman
AI Compilers, Microsoft

Groverkss · December 18, 2024, 5:54pm

Before bufferization, you want to run LoopInvariantSubsetHoisting which should make the loop iterate of the vector itself: llvm-project/mlir/lib/Transforms/LoopInvariantCodeMotion.cpp at main · llvm/llvm-project · GitHub

In IREE we have a pass “OptimizeTensorExtractInsertSlice” which does runs some of these simplifications (which exist upstream) before bufferization (after vectorization) iree/compiler/src/iree/compiler/Codegen/Common/OptimizeTensorInsertExtractSlices.cpp at main · iree-org/iree · GitHub

manbearian · December 18, 2024, 6:04pm

Thank you!
EDIT i misread the code, the IR is correct, just needs a bit of clean up

     %19:2 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %17, %arg14 = %18) -> (tensor<64xf32>, vector<64xf32>) {
        %extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
        %27 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
        %28 = vector.shape_cast %27 : vector<1x64xf32> to vector<64xf32>
        %29 = arith.addf %arg14, %28 : vector<64xf32>
        scf.yield %arg13, %29 : tensor<64xf32>, vector<64xf32>
      }
      %20 = vector.transfer_write %19#1, %19#0[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>

Groverkss · December 18, 2024, 6:07pm

Yeah, the pass leaves the dead iter_args. The IREE pass runs scf::ForOp canonicalization patterns to fix this: iree/compiler/src/iree/compiler/Codegen/Common/OptimizeTensorInsertExtractSlices.cpp at main · iree-org/iree · GitHub

Topic		Replies	Views
New Linalg Code Generation Strategy for Innermost Reductions MLIR	7	391	December 5, 2024
How to efficiently vectorize linalg.generic reduction op? MLIR	2	347	January 9, 2024
Linalg.tiled_loop does not bufferize MLIR	2	317	November 15, 2021
Generating In-Place Computation for Linalg Operations after Bufferization MLIR linalg	2	71	November 13, 2024
Bug in rank reducing tensor.extract_slice and subsequent linalg bufferization? MLIR	3	187	February 14, 2022

Linalg vectorizer, reductions, and bufferization

Related topics