Hi, folks. I’ve been playing a bit with vectorization by way of linalg and i’m curious about the loops it is generating.
For part of my reduction loop i end up with code like this which is accumulating into a single vector:
%17 = vector.transfer_write %cst, %10[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>
%18 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %17) -> (tensor<64xf32>) {
%extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
%25 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
%26 = vector.transfer_read %arg13[%c0], %cst_0 {in_bounds = [true]} : tensor<64xf32>, vector<64xf32>
%27 = vector.shape_cast %25 : vector<1x64xf32> to vector<64xf32>
%28 = arith.addf %26, %27 : vector<64xf32>
%29 = vector.transfer_write %28, %arg13[%c0] {in_bounds = [true]} : vector<64xf32>, tensor<64xf32>
scf.yield %29 : tensor<64xf32>
}
When this loop passes through bufferization i get some pretty terrible code where it is spilling the vector to memory on each iteration of the loop. I’m not going to reproduce that IR here, but i think you can imagine each “tensor” variable in the code above becoming a memref backed by an allocation and reading and writing that memref on each loop iteration.
If i can instead get the loop to iterate on the vector itself, the resultant IR looks simpler and the generated code is improved. The modified loop looks like this:
%18 = scf.for %arg12 = %c0 to %c16 step %c1 iter_args(%arg13 = %cst) -> (vector<64xf32>) {
%extracted_slice_5 = tensor.extract_slice %expanded[%arg12, 0] [1, 64] [1, 1] : tensor<16x64xf32> to tensor<1x64xf32>
%24 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<1x64xf32>, vector<1x64xf32>
%25 = vector.shape_cast %24 : vector<1x64xf32> to vector<64xf32>
%26 = arith.addf %arg13, %25 : vector<64xf32>
scf.yield %arg13 : vector<64xf32>
}
This IR results in no additional memory allocated and no “explicit” spills in the loop after running bufferization.
So my questions:
- am i thinking about this problem correctly from others point of view?
- is there a way to get the linalg vectorizer to automatically generate the above code?
- is the transformation i’ve described above something that already exists in MLIR core?
Thanks,
ian Bearman
AI Compilers, Microsoft