I’m trying to understand how affine-supervectorize pass is working, after few experiments I get the issue when underlying broadcasting of vector.transfer_read
operation resulting in incorrect behaviour of original procedure:
minimal example
module {
func @foo() {
%0 = alloc() : memref<1x1x3x3xf32>
%1 = alloc() : memref<1x1x3x1xf32>
%c0 = constant 0 : index
affine.for %arg2 = 0 to 1 {
affine.for %arg3 = 0 to 1 {
affine.for %arg4 = 0 to 3 {
affine.for %arg5 = 0 to 3 {
%4 = affine.load %0[%c0, %c0, %arg4, %arg5] : memref<1x1x3x3xf32>
// NOTE(Egor): this operation will be lowered into :
// #map5 = affine_map<(d0, d1, d2, d3) -> (0)>
// %11 = vector.transfer_read %1[%7, %8, %9, %10], %cst_0 {permutation_map = #map5} : memref<1x1x3x1xf32>, vector<4xf32>
%5 = affine.load %1[%c0, %c0, %arg4, %c0] : memref<1x1x3x1xf32>
%6 = addf %4, %5 : f32
affine.store %6, %0[%arg2, %arg3, %arg4, %arg5] : memref<1x1x3x3xf32>
}
}
}
}
return
}
}
this is the minimal example that needed to reproduce behavior, after invoking this command
mlir-opt --affine-super-vectorize=“virtual-vector-size=4” --convert-vector-to-scf MVE.mlir > MVE_vectorized.mlir
it will produce code:
module {
func @foo() {
%cst = constant 0.000000e+00 : f32
%c4 = constant 4 : index
%c1 = constant 1 : index
%c0 = constant 0 : index
%0 = alloca() : memref<vector<4xf32>>
%1 = alloc() : memref<1x1x3x3xf32>
%2 = alloc() : memref<1x1x3x1xf32>
affine.for %arg0 = 0 to 1 {
affine.for %arg1 = 0 to 1 {
affine.for %arg2 = 0 to 3 {
affine.for %arg3 = 0 to 3 step 4 {
%3 = vector.transfer_read %1[%c0, %c0, %arg2, %arg3], %cst : memref<1x1x3x3xf32>, vector<4xf32>
// START: this is what transfer read with non-indentity affine map was lowered into:
scf.for %arg4 = %c0 to %c4 step %c1 {
%7 = index_cast %arg4 : index to i32
%8 = load %0[] : memref<vector<4xf32>>
%9 = cmpi "slt", %arg4, %c0 : index
scf.if %9 {
%10 = load %2[%c0, %c0, %arg2, %c0] : memref<1x1x3x1xf32>
%11 = vector.insertelement %10, %8[%7 : i32] : vector<4xf32>
store %11, %0[] : memref<vector<4xf32>>
} else {
%10 = vector.insertelement %cst, %8[%7 : i32] : vector<4xf32>
store %10, %0[] : memref<vector<4xf32>>
}
}
%4 = vector.type_cast %0 : memref<vector<4xf32>> to memref<vector<4xf32>>
%5 = load %4[] : memref<vector<4xf32>>
// END
%6 = addf %3, %5 : vector<4xf32>
vector.transfer_write %6, %1[%arg0, %arg1, %arg2, %arg3] : vector<4xf32>, memref<1x1x3x3xf32>
}
}
}
}
return
}
}
looking at this implementation it’s determined that the only IF case that will be visited, will be the lower one (where vector is populated with padding value %cst
), this is not the correct representation of original loop. Can someone please explain this situation to me? Is this a bug or my understanding of broadcasting is wrong in this situation?
As far as I understood #map5 = affine_map<(d0, d1, d2, d3) -> (0)>
means that we will take the first element
%11 = vector.transfer_read %1[%7, %8, %9, %10], %cst_0 {permutation_map = #map5} : memref<1x1x3x1xf32>, vector<4xf32>
vector.transfer_read
is pointing into, and we will use that element to populate the whole vector.