This is a gemm MLIR file.
module {
func.func @_Z4gemmPA32_iS0_S0_(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>, %arg2: memref<32x32xi32>) {
affine.for %arg3 = 0 to 32 {
affine.for %arg4 = 0 to 32 {
%c0_i32 = arith.constant 0 : i32
%0 = affine.for %arg5 = 0 to 32 iter_args(%arg6 = %c0_i32) -> (i32) {
%1 = affine.load %arg0[%arg3, %arg5] : memref<32x32xi32>
%2 = affine.load %arg1[%arg5, %arg4] : memref<32x32xi32>
%3 = arith.muli %1, %2 : i32
%4 = arith.addi %arg6, %3 : i32
affine.yield %4 : i32
}
affine.store %0, %arg2[%arg3, %arg4] : memref<32x32xi32>
}
}
return
}
}
Command:mlir-opt gemm.mlir -affine-super-vectorize=“virtual-vector-size=8 test-fastest-varying=0 vectorize-reductions=true”
Why does the result vectorize the second loop and fail to vectorize the innermost loop?