Consider writing the tensor-times-vector kernel (C(i, j) = A(i, j, k) * B(k)
) in MLIR:
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1, d2) -> (d2)>
func.func @func() {
%A = memref.alloc() : memref<100x100x100xf64>
%B = memref.alloc() : memref<100xf64>
%C = memref.alloc() : memref<100x100xf64>
linalg.generic {indexing_maps = [#map1, #map3, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%A, %B: memref<100x100x100xf64>, memref<100xf64>) outs(%C : memref<100x100xf64>) {
^bb0(%in: f64, %in_0: f64, %out: f64):
%0 = arith.addf %in, %in_0 : f64
linalg.yield %0 : f64
}
return
}
Hitting this with the convert-linalg-to-affine-loops
pass yields this:
module {
func.func @func() {
%alloc = memref.alloc() : memref<100x100x100xf64>
%alloc_0 = memref.alloc() : memref<100xf64>
%alloc_1 = memref.alloc() : memref<100x100xf64>
affine.for %arg0 = 0 to 100 {
affine.for %arg1 = 0 to 100 {
affine.for %arg2 = 0 to 100 {
%0 = affine.load %alloc[%arg0, %arg1, %arg2] : memref<100x100x100xf64>
%1 = affine.load %alloc_0[%arg2] : memref<100xf64>
%2 = arith.addf %0, %1 : f64
affine.store %2, %alloc_1[%arg0, %arg1] : memref<100x100xf64>
}
}
}
return
}
}
A common optimization here would be to introduce a temporary for the value of C(i, j)
inside the k
loop, accumulate that instead of a store to memory, followed by a store into C(i, j)
once the k
loop is done. While an optimization like this may be possible in the affine dialect (i didn’t find anything like it), it seems like the most information is present here in the linalg.generic
operation to perform this optimization, especially given that the iterator type for the k loop has been marked as “reduction”. I spent some time searching the code and the documentation if something like this was present, so any pointers will be helpful!