Hi,
I’ve been working on some data processing which requires sub-sampling a tensor at regular intervals, with an offset within each tile. My thinking was that this could be implemented as linalg.generic
with index maps used on the input to specify the interval and offset - e.g. for sampling an 8x8 tile with offsets [4, 5]:
#map0 = affine_map<(d0, d1) -> (d0 * 8 + 4, d1 * 8 + 5)>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
module attributes {llvm.target_triple = "aarch64-none-linux-gnu"} {
func.func @pipeline(%arg0: tensor<360x640xf32>) -> tensor<45x80xf32> attributes {llvm.emit_c_interface} {
%0 = tensor.empty() : tensor<45x80xf32>
%1 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<360x640xf32>) outs(%0 : tensor<45x80xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<45x80xf32>
return %1 : tensor<45x80xf32>
}
}
This works fine with a direct lowering to CPU, but we noticed we weren’t getting the right results when running the pipeline using IREE. I’ve tracked it down to the tiling and can reproduce it using the -linalg-tile
pass (as I understand it, IREE uses the same pass/code from MLIR - hence why I’m raising it here).
mlir-opt -linalg-tile="tile-sizes=15,20" test.mlir
produces:
#map0 = affine_map<(d0) -> (d0 * 8 + 4)>
#map1 = affine_map<(d0) -> (d0 * -8 + 353, 117)>
#map2 = affine_map<(d0) -> (d0 * 8 + 5)>
#map3 = affine_map<(d0) -> (d0 * -8 + 633, 158)>
#map4 = affine_map<(d0, d1) -> (d0 * 8 + 4, d1 * 8 + 5)>
#map5 = affine_map<(d0, d1) -> (d0, d1)>
module attributes {llvm.target_triple = "aarch64-none-linux-gnu"} {
func.func @pipeline(%arg0: tensor<360x640xf32>) -> tensor<45x80xf32> attributes {llvm.emit_c_interface} {
%c0 = arith.constant 0 : index
%c45 = arith.constant 45 : index
%c15 = arith.constant 15 : index
%c80 = arith.constant 80 : index
%c20 = arith.constant 20 : index
%0 = tensor.empty() : tensor<45x80xf32>
%1 = scf.for %arg1 = %c0 to %c45 step %c15 iter_args(%arg2 = %0) -> (tensor<45x80xf32>) {
%2 = scf.for %arg3 = %c0 to %c80 step %c20 iter_args(%arg4 = %arg2) -> (tensor<45x80xf32>) {
%3 = affine.apply #map0(%arg1)
%4 = affine.min #map1(%arg1)
%5 = affine.apply #map2(%arg3)
%6 = affine.min #map3(%arg3)
%extracted_slice = tensor.extract_slice %arg0[%3, %5] [%4, %6] [1, 1] : tensor<360x640xf32> to tensor<?x?xf32>
%extracted_slice_0 = tensor.extract_slice %arg4[%arg1, %arg3] [15, 20] [1, 1] : tensor<45x80xf32> to tensor<15x20xf32>
%7 = linalg.generic {indexing_maps = [#map4, #map5], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<?x?xf32>) outs(%
extracted_slice_0 : tensor<15x20xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<15x20xf32>
%inserted_slice = tensor.insert_slice %7 into %arg4[%arg1, %arg3] [15, 20] [1, 1] : tensor<15x20xf32> into tensor<45x80xf32>
scf.yield %inserted_slice : tensor<45x80xf32>
}
scf.yield %2 : tensor<45x80xf32>
}
return %1 : tensor<45x80xf32>
}
}
It looks like the index map is split into maps for each dimension (map0
and map2
) which are then applied to compute the locations of the input tile, but then the original map is still used on the linalg.generic
over the tile, so the map is effectively applied twice.
Is this something that needs fixing in the tiling pass or is this an invalid use of index maps? For now, I’ve been able to work around it by using linalg.index
to get the indices of the output tensor and computing the input tensor indices for tensor.extract
.
Thanks
Rob