This is a long thread, but as a fly on the wall comment, another approach here that is isomorphis to what is being proposed is
%sum = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%size_0, %size_1)
step (%c10, %c10)
ins (%in_ = %in: tensor<100x100xf32>)
outs (%out_ = %out: tensor<100x100xf32>)
iterator_types ("parallel", "parallel") {
%in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
%transpose_sub = linalg.generic {
indexing_maps = [#id, #tr],
iterator_types = ["parallel", "parallel"]}
ins(%in_sub: tensor<10x10xf32>)
outs(%out_sub: tensor<10x10xf32>) {
^bb0(%in_elem: f32, %out_elem: f32):
linalg.yield %in_elem : f32
} -> tensor<10x10xf32>
linalg.tiled_loop_terminator {
tiled_yield %transpose_sub at [%j, %i][%c20, %c10][%c1, %c1]
}
}
That is, the terminator has a region, with one op per outs, and that op holds the offsets/strides to insert into for this iteration. I think this dodges the weirdness of linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32>
needing %out_sub to be defined by a tensor.extract_slice
. It’s the same information, but represented without needing to traverse a use-def chain to reach the dummy “read”.