If we are doing something that involves a lot more work, then I propose a slightly different design that allows to avoid the weird terminator syntax with the variadic pair of args.
%sum = linalg.tiled_loop (%i) = (%c0) to (%100) step (%c10) {
input(%in: tensor<100xf32>) {
%0 = linalg.tile_range %in [%i][%c10][%c1] : !linalg.subset
linalg.yield %0 : !linalg.subset
}
output(%out: tensor<f32>) {
%0 = linalg.full_range %out : !linalg.subset
linalg.yield %0 : !linalg.subset
}
computation(%in_: tensor<10xf32>, %out_: tensor<f32>) {
%sum = <some_computation>(%in_, %out_) : tensor<f32>
linalg.yield %sum : tensor<f32>
}
}
Every input and output tensor/memref will have a corresponding region. These regions define subset transformations to get arguments for the “computation” region. At first, only tiled and non-tiled args will be produced with linalg.tile_range
and linalg.full_range
operations. Later, we might add support for non-rectangular subsets.
We can also bufferize this op.
linalg.tiled_loop (%i) = (%c0) to (%100) step (%c10) {
input(%in: memref<100xf32>) {
%0 = linalg.tile_range %in [%i][%c10][%c1] : !linalg.subset
linalg.yield %0 : !linalg.subset
}
output(%out: memref<f32>) {
%0 = linalg.full_range %out : !linalg.subset
linalg.yield %0 : !linalg.subset
}
computation(%in_: memref<10xf32, #map>, %out_: memref<f32>) {
<some_computation>(%in_, %out_)
linalg.yield
}
}
linalg.yield
can still be used as a terminator in all of these regions. In “computation” region it yields slices or entire tensors for the corresponding tensor output argument.
One type !linalg.subset
and two subset ops linalg.tile_range
, linalg.full_range
will have to be added. Also, in order to develop linalg.tiled_loop 2.0
incrementally, I would suggest to create a separate operation, make sure that every pass works correctly and then replace the original linalg.tiled_loop
with it.