In general I think it makes sense to have the body yield a tile instead of the whole tensor. So +1 for the direction. (Note this is kind of what we do in IREE with flow.dispatch.tensor.load
and flow.dispatch.tensor.store
.
I mostly have nit about
linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32>
what does “%transpose_sub
in %out_sub
” mean?
Would something like
linalg.tiled_yield %transpose_sub as %out_sub
be more readable. Essentially saying that %transpose_sub
replaces what was %out_sub
.
Also,
tiled_loop.yield %sub_sum in %out_
was this a typo or is tiled_loop.yield
signifying something else.
Side note : This does seem to fit well with the interface RFC for `TilingInterface` for tiling operations that dont fit into Linalg Structured Operation definition which also is actually only having the tiled implementation return the tile and moving the tensor.insert_slice
into being an implementation detail of the generated tiled code.