[RFC] Changes to linalg::TiledLoopOp to unblock reductions

In general I think it makes sense to have the body yield a tile instead of the whole tensor. So +1 for the direction. (Note this is kind of what we do in IREE with flow.dispatch.tensor.load and flow.dispatch.tensor.store.

I mostly have nit about

linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32>

what does “%transpose_sub in %out_sub” mean?

Would something like

linalg.tiled_yield %transpose_sub as %out_sub

be more readable. Essentially saying that %transpose_sub replaces what was %out_sub.

Also,

 tiled_loop.yield %sub_sum in %out_

was this a typo or is tiled_loop.yield signifying something else.

Side note : This does seem to fit well with the interface RFC for `TilingInterface` for tiling operations that dont fit into Linalg Structured Operation definition which also is actually only having the tiled implementation return the tile and moving the tensor.insert_slice into being an implementation detail of the generated tiled code.