[RFC] Changes to linalg::TiledLoopOp to unblock reductions

Yes: I understood a few aspects thanks to Mahesh!

First it is good to keep in mind that linalg.tiled_loop isn’t really a “structured op” in the sense that the ins and outs aren’t connected in any way to the iteration space, which is exclusively driven by the (%i) = (%c0) to (%size_0) step (%c10) expression.

So it isn’t even clear what semantics the ins and outs really carry in the first place, it seems more like a convention than anything at the tensor level, it may show up more useful at the memref level (can you also provide a memref example?).

For example it should be valid to write the transpose tiled_loop without an “outs”:

%sum = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%size_0, %size_1)
    step (%c10, %c10)
    ins (%in_ =  %in: tensor<100x100xf32>)
    iterator_types ("parallel", "parallel")  {
  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
  %out_sub = tensor.init_tensor : tensor<10x10xf32>
  %transpose_sub = linalg.generic {
      indexing_maps =  [#id, #tr],
      iterator_types =  ["parallel", "parallel"]}
      ins(%in_sub: tensor<10x10xf32>)
      outs(%out_sub: tensor<10x10xf32>)  {
    ^bb0(%in_elem: f32,  %out_elem: f32):
      linalg.yield  %in_elem : f32
  } -> tensor<10x10xf32>
  linalg.tiled_yield %transpose_sub in /* use range, see below */: tensor<10x10xf32>
}

Unless the outs is used to compute the shape of the results of the tiled_loop?
Which allows a tiled_loop to only partially produce the output, the other values would be read from the outs (which is really not well named for the tensor level, as it shows here again).

Then as Mahesh mentioned, the linalg.tiled_yield takes the tile to yield as first argument, but **because the linalg.tiled_loop isn’t structured, it needs to also carry the range for the tile.
I feel that a more accurate model of what we want to express here would be to actually carry the intent with a type instead of “faking” it and hoping to recover from the extract_slice:

...
  %tile_range = tensor.tile_range [%j, %i][%c20, %c10][%c1, %c1] : !tensor.tile_range
  %out_sub = tensor.extract_slice %out(%tile_range) // only if %out_sub is used as input to the computation!
  ...
  linalg.tiled_yield %transpose_sub at %tile_range : tensor<10x10xf32>
}

The alternative is to express the coordinate in the tiled_yield explicitly, but that was what @ntv found to be hard to manage in the original proposal because of the amount of variadic.
Using a proper SSA value for the range make the linalg.tiled_yield very regular: operands comes in pair. A variadic of pair seems quite manageable here (actually the proposal here also has a variadic of pair so I guess we’re on the same page in terms of complexity of the yield…).