[RFC] Changes to linalg::TiledLoopOp to unblock reductions

This is a long thread, but as a fly on the wall comment, another approach here that is isomorphis to what is being proposed is

%sum = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%size_0, %size_1)
    step (%c10, %c10)
    ins (%in_ =  %in: tensor<100x100xf32>)
    outs (%out_ =  %out: tensor<100x100xf32>)
    iterator_types ("parallel", "parallel")  {
  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
 
  %transpose_sub = linalg.generic {
      indexing_maps =  [#id, #tr],
      iterator_types =  ["parallel", "parallel"]}
      ins(%in_sub: tensor<10x10xf32>)
      outs(%out_sub: tensor<10x10xf32>)  {
    ^bb0(%in_elem: f32,  %out_elem: f32):
      linalg.yield  %in_elem : f32
  } -> tensor<10x10xf32>
  linalg.tiled_loop_terminator {
    tiled_yield %transpose_sub at [%j, %i][%c20, %c10][%c1, %c1]
  }
}

That is, the terminator has a region, with one op per outs, and that op holds the offsets/strides to insert into for this iteration. I think this dodges the weirdness of linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32> needing %out_sub to be defined by a tensor.extract_slice. It’s the same information, but represented without needing to traverse a use-def chain to reach the dummy “read”.