[RFC] Changes to linalg::TiledLoopOp to unblock reductions

If we are doing something that involves a lot more work, then I propose a slightly different design that allows to avoid the weird terminator syntax with the variadic pair of args.

%sum = linalg.tiled_loop (%i) = (%c0) to (%100) step (%c10) {
    input(%in: tensor<100xf32>) {
      %0 = linalg.tile_range %in [%i][%c10][%c1] : !linalg.subset
      linalg.yield %0 : !linalg.subset
    }
    output(%out: tensor<f32>) {
      %0 = linalg.full_range %out : !linalg.subset
      linalg.yield %0 : !linalg.subset
    }
    computation(%in_: tensor<10xf32>, %out_: tensor<f32>) {
      %sum = <some_computation>(%in_, %out_) : tensor<f32>
      linalg.yield %sum : tensor<f32>
    }
  }

Every input and output tensor/memref will have a corresponding region. These regions define subset transformations to get arguments for the “computation” region. At first, only tiled and non-tiled args will be produced with linalg.tile_range and linalg.full_range operations. Later, we might add support for non-rectangular subsets.

We can also bufferize this op.

linalg.tiled_loop (%i) = (%c0) to (%100) step (%c10) {
    input(%in: memref<100xf32>) {
      %0 = linalg.tile_range %in [%i][%c10][%c1] : !linalg.subset
      linalg.yield %0 : !linalg.subset
    }
    output(%out: memref<f32>) {
      %0 = linalg.full_range %out : !linalg.subset
      linalg.yield %0 : !linalg.subset
    }
    computation(%in_: memref<10xf32, #map>, %out_: memref<f32>) {
      <some_computation>(%in_, %out_)
      linalg.yield 
    }
  }

linalg.yield can still be used as a terminator in all of these regions. In “computation” region it yields slices or entire tensors for the corresponding tensor output argument.

One type !linalg.subset and two subset ops linalg.tile_range, linalg.full_range will have to be added. Also, in order to develop linalg.tiled_loop 2.0 incrementally, I would suggest to create a separate operation, make sure that every pass works correctly and then replace the original linalg.tiled_loop with it.