[RFC] Changes to linalg::TiledLoopOp to unblock reductions

The objection was to a terminator that reproduces the logic of multiple subtensor_insert all crammed together within a single op + attributes.

I’ll repost from my reply at the time:

The difference in this new RFC is that the “insert_like” becomes unnecessary and Alex proposes to instead tie the yield to the “extract_like” op. The same discussion about non-local behavior for validation applies.

The %out_sub + yielding a tile instead of the whole tensor is what enables reduction semantics at the tile level. The one thing that is not ideal is the “non local verification behavior” I mentioned above.

The crux of the problem is that an interesting abstraction gap appears when one mixes all of the constraints below:

  1. tiling (and other transformations)
  2. parallel iterators
  3. reduction iterators
  4. tensor SSA values

This is related to the semantics of the yield and the type of the yielded value vs the returned value.
That abstraction gap disappears once bufferization is performed.

The only job of the tiled_loop abstraction is to hide that abstraction gap and allow transformations on tensors + late bufferization.

The abstraction we have today uses

linalg.tiled_loop` + `extract` + `insert` + `yield the whole tensor`

This works fine for 1. + 2. + 4. but not for 3. Yielding the tile appears essential to encoding parallel + reduction at the tile level.

This RFC proposes to drop the usage of insert_like in this case, which adds an extra simplification. The representation would resemble:

linalg.tiled_loop` + `extract`  + `yield the tile`

As a reminder, the ideal form we’d have loved to have is mentioned in that previous post and can be most naturally expressed as the result of a transformation:

Long story short, there are some deep representational considerations here that are driven by transformations. The solution proposed in this RFC is a nice incremental step forward. Compared to the ideal-but-still-missing-stuff form of linalg.some_fancy_op, this simplifies the “insert_like” part but the “extract_like” part is still obtained through “nested extract op + SSA values” and not “enclosing op semantics”. There is strong suspicion that we cannot get rid of the “nested extract op + SSA” part without significant other degradations.

Still, I don’t think this is the end of the road and this is likely to continue evolving as we learn more from practice once we can also handle parallel codegen across reductions on tensors (mouthful).