[RFC] Changes to linalg::TiledLoopOp to unblock reductions

nicolasvasilache · July 16, 2021, 8:04am

The objection was to a terminator that reproduces the logic of multiple subtensor_insert all crammed together within a single op + attributes.

I’ll repost from my reply at the time:

[RFC] Add Linalg TileOp

I’d split that into multiple terminators or multiple ops + yields, something close to subtensor_insert + yield . This will require a little bit of non-local behavior for the validation and has a bit of an awkward semantic:
%rX = insert_like %x into %X[offsets][sizes][strides]: typeof(%x) into typeof(%X)
yield %rX : typeof(%rX) // But we actually really yield %x@[offsets][sizes][strides]

The difference in this new RFC is that the “insert_like” becomes unnecessary and Alex proposes to instead tie the yield to the “extract_like” op. The same discussion about non-local behavior for validation applies.

The %out_sub + yielding a tile instead of the whole tensor is what enables reduction semantics at the tile level. The one thing that is not ideal is the “non local verification behavior” I mentioned above.

The crux of the problem is that an interesting abstraction gap appears when one mixes all of the constraints below:

tiling (and other transformations)
parallel iterators
reduction iterators
tensor SSA values

This is related to the semantics of the yield and the type of the yielded value vs the returned value.
That abstraction gap disappears once bufferization is performed.

The only job of the tiled_loop abstraction is to hide that abstraction gap and allow transformations on tensors + late bufferization.

The abstraction we have today uses

linalg.tiled_loop` + `extract` + `insert` + `yield the whole tensor`

This works fine for 1. + 2. + 4. but not for 3. Yielding the tile appears essential to encoding parallel + reduction at the tile level.

This RFC proposes to drop the usage of insert_like in this case, which adds an extra simplification. The representation would resemble:

linalg.tiled_loop` + `extract`  + `yield the tile`

As a reminder, the ideal form we’d have loved to have is mentioned in that previous post and can be most naturally expressed as the result of a transformation:

[RFC] Add Linalg TileOp

tile(some_op_2(some_op_1(%A, %B) ,%C), 5) .

And we would like to express this as an op with a region that I’ll call “region after”-tiling:
linalg.some_fancy_op (%A, %B, %C) <some_extra_attributes_and_operands> {
^bb(%tiledA, %tiledB, %tiledC) {
  %tiledD = some_op_1(%tiledA, %tiledB) ...
  %tiledE = some_op_2(%tiledC, %tiledD) ...
  linalg.yield %tiledE ...
}}
Unfortunately, we could not find a good generic form for this that works in the general case as it raises some nasty inverse problems.

Long story short, there are some deep representational considerations here that are driven by transformations. The solution proposed in this RFC is a nice incremental step forward. Compared to the ideal-but-still-missing-stuff form of linalg.some_fancy_op, this simplifies the “insert_like” part but the “extract_like” part is still obtained through “nested extract op + SSA values” and not “enclosing op semantics”. There is strong suspicion that we cannot get rid of the “nested extract op + SSA” part without significant other degradations.

Still, I don’t think this is the end of the road and this is likely to continue evolving as we learn more from practice once we can also handle parallel codegen across reductions on tensors (mouthful).

Topic		Replies	Views
[RFC] Add Linalg TileOp MLIR	17	1421	February 22, 2021
Difference between --linalg-tile and --linalg-tile-and-fuse-tensor-ops MLIR	3	636	October 26, 2021
Fuse linalg.tiled_loop MLIR	3	532	March 14, 2022
Linalg.tiled_loop does not bufferize MLIR	2	292	November 15, 2021
[RFC] Linalg on Tensors Update and Comprehensive Bufferization RFC MLIR	6	1806	May 6, 2021

[RFC] Changes to linalg::TiledLoopOp to unblock reductions

Related Topics