[RFC] Split fusion portions of the TilingInterface into a new interface

qed · September 9, 2024, 9:00pm

Problem Statement

The TilingInterface is designed to allow generating tiled implementations of operations. In addition to tiling specific interface methods for use within helpers like scf::tileUsingSCF, the interface includes a number of methods designed for “tile and fuse” algorithms, grouped below:

Tiling

getLoopIteratorTypes
getIterationDomain
getTiledImplementation
getResultTilePosition

Fusion

Producer fusion
- generateResultTileValue
- getIterationDomainTileFromResultTile
Consumer fusion
- getTiledImplementationFromOperandTile
- getIterationDomainTileFromOperandTile

This design works well for chains of tilable ops as users can in theory choose to tile any of the operators in the chain and fuse in the surrounding operations. The fact that fusion is tied to the TilingInterface, however, restricts the fusion algorithm to tilable ops.

Consider the following example

%0 = ...
%1 = linalg.transpose
%2 = tensor.extract_slice %1
%4 = linalg.transpose %2

Currently, trying to tile %4 and fuse producers would stop on %3 (a la tileConsumersAndFuseProducersUsingSCF) as tensor.extract_slice does not implement the tiling interface. Implementing the tiling interface should be unecessary to perform the fusion though, as after tiling the linalg.transpose we should expect something like the following:

%0 = ...
%1 = linalg.transpose
%2 = tensor.extract_slice %1
%3 = scf.for {
  %4 = tensor.extract_slice %2
  %5 = linalg.transpose %4
  scf.yield %5
}

Where at this point we could merge consecutive extract slice ops and continue fusion on the merged extract slice, fusing in %1.

Similar reasoning applies to tensor.expand_shape and tensor.collapse_shape where we do not want to implement the tiling interface for those ops but do want to fuse them in situations like the above.

Proposed Solution

The solution proposed here is to split the TilingInterface into two interfaces: itself (TilingInterface) and one specifically for fusion, FusableOpInterface (naming suggestions welcome). The interface methods would be split into the two groups listed above (matching with the existing documentation in TilingInterface.td).

The TilingInterface can be changed to inherit from the FusableOpInterface, leaving existing TilingInterface implementations alone and avoiding API breakages.

cc @MaheshRavishankar @ftynse @nicolasvasilache

MaheshRavishankar · September 9, 2024, 9:28pm

I dont know if you actually want to split the methods.

All you are looking for is a way to change

%0 = <some_op>(%operand0, %operand1, ...)
%1 = tensor.extract_slice

%0 = <some_slice_op> %operand0
%1 = <some_slice_op> %operand
%2 = <some op>(%0, %1)

i.e. you are just trying to implement a method to swap the extract_slice and the op. For ops that implement the TilingInterface that is done generically via replaceExtractSliceWithTiledProducer method. You want to be able to do this swap (or replacement of tensor.extract_slice) for operations that dont implement the FusionInterface.

I know we discussed splitting the interface, but another approach is we implement this as a stand-alone interface with a separate method. TilingInterface can inherit from this interface and implement this method by just redirecting to call replaceExtractSliceWithTiledProducer.

So you only need to add to the TilingInterface (and adapt the tile and fuse logic in scf::tileConsumerAndFuseProducerUsingSCF.

ftynse · September 10, 2024, 5:44pm

I am favorable to splitting the interface, as per previous discussions. But the specific example from the RFC affords other solutions. Specifically, we may want to just swap the extract_slice and the first transpose as that would just reduce the amount of work the transpose does. (Or duplicate the work if the first transpose has other users, but so would fusion in that case.) More generally, we may want to run partial canonicalization/cse/other cleanups after each fusion somehow.

MaheshRavishankar · September 11, 2024, 4:03pm

Yeah, I am also trying to figure out how we can run some cleanup after fusion. Any suggestions (obviously you can do it with transform dialect, but for non-transform dialect cases)

ftynse · September 13, 2024, 5:12pm

applyPatternsAndFoldGreedily has an overload that takes a list of operations to simplify rather than the root. It can be called on operations produced by fusion to do a sort of local canonicalization. And we can add a listener to the rewriter passed into the tiling/fusion logic that captures all created ops for that purpose.

qed · September 17, 2024, 3:07pm

Sorry I am late to respond, thanks for the responses.

This works too. I figured that if we were going to design a new interface for this anyway though, we might as well split the interface given that there was discussion about it in the past, but I don’t have a strong opinion here.

I probably should have provided more examples of the cases I’m interested in supporting with a new interface. I used tensor.extract_slice as the example here because I considered it an operation that we certainly would not want to implement the tiling interface for, but I am actually more interested in the ability to fuse operations like tensor.expand_shape and tensor.collapse_shape. For reshaping operations, I need something more powerful than just canonicalization/cse.

This sounds like an interesting approach, and in fact sounds like we could do a lot more than just canonicalization/cse with this. So to summarize it sounds like there are two directions here:

Introduce a new interface, either by splitting out the fusion portions of the TilingInterface or standalone.
Add a FrozenRewritePatternSet option to the SCFTileAndFuseOptions with a custom listener for tracking newly inserted slices and restricting pattern application to those slices.

ftynse · September 19, 2024, 8:37am

Note that I’m still favorable to splitting the interface. My argument for splitting was that the tiling interface is unnecessarily complex due to coupling to fusion. The counterargument to that was that there are cases where we want (for efficiency) or even need (not expressible otherwise) a joint implementation to “tile-and-fuse” rather than separately tile and then fuse. Such cases look like a minority though. So maybe we can have a SimpleTilingOpInterface that only provides defines tiling methods, FusionOpInterface that defines fusion, and the current TilingInterface that inherits both with the default implementation deferring to those and a possibility for the user to override it. We just need to find the right kind of examples that motivate this work and ensure the new interfaces properly support them as a quality check of our design.

qed · September 21, 2024, 10:54pm

I opened a PR for option 2) here: [mlir] Add option for a cleanup pattern set to SCF tiling helper by qedawkins · Pull Request #109554 · llvm/llvm-project · GitHub

I am still interested in splitting the TilingInterface as well but need to come up with clearer motivating examples.

javedabsar · September 22, 2024, 7:13pm

+1 for splitting the interfaces.

ofri_frishman · February 4, 2025, 10:30am

I noticed in your PR that you handle cases regarding tensor.extract_slice or ops that can be folded into it like tensor.cast.
But you mentioned that you are more interested in fusing operations that cannot be folded like tensor.expand_shape and tensor.collapse_shape.
I also have a use case where I would like to fuse groups of ops that include reshaping operations and was wondering if you found a way to include such operations in tiling, or if you still think that splitting the tiling interface into 2 interfaces would be the best way forwards, but not implemented yet?
I also encountered another issue regarding tiling and added a post about the issues I encountered .
@qed

qed · February 4, 2025, 1:30pm

This ended up happening half downstream:

github.com/iree-org/iree

compiler/src/iree/compiler/Codegen/Common/Transforms.cpp

2733115ef


      
          /// // The transformation is possible because each reassociation group has a
          /// // contiguous slice. (i.e., [2x4->2x4], [2x8->1x5], [4x2x4->1x1x4])
          /// // After the transformation:
          ///
          /// %slice = tensor.extract_slice %in ...
          ///     tensor<8x16x32xf32> to tensor<8x5x4xf32>
          /// %reshape = tensor.expand_shape %slice [[0, 1], [2, 3], [4, 5, 6]]
          ///     tensor<8x5x4xf32> to tensor<2x4x1x5x1x1x4xf32>
          /// ```
          static LogicalResult
          swapExpandShapeWithSlice(RewriterBase &rewriter,
                                   tensor::ExpandShapeOp expandShapeOp,
                                   tensor::ExtractSliceOp sliceOp) {
            SmallVector<OpFoldResult> offsets = sliceOp.getMixedOffsets();
            SmallVector<OpFoldResult> sizes = sliceOp.getMixedSizes();
          
            if (sliceOp.getResultType().getRank() != sizes.size()) {
              return rewriter.notifyMatchFailure(sliceOp,
                                                 "unimplemented: rank reducing slice");
            }

And populated here:

github.com/iree-org/iree

compiler/src/iree/compiler/Codegen/Common/GPU/GPUApplyTilingLevel.cpp

2733115ef


      
          if (tilingLevel != IREE::GPU::TilingLevel::Subgroup) {
            tensor::ExtractSliceOp::getCanonicalizationPatterns(cleanupPatterns,
                                                                context);
            tensor::DimOp::getCanonicalizationPatterns(cleanupPatterns, context);
            tensor::populateMergeConsecutiveInsertExtractSlicePatterns(
                cleanupPatterns);
            populateSwapExtractWithExpandPattern(cleanupPatterns);

This pattern also only works for special cases where shapes nicely divide. If this pattern looks useful I would be open to having it upstream (I don’t have cycles for that atm). I still see splitting the interface as worthwhile too, but the benefit is probably mostly cleanup.

ofri_frishman · February 4, 2025, 6:04pm

I have a use case where this pattern would be useful. Not using it would probably require implementing the tiling interface for tensor.extract_slice or at least part of the interface as per the suggestion to split the interface into 2 parts.
If you are open to upstreaming it from IREE but don’t have the capacity to do that currently then I would be willing to try and do that, any pointers would be appreciated.

I’m not sure I follow regarding the special cases, could you elaborate or point to a relevant test case?

I agree, but don’t think I would try to make that change without having at least some use case of an operation that would require solely the fuse interface and can’t be dealt with via other means.

ofri_frishman · April 30, 2025, 4:15pm

Revisiting this thread from a while back.
I’ve upstreamed from IREE the pattern that @qed wrote for tensor.expand_shape and wrote a new one for tensor.collapse_shape. This enables expanding fusion beyond these ops which otherwise would be blocked.
With that done, I came across an issue that this approach creates regarding deciding which ops to fuse.

The issue is that using such patterns makes it hard to understand which ops from the original IR have equivalents created within the loop nest. When we tile an op that implements the tiling interface it is easy to track this mapping, and the scf utility tileConsumerAndFuseProducersUsingSCF even provides this mapping in the replacements field of the SCFTileAndFuseResult. But for ops such as tensor.collapse_shape no such mapping information exists, and I’m not sure how to find this mapping since different patterns could be applied between tiling steps with varying effects. Some may swap between tensor.extract_slice and another op, and in doing so, create an equivalent of the other op inside the loop nest. In other cases, the pattern may fold an op into tensor.extract_slice (which is also kind of a way of creating an equivalent of the op inside the loop nest).

I’m wondering if there is some way of getting the mapping of these ops (that were added to the loop nest in one way or another but don’t implement the tiling interface) and their equivalents from outside the loop nest.

I thought about 2 ways to get this mapping:

Adding a listener to one or more rewriter methods. But I’m not sure that would work, as the different patterns used could be diverse and not use the same methods. One pattern could use something like replaceOpWithNewOp and another pattern might use a different rewriter method.
Splitting the tiling interface into a tiling interface and a fuse interface as suggested previously, and then the fuse interface could be used instead of the various patterns currently used

Just to fill in the picture regarding the need for this mapping - I want to make a final decision regarding tiling as an iterative process with the following steps:

Tile and fuse some selected ops without changing the original IR beyond creating a new loop nest
Analyze the result and decide which ops to add/remove to the list of ops that should be tiled and fused
Return to the first step if any changes were made
Connect the final version of loop nest to the original IR (update consumers) and run cleanups to remove dead code

The mapping is a requirement for the analysis done in the second step. Without this mapping I’m not sure it would be possible to understand if the result contains any computations that might be done within the loop and outside it (recomputation) or any other factors that could affect the decision.

So, this use case might provide more motivation for splitting the tiling interface into a tiling interface and a fuse interface.

Any thoughts about how would be the best way to get this mapping?

ofri_frishman · May 5, 2025, 12:50pm

The solution that was implemented by @qed for adding ops that don’t implement the tiling interface into loop nests was to use various patterns between fusion steps.
But how can this solution support the case where results of such ops should be yielded from the loop nest?
@qed @ftynse @MaheshRavishankar

ftynse · May 5, 2025, 2:09pm

This all has been evicted from my cache, so I can only make a high-level comment. I am still in favor of splitting the interface in the name of (cognitive) simplicity. It just seems that @qed went for a simpler solution for them downstream, which is a fair game, and nobody else is willing to pick up the grunt work of refactoring upstream.

For “analysis” purposes, a hacky solution could be to clone a part of the IR you want to transform, e.g. by first isolating it in an scf.execute_region and cloning the region. Partially transform the cloned IR, analyze it and either (a) throw it away or (b) inline it back to replace the original execute region. We have the IRMapping class to maintain all sorts of mapping between parts of IR.

ofri_frishman · May 6, 2025, 12:40pm

I saw the distinction in the TilingInterface between “tiling” methods and “fusing” methods that @MaheshRavishankar documented and that @qed suggested to split the interface along.
But after reading the implementation of scf::tileConsumerAndFuseProducersUsingSCF and scf::tileAndFuseConsumerOfSlice I think that splitting this into 2 interfaces will only help with cognitive simplicity as you stated, but not much more.
The reason I think that is because of the following:

If we only want to fuse a producer without yielding its results, then implementing generateResultTileValue for the producer seems sufficient.
But if we want to also yield the producer’s results, then in addition to that method, also getIterationDomainTileFromResultTile and getResultTilePosition are required, even though getResultTilePosition was not specified to be part of the “fusing” methods. So yielding from such an op would require it to implement also the suggested stand alone tiling interface.

On the other hand, if we want to fuse a consumer, it seems in the code that we should always yield the consumer’s results (which makes sense). Doing that requires using getTiledImplementationFromOperandTile, getIterationDomainTileFromOperandTile and getResultTilePosition. So again this requires the use of getResultTilePosition which was not specified to be part of the “fusing” methods, so fusing such an op would also require it to implement the suggested stand alone tiling interface.

I would really like to get @MaheshRavishankar take on this to clarify if I am misunderstanding something regarding the implementation or my understanding here.

MaheshRavishankar · May 6, 2025, 9:02pm

@ofri_frishman thanks a lot for looking into this more, and sorry for the delayed response. Have been following the chat, but just didnt get the bandwidth to write up a reasoned response.

I have long had this opinion. To me the TilingInterface is tightly linked with fusion. So splitting does not make too much sense to me, but I am sensitive to counter arguments on these. Another argument for not splitting is that in some form tiling is just the degenerate case of fusion. If you had all parallel iterator types, tiling

%result = linalg.generic

is same as first creating the loops based on the iteration space of linalg.generic

%result = linalg.generic
%tiled_result = scf.for ... iter_args(%arg0 = %result) {
   %0 = tensor.extract_slice %result [..][..][..]
   %1 = tensor.insert_slice %0 into %arg0 [..][..][..]
   scf.yield %1 
}

and then just applying fusion. This doesnt work when you need to tile the reduction iterator type, but still there is a tight link between tiling and fusion.

I think getResultTilePosition is also required for just tiling (if the comment doesnt say so already then we should add that). The methods required for fusion are a super set of the methods required for tiling.

Beyond the specific questions though, I think the best way to control fusion is before you do tile and fuse you look at the untiled IR and see what are the operations that you want to tile and fuse. Then you can use the fusionControlFn in SCFTileAndFuseOptions (here. If there is some enhancements to the control mechanism that would help, we can certainly do that, but I’d need some more info on what you are trying to do.

ofri_frishman · May 7, 2025, 6:19am

Originally I wanted to do as you suggested

Look at the untiled IR and analyze it
Decide what to tile
Apply tile and fuse on those ops

But after adding tile and fuse support for a non trivial example, it seemed that separating the decision making step and the transformation step could cause the required transformation step to not be possible.
The reason for that is:

The tiling interface implementation of an op may hold various limitations regarding when tiling is possible. So, if we decide in the analysis phase that an op should be tiled in a specific way, then without trying that out, we might find out later on that the op just returns failure instead of the tiled implementation when trying to tile it as we wanted
Some ops do not implement the tiling interface, but we do want to fuse them into a loop nest. That can be done with various patterns that fold an op into an extract_slice or insert_slice or swap the place of the op with them. But again, those patterns have various limitations and without trying to use them it would be hard to know if they apply to a specific case

It would be interesting to understand how those limitations are dealt with in IREE.
But to me it seemed that another approach would be to couple the analysis and transformation steps more tightly, via iterating on different possible tile and fuse transformations and seeing what is supported and what isn’t, and updating transformation controls accordingly.
A difficulty with that approach is that it can be hard to understand which ops entered the loop nest via pattern since general patterns don’t have a clear interface to define what exactly they are meant to do (unlike something like getTiledImplementation where the required transformation in very clear).

MaheshRavishankar · May 16, 2025, 5:29am

Everything you pointed to is valid. I think this is a more a “still figuring this out”, and your hitting many of the same limits we are hitting in iREE. The way we deal with this in IREE is to implement the worklist by ourselves. I think @qed added the cleanupPatterns to SCFTileAndFuseOptions that got us to a state that for now seems to work.

Since you have looked deeply, Id be interested to know what your suggestions are. My thinking is untiled code has the best form to analyse “what should be tiled”. Once you start tiling a lot of the information you need to perform tiling gets “dispersed” in the IR and its hard to recover. This is the reason in IREE we “analyse first” and then do the transformation. Keeping the two separate also keeps distinct, what should be tile and fusable, and what is currently based on the implementation. The hope is that the implementation catches up to the analysis.

ofri_frishman · May 18, 2025, 5:40am

I definitely agree with your statement that I’m “still figuring this out”, so Its hard to make concrete suggestions. One issue is that I think that some ops that currently do not implement the tilingInterface and do not perform compute operations such as expand_shape or collapse_shape will be required to implement the interface. I’m trying to postpone this as much as possible, but if it becomes a must, I hope there will be openness to accepting upstreaming this implementation.

I haven’t encountered that yet, but makes sense that if you don’t do all the tiling and fusing at once, but instead do it on phases, then you could have a difficulty to update or redo previous decisions about tiling. But I’ll probably better understand if my implementation hits this issue in the future.

Topic		Replies	Views
Extending `tileConsumerAndFuseProducer` to handle more patterns MLIR	11	412	June 18, 2024
Tile and fuse support MLIR mlir	5	510	February 11, 2025
Implementing TilingInterface for tensor ops MLIR mlir	7	252	October 27, 2025
RFC for `TilingInterface` for tiling operations that dont fit into Linalg Structured Operation definition MLIR	6	1070	July 20, 2021
RFC: Changes to `TilingInterface` MLIR	21	1377	December 6, 2022

[RFC] Split fusion portions of the TilingInterface into a new interface

Problem Statement

Tiling

Fusion

Proposed Solution

Related topics