This topic has come up multiple times, but after reviewing past discussions, I couldn’t find a clear or unanimous decision on how to move forward.
We’re using TileAndFuse, specifically the tileConsumerAndFuseProducersUsingSCF method, which fuses operations implementing the TilingInterface. However, many “no-op” tensor operations — such as tensor.expand_shape, tensor.insert_slice, and others - don’t implement the interface, and thus can’t be fused.
In some ways, this post revives this inactive RFC thread: [RFC] Split fusion portions of the TilingInterface into a new interface.
That RFC proposed splitting the existing interface, which currently supports both tiling and fusion, into two separate interfaces: one dedicated to tiling and the other to fusion.
The general sentiment seemed to be that this separation made sense conceptually, but in practice both interfaces would likely share very similar logic, since tiling and fusion are tightly coupled.
Since that discussion, @qed introduced an option to run cleanup pattern sets between fusion steps. This allows patterns like MergeConsecutiveInsertExtractSlice and BubbleUpExtractSliceOp to “skip over” intermediate tensor ops and directly connect the slices to a TilingInterface op.
While this approach is helpful, I see two key drawbacks compared to directly implementing a TilingInterface (or a dedicated FusableInterface):
Loss of IR context between fusion steps: As noted in the RFC, applying cleanup patterns between fusion steps disconnects the transformed IR from the original structure. This makes it harder to reason about what to fuse first, and apply targeted transformations later.
Pattern limitations: In some cases, cleanup patterns simply aren’t sufficient:
BubbleUp doesn’t support strided slices. While the current TilingInterface doesn’t support fusion of strided slices either, this is something we’d like to enable in the future.
BubbleUp only works on expand_shape if each reassociation group corresponds to a contiguous slice.
I understand there’s some reluctance to implement TilingInterface for the mentioned tensor ops, but I’d like to better understand the rationale. Even if there’s no performance benefit, the capability is technically feasible and could provide significant value for pre-transofmration analysis.
Separating transformation mechanisms from heuristics already allows the caller to decide which operations should be tiled.
I’ll start by strongly agreeing with this statement. This is in line with the work that @rolfmorel is doing on the transform and DLTI dialects.
Regardless to their current usage, these patterns are meaningful on their own. They can be used to expose fusion and simplify load/store access patterns. So, even if we implement the interfaces and allow fusion without the cleanups, we still should be aware that they may be a part of the pipeline somewhere.
I think the keyword here is targeted. If you’re running targeted transformations (e.g. a particular hand crafted schedule), then you know what you have at every step. The problem is when you have a default pipeline and can’t control (skip) what has run before.
Deciding on the interplay between passes and transforms is complicated. Often you get unpredictable side-effects, either because you missed something, or because the original implementation was too vague/focused.
So, I’d start with a comparison between the current “clean up” approach and tiling tensor operations. Show actual IR on either side and what a pass / transform would do to reach the next stage.
I wouldn’t necessarily say there is opposition to implementing the interface for shape-related ops, this just requires a substantial amount of work and there are simpler solutions people who had the need for this could implement to circumvent the need. Specifically, tiling expand_shape and collapse_shape is a recurrent request, so IMO is something that belongs upstream even if it has to remain somehow optional.
The challenge with expand_shape and collapse_shape specifically is that the tiling interface currently also requires fusion-related methods to be implemented, and I don’t remember offhand whether they can be left unimplemented or return failure and have fusion fail graciously. Producer fusion looks a bit tricky because one would need to invert the “indexing map”, (d0,d1) → (d0*SIZE+d1) for collapse, (d0) → (d0 floordiv SIZE, d0 mod SIZE) for expand, in order to compute through it, which is problematic for the affine abstraction we use in Linalg. Not saying this is impossible and the interfaces are designed not to require affine expressions.
Note that it is possible to register interfaces on the op without modifying the op itself, which allows for non-disruptive downstream experimentation here. I’d say it would be interesting for someone to try doing that and report back with the findings.
Diving a bit deeper into the existing BubbleUp patterns which replaces expand/collapse_shape → extract_slice with extract_slice → expand/collapse_shape I realized that what I previously referenced as a limitation is actually valid and indeed cannot be addressed using the TilingInterface alone. So the main issue is the loss of context with the initial IR throughout the transformation.
The interface does allow to return a failure in case fusion is not possible. I suppose that any TilingInterface implementation is likely to share a similar logic with BubbleUpExtractSlice patterns.
I agree, I think it will allow us to move forward on our downstream project and hopefully register it upstream later on.
Sure:
%0 = linalg.generic
%1 = linalg.expand_shape %0
%2 = linalg.generic %1
After tiling %2 and either fusing %1 using a new TilingInterface or applying cleanup patterns I have:
%0 = linalg.generic
%1 = linalg.expand_shape %0
%2 = linalg.generic %1
scf.for {
%3 = linalg.extract_slice %0
%4 = linalg.expand_shape %3
%5 = linalg.generic %4
yield %5
}
At this point I can decide whether or not I should fuse %0 into the tiled loop. In the case that expand_shape was fused using its interface, I retain a mapping between %4 and %1, making it clear that %0 has a single real consumer.
However, if BubbleUp pattern was applied this mapping is lost and the analysis becomes more difficult.
Another, more specific case, is for tensor.insert_slice. Consider:
I dont know what the status here is, but it seems unreasonable to implement tiling interface for such operations. You want to try something out, you can implement the tiling interface for this op as an external interface and do that down stream. I dont think this will work out too well though
This hasn’t been implemented downstream yet. Why is it unreasonable to implement tiling interface for such operations?
Could you elaborate on why you think this won’t work out well?
As i see it, using cleanup patterns between fusion steps has a serious drawback of losing the mapping between the original op and tiled (or folded) version since cleanup patterns don’t provide that information, as Ziv stated. I would like to better understand what are the drawbacks of implementing the tiling interface for ops that we wouldn’t define as “compute operations”.
Yeah, that is the problem with cleanup patterns. Ideally you dont want cleanup patterns either. That is adding too much in one step. If I were to be doing this myself, I would remove the cleanup pattern as well, but this is a pragmatic choice.
From my perspective, it comes down to what tiling really is in the traditional polyhedral compiler sense. You are taking an iteration space where each iteration does “some amount of work” or “compute” and then modifying the schedule to change where this compute happens. So tiling and fusion really are a matter of reordering the order of computation under some constraints. For this like reshapes etc, these arent really doing “work”. So in that sense it does not seem to be candidates for tiling. The clean up patterns are something I would remove completely. The way to ideally structure this is to “preprocess for tile + fuse” followed by “tile and fuse” followed by post-process for tile and fuse. The tile and fuse methods give you back all information about the transformations used. Its the cleanup patterns that lose this information, and now we are following along that path to say well maybe we should tile the operations instead, so that seems to be going down a path that does not connect to what tile and fuse transformations are fundamentally meant for.
I’m trying to look at this from a pragmatic point of view, where I just want to get the best results out of the tile and fuse transformation. So, in that view I don’t really see a difference between something we would instinctively look at as a compute op like linalg.abs and something that we would instinctively look at as a non compute op like tensor.collapse_shape.
I definitely agree with you that the tile and fuse methods give you the information you require about what happened in the transformations they performed and cleaunp patterns don’t, so we should strive to remove the cleanup patterns from being part of the tile and fuse transformation.
Given my outlook from the first paragraph, I thought that the easiest way to get rid of the cleanup patterns would be to just implement the tiling interface for any op we come across during the tile and fuse process. But I have found that there are cases where this is not possible. For example when dealing with tiling in cases of dynamic tensor shapes, many times tensor.dim ops need to be added to the IR. These ops do not have a tensor result so cannot be tiled but can cause additional dependencies that prevent fusion, or if fusion is done they will prohibit the original non tiled compute op from being erased as part of dead code elimination.
Having a pre and/or post step to fusion to handle such cases seems like a possible way to get the pragmatic results that I’m after as well as leaving the tiling interface only for ops that we look at as compute ops. But for that to work the pre/post step would need to apply some transformations and also provide the relevant information required for continued analysis similar to the information that the tiling interface supplies. Do you have any thoughts on how to implement this, would it require a new interface or some other mechanism?
In my understanding, the polyhedral representation of tiling and fusion operates on opaque statements described by the iteration domain, access patterns and potentially some cost. It doesn’t really care about the semantics of the statement beyond that. The statement may well be a noop, it wouldn’t tell the difference and happily tile the surrounding loops of schedule their iterations anywhere in the program (because there are no dependencies). A collapse_shape is not even a noop, it’s more of a A’[N*i+j] = A[i][j], but the inplace-ness makes it a noop. In the polyhedral representation, it would just compose with the access function. From this view, tiling these should be just fine. Fusion may be problematic if the composition makes the access function non-invertible.