Hello, i’m looking at the linalg fusion code in mlir/lib/Dialect/Linalg/Transforms/Fusion.cpp and i’m curious about this limitation:

// Must be a subview or a slice to guarantee there are loops we can fuse
// into.
auto subView = dyn_cast_or_null<SubViewOp>(consumedView.getDefiningOp());
auto slice = dyn_cast_or_null<SliceOp>(consumedView.getDefiningOp());
if (!subView && !slice) {
LLVM_DEBUG(dbgs() << "\nNot fusable (not a subview or slice)");
continue;
}

I’d like to understand more about what this is doing/intending. If anyone has some insight here that they could please share, i would be grateful.

thanks in advance,

ian Bearman
Principal Software Engineering Manager
Microsoft Visual C++ Team: Optimization & Code Generation /* Making your code faster, smaller, smarter! */

The idea is that Linalg ops and transformations operate on “views”.
The ops have a strong data-driven flavor as explained in the rationale and the dialect doc.

In this context, there are currently 2 types of fusion on Linalg ops:

simple pointwise tensor-level fusion, which takes 2 linalg.generic ops and combines them into a single linalg.generic op with a new region;

“tileAndFuse” buffer-level fusion in which, given a producer-consumer relationship:
a) the consumer is tiled,
b) the subview/slices of the consumer are derived for all operands,
c) the producer subcomputation is derived and computed (literally spliced just before) in the same scope as the tiled consumer.

2 a) and b) are exactly the tiling transformation in Linalg. Tiling introduces appropriate enclosing loops and creates subviews/slices.

Fusion use such subviews to determine the subset of producer’s computation.
Note that i the limit, when tile size is 1, linalg fusion will behave similarly to loop fusion.

We could also envision other ways to trigger fusion depending on requirements.

Thank you for replying. Your explanation is helpful, though i’m in need of more education here i if you please.

For #1, as implemented the pointwise tensor-level fusion operates only on Tensor types (!op.hasBufferSemantics()). i’m interested in this type of fusion over memref types (op.hasBufferSemantics()) and i’m wondering if this is a fundamental limitation, a design choice, or just something not yet approached/implemented.

Further on #1, the --convert-linalg-to-X lowering passes seem to require memref types on operations (op.hasBufferSemantics()). For example, in LinalgOptToLoopsImpl there is an assert that this is the case. Question: Am i missing a pass to lower tensor-typed operations?

On #2, after spending more time stepping through test cases for this type of fusion, i’m understanding it a bit more mechanically, but conceptually i think i’m still missing something for this pass. Can you please tell me more about the expectation that the consumer is already tiled? Is the expectation that an earlier pass has tiled some of the operations (e.g. the really expensive ones?) and the remaining ones are then fused in to the tiles?

I am not sure I follow the question here. The fusion of linalg operations on tensors is different from the fusion of linalg operation on buffers. They are trying to fuse operations but in a different way. The check is a sanity check to make sure that the right approach is used based on operation semantics. It would be easier to answer if you can provide some more specifics as to what you are looking for.

Lowering from tensor types, which are SSA value semantics, to memref types that have buffer semantics requires memory allocation to happen. This is not implemented right now AFAIK. Linalg does allow you to incrementally allocate memory, i.e. allocate buffer for a single tensor at a time, instead of having to do a one-shot allocation for memory of all tensors.

Linalg fuses operations based on producer-consumer relationship. The intuition for fusing linalg operations on buffers is that you tile the consumer operation first, i.e. you produce a tile of the result of the computation, and have loops surrounding this to generate all tiles of the result. Then for each tile you can reason about what tiles of the operands of the consumer are needed to produce a tile of the result. Then you generate the tile of operand within the inter-tile loops by tiling (and fusing) the produce op. Now you have a producer that produces a tile that is immediately consumed by the consumer.All of this logic is implemented by the tileAndFuseLinalgOpAndSetMarker method here.

I hope that gives some context. Happy to go into more detail if there are things that arent clear yet.

Thank you again for your response. Let me clarify the first question. I’m interested in a pass that can fuse generic operations which operate on memref typed operands. I envision a pass similar to what has been written for tensor-typed operands. The question i’m asking is about why the existing implementation of fusion type #1 is based on tensor types rather than memref types. Is there a fundamental reason for this? I don’t see why not, but i haven’t considered the problem deeply.

It’s more historical than anything else. The first implemented fusion was tileAndFuse on buffers. Then we realized there are a lot of pointwise ops that can be fused trivially in the tensor world (i.e. without needing to create loops for subviews). Fusing in the tensor world also removes the need for allocating buffers that can be removed after fusion in the buffer world.

The region-based fusion implemented in tensors can also be implemented in the buffer world. Things are less simple because there is no SSA use-def chains and you need to consider the alias analysis and make some assumptions. But it should be possible.

Still, the preferred path would be to fuse linalg on tensors and then allocate buffers (fusing on buffers requires allocating them first and later erasing them).

But tensor → buffer is WIP. Our collaborators from DFKI are looking at it (@pifon2a to ping the right people here).

I would also be interested to see this implemented, even if it is limited to a simple alias analysis. We just did not find the time to look at it so far.

+1. I dont immediately see what implementing fusion on buffers similar to fusion on tensors would achieve more than having fusion on tensors + buffer allocation to get the linalg on buffers. @herhut some context from you might help based on what you are looking at if doing fusion on buffers does give you more than a different path to get to the same end point as fusion on tensor + buffer allocation (even a dumb buffer allocation)