[RFC] Tiling interface supports fuse consumer

TLDR;

This RFC proposes an addition to the tiling interface by introducing the generateOperandTileValue method, allowing for fusion with consumers.

Motivation
Tiling interface only support fuse producer with interface method ‘generateResultTileValue’ currently, but do not have interface like ‘generateOperandTileValue’ to fuse consumers. which means we cannot fuse consumer with tiling interface.

Consider the following input CFG:

      A
    /   \
   B     C

We cannot achieve this behavior with the current tile and fuse infrastructure: merging these three ops into the same for loop through tiling and fusion.

Previous Discussions

Proposal

Initially, We will add a new method to the tiling interface:


InterfaceMethod<
 /*desc=*/[{
   Method to generate the code that produces a tile of the operand.

   Generates the IR that computes the tile of a result of the
   operation.  The `offsets` and `sizes` describe the tile of
   the output required.

   - `offsets` provides the offset of the tile in the coordinate system
     of the original iteration space, i.e., if an iteration space
     dimension had non-zero offset, it must be included in the offset
     provided here (as opposed to zero-based offset "relative" to the
     iteration space).
   - `sizes` provides the size of the tile.
 }],
 /*retType=*/"FailureOr<TilingResult>",
 /*methodName=*/"generateOperandTileValue",
 /*args=*/(ins
   "OpBuilder &":$b,
   "unsigned":$operandNumber,
   "ArrayRef<OpFoldResult>":$offsets,
   "ArrayRef<OpFoldResult>":$sizes),
 /*methodBody=*/"",
 /*defaultImplementation=*/[{
   return failure();
 }]
>

Next, we will implement this method for all ops with the tiling interface. For example, for linalg ops, we can obtain the operand-to-domain mapping using the “getMatchingIndexingMap” method and then call “getTiledImplementation” to realize this function.

What about the fusion?
In the current implementation, we achieve loop fusion mostly through FuseIntoContainingOp, We use it as an example to illustrate. FuseIntoContainingOp pass finds tensor.extract_slice using the producer result within the loop to obtain result tile info, and then calling the generateResultTileValue method to generate the op within the loop.

For example, consider the following input:

   A = "foo.op"(I)
   scf.forall (...) {
      A_slice = tensor.extract_slice(A) // Anchor
      B = "foo.op1"(A_slice) 
   }
   "foo.op2"(A)

After fuse foo.op into scf.forall, we have result ir:

   A = "foo.op"(I)
   scf.forall (...) {
      I_slice = tensor.extract_slice(I)
      A_local = "foo.op0"(I_slice)
      B = "foo.op1"(A_local) 
   }
   "foo.op2"(A)

When we support fusion with consumers, we can also find tensor.parallel_insert_slice corresponding to the output within the loop to obtain operand tile info. Then, we can call the generateOperandTileValue method to generate the op within the loop.

For example:

  %1 = scf.forall ... {
    %0 = ...
    scf.forall.in_parallel {                                                    
      tensor.parallel_insert_slice %0 into %o[...][...][...] : ... // Anchor                         
    }                                                                           
  } 
  %2 = "foo.op0"(%1)
  "foo.op1"(%1)
  "foo.op2"(%2)

After fuse foo.op0 into scf.forall, we have result ir:

  %1:2 = scf.forall ... {
    %0 = ...
    %2_local = "foo.op0"(%0)
    scf.forall.in_parallel {                                                    
      tensor.parallel_insert_slice %0 into %o[...][...][...] : ... // Anchor 
      tensor.parallel_insert_slice %2_local into %o1[...][...][...] : ... // Anchor 
    }                                                                           
  } 
  "foo.op1"(%1#0)
  "foo.op2"(%1#1)

It is worth noting that we will not remove the existing return values of scf.forall to prevent any other usage. If there are no other uses, these should be removed in the future by DCE.

Other issues

  • As disscussed in Any plan to fuse consumers?.
    @rengolin mentioned complex bufferization logic and I didn’t understand why. The inplace logic should be specified by the op’s bufferization interface. What does this have to do with tiling itself?
  • TBD

The code for this RFC has not been implemented yet, and we hope to receive more suggestions from everyone before starting the implementation. Any suggestions and questions will be welcomed, and I appreciate everyone’s attention in advance.

Does it mean that you will need to fuse all the consumers into the loop? What happens if the CFG becomes more complicated, e.g.,

  %gemm = matmul ( %a : <16x10>, %b: <10x10> )
  %res0 = abs ( %gemm );
  %res1 = add ( %res0, %gemm);
  store %res0, %res1 to ...

With CFG:

     matmul
   /      |
abs       |
  | \     |
  |    add
  v     |
output  v
      output

It looks like you will need to fuse abs firstly, then add. Otherwise, there could be domination issue. If the graph becomes more complicated, the fusion order seems hard to determine.

Also, what happens to other LoopLikeInterface ops, e.g., scf.for, etc? Are you going to model it with other insert operations (e.g., tensor.insert_slice ops)?

cc @MaheshRavishankar @nicolasvasilache who have more detailed context about TilingInterface

1 Like

Exactly, this is what I meant in my original reply but couldn’t articulate too well.

There should be a way to run dominance analysis through an arbitrary graph and figure out what can be fused and what cannot, maybe even partitioning the graph, and specify the order in which those happen to influence bufferization later.

My concerns in the other thread were that, once we get in that stage, the complications span beyond just fusion. We could be modifying the CFG, changing liveness and potentially creating a lot of corner cases in bufferization.

Mathematically speaking, nothing. Practically speaking, it’s like generating LLVM IR that the instruction selection / register allocation cannot consume in any reasonable way. You can’t propose a transformation without understanding the consequences further down the pipeline.

We are only providing the mechanism for fusing consumers with the intention of not necessarily fusing all ops. In a transform operation like fuseIntoContainingOp , it depends on how the transformation is utilized.

Fuse consumer does indeed introduce a dominance issue. However, I believe the root cause of this problem is not brought about by the fuse consumer but rather demands that users of the transform operation analyze the dependency order of calls. There are two reasons for this:

  1. We cannot forcefully demand that all fuses succeed; currently, fuse producers may also face failures for various reasons. Therefore, it seems not unreasonable for consumer fuse to fail.It’s just that the constraints on fusing consumers are a bit more restrictive.
  2. When fusing producers, we still have an optimal fuse order, as illustrated in the following example:
     A
   /    \
  B  -->  C
   \    /
      D

When we start fusing from D, fusing with C first will be a better choice. Therefore, making reasonable choices for the fuse order can lead to better performance, which should also be reasonable for fuse consumers.

Overall, I do hope to provide a mechanism for fusing consumers, but I cannot guarantee that it will always succeed (as mentioned earlier, fusing producers also cannot be guaranteed).As for greedily fusing, we can adjust the fuse order through built-in analysis to maximize the number of ops that can be fused.

This issue does indeed exist. We expect bufferization to be as independent as possible and capable of handling all situations. However, the complexity introduced by fuse consumer will inevitably create new corner cases, possibly requiring changes in conjunction with bufferization. Currently, we cannot assess how complex this modification might be.If there are experts in bufferization who can provide some advice, it would be helpful.

@cxy thank you so much for this post. THis was pointed out to me cause I was discussing exactly this with @hanchung and he mentioned this post to me.

I think the plan you have in the original post is exactly how I would go about it. In reality fusion of “tile consumer + fuse producer” is a choice. It is equally valid to “tile producer and fuse consumer” like you have described. If this is something you are still interested in, I am happy to help review/provide suggestions. I recently did clean up the implementation of tile and fuse using scf ops to use LoopLikeOpInterface. That should be how this is done as well.

With respect to impact on fusion and what is fused, that is a caller issue to handle. What these methods provide is just the mechanism. The caller should decide what should be fused and how.

Thank you for your approval of this plan.I will soon begin implementing this feature, and your assistance will be greatly appreciated :slightly_smiling_face:.

I’m glad our views are aligned on this issue.

I have already submitted the code on GitHub [mlir][linalg] Enable fuse consumer by cxy-1993 · Pull Request #85528 · llvm/llvm-project · GitHub. If you have time, I would greatly appreciate your help in reviewing the code. @MaheshRavishankar @hanchung @rengolin