What if Producer has multiple Results or Result with multiple Uses for `tileAndFuseProducerOfSlice`

Hi, all.

In tileConsumerAndFuseProducersUsingSCF, tileAndFuseProducerOfSlice is used to fuse Producer into containingOp(like scf.for). E.g.

%1 = op1
// op2 has two result
%2_1, %2_2 = op2(%1)
%3 = scf.for(...){
   // tiled op3, the Owner of %2_1
   %ex_2_1  = extract %2_1
   %tiled_3... = tiled_op3(%ex_2_1)
   %ins_3 = insert %tiled_3 into .. ;
   yield  %ins_3
}
// the Owner of %2_2
%4 = op4(%2_2)

Through current tileAndFuseProducerOfSlice, above MLIR will be transformed into

%1 = op1
%2_1, %2_2 = op2(%1) // TO be removed by caller
%3 = scf.for(...){
   // tiled op2, the Owner of %1
   %ex_1 = extract %1
  // generated by `getTiledImplementation`
   %tiled_2_1, %tiled_2_2 = tiled_op2(ex_1)
   // tiled op3, the Owner of %tiled_2_1
   %ex_2_1  = extract %2_1 // TO be removed by caller
   %tiled_3... = tiled_op3(%tiled_2_1)
   %ins_3 = insert %tiled_3 into .. ;
   yield  %ins_3
}
// the Owner of %2_2
%4 = op4(%2_2)

It seems that no new insert_slice and yield value were created for %tiled_2_2? Do we expect op4 still use untiled %2_2 rather yeild result of %tiled_2_2? even though op2 has been tiled and fused?

If I have any misunderstanding, please correct me, thx.

Another similar case for Producer with single Result but owning multiple users:

%1 = op1
%2= op2(%1)
%3 = scf.for(...){
   // tiled op3, one of Owners of Value %2
   %ex_2  = extract %2
   %tiled_3... = tiled_op3(%ex_2)
   %ins_3 = insert %tiled_3 into .. ;
   yield  %ins_3
}
// another Owner of %2: op4
%4 = op4(%2)

After tileAndFuseProducerOfSlice:

%1 = op1
%2 = op2(%1) // TO be removed by caller
%3 = scf.for(...){
   // tiled op2, the Owner of %1
   %ex_1 = extract %1
  // generated by `getTiledImplementation`
   %tiled_2 = tiled_op2(ex_1)
   // tiled op3, the Owner of %tiled_2
   %ex_2  = extract %2 // TO be removed by caller
   %tiled_3... = tiled_op3(%tiled_2)
   %ins_3 = insert %tiled_3 into .. ;
   yield  %ins_3
}
// the Owner of %2
%4 = op4(%2)

In this case, The Value %2 has two users, respectively op3 and op4. When op2 was tiled and fused into containingOp(generated by op3), it seems that another %2 user op4 still use untiled %2?

If we can create new insert_slice and yield for value %tiled_2, it is feasible to continually fuse op4 via possibly coming tileAndFuseConsumerOfSlice. Anyway, the original untiled op2 can be eliminated safely in avoid of recomputation, which is actually a dead code in IR.

(answering one post at a time)

Yeah, this is expected (at least with the current implementation, happy to get correcting fixes). The issue is that it isnt clear how to “reconstruct” %2_2 within the scf.for loop. Maybe given that you know the tile of the iteration space of op2 used within the loop based on its use of %2_1 you can get the tile of %2_2 created and use tensor.insert_slice to reconstruct the replacement for %2_2. That would require you knowing somehow every iteration of the loop evaluates a different slice of %2_2 and the insert_slice puts things back the right way. Not cleat how to do that in a general setting, but maybe supportable in some narrow cases.
If you have a more concrete example, would be great to enhance the current methods to handle this case.

If you provide a repro, id be happy to look into it… I think this should be supportable by using the control function here appropriately.

I think this example is what you are looking for ?

Thanks for your suggestion and very useful example.

In example you provided, it indeed deals with the single result with two uses(%mm2 and return). However, I found that whether yield/insert_slice result of fusedProducer is controlled by dominanceInfo. Lets consider following topology, which only differs from your example in newOp:

   matmul0
   /    \
newOp matmul1    
   \       /
    return

Given current check logic, newOp, one of the user of matmul0, would (possibly) locate before matmul1. As the result, the matmul1 would not dominates newOp(seems caused by this line. Then the matmul0 would not yield its result and the newOp still has to use untiled matmul0 as worried.

As you said, no matter multiple Results or Result with multiple Uses, I agree that they should be unified and feasibly controlled by TillingOption. Here are some points need to talk:

  1. I also found your detailed comment about whether reconstruct fused producer inner loop here. For the cases what you showed, I want to confirm something with you firstly:
/// ```mlir
/// %0 = linalg.matmul ins(...) outs(...) -> tensor<?x?xf32>
/// %1 = linalg.matmul ins(%0, ..) outs(...) -> tensor<?x?x?f32>
///
/// If `%1` is tiled in a 2D fashion and `%0` is fused with it, the resulting IR
/// is
///
/// ```mlir
/// %t1_0 = scf.for .... iter_args(%arg0 = ...) {
///   %t1_1 = scf.for ... iter_args(%arg1 = %arg0) {
///     ...
///     %t1_2 = linalg.matmul ins(...) outs(...) -> tensor<?x?xf32>
///     %t1_3 = linalg.matmul ins(%t1_2, ...)
///     %t1_4 = tensor.insert_slice %t1_3 into %arg1 ...
///     scf.yield %t1_4
///   }
///   scf.yield %t1_1
/// }
/// ```

I guess what you concern here is redundant computation about the first matmul because that it could not share the %t1_1 loop generated by the second matmul and has to generate another loop related to column of first matmul, is it right?
If so, I think the issue actually results from why this kind of producer fusion is allowed from the perspective of performance.

If this scenario should be supported for robustness anyway, here is the possible resulting IR:

// generated by the row of matmul1 and shared by matmul0
%1:2 = scf.for %arg1=  to  iter_args(%arg1_for_mm0=%?, %arg1_for_mm1=%?) 
   // column of matmul1
   %2:2 = scf.for %arg2 =  to iter_args(%arg2_for_mm0=%arg1_for_mm0, %arg2_for_mm1=%arg1_for_mm1) 
     // column of matmul0
     %3 = scf.for %arg3 = to iter_args(%arg3_for_mm1 = %arg2_for_mm1) 
         %t0 = matmul(...)
         // insert matmul0, the `SliceParameters` can be inferred by `AffineMap`
         %insert0 = insert %t0 into %arg3_for_mm1[%arg1,%arg3] [...]
         // yield matmul0   
         yield %insert0
     %t1 = matmul(...)
     // insert matmul1
     %insert1= insert %t1 into %arg2_for_mm0[%arg1,%arg2] [...]
     // yield matmul0(actually the result of `scf.for`) and matmul1
     yield %3, %insert1
  // yield matmul0 and matmul1
  yield %2#0, %2#1

Next,

  1. the control function maybe better guided by whether any Uses of all Result of fusedProducer remain untiled. If so, insert slice and yield for it/them are needed. As illustrated above, the tiled order might be matmul1->matmul0, and then newOp is still not tiled, so we should insert/yield for result of matmul0, in which case we can:
    1. remove redundant computation about matmul0.
    2. furthermore fuse newOp.
  2. currently yield/insert process is created by yieldReplacementForFusedProducer which is outside tileAndFuseProducerOfSlice. Shall we move the former one into latter for several possible reasons:
    1. it is almost only used in tileAndFuseProducerOfSlice at least so far.
    2. since that tileAndFuseProducerOfSlice is very important and frequently used utility function related to fusion, it is reasonable to accept fusionControlFn as its argument. Then, all arguments yieldReplacementForFusedProducer needed is covered in tileAndFuseProducerOfSlice.
    3. it looks more friendly for other developers or users who may forget or not aware to deal with it.
  3. Certainly, as you may find above, it is also required to extend current member variable of scf::SCFFuseProducerOfSliceResult, which maybe called SmallVector<OpResult> origResultsYield indicating which results are inserted to a new sliceOp and yield by a loop. Then, the caller can continue to get necessary information just like what you have done here.

Surely, I am also glad to prepare a PR when reaching agreement with community about this topic.

1 Like

That is just one example. You could control it anyway based on what makes sense for your use case. That is why there is a callback. To allow callers to make that decision based on your use case. In general I havent found a “one size fits all” solution here.

Lets separate out functionality and performance. The transformation is not making any opinions on performance (there is just not enough information for the transformation to make this decision). From the transformations perspective it is entirely valid to create this IR, it is just going to be introducing redundant computation. Whether that is a good idea or not is upto the caller to decide.

Regarding the IR you posted (I dont have all the context of these methods in my head right now w.r.t implementation details, but) it is making assumptions about what is efficient on a given platform. For example, it might be efficient for a particular use case (i.e. shape and platform) where you want to just redundantly compute the tile of the first matmul, store it in some local scratch memory and use it, instead of writing all the way back to some global memory space and might also require synchronization. See the forall version of the generated code. There your suggested IR would not work (I think)
. Your final IR might be what you desire, but it must be broken into individual steps that would use a combination of tiling (potentially multiple times) + promotion to get the right state. The transformation is just one step of a larger sequence of transformations to get the final IR state.

In general this is the sequence we’ve used

  1. Tile any root to get the initial loop structure you want
  2. Then use tileAndFuse to fuse into that loop.
  3. Further apply tiling to either the tiled root op or fused ops to get more of the final loop structure you want.
    This basically requires caller to “analyze” the sequence of ops and figure out the “strategy” to use. These transformations are meant to be building blocks of that strategy and are purposely un-opinionated. They “do what you tell it to do”.

That just seems like a code-organization question. Keeping it out makes the code more readable. It is also exposed so that someone could put together a worklist without having to call the umbrella entry point of tileAndFuseUsingSCF.

In general, as far as I can remember right now, this method is meant to do a single step of taking a op -> tensor.extract_slice and converting it to tensor.extract_slice -> tiled_op transformation. Information about uses of the op seem outside what this method needs to care about.

All right, thanks for your detailed explanation.

As for how to deal with multiple results case, I think we should at least provide the option for user to yield all of them just like enabled yieldReplacementForFusedProducer for multiple uses case.

It looks like you prefer to reconstruct %2_2(another OpResult) outside tileAndFuseProducerOfSlice, just like what you did for yieldReplacementForFusedProducer here?

If so, let me try to break down what we need to enhance about how to yield other OpResult of fused producer:

  1. As you mentioned above, it is up to the caller to make decision and should be also controlled by fusionControlFn, which need to return the third Boolean value indicating whether yield other OpResult of fusedProducer within tiled loop.
  2. Similar to yieldReplacementForFusedProducer, we need another function like yieldOtherResultForFusedProducer to deal with fusedProducer with multiple results.
  3. the caller need to replace the original Value with new Result of Loop.

Another feasible solution is that we can directly enhance yieldReplacementForFusedProducer to deal with multiple OpResult at the same time. It just lacks slice information which can be calculated via AffineMap of fusableProducer.

[Edited]: It seems addInitOperandsToLoopNest only support scf.for but excluding scf.forall. We also need to extend it.

Looking forward to your opinion on this.

The patch is available here. Please help to review, thanks.