Thanks for your suggestion and very useful example.
In example you provided, it indeed deals with the single result with two uses(%mm2
and return
). However, I found that whether yield/insert_slice
result of fusedProducer
is controlled by dominanceInfo. Lets consider following topology, which only differs from your example in newOp
:
matmul0
/ \
newOp matmul1
\ /
return
Given current check logic, newOp
, one of the user of matmul0
, would (possibly) locate before matmul1
. As the result, the matmul1
would not dominates newOp
(seems caused by this line. Then the matmul0
would not yield its result and the newOp
still has to use untiled matmul0
as worried.
As you said, no matter multiple Results or Result with multiple Uses, I agree that they should be unified and feasibly controlled by TillingOption
. Here are some points need to talk:
- I also found your detailed comment about whether reconstruct fused producer inner loop here. For the cases what you showed, I want to confirm something with you firstly:
/// ```mlir
/// %0 = linalg.matmul ins(...) outs(...) -> tensor<?x?xf32>
/// %1 = linalg.matmul ins(%0, ..) outs(...) -> tensor<?x?x?f32>
///
/// If `%1` is tiled in a 2D fashion and `%0` is fused with it, the resulting IR
/// is
///
/// ```mlir
/// %t1_0 = scf.for .... iter_args(%arg0 = ...) {
/// %t1_1 = scf.for ... iter_args(%arg1 = %arg0) {
/// ...
/// %t1_2 = linalg.matmul ins(...) outs(...) -> tensor<?x?xf32>
/// %t1_3 = linalg.matmul ins(%t1_2, ...)
/// %t1_4 = tensor.insert_slice %t1_3 into %arg1 ...
/// scf.yield %t1_4
/// }
/// scf.yield %t1_1
/// }
/// ```
I guess what you concern here is redundant computation about the first matmul
because that it could not share the %t1_1
loop generated by the second matmul
and has to generate another loop related to column of first matmul
, is it right?
If so, I think the issue actually results from why this kind of producer fusion is allowed from the perspective of performance.
If this scenario should be supported for robustness anyway, here is the possible resulting IR:
// generated by the row of matmul1 and shared by matmul0
%1:2 = scf.for %arg1= to iter_args(%arg1_for_mm0=%?, %arg1_for_mm1=%?)
// column of matmul1
%2:2 = scf.for %arg2 = to iter_args(%arg2_for_mm0=%arg1_for_mm0, %arg2_for_mm1=%arg1_for_mm1)
// column of matmul0
%3 = scf.for %arg3 = to iter_args(%arg3_for_mm1 = %arg2_for_mm1)
%t0 = matmul(...)
// insert matmul0, the `SliceParameters` can be inferred by `AffineMap`
%insert0 = insert %t0 into %arg3_for_mm1[%arg1,%arg3] [...]
// yield matmul0
yield %insert0
%t1 = matmul(...)
// insert matmul1
%insert1= insert %t1 into %arg2_for_mm0[%arg1,%arg2] [...]
// yield matmul0(actually the result of `scf.for`) and matmul1
yield %3, %insert1
// yield matmul0 and matmul1
yield %2#0, %2#1
Next,
- the control function maybe better guided by whether any Uses of all Result of
fusedProducer
remain untiled. If so, insert slice and yield for it/them are needed. As illustrated above, the tiled order might be matmul1->matmul0
, and then newOp
is still not tiled, so we should insert/yield for result of matmul0
, in which case we can:
- remove redundant computation about
matmul0
.
- furthermore fuse
newOp
.
- currently yield/insert process is created by
yieldReplacementForFusedProducer
which is outside tileAndFuseProducerOfSlice
. Shall we move the former one into latter for several possible reasons:
- it is almost only used in
tileAndFuseProducerOfSlice
at least so far.
- since that
tileAndFuseProducerOfSlice
is very important and frequently used utility function related to fusion, it is reasonable to accept fusionControlFn
as its argument. Then, all arguments yieldReplacementForFusedProducer
needed is covered in tileAndFuseProducerOfSlice
.
- it looks more friendly for other developers or users who may forget or not aware to deal with it.
- Certainly, as you may find above, it is also required to extend current member variable of
scf::SCFFuseProducerOfSliceResult
, which maybe called SmallVector<OpResult> origResultsYield
indicating which results are inserted to a new sliceOp and yield by a loop. Then, the caller can continue to get necessary information just like what you have done here.
Surely, I am also glad to prepare a PR when reaching agreement with community about this topic.