Generally, we can fuse producers of some op inside the containing op, but if the producer has multiple users outside the containing op, we can’t fuse the producer into containing op completely.
for example:
func.func @add(%a: tensor<16x10xf32>, %b: tensor<10x10xf32>) -> (tensor<16x10xf32>, tensor<16x10xf32>) {
%init_0 = tensor.empty() :tensor<16x10xf32>
%matmul = linalg.matmul ins(%a, %b : tensor<16x10xf32>, tensor<10x10xf32>)
outs(%init_0 : tensor<16x10xf32>) -> tensor<16x10xf32>
%init_1 = tensor.empty() :tensor<16x10xf32>
%res0 = linalg.elemwise_unary {fun = #linalg.unary_fn<abs>, "res0"}
ins(%matmul : tensor<16x10xf32>) outs(%init_1 : tensor<16x10xf32>) -> tensor<16x10xf32>
%init_2 = tensor.empty() :tensor<16x10xf32>
%res1 = linalg.elemwise_unary {fun = #linalg.unary_fn<ceil>, "res1"}
ins(%matmul : tensor<16x10xf32>) outs(%init_1 : tensor<16x10xf32>) -> tensor<16x10xf32>
return %res0, %res1 : tensor<16x10xf32>, tensor<16x10xf32>
}
transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):
transform.sequence %arg0 : !pdl.operation failures(propagate) {
^bb0(%arg1: !pdl.operation):
%res0 = transform.structured.match attributes{"res0"} in %arg1 : (!pdl.operation) -> !pdl.operation
%foreach_thread_op1, %tiled_op1 = transform.structured.tile_to_foreach_thread_op %res0 num_threads [4, 0]
%matmul = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!pdl.operation) -> !pdl.operation
transform.structured.fuse_into_containing_op %matmul into %foreach_thread_op1
}
}
We have to keep a ‘matmul’ op outside the ‘foreach_thread_op1’, because it has a user outside ‘foreach_thread_op1’.
I want to tile and fuse all operations into one foreach_thread_op, but i can’t do it with the current infrastructure.
I think if we can fuse all consumers into the containing op, we can tile the ‘matmul’ and fuse its two consumers. Of course, we need to add a ‘generateOperandTileValue’ in the tiling interface, which is opposite to ‘generateResultTileValue’, and promote the sinking of insert slice.
I want to know if mlir provides some infrastructure to fuse consumer into the containing op in the future. If not, is there a better solution to fuse all operation into one foreach_thread_op.