Any plan to fuse consumers?

Generally, we can fuse producers of some op inside the containing op, but if the producer has multiple users outside the containing op, we can’t fuse the producer into containing op completely.

for example:

func.func @add(%a: tensor<16x10xf32>, %b: tensor<10x10xf32>) -> (tensor<16x10xf32>, tensor<16x10xf32>) {
  %init_0 = tensor.empty() :tensor<16x10xf32>
  %matmul = linalg.matmul ins(%a, %b : tensor<16x10xf32>, tensor<10x10xf32>)
                          outs(%init_0 : tensor<16x10xf32>) -> tensor<16x10xf32>
  %init_1 = tensor.empty() :tensor<16x10xf32>
  %res0 = linalg.elemwise_unary {fun = #linalg.unary_fn<abs>, "res0"}
                                 ins(%matmul : tensor<16x10xf32>) outs(%init_1 : tensor<16x10xf32>) -> tensor<16x10xf32>
  %init_2 = tensor.empty() :tensor<16x10xf32>
  %res1 = linalg.elemwise_unary {fun = #linalg.unary_fn<ceil>, "res1"}
                                 ins(%matmul : tensor<16x10xf32>) outs(%init_1 : tensor<16x10xf32>) -> tensor<16x10xf32>
  return %res0, %res1 : tensor<16x10xf32>, tensor<16x10xf32>
}

transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):
  transform.sequence %arg0 : !pdl.operation failures(propagate) {
    ^bb0(%arg1: !pdl.operation):
      %res0 = transform.structured.match attributes{"res0"} in %arg1 : (!pdl.operation) -> !pdl.operation
      %foreach_thread_op1, %tiled_op1 = transform.structured.tile_to_foreach_thread_op %res0 num_threads [4, 0]
      %matmul = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!pdl.operation) -> !pdl.operation
      transform.structured.fuse_into_containing_op %matmul into %foreach_thread_op1
  }
}

We have to keep a ‘matmul’ op outside the ‘foreach_thread_op1’, because it has a user outside ‘foreach_thread_op1’.

I want to tile and fuse all operations into one foreach_thread_op, but i can’t do it with the current infrastructure.

I think if we can fuse all consumers into the containing op, we can tile the ‘matmul’ and fuse its two consumers. Of course, we need to add a ‘generateOperandTileValue’ in the tiling interface, which is opposite to ‘generateResultTileValue’, and promote the sinking of insert slice.

I want to know if mlir provides some infrastructure to fuse consumer into the containing op in the future. If not, is there a better solution to fuse all operation into one foreach_thread_op.

If I understand your query, this is certainly doable, but not as trivial as it sounds.

Your IR has two (independent) fusion opportunities:

  %res0 = abs ( matmul ( %a : <16x10>, %b: <10x10> ) );
  %res1 = ceil ( matmul ( %a: <16x10>, %b: <10x10> ) );

With CFG:

   matmul
   /    \
abs     ceil

You cannot do both paths in-place, you need to choose one. But you can tile and at least fuse all the ops in the same inner loop. However, you still need a separate buffer for each output, potentially making the last one in-place.

Extending this for arbitrary CFGs can lead to complex bufferization logic. It leads to a lot of corner cases that play with other passes and needs work that doesn’t just add the transform, but actually considers the side effects.

I have proposed a new RFC([RFC] Tiling interface supports fuse consumer) to address this issue. Welcome to engage in a thorough discussion there.