[Linagl][Performance] Because copy without policy leads to redundant ops in fusion pass

RUN: mlir-opt %s -pass-pipeline=“builtin.func(tosa-to-linalg,linalg-fuse-elementwise-ops)” -split-input-file
with below tosa Module:

func @fused_unfused_mix(%arg0: tensor<4x16xi32>, %arg1: tensor<4x16xi32>) -> (tensor<2x8xi32>, tensor<2x8xi32>) {
  %0 = "tosa.add"(%arg0, %arg1) : (tensor<4x16xi32>, tensor<4x16xi32>) -> tensor<4x16xi32>
  %1 = "tosa.sub"(%0, %arg0) : (tensor<4x16xi32>, tensor<4x16xi32>) -> tensor<4x16xi32>
  %2 = "tosa.slice"(%1) {start = [2, 8], size = [2, 8]} : (tensor<4x16xi32>) -> tensor<2x8xi32>
  %3 = "tosa.div"(%0, %1) : (tensor<4x16xi32>, tensor<4x16xi32>) -> tensor<4x16xi32>
  %4 = "tosa.slice"(%3) {start = [2, 8], size = [2, 8]} : (tensor<4x16xi32>) -> tensor<2x8xi32>
  return %2, %4 : tensor<2x8xi32>, tensor<2x8xi32>
}

we will get:

#map = affine_map<(d0, d1) -> (d0, d1)>
module {
  func @fused_unfused_mix(%arg0: tensor<4x16xi32>, %arg1: tensor<4x16xi32>) -> (tensor<2x8xi32>, tensor<2x8xi32>) {
    %0 = linalg.init_tensor [4, 16] : tensor<4x16xi32>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : tensor<4x16xi32>, tensor<4x16xi32>) outs(%0 : tensor<4x16xi32>) {
    ^bb0(%arg2: i32, %arg3: i32, %arg4: i32):
      %6 = arith.addi %arg2, %arg3 : i32
      %7 = arith.subi %6, %arg2 : i32
      linalg.yield %7 : i32
    } -> tensor<4x16xi32>
    %2 = "tosa.slice"(%1) {size = [2, 8], start = [2, 8]} : (tensor<4x16xi32>) -> tensor<2x8xi32>
    %3 = linalg.init_tensor [4, 16] : tensor<4x16xi32>
    %4 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : tensor<4x16xi32>, tensor<4x16xi32>) outs(%3 : tensor<4x16xi32>) {
    ^bb0(%arg2: i32, %arg3: i32, %arg4: i32):
      %6 = arith.addi %arg2, %arg3 : i32
      %7 = arith.subi %6, %arg2 : i32
      %8 = arith.addi %arg2, %arg3 : i32
      %9 = arith.divsi %8, %7 : i32
      linalg.yield %9 : i32
    } -> tensor<4x16xi32>
    %5 = "tosa.slice"(%4) {size = [2, 8], start = [2, 8]} : (tensor<4x16xi32>) -> tensor<2x8xi32>
    return %2, %5 : tensor<2x8xi32>, tensor<2x8xi32>
  }
}

SliceOp is merely illustrative and can be any other non-elementwise OP.
we expect result maybe:

#map = affine_map<(d0, d1) -> (d0, d1)>
module {
  func @fused_unfused_mix(%arg0: tensor<4x16xi32>, %arg1: tensor<4x16xi32>) -> (tensor<2x8xi32>, tensor<2x8xi32>) {
    %0 = linalg.init_tensor [4, 16] : tensor<4x16xi32>
    %1 = linalg.init_tensor [4, 16] : tensor<4x16xi32>
    %2,%3 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : tensor<4x16xi32>, tensor<4x16xi32>) outs(%0 : tensor<4x16xi32>, %1 : tensor<4x16xi32>) {
    ^bb0(%arg2: i32, %arg3: i32, %arg4: i32):
      %6 = arith.addi %arg2, %arg3 : i32
      %7 = arith.subi %6, %arg2 : i32
      %8 = arith.addi %arg2, %arg3 : i32
      %9 = arith.divsi %8, %7 : i32
      linalg.yield %7,%9 : i32,i32
    } -> (tensor<4x16xi32>, tensor<4x16xi32>)
    %4 = "tosa.slice"(%2) {size = [2, 8], start = [2, 8]} : (tensor<4x16xi32>) -> tensor<2x8xi32>
    %5 = "tosa.slice"(%3) {size = [2, 8], start = [2, 8]} : (tensor<4x16xi32>) -> tensor<2x8xi32>
    return %4, %5 : tensor<2x8xi32>, tensor<2x8xi32>
  }
}

We get this result because we always copy the Op of fusable.
Copy does solve two problems:

1. fuse cause cyclye production like this:
          fusable_operand
               \      |
                \     |- unfusable_op
                 \       /
               fusable_op0

2. fuse destroys the topology order.

when we traversal postorder and insert fusion op at others_op.loc:
    fusable_operand
             |    \
    unfusable_op   \
             |   fusable_op
             |
         others

when we traversal preorder and insert fusion op at fusable_operand.loc:
    fusable_operand
             |    \
      unfusable_op \
             | \    \
             |  fusable_op
         others

But copy behavior obviously affects performance. For the second case, we may do partial topological sorting in the interval of fuseOps, but for the first case, we need to evaluate whether to copy or break not to do fuse.

I think for the fusion pass, postorder may be more suitable for us. Because the data stream is converged, so we can do more fusion.

If the above is not the behavior we expect, maybe I can help with something, I’m interested in that.