Question about linalg matmul tile method

There is a basic linalg matmul op with m=512, n= 512, k=512.
transform.sequence failures(propagate) {
^bb0(%arg1: !pdl.operation):
%0 = transform.structured.match ops{[“linalg.matmul”]} in %arg1 : (!pdl.operation) → !pdl.operation
%1, %loops:3 = transform.structured.tile %0 [16, 16, 4] : (!pdl.operation) → (!pdl.operation, !pdl.operation, !pdl.operation, !pdl.operation)
}

func.func @tile_linalg_matmul(
%arg0: tensor<128x128xf32>, %arg1: tensor<128x128xf32>, %arg2: tensor<128x128xf32>)
→ tensor<128x128xf32> {
%0 = linalg.matmul ins(%arg0, %arg1: tensor<128x128xf32>, tensor<128x128xf32>)
outs(%arg2: tensor<128x128xf32>)
→ tensor<128x128xf32>
return %0 : tensor<128x128xf32>
}
Then it will be tiled to block size with m=16, n= 16, k=4 using cmd “./mlir-opt linalg_struct.mlir -test-transform-dialect-interpreter -split-input-file --verify-diagnostics”
Why is the computation result of the block not accumulated but directly written back to the result matrix?

module {
transform.sequence failures(propagate) {
^bb0(%arg0: !pdl.operation):
%0 = transform.structured.match ops{[“linalg.matmul”]} in %arg0 : (!pdl.operation) → !pdl.operation
%tiled_linalg_op, %loops:3 = transform.structured.tile %0[16, 16, 4] : (!pdl.operation) → (!pdl.operation, !pdl.operation, !pdl.operation, !pdl.operation)
}
func.func @tile_linalg_matmul(%arg0: tensor<128x128xf32>, %arg1: tensor<128x128xf32>, %arg2: tensor<128x128xf32>) → tensor<128x128xf32> {
%c16 = arith.constant 16 : index
%c16_0 = arith.constant 16 : index
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%c128 = arith.constant 128 : index
%0 = scf.for %arg3 = %c0 to %c128 step %c16 iter_args(%arg4 = %arg2) → (tensor<128x128xf32>) {
%c0_1 = arith.constant 0 : index
%c128_2 = arith.constant 128 : index
%1 = scf.for %arg5 = %c0_1 to %c128_2 step %c16_0 iter_args(%arg6 = %arg4) → (tensor<128x128xf32>) {
%c0_3 = arith.constant 0 : index
%c128_4 = arith.constant 128 : index
%2 = scf.for %arg7 = %c0_3 to %c128_4 step %c4 iter_args(%arg8 = %arg6) → (tensor<128x128xf32>) {
%extracted_slice = tensor.extract_slice %arg0[%arg3, %arg7] [16, 4] [1, 1] : tensor<128x128xf32> to tensor<16x4xf32>
%extracted_slice_5 = tensor.extract_slice %arg1[%arg7, %arg5] [4, 16] [1, 1] : tensor<128x128xf32> to tensor<4x16xf32>
%extracted_slice_6 = tensor.extract_slice %arg8[%arg3, %arg5] [16, 16] [1, 1] : tensor<128x128xf32> to tensor<16x16xf32>
%3 = linalg.matmul ins(%extracted_slice, %extracted_slice_5 : tensor<16x4xf32>, tensor<4x16xf32>) outs(%extracted_slice_6 : tensor<16x16xf32>) → tensor<16x16xf32>
%inserted_slice = tensor.insert_slice %3 into %arg8[%arg3, %arg5] [16, 16] [1, 1] : tensor<16x16xf32> into tensor<128x128xf32>
scf.yield %inserted_slice : tensor<128x128xf32>
}
scf.yield %2 : tensor<128x128xf32>
}
scf.yield %1 : tensor<128x128xf32>
}
return %0 : tensor<128x128xf32>
}
}
%3 just insert to result matrix

Because that’s exactly what the tiling transform op does.

You may want to use another transform hoist_redundant_vector_transfers, which can be applied after the vectorisation.
Here is a link to the test.

1 Like

Linalg’s TileUsingForOp is used to just tile a op. PartialReductionOpInterface in Linalg is used for tile op like matmul. It is clear. But using matmul within TileUsingForOp could potentially lead to confusion.
Thx !