Splitting TileOp with two tile sizes into two TileOps

I’m working on some experiments and I’d like to split the tiling in two separate ops. Basically, I want to apply one transform between the tiling of the first and the second loop. For example, let’s say we have:

%1, %2 = transform.structured.tile %0[32, 16] : (!pdl.operation) -> (!pdl.operation, !pdl.operation)

and I want to split it in two, something like:

%1, %2 = transform.structured.tile %0[32] : (!pdl.operation) -> (!pdl.operation, !pdl.operation)    
// the transform I want to apply would be here
%3, %4 = transform.structured.tile %1[16] : (!pdl.operation) -> (!pdl.operation, !pdl.operation) 

However, the output I get is different, which means that tiling the tiled loop is not equivalent to tiling once using two tile sizes. However, I don’t understand why is not equivalent, since conceptually it seems like it should be. Is there any way to split the TileOp in two and still get the same, equivalent code?

Complete Example

For the sake of completeness, here is a complete example using one TileOp:

#map0 = affine_map<(d0, d1) -> (d0, d1)>
module {  
  func.func @gemm(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %init: tensor<?x?xf32>) -> tensor<?x?xf32> {
    %0 = linalg.generic {indexing_maps = [#map0, #map0, #map0], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%init : tensor<?x?xf32>) {
    ^bb0(%arg6 : f32, %arg7 : f32, %arg8 : f32):
       %1 = arith.mulf %arg6, %arg7 : f32
       linalg.yield %1 : f32
    } -> tensor<?x?xf32>    
    return %0 : tensor<?x?xf32>
  }
  transform.sequence  failures(propagate) {
  ^bb0(%arg0: !pdl.operation):
    %0 = transform.structured.match ops{["linalg.generic"]} attributes {iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>]} in %arg0 : (!pdl.operation) -> !pdl.operation      
    %tiled_linalg_op, %loops:2 = transform.structured.tile %0[32, 16] : (!pdl.operation) -> (!pdl.operation, !pdl.operation, !pdl.operation)        
  }  
}

and using two separate ops:

#map0 = affine_map<(d0, d1) -> (d0, d1)>
module {  
  func.func @gemm(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %init: tensor<?x?xf32>) -> tensor<?x?xf32> {
    %0 = linalg.generic {indexing_maps = [#map0, #map0, #map0], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%init : tensor<?x?xf32>) {
    ^bb0(%arg6 : f32, %arg7 : f32, %arg8 : f32):
       %1 = arith.mulf %arg6, %arg7 : f32
       linalg.yield %1 : f32
    } -> tensor<?x?xf32>    
    return %0 : tensor<?x?xf32>
  }
  transform.sequence  failures(propagate) {
  ^bb0(%arg0: !pdl.operation):
    %0 = transform.structured.match ops{["linalg.generic"]} attributes {iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>]} in %arg0 : (!pdl.operation) -> !pdl.operation      
    %tiled_linalg_op, %loops = transform.structured.tile %0[32] : (!pdl.operation) -> (!pdl.operation, !pdl.operation)    
    %tiled_linalg_op2, %loops2 = transform.structured.tile %tiled_linalg_op[16] : (!pdl.operation) -> (!pdl.operation, !pdl.operation)    
  }  
}

that tiles the same dimension twice with two different sizes. If you want the second operation to tile the second dimension, specify tile size 0 for the first dimension and 16 for the second as in:

%1, %2 = transform.structured.tile %0[32] : (!pdl.operation) -> (!pdl.operation, !pdl.operation)    
// the transform I want to apply would be here
%3, %4 = transform.structured.tile %1[0, 16] : (!pdl.operation) -> (!pdl.operation, !pdl.operation)

Thanks for the idea :slightly_smiling_face:.

I forgot to mention it in my original post, but I also tried specifying tile size 0 for the first dimension. This way I get a slightly different output compared to tiling twice with [32] and then [16], but sill very far from what I get from one single TileOp with [32, 16].

I don’t think it’s an issue in my setup but I’ll summarize it just in case. I’m running it with mlir-opt --test-transform-dialect-interpreter. I also tried in a second setup with a third party tool that applies several optimizations, but either way I don’t get the output I expect. Any idea?

Let me reiterate. Tiling by [32] then [16] is absolutely not the same as tiling by [32, 16]. The former tiles the same iteration dimension, the latter tiles different iteration dimensions. These two are expected to produce different results.

Now, tiling by [32] and then by [0, 16] is “structurally equivalent” to tiling by [32,16]. There may be variations in what the IR looks like, but the loop structure is the same in both cases. Specifically, the latter tiling will only have tensor.extract_slice inside the inner loop and slice along both dimensions at the same time. The former tiling will have tensor.extract_slice along the first dimension in the outer loop, and then another tensor.extract_slice along the second dimension in the inner loop.

Thanks for reiterating.

I was expecting to have similar outputs, thus I was not convinced that tiling by [32] and then by [0, 16] would be equivalent.