I did some experiment with --affine-loop-tile
and notice it only tiles the loop nest but not the underlying buffers. This switch seems aiming at optimizing memory hierachy performance rather than working around memory resource restrictions. But I don’t know if I missed some switches that I’m not aware of so want to make sure I didn’t miss something to tile the buffers for --affine-loop-tile
. In comparison, I see --linalg-tile
and --linalg-tile-and-fuse-on-tensors
tile the underlying buffers. Is --affine-loop-tile
not designed for working around memory size restrictions?
For example, the IR before and after -affine-loop-tile="tile-size=32"
on function simple_matmul
from the test mlir\test\Dialect\Affine\loop-tiling.mlir
produces the following IR.
- Before
-affine-loop-tile="tile-size=32"
func @simple_matmul(%arg0: memref<256x256xvector<64xf32>>, %arg1: memref<256x256xvector<64xf32>>, %arg2: memref<256x256xvector<64xf32>>) -> memref<256x256xvector<64xf32>> {
affine.for %arg3 = 0 to 256 {
affine.for %arg4 = 0 to 256 {
affine.for %arg5 = 0 to 250 {
%0 = affine.load %arg0[%arg3, %arg5] : memref<256x256xvector<64xf32>>
%1 = affine.load %arg1[%arg5, %arg4] : memref<256x256xvector<64xf32>>
%2 = affine.load %arg2[%arg3, %arg4] : memref<256x256xvector<64xf32>>
%3 = mulf %0, %1 : vector<64xf32>
%4 = addf %2, %3 : vector<64xf32>
affine.store %4, %arg2[%arg3, %arg4] : memref<256x256xvector<64xf32>>
}
}
}
return %arg2 : memref<256x256xvector<64xf32>>
}
- After
-affine-loop-tile="tile-size=32"
#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> (d0 + 32)>
#map2 = affine_map<(d0) -> (d0 + 32, 250)>
module {
func @simple_matmul(%arg0: memref<256x256xvector<64xf32>>, %arg1: memref<256x256xvector<64xf32>>, %arg2: memref<256x256xvector<64xf32>>) -> memref<256x256xvector<64xf32>> {
affine.for %arg3 = 0 to 256 step 32 {
affine.for %arg4 = 0 to 256 step 32 {
affine.for %arg5 = 0 to 250 step 32 {
affine.for %arg6 = #map0(%arg3) to #map1(%arg3) {
affine.for %arg7 = #map0(%arg4) to #map1(%arg4) {
affine.for %arg8 = #map0(%arg5) to min #map2(%arg5) {
%0 = affine.load %arg0[%arg6, %arg8] : memref<256x256xvector<64xf32>>
%1 = affine.load %arg1[%arg8, %arg7] : memref<256x256xvector<64xf32>>
%2 = affine.load %arg2[%arg6, %arg7] : memref<256x256xvector<64xf32>>
%3 = mulf %0, %1 : vector<64xf32>
%4 = addf %2, %3 : vector<64xf32>
affine.store %4, %arg2[%arg6, %arg7] : memref<256x256xvector<64xf32>>
}
}
}
}
}
}
return %arg2 : memref<256x256xvector<64xf32>>
}
}