Hello, we tried to tile and parallelize a linalg.matmul
to let it run on an GPU.
While lowering we encountered a loop which doesn’t get parallelized by the affine-parallelize
pass.
We created this minimal example, where the loop over 2048
also doesn’t get parallelized using mlir-opt --affine-parallelize
(same loop-nest as a tiled matmul, but simplified body):
#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> (d0 + 4)>
#map2 = affine_map<(d0) -> (d0 + 32, 1000)>
#map3 = affine_map<(d0) -> (d0 + 32)>
func.func @main() {
%alloc = memref.alloc() : memref<16x2048xf32>
%alloc_0 = memref.alloc() : memref<2048x1000xf32>
%alloc_1 = memref.alloc() : memref<16x1000xf32>
%cst = arith.constant 1.000000e+00 : f32
affine.for %arg0 = 0 to 16 step 4 {
affine.for %arg1 = 0 to 2048 step 32 {
affine.for %arg2 = 0 to 1000 step 32 {
affine.for %arg3 = #map(%arg0) to #map1(%arg0) {
affine.for %arg4 = #map(%arg2) to min #map2(%arg2) {
affine.for %arg5 = #map(%arg1) to #map3(%arg1) {
affine.store %cst, %alloc_1[%arg3, %arg4] : memref<16x1000xf32>
}
}
}
}
}
}
return
}
It results in the following IR:
#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> (d0 + 32)>
module {
func.func @main() {
%alloc = memref.alloc() : memref<16x2048xf32>
%alloc_0 = memref.alloc() : memref<2048x1000xf32>
%alloc_1 = memref.alloc() : memref<16x1000xf32>
%cst = arith.constant 1.000000e+00 : f32
affine.parallel (%arg0) = (0) to (16) step (4) {
affine.for %arg1 = 0 to 2048 step 32 {
affine.parallel (%arg2) = (0) to (1000) step (32) {
affine.parallel (%arg3) = (%arg0) to (%arg0 + 4) {
affine.parallel (%arg4) = (%arg2) to (min(%arg2 + 32, 1000)) {
affine.for %arg5 = #map(%arg1) to #map1(%arg1) {
affine.store %cst, %alloc_1[%arg3, %arg4] : memref<16x1000xf32>
}
}
}
}
}
}
return
}
}
We don’t understand why this loop in particular doesn’t get parallelized via the pass, but the surrounding ones do.
For GEMM, we ideally want to map these loops to blocks and threads for GPU execution. Does anybody have an idea how we could achieve that?