Affine-Parallelize not parallelizing some loops

imesmuaran · February 21, 2024, 11:43am

Hello, we tried to tile and parallelize a linalg.matmul to let it run on an GPU.
While lowering we encountered a loop which doesn’t get parallelized by the affine-parallelize pass.

We created this minimal example, where the loop over 2048 also doesn’t get parallelized using mlir-opt --affine-parallelize (same loop-nest as a tiled matmul, but simplified body):

#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> (d0 + 4)>
#map2 = affine_map<(d0) -> (d0 + 32, 1000)>
#map3 = affine_map<(d0) -> (d0 + 32)>

func.func @main() {
    %alloc = memref.alloc() : memref<16x2048xf32>
    %alloc_0 = memref.alloc() : memref<2048x1000xf32>
    %alloc_1 = memref.alloc() : memref<16x1000xf32>
    
    %cst = arith.constant 1.000000e+00 : f32
    
    affine.for %arg0 = 0 to 16 step 4 {
      affine.for %arg1 = 0 to 2048 step 32 {
        affine.for %arg2 = 0 to 1000 step 32 {
          affine.for %arg3 = #map(%arg0) to #map1(%arg0) {
            affine.for %arg4 = #map(%arg2) to min #map2(%arg2) {
              affine.for %arg5 = #map(%arg1) to #map3(%arg1) {
                affine.store %cst, %alloc_1[%arg3, %arg4] : memref<16x1000xf32>
              }
            }
          }
        }
      }
    }
    return
}

It results in the following IR:

#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> (d0 + 32)>
module {
  func.func @main() {
    %alloc = memref.alloc() : memref<16x2048xf32>
    %alloc_0 = memref.alloc() : memref<2048x1000xf32>
    %alloc_1 = memref.alloc() : memref<16x1000xf32>
    %cst = arith.constant 1.000000e+00 : f32
    affine.parallel (%arg0) = (0) to (16) step (4) {
      affine.for %arg1 = 0 to 2048 step 32 {
        affine.parallel (%arg2) = (0) to (1000) step (32) {
          affine.parallel (%arg3) = (%arg0) to (%arg0 + 4) {
            affine.parallel (%arg4) = (%arg2) to (min(%arg2 + 32, 1000)) {
              affine.for %arg5 = #map(%arg1) to #map1(%arg1) {
                affine.store %cst, %alloc_1[%arg3, %arg4] : memref<16x1000xf32>
              }
            }
          }
        }
      }
    }
    return
  }
}

We don’t understand why this loop in particular doesn’t get parallelized via the pass, but the surrounding ones do.

For GEMM, we ideally want to map these loops to blocks and threads for GPU execution. Does anybody have an idea how we could achieve that?

bondhugula · February 21, 2024, 12:50pm

But %arg1 and %arg5 aren’t parallel loops. Their iterations are writing to the same memory location!

chenghuaWang · February 21, 2024, 1:01pm

IIRC, linalg.matmul op can also be represented by linalg.generic. It has iterator_types = [“parallel”, “parallel”, “reduction”], which means only two for loops can be mapped to affine.parallel. U can use scf.forall instead, and transform dialect provides some methods to map scf.forall to blocks and threads.

jungpark · February 21, 2024, 1:23pm

Yeah, as @bondhugula mentioned, affine.for for %arg1 and %arg5 are not required in above IR.

My assumption as you called it GEMM, 2048 is K dimension of 16x2048x1000 sized GEMM calculation and you’re trying 32x32 sized tile.
In that case, iterating over 2048 axis is a reduction, you may want to keep unless you’re planning to use a special instruction such as atomic_add.

imesmuaran · February 21, 2024, 1:45pm

Yeah of course, thank you, totally overlooked that

We tried to tile a matrix-matrix-multiplication and get block and thread mappings in the pipeline, ideally just by using passes (not the transform dialect - at least not yet until we are more familiar with mlir). With

mlir-opt \
    --convert-linalg-to-affine-loops \
    --affine-loop-tile="tile-sizes=4,32,32" \
    --affine-loop-unroll="unroll-factor=4" \
    --canonicalize \
    --affine-loop-invariant-code-motion \
    --affine-loop-normalize \
    --affine-parallelize \
    --lower-affine \
    --canonicalize \
    --gpu-map-parallel-loops \
    --convert-parallel-loops-to-gpu \
    matmul.mlir

we just get a block dimension with a single thread dimension each, and we’re still experimenting…

    ...
    %c1 = arith.constant 1 : index
    %c4 = arith.constant 4 : index
    %c0 = arith.constant 0 : index
    ...
    %c1_5 = arith.constant 1 : index
    %0 = affine.apply #map(%c4)[%c0, %c1]
    %1 = affine.apply #map(%c32)[%c0, %c1]

    gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %0, %arg7 = %c1_5, %arg8 = %c1_5) threads(%arg3, %arg4, %arg5) in (%arg9 = %1, %arg10 = %c1_5, %arg11 = %c1_5) {

      %2 = affine.apply #map1(%arg0)[%c1, %c0]
      %3 = arith.muli %2, %c4 : index
      %4 = affine.apply #map1(%arg3)[%c1, %c0]
      %5 = arith.muli %4, %c32 : index
      scf.for %arg12 = %c0 to %c64 step %c1 {
        %6 = arith.muli %arg12, %c32 : index
        scf.parallel (%arg13) = (%c0) to (%c4) step (%c1) {
          %7 = arith.addi %3, %arg13 : index
          %8 = arith.muli %4, %c-32 : index
          %9 = arith.addi %8, %c1000 : index
          %10 = arith.cmpi sgt, %9, %c32 : index
          %11 = arith.select %10, %c32, %9 : index
          scf.parallel (%arg14) = (%c0) to (%11) step (%c1) {
            %12 = arith.addi %5, %arg14 : index
            scf.for %arg15 = %c0 to %c8 step %c1 {
     ...

imesmuaran · February 21, 2024, 8:50pm

it’s a bit more complicated then what we were looking for, but we will try to solve it with the transform dialect, thank you for the tip

Topic		Replies	Views
Question about --affine-loop-tile MLIR	1	544	October 26, 2021
Making linalg.matmul to GPU runnable code MLIR	6	1251	April 19, 2022
Help lowering affine loop to OpenMP MLIR	8	453	April 28, 2024
Parallelization of affine.for containing vector.transfer_write/read is not supported MLIR	0	145	June 8, 2023
Parallelization of affine.for containing vector.transfer_write/read is not supported MLIR	4	207	June 8, 2023

Affine-Parallelize not parallelizing some loops

Related Topics