Constructing pipeline lowering an affine parallel loop to NVIDIA GPU

Consider the following snippet of MLIR:

func.func @kernel(%arg0: memref<?xf64>, %arg1: memref<?xf64>, %arg3: memref<1xindex>) attributes {llvm.emit_c_interface} {
  %c0 = arith.constant 0 : index
  %dim = memref.load %arg3[%c0] : memref<1xindex>
  %alloc = memref.alloc(%dim) : memref<?xf64>
  affine.for %arg2 = %c0 to %dim {
    %0 = affine.load %arg0[%arg2] : memref<?xf64>
    %1 = arith.addf %0, %0 : f64 %1, %alloc[%arg2] : memref<?xf64>

I would like to lower this loop onto a GPU, which should be possible using all of the utilities available in MLIR currently. The following pipeline gets me there, but not in the way I’d expect:

./bin/mlir-opt testing.mlir --pass-pipeline="builtin.module(func.func(affine-loop-fusion{fusion-maximal=true}, affine-scalrep, affine-parallelize, lower-affine, gpu-map-parallel-loops, convert-parallel-loops-to-gpu, lower-affine), gpu-kernel-outlining, gpu.module(convert-gpu-to-nvvm, convert-nvgpu-to-nvvm, gpu-to-cubin), gpu-to-llvm)"

In particular, the generated code launches kernels over a grid = number of elements in the array, with block size of 1. This is obviously terrible for performance. I’m not sure which passes should be applied in order to get the behavior I want. Things are wrong at at least this point in the pipeline:

./bin/mlir-opt testing.mlir --pass-pipeline="builtin.module(func.func(affine-parallelize, lower-affine, gpu-map-parallel-loops))"

where the mapping looks like:

module {
  func.func @kernel(%arg0: memref<?xf64>, %arg1: memref<?xf64>, %arg2: memref<1xindex>) attributes {llvm.emit_c_interface} {
    %c0 = arith.constant 0 : index
    %0 = memref.load %arg2[%c0] : memref<1xindex>
    %alloc = memref.alloc(%0) : memref<?xf64>
    %c0_0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    scf.parallel (%arg3) = (%c0_0) to (%0) step (%c1) {
      %1 = memref.load %arg0[%arg3] : memref<?xf64>
      %2 = arith.addf %1, %1 : f64 %2, %alloc[%arg3] : memref<?xf64>
    } {mapping = [#gpu.loop_dim_map<processor = block_x, map = (d0) -> (d0), bound = (d0) -> (d0)>]}

where i’d maybe expect the processor to be some function of block_x, thread and block_dim.

Any guidance is appreciated here! I was playing around with affine-loop-tile but that didn’t seem to be pushing me in the right direction.

After digging around for a while, I came to the following solution:

  • I first used the canonicalize pass to merge nested scf parallel loops into multi-dimensional parallel loops
  • wrote a pass that found these multi-dimensional loops and used the collapseParallelLoops utility to flatten them into a 1-D loop
  • Use the scf parallel loop tiling pass to tile the loop for each block
  • Use the scf for loop canonicalization pass to clean up the modified loops for the gpu parallel loop mapping pass to assign everything correctly.
1 Like

For the second step, you didn’t find a pass (or a test pass) in the repo to do this? Ideally, one shouldn’t have to write a new pass or utility to get a simple scf.parallel or affine.parallel lower, map, and execute on NVIDIA or AMD GPUs – all of this is/should be available in the upstream infrastructure.

For the second step, lib/Dialect/SCF/Transforms/ParallelLoopCollapsing.cpp?

Yeah, there’s the “TestSCFParallelLoopCollapsing” pass, but the doc-string etc made it appear that this pass was for testing purposes only. My implementation of the second bullet was derived from that pass.

1 Like