Constructing pipeline lowering an affine parallel loop to NVIDIA GPU

rohany · May 26, 2023, 7:04pm

Consider the following snippet of MLIR:

func.func @kernel(%arg0: memref<?xf64>, %arg1: memref<?xf64>, %arg3: memref<1xindex>) attributes {llvm.emit_c_interface} {
  %c0 = arith.constant 0 : index
  %dim = memref.load %arg3[%c0] : memref<1xindex>
  %alloc = memref.alloc(%dim) : memref<?xf64>
  affine.for %arg2 = %c0 to %dim {
    %0 = affine.load %arg0[%arg2] : memref<?xf64>
    %1 = arith.addf %0, %0 : f64
    affine.store %1, %alloc[%arg2] : memref<?xf64>
  }
  return
}

I would like to lower this loop onto a GPU, which should be possible using all of the utilities available in MLIR currently. The following pipeline gets me there, but not in the way I’d expect:

./bin/mlir-opt testing.mlir --pass-pipeline="builtin.module(func.func(affine-loop-fusion{fusion-maximal=true}, affine-scalrep, affine-parallelize, lower-affine, gpu-map-parallel-loops, convert-parallel-loops-to-gpu, lower-affine), gpu-kernel-outlining, gpu.module(convert-gpu-to-nvvm, convert-nvgpu-to-nvvm, gpu-to-cubin), gpu-to-llvm)"

In particular, the generated code launches kernels over a grid = number of elements in the array, with block size of 1. This is obviously terrible for performance. I’m not sure which passes should be applied in order to get the behavior I want. Things are wrong at at least this point in the pipeline:

./bin/mlir-opt testing.mlir --pass-pipeline="builtin.module(func.func(affine-parallelize, lower-affine, gpu-map-parallel-loops))"

where the mapping looks like:

module {
  func.func @kernel(%arg0: memref<?xf64>, %arg1: memref<?xf64>, %arg2: memref<1xindex>) attributes {llvm.emit_c_interface} {
    %c0 = arith.constant 0 : index
    %0 = memref.load %arg2[%c0] : memref<1xindex>
    %alloc = memref.alloc(%0) : memref<?xf64>
    %c0_0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    scf.parallel (%arg3) = (%c0_0) to (%0) step (%c1) {
      %1 = memref.load %arg0[%arg3] : memref<?xf64>
      %2 = arith.addf %1, %1 : f64
      memref.store %2, %alloc[%arg3] : memref<?xf64>
      scf.yield
    } {mapping = [#gpu.loop_dim_map<processor = block_x, map = (d0) -> (d0), bound = (d0) -> (d0)>]}
    return
  }
}

where i’d maybe expect the processor to be some function of block_x, thread and block_dim.

Any guidance is appreciated here! I was playing around with affine-loop-tile but that didn’t seem to be pushing me in the right direction.

rohany · June 5, 2023, 9:23pm

After digging around for a while, I came to the following solution:

I first used the canonicalize pass to merge nested scf parallel loops into multi-dimensional parallel loops
wrote a pass that found these multi-dimensional loops and used the collapseParallelLoops utility to flatten them into a 1-D loop
Use the scf parallel loop tiling pass to tile the loop for each block
Use the scf for loop canonicalization pass to clean up the modified loops for the gpu parallel loop mapping pass to assign everything correctly.

bondhugula · June 6, 2023, 2:03am

For the second step, you didn’t find a pass (or a test pass) in the repo to do this? Ideally, one shouldn’t have to write a new pass or utility to get a simple scf.parallel or affine.parallel lower, map, and execute on NVIDIA or AMD GPUs – all of this is/should be available in the upstream infrastructure.

bondhugula · June 6, 2023, 2:04am

For the second step, lib/Dialect/SCF/Transforms/ParallelLoopCollapsing.cpp?

rohany · June 6, 2023, 4:17am

Yeah, there’s the “TestSCFParallelLoopCollapsing” pass, but the doc-string etc made it appear that this pass was for testing purposes only. My implementation of the second bullet was derived from that pass.

Topic		Replies	Views
Help lowering GPU modules to LLVM MLIR	11	779	August 24, 2023
Issues with the Lowering Path for Generating GPU Code Using MLIR MLIR gpu	4	115	January 14, 2025
Failed to lower GPU dialect using gpu-lower-to-nvvm pipeline MLIR	5	261	February 29, 2024
Memref to bare pointer conversion when lowering to GPU code MLIR	2	305	March 14, 2022
How to lowering gpu.launch correctly MLIR	4	258	December 4, 2023

Constructing pipeline lowering an affine parallel loop to NVIDIA GPU

Related topics