Running reductions on GPU

teekarna · May 15, 2024, 6:46pm

Following examples, I can handle allocations and elementwise ops on GPUs. I’m having issues with reduction ops.

Here’s a reduction example with linalg named ops:

%reduce = linalg.reduce { arith.addf } ins(%input: tensor<16xf32>)
    outs(%output: tensor<f32>) dimensions = [0]

After linalg-generalize-named-ops pass this becomes

#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> ()>
%0 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["reduction"]} ins(%arg0 : tensor<16xf32>) outs(%arg1 : tensor<f32>) {
^bb0(%in: f32, %out: f32):
  %1 = arith.addf %out, %in : f32
  linalg.yield %1 : f32
} -> tensor<f32>

And convert-linalg-to-parallel-loops pass converts it to scf.for:

scf.for %arg0 = %c0 to %c16 step %c1 {
  %0 = memref.load %memref[%arg0] : memref<16xf32>
  %1 = memref.load %memref_1[] : memref<f32>
  %2 = arith.addf %1, %0 : f32
  memref.store %2, %memref_1[] : memref<f32>
}

As far as I understand, the existing GPU passes like gpu-map-parallel-loops and convert-parallel-loops-to-gpu only apply to scf.parallel loops. They do not modify scf.for.

Are there any existing passes that could be used to offload the above kernel to GPU? We can assume targeting nvvm pipeline for example.

Topic		Replies	Views
Add reduction support to SCFToGPU MLIR gpu	6	252	January 13, 2025
SCFToGPU convertion -convert-parallel-loops-to-gpu MLIR gpu	3	758	June 19, 2023
Constructing pipeline lowering an affine parallel loop to NVIDIA GPU MLIR gpu	4	530	June 6, 2023
Confused about -convert-parallel-loops-to-gpu MLIR	1	230	March 6, 2024
How to lower scf.for to run on gpu with mlir-cude-runner MLIR	5	1084	December 4, 2020

Running reductions on GPU

Related topics