Consider the IR below. The function computes a reduction over a 1D buffer, which has been tiled to nested scf.parallel ops. Note that the inner loop is also a scf.parallel loop with scf.reduce op. The loops have appropriate gpu.loop_dim_map attributes.
#map = affine_map<(d0) -> (d0 * 128)>
module {
func.func @sum(%arg0: memref<8192xf32>) -> memref<f32> {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c64 = arith.constant 64 : index
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<f32>
memref.store %cst, %alloc[] : memref<f32>
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<64xf32>
scf.for %arg1 = %c0 to %c64 step %c1 {
memref.store %cst, %alloc_0[%arg1] : memref<64xf32>
}
scf.parallel (%arg1) = (%c0) to (%c64) step (%c1) {
%subview = memref.subview %alloc_0[%arg1] [1] [1] :
memref<64xf32> to memref<f32, strided<[], offset: ?>>
%0 = affine.apply #map(%arg1)
%subview_1 = memref.subview %arg0[%0] [128] [1] :
memref<8192xf32> to memref<128xf32, strided<[1], offset: ?>>
%1 = scf.parallel (%arg2) = (%c0) to (%c128) step (%c1) init (%cst) -> f32 {
%2 = memref.load %subview_1[%arg2] : memref<128xf32, strided<[1], offset: ?>>
scf.reduce(%2 : f32) {
^bb0(%arg3: f32, %arg4: f32):
%3 = arith.addf %arg3, %arg4 : f32
scf.reduce.return %3 : f32
}
} {mapping = [#gpu.loop_dim_map<processor = thread_x,
map = (d0) -> (d0),
bound = (d0) -> (d0)>]}
memref.store %1, %subview[] : memref<f32, strided<[], offset: ?>>
scf.reduce
} {mapping = [#gpu.loop_dim_map<processor = block_x,
map = (d0) -> (d0),
bound = (d0) -> (d0)>]}
scf.for %arg1 = %c0 to %c64 step %c1 {
%0 = memref.load %alloc_0[%arg1] : memref<64xf32>
%1 = memref.load %alloc[] : memref<f32>
%2 = arith.addf %0, %1 : f32
memref.store %2, %alloc[] : memref<f32>
}
memref.dealloc %alloc_0 : memref<64xf32>
return %alloc : memref<f32>
}
}
The above IR is in a form that convert-parallel-loops-to-gpu pass could consume. However, at the moment SCFToGPU does not support reductions.
Arguably, the above IR could be lowered to a gpu.launch op and a gpu.all_reduce op. Specifically, the the inner scf.parallel can be replaced by gpu.all_reduce, inheriting the reduction region from the scf.reduce op. The converted IR (after canonicalization) would be:
#map = affine_map<(d0) -> (d0 * 128)>
module {
func.func @sum(%arg0: memref<8192xf32>) -> memref<f32> {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c64 = arith.constant 64 : index
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<f32>
memref.store %cst, %alloc[] : memref<f32>
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<64xf32>
scf.for %arg1 = %c0 to %c64 step %c1 {
memref.store %cst, %alloc_0[%arg1] : memref<64xf32>
}
gpu.launch blocks(%arg1, %arg2, %arg3) in (%arg7 = %c64, %arg8 = %c1, %arg9 = %c1)
threads(%arg4, %arg5, %arg6) in (%arg10 = %c128, %arg11 = %c1, %arg12 = %c1) {
%subview = memref.subview %alloc_0[%arg1] [1] [1] :
memref<64xf32> to memref<f32, strided<[], offset: ?>>
%0 = affine.apply #map(%arg1)
%subview_1 = memref.subview %arg0[%0] [128] [1] :
memref<8192xf32> to memref<128xf32, strided<[1], offset: ?>>
%1 = memref.load %subview_1[%arg4] : memref<128xf32, strided<[1], offset: ?>>
%2 = gpu.all_reduce %1 uniform {
^bb0(%arg13: f32, %arg14: f32):
%3 = arith.addf %arg13, %arg14 : f32
gpu.yield %3 : f32
} : (f32) -> f32
memref.store %2, %subview[] : memref<f32, strided<[], offset: ?>>
gpu.terminator
} {SCFToGPU_visited}
scf.for %arg1 = %c0 to %c64 step %c1 {
%0 = memref.load %alloc_0[%arg1] : memref<64xf32>
%1 = memref.load %alloc[] : memref<f32>
%2 = arith.addf %0, %1 : f32
memref.store %2, %alloc[] : memref<f32>
}
memref.dealloc %alloc_0 : memref<64xf32>
return %alloc : memref<f32>
}
}
This IR can be lowered to GPU binary using existing passes, for example the nvvm pipeline. (Moving buffers to/from the device, and potentially executing the final reduction on the gpu would require additional changes but those can be dealt with separately.)
The necessary code change in SCFToGPU conversion looks relatively straightforward (at least for simple cases like this).
Is this a reasonable approach for reduction op lowering that would be of general interest?
My motivation is that the first version with tiled scf.parallel op could plausibly be generated from, say, linalg.reduce, using something like linalg::tileReductionUsingForall (with some modifications - at the moment it generates a serial scf.for for the inner loop) with which lowering from linalg to gpu binary should be possible.