Consider the IR below. The function computes a reduction over a 1D buffer, which has been tiled to nested scf.parallel
ops. Note that the inner loop is also a scf.parallel
loop with scf.reduce
op. The loops have appropriate gpu.loop_dim_map
attributes.
#map = affine_map<(d0) -> (d0 * 128)>
module {
func.func @sum(%arg0: memref<8192xf32>) -> memref<f32> {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c64 = arith.constant 64 : index
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<f32>
memref.store %cst, %alloc[] : memref<f32>
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<64xf32>
scf.for %arg1 = %c0 to %c64 step %c1 {
memref.store %cst, %alloc_0[%arg1] : memref<64xf32>
}
scf.parallel (%arg1) = (%c0) to (%c64) step (%c1) {
%subview = memref.subview %alloc_0[%arg1] [1] [1] :
memref<64xf32> to memref<f32, strided<[], offset: ?>>
%0 = affine.apply #map(%arg1)
%subview_1 = memref.subview %arg0[%0] [128] [1] :
memref<8192xf32> to memref<128xf32, strided<[1], offset: ?>>
%1 = scf.parallel (%arg2) = (%c0) to (%c128) step (%c1) init (%cst) -> f32 {
%2 = memref.load %subview_1[%arg2] : memref<128xf32, strided<[1], offset: ?>>
scf.reduce(%2 : f32) {
^bb0(%arg3: f32, %arg4: f32):
%3 = arith.addf %arg3, %arg4 : f32
scf.reduce.return %3 : f32
}
} {mapping = [#gpu.loop_dim_map<processor = thread_x,
map = (d0) -> (d0),
bound = (d0) -> (d0)>]}
memref.store %1, %subview[] : memref<f32, strided<[], offset: ?>>
scf.reduce
} {mapping = [#gpu.loop_dim_map<processor = block_x,
map = (d0) -> (d0),
bound = (d0) -> (d0)>]}
scf.for %arg1 = %c0 to %c64 step %c1 {
%0 = memref.load %alloc_0[%arg1] : memref<64xf32>
%1 = memref.load %alloc[] : memref<f32>
%2 = arith.addf %0, %1 : f32
memref.store %2, %alloc[] : memref<f32>
}
memref.dealloc %alloc_0 : memref<64xf32>
return %alloc : memref<f32>
}
}
The above IR is in a form that convert-parallel-loops-to-gpu
pass could consume. However, at the moment SCFToGPU does not support reductions.
Arguably, the above IR could be lowered to a gpu.launch
op and a gpu.all_reduce
op. Specifically, the the inner scf.parallel
can be replaced by gpu.all_reduce
, inheriting the reduction region from the scf.reduce
op. The converted IR (after canonicalization) would be:
#map = affine_map<(d0) -> (d0 * 128)>
module {
func.func @sum(%arg0: memref<8192xf32>) -> memref<f32> {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c64 = arith.constant 64 : index
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<f32>
memref.store %cst, %alloc[] : memref<f32>
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<64xf32>
scf.for %arg1 = %c0 to %c64 step %c1 {
memref.store %cst, %alloc_0[%arg1] : memref<64xf32>
}
gpu.launch blocks(%arg1, %arg2, %arg3) in (%arg7 = %c64, %arg8 = %c1, %arg9 = %c1)
threads(%arg4, %arg5, %arg6) in (%arg10 = %c128, %arg11 = %c1, %arg12 = %c1) {
%subview = memref.subview %alloc_0[%arg1] [1] [1] :
memref<64xf32> to memref<f32, strided<[], offset: ?>>
%0 = affine.apply #map(%arg1)
%subview_1 = memref.subview %arg0[%0] [128] [1] :
memref<8192xf32> to memref<128xf32, strided<[1], offset: ?>>
%1 = memref.load %subview_1[%arg4] : memref<128xf32, strided<[1], offset: ?>>
%2 = gpu.all_reduce %1 uniform {
^bb0(%arg13: f32, %arg14: f32):
%3 = arith.addf %arg13, %arg14 : f32
gpu.yield %3 : f32
} : (f32) -> f32
memref.store %2, %subview[] : memref<f32, strided<[], offset: ?>>
gpu.terminator
} {SCFToGPU_visited}
scf.for %arg1 = %c0 to %c64 step %c1 {
%0 = memref.load %alloc_0[%arg1] : memref<64xf32>
%1 = memref.load %alloc[] : memref<f32>
%2 = arith.addf %0, %1 : f32
memref.store %2, %alloc[] : memref<f32>
}
memref.dealloc %alloc_0 : memref<64xf32>
return %alloc : memref<f32>
}
}
This IR can be lowered to GPU binary using existing passes, for example the nvvm
pipeline. (Moving buffers to/from the device, and potentially executing the final reduction on the gpu would require additional changes but those can be dealt with separately.)
The necessary code change in SCFToGPU conversion looks relatively straightforward (at least for simple cases like this).
Is this a reasonable approach for reduction op lowering that would be of general interest?
My motivation is that the first version with tiled scf.parallel
op could plausibly be generated from, say, linalg.reduce
, using something like linalg::tileReductionUsingForall
(with some modifications - at the moment it generates a serial scf.for
for the inner loop) with which lowering from linalg to gpu binary should be possible.