Add reduction support to SCFToGPU

Consider the IR below. The function computes a reduction over a 1D buffer, which has been tiled to nested scf.parallel ops. Note that the inner loop is also a scf.parallel loop with scf.reduce op. The loops have appropriate gpu.loop_dim_map attributes.

#map = affine_map<(d0) -> (d0 * 128)>
module {
  func.func @sum(%arg0: memref<8192xf32>) -> memref<f32> {
    %c128 = arith.constant 128 : index
    %c1 = arith.constant 1 : index
    %c64 = arith.constant 64 : index
    %c0 = arith.constant 0 : index
    %cst = arith.constant 0.000000e+00 : f32
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<f32>
    memref.store %cst, %alloc[] : memref<f32>
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<64xf32>
    scf.for %arg1 = %c0 to %c64 step %c1 {
      memref.store %cst, %alloc_0[%arg1] : memref<64xf32>
    }
    scf.parallel (%arg1) = (%c0) to (%c64) step (%c1) {
      %subview = memref.subview %alloc_0[%arg1] [1] [1] :
          memref<64xf32> to memref<f32, strided<[], offset: ?>>
      %0 = affine.apply #map(%arg1)
      %subview_1 = memref.subview %arg0[%0] [128] [1] :
          memref<8192xf32> to memref<128xf32, strided<[1], offset: ?>>
      %1 = scf.parallel (%arg2) = (%c0) to (%c128) step (%c1) init (%cst) -> f32 {
        %2 = memref.load %subview_1[%arg2] : memref<128xf32, strided<[1], offset: ?>>
        scf.reduce(%2 : f32) {
        ^bb0(%arg3: f32, %arg4: f32):
          %3 = arith.addf %arg3, %arg4 : f32
          scf.reduce.return %3 : f32
        }
      } {mapping = [#gpu.loop_dim_map<processor = thread_x,
                                      map = (d0) -> (d0),
                                      bound = (d0) -> (d0)>]}
      memref.store %1, %subview[] : memref<f32, strided<[], offset: ?>>
      scf.reduce 
    } {mapping = [#gpu.loop_dim_map<processor = block_x,
                                    map = (d0) -> (d0),
                                    bound = (d0) -> (d0)>]}
    scf.for %arg1 = %c0 to %c64 step %c1 {
      %0 = memref.load %alloc_0[%arg1] : memref<64xf32>
      %1 = memref.load %alloc[] : memref<f32>
      %2 = arith.addf %0, %1 : f32
      memref.store %2, %alloc[] : memref<f32>
    }
    memref.dealloc %alloc_0 : memref<64xf32>
    return %alloc : memref<f32>
  }
}

The above IR is in a form that convert-parallel-loops-to-gpu pass could consume. However, at the moment SCFToGPU does not support reductions.

Arguably, the above IR could be lowered to a gpu.launch op and a gpu.all_reduce op. Specifically, the the inner scf.parallel can be replaced by gpu.all_reduce, inheriting the reduction region from the scf.reduce op. The converted IR (after canonicalization) would be:

#map = affine_map<(d0) -> (d0 * 128)>
module {
  func.func @sum(%arg0: memref<8192xf32>) -> memref<f32> {
    %c128 = arith.constant 128 : index
    %c1 = arith.constant 1 : index
    %c64 = arith.constant 64 : index
    %c0 = arith.constant 0 : index
    %cst = arith.constant 0.000000e+00 : f32
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<f32>
    memref.store %cst, %alloc[] : memref<f32>
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<64xf32>
    scf.for %arg1 = %c0 to %c64 step %c1 {
      memref.store %cst, %alloc_0[%arg1] : memref<64xf32>
    }
    gpu.launch blocks(%arg1, %arg2, %arg3) in (%arg7 = %c64, %arg8 = %c1, %arg9 = %c1)
        threads(%arg4, %arg5, %arg6) in (%arg10 = %c128, %arg11 = %c1, %arg12 = %c1) {
      %subview = memref.subview %alloc_0[%arg1] [1] [1] :
          memref<64xf32> to memref<f32, strided<[], offset: ?>>
      %0 = affine.apply #map(%arg1)
      %subview_1 = memref.subview %arg0[%0] [128] [1] :
          memref<8192xf32> to memref<128xf32, strided<[1], offset: ?>>
      %1 = memref.load %subview_1[%arg4] : memref<128xf32, strided<[1], offset: ?>>
      %2 = gpu.all_reduce  %1 uniform {
      ^bb0(%arg13: f32, %arg14: f32):
        %3 = arith.addf %arg13, %arg14 : f32
        gpu.yield %3 : f32
      } : (f32) -> f32
      memref.store %2, %subview[] : memref<f32, strided<[], offset: ?>>
      gpu.terminator
    } {SCFToGPU_visited}
    scf.for %arg1 = %c0 to %c64 step %c1 {
      %0 = memref.load %alloc_0[%arg1] : memref<64xf32>
      %1 = memref.load %alloc[] : memref<f32>
      %2 = arith.addf %0, %1 : f32
      memref.store %2, %alloc[] : memref<f32>
    }
    memref.dealloc %alloc_0 : memref<64xf32>
    return %alloc : memref<f32>
  }
}

This IR can be lowered to GPU binary using existing passes, for example the nvvm pipeline. (Moving buffers to/from the device, and potentially executing the final reduction on the gpu would require additional changes but those can be dealt with separately.)

The necessary code change in SCFToGPU conversion looks relatively straightforward (at least for simple cases like this).

Is this a reasonable approach for reduction op lowering that would be of general interest?

My motivation is that the first version with tiled scf.parallel op could plausibly be generated from, say, linalg.reduce, using something like linalg::tileReductionUsingForall (with some modifications - at the moment it generates a serial scf.for for the inner loop) with which lowering from linalg to gpu binary should be possible.

Related thread: SCFToGPU convertion -convert-parallel-loops-to-gpu

cc: @Groverkss @qed

Any thoughts @ftynse @rohany ?

I don’t have objections to an attempt to implement such a functionality, but need more details if you want design/code advice. It is reasonable to support targeting reductions from scf.parallel for input formats that can only target scf.parallel.

The general design point of Linalg is to avoid discarding information only to infer it later again via analysis. This may apply here, depending on the complexity of logic needed to properly handle analysis. Instead, one may want to directly target the appropriate GPU operations from Linalg without going through loops.

Thanks, I think I can propose a patch but wanted to get feedback whether this approach sounds useful before investing more time to it.

Mapping ops directly from linalg to GPU does sound like a good idea. However, at the moment the only existing pass that maps loops to gpu lauch ops that I know of is convert-parallel-loops-to-gpu. That’s why I’d start by generalizing it.

The linalg to GPU pipeline that I have in mind requires linalg.reduce op at the high level. Its reduction block can directly be inlined to both scf.reduce and gpu.all_reduce, so the conversions are simple. If, however, the linalg op is linalg.generic more complex analysis is required to infer the correct reduction block.

@ftynse I have drafted a patch in this PR that converts scf.parallel+scf.reduce op combo to a gpu.all_reduce op.