SCFToGPU convertion -convert-parallel-loops-to-gpu

Hi all. I was trying to convert parallel reduce to gpu. After construct affine IR with reduction add as, run the command mlir-opt -convert-parallel-loops-to-gpu affine_reduction.mlir. And there will do nothing after the pass optimization.

#map = affine_map<(d0) -> (d0)>
module  {
  func @affine_parallel_with_reductions(%arg0: memref<3x3xf32>) -> f32{
    %c0 = arith.constant 0 : index
    %c2 = arith.constant 2 : index
    %c1 = arith.constant 1 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = scf.parallel (%arg1, %arg2) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) init (%cst) -> f32 {
      %1 = memref.load %arg0[%arg1, %arg2] : memref<3x3xf32>
      scf.reduce(%1)  : f32 {
      ^bb0(%arg3: f32, %arg4: f32):  // no predecessors
        %2 = arith.addf %arg3, %arg4 : f32
        scf.reduce.return %2 : f32
      }
      scf.yield
    } {mapping = [{bound = #map, map = #map, processor = 0}, {bound = #map, map = #map, processor = 1}]}
    return %0 : f32
  }
}

After reading the implementation of llvm-project/mlir/lib/Conversion/SCFToGPU.cpp, currently scf to gpu not support reduction.

static LogicalResult processParallelLoop(
    ParallelOp parallelOp, gpu::LaunchOp launchOp,
    BlockAndValueMapping &cloningMap, SmallVectorImpl<Operation *> &worklist,
    DenseMap<gpu::Processor, Value> &bounds, PatternRewriter &rewriter) {
  // TODO: Verify that this is a valid GPU mapping.
  // processor ids: 0-2 block [x/y/z], 3-5 -> thread [x/y/z], 6-> sequential
  ArrayAttr mapping =
      parallelOp->getAttrOfType<ArrayAttr>(gpu::getMappingAttrName());

  // TODO: Support reductions.
  if (!mapping || parallelOp.getNumResults() != 0)
    return failure();

When I just open this option for close reduction, I get some error as below. I suppose that the reason of this error is that the SSA value of scf.parallel. Because in the current UT case in MLIR, the scf.parallel will not return values.

And I have two question here.

The first one is that when the community will support this feature, support reduction when convertion scf to gpu.

The second one is that if the I want to implement a basic version there are any potential problem or point that I should focus on.

Thank u.

modify code

  // TODO: Support reductions.
  if (!mapping )
    return failure();

errors

parallel.mlir:39:10: error: failed to legalize operation 'scf.parallel' marked as erased
    %0 = scf.parallel (%arg1, %arg2) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) init (%cst) -> f32 {
         ^

current parallel ut

  %step = arith.constant 2 : index
  scf.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
                                          step (%arg4, %step)  {
    %val = memref.load %buf[%i0, %i1] : memref<?x?xf32>
    memref.store %val, %res[%i1, %i0] : memref<?x?xf32>
  } { mapping = [{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}, {processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}] }
  return

Ahem, this isn’t how code works… You can’t just remove the check and expect everything to start working magically, you need to actually implement the support for reductions. parallelOp.getNumResults() != 0 is checking whether the loop has reductions because reductions are the only thing that produces results of the loop operation.

Whenever somebody contributes it. There are no concrete plans or timelines for the absolute majority of the features. The project being community-based means that you, me, or anybody else is welcome to contribute new features (as long as they follow the guidelines) and thus become part of the community. Usually, new features are contributed because somebody needs them and thinks they may be beneficial for others.

Figure out the way of mapping scf.reduce to something like gpu.all_reduce, and write the code that implements it.