Confusion around scf.reduce

rohany · June 13, 2023, 6:36pm

I’m confused about the semantics of scf.reduce, in particular when a loop may be reducing multiple values. Consider the following bit of affine code which reduces a sum and max over a memref.

func.func @reduce2(%input : memref<10xf32>) -> (f32, f32) {
  %zero = arith.constant 0. : f32
  %reduceval, %maxval = affine.for %i = 0 to 10 iter_args(%sum = %zero, %max = %zero) -> (f32, f32) {
    %0 = affine.load %input[%i] : memref<10xf32>
    %1 = arith.addf %0, %sum : f32
    %2 = arith.maxf %0, %max : f32
    affine.yield %1, %2 : f32, f32
  }
  return %reduceval, %maxval : f32, f32
}

This loop can get parallelized, but when I lower it to SCF (-pass-pipeline="builtin.module(func.func(affine-parallelize{parallel-reductions}, lower-affine, canonicalize))", I am confused about what’s going on:

  func.func @reduce2(%arg0: memref<10xf32>) -> (f32, f32) {
    %cst = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c10 = arith.constant 10 : index
    %c1 = arith.constant 1 : index
    %cst_0 = arith.constant 0xFF800000 : f32
    %0:2 = scf.parallel (%arg1) = (%c0) to (%c10) step (%c1) init (%cst, %cst_0) -> (f32, f32) {
      %3 = memref.load %arg0[%arg1] : memref<10xf32>
      scf.reduce(%3)  : f32 {
      ^bb0(%arg2: f32, %arg3: f32):
        %4 = arith.addf %arg2, %arg3 : f32
        scf.reduce.return %4 : f32
      }
      scf.reduce(%3)  : f32 {
      ^bb0(%arg2: f32, %arg3: f32):
        %4 = arith.maxf %arg2, %arg3 : f32
        scf.reduce.return %4 : f32
      }
      scf.yield
    }
    %1 = arith.addf %0#0, %cst : f32
    %2 = arith.maxf %0#1, %cst : f32
    return %1, %2 : f32, f32
  }

There are two scf.reduce operations here, but it seems unclear to me (from here and the documentation) how we know which reduce operation corresponds to which reduction value of the for loop. How do I know which reduce corresponds to which reduction?

This question is part of a larger endeavor to map code with scf.reduce operations onto GPUs. I’m working on getting parallel loop tiling to work with loops with reductions, after which I’ll try to get scf.reduce mapped to gpu.all_reduce.

I’m surprised that this doesn’t exist already, given many MLIR dialects are aiming to target GPUs – are these other projects skipping scf and mapping to gpu directly when their code contains reductions (like common linear algebra operations)?

tpreudhomme · June 15, 2023, 2:30pm

Isn’t that what this extract of the scf.parallel documentation is about?

Reductions are matched to result and initial values in order of their appearance in the body.

So first scf.reduce is the first result of the scf.parallel.

rohany · June 15, 2023, 4:25pm

Thank you, I must have missed it while reading the documentation earlier.

Topic		Replies	Views
SCFToGPU convertion -convert-parallel-loops-to-gpu MLIR gpu	3	765	June 19, 2023
[PSA] New syntax for `scf.parallel` Deprecation & Important Refactoring	0	301	December 16, 2023
Add reduction support to SCFToGPU MLIR gpu	6	270	January 13, 2025
[RFC] Yield for Affine Dialect MLIR	22	2089	July 10, 2020
Running reductions on GPU MLIR	0	119	May 15, 2024

Confusion around scf.reduce

Related topics