Confusion around scf.reduce

I’m confused about the semantics of scf.reduce, in particular when a loop may be reducing multiple values. Consider the following bit of affine code which reduces a sum and max over a memref.

func.func @reduce2(%input : memref<10xf32>) -> (f32, f32) {
  %zero = arith.constant 0. : f32
  %reduceval, %maxval = affine.for %i = 0 to 10 iter_args(%sum = %zero, %max = %zero) -> (f32, f32) {
    %0 = affine.load %input[%i] : memref<10xf32>
    %1 = arith.addf %0, %sum : f32
    %2 = arith.maxf %0, %max : f32
    affine.yield %1, %2 : f32, f32
  }
  return %reduceval, %maxval : f32, f32
}

This loop can get parallelized, but when I lower it to SCF (-pass-pipeline="builtin.module(func.func(affine-parallelize{parallel-reductions}, lower-affine, canonicalize))", I am confused about what’s going on:

  func.func @reduce2(%arg0: memref<10xf32>) -> (f32, f32) {
    %cst = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c10 = arith.constant 10 : index
    %c1 = arith.constant 1 : index
    %cst_0 = arith.constant 0xFF800000 : f32
    %0:2 = scf.parallel (%arg1) = (%c0) to (%c10) step (%c1) init (%cst, %cst_0) -> (f32, f32) {
      %3 = memref.load %arg0[%arg1] : memref<10xf32>
      scf.reduce(%3)  : f32 {
      ^bb0(%arg2: f32, %arg3: f32):
        %4 = arith.addf %arg2, %arg3 : f32
        scf.reduce.return %4 : f32
      }
      scf.reduce(%3)  : f32 {
      ^bb0(%arg2: f32, %arg3: f32):
        %4 = arith.maxf %arg2, %arg3 : f32
        scf.reduce.return %4 : f32
      }
      scf.yield
    }
    %1 = arith.addf %0#0, %cst : f32
    %2 = arith.maxf %0#1, %cst : f32
    return %1, %2 : f32, f32
  }

There are two scf.reduce operations here, but it seems unclear to me (from here and the documentation) how we know which reduce operation corresponds to which reduction value of the for loop. How do I know which reduce corresponds to which reduction?

This question is part of a larger endeavor to map code with scf.reduce operations onto GPUs. I’m working on getting parallel loop tiling to work with loops with reductions, after which I’ll try to get scf.reduce mapped to gpu.all_reduce.

I’m surprised that this doesn’t exist already, given many MLIR dialects are aiming to target GPUs – are these other projects skipping scf and mapping to gpu directly when their code contains reductions (like common linear algebra operations)?

Isn’t that what this extract of the scf.parallel documentation is about?

Reductions are matched to result and initial values in order of their appearance in the body.

So first scf.reduce is the first result of the scf.parallel.

Thank you, I must have missed it while reading the documentation earlier.