I’m confused about the semantics of scf.reduce, in particular when a loop may be reducing multiple values. Consider the following bit of affine code which reduces a sum and max over a memref.

```
func.func @reduce2(%input : memref<10xf32>) -> (f32, f32) {
%zero = arith.constant 0. : f32
%reduceval, %maxval = affine.for %i = 0 to 10 iter_args(%sum = %zero, %max = %zero) -> (f32, f32) {
%0 = affine.load %input[%i] : memref<10xf32>
%1 = arith.addf %0, %sum : f32
%2 = arith.maxf %0, %max : f32
affine.yield %1, %2 : f32, f32
}
return %reduceval, %maxval : f32, f32
}
```

This loop can get parallelized, but when I lower it to SCF (`-pass-pipeline="builtin.module(func.func(affine-parallelize{parallel-reductions}, lower-affine, canonicalize))"`

, I am confused about what’s going on:

```
func.func @reduce2(%arg0: memref<10xf32>) -> (f32, f32) {
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%c10 = arith.constant 10 : index
%c1 = arith.constant 1 : index
%cst_0 = arith.constant 0xFF800000 : f32
%0:2 = scf.parallel (%arg1) = (%c0) to (%c10) step (%c1) init (%cst, %cst_0) -> (f32, f32) {
%3 = memref.load %arg0[%arg1] : memref<10xf32>
scf.reduce(%3) : f32 {
^bb0(%arg2: f32, %arg3: f32):
%4 = arith.addf %arg2, %arg3 : f32
scf.reduce.return %4 : f32
}
scf.reduce(%3) : f32 {
^bb0(%arg2: f32, %arg3: f32):
%4 = arith.maxf %arg2, %arg3 : f32
scf.reduce.return %4 : f32
}
scf.yield
}
%1 = arith.addf %0#0, %cst : f32
%2 = arith.maxf %0#1, %cst : f32
return %1, %2 : f32, f32
}
```

There are two `scf.reduce`

operations here, but it seems unclear to me (from here and the documentation) how we know which reduce operation corresponds to which reduction value of the for loop. How do I know which reduce corresponds to which reduction?

This question is part of a larger endeavor to map code with `scf.reduce`

operations onto GPUs. I’m working on getting parallel loop tiling to work with loops with reductions, after which I’ll try to get scf.reduce mapped to gpu.all_reduce.

I’m surprised that this doesn’t exist already, given many MLIR dialects are aiming to target GPUs – are these other projects skipping scf and mapping to gpu directly when their code contains reductions (like common linear algebra operations)?