I’m confused about the semantics of scf.reduce, in particular when a loop may be reducing multiple values. Consider the following bit of affine code which reduces a sum and max over a memref.
func.func @reduce2(%input : memref<10xf32>) -> (f32, f32) {
%zero = arith.constant 0. : f32
%reduceval, %maxval = affine.for %i = 0 to 10 iter_args(%sum = %zero, %max = %zero) -> (f32, f32) {
%0 = affine.load %input[%i] : memref<10xf32>
%1 = arith.addf %0, %sum : f32
%2 = arith.maxf %0, %max : f32
affine.yield %1, %2 : f32, f32
}
return %reduceval, %maxval : f32, f32
}
This loop can get parallelized, but when I lower it to SCF (-pass-pipeline="builtin.module(func.func(affine-parallelize{parallel-reductions}, lower-affine, canonicalize))"
, I am confused about what’s going on:
func.func @reduce2(%arg0: memref<10xf32>) -> (f32, f32) {
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%c10 = arith.constant 10 : index
%c1 = arith.constant 1 : index
%cst_0 = arith.constant 0xFF800000 : f32
%0:2 = scf.parallel (%arg1) = (%c0) to (%c10) step (%c1) init (%cst, %cst_0) -> (f32, f32) {
%3 = memref.load %arg0[%arg1] : memref<10xf32>
scf.reduce(%3) : f32 {
^bb0(%arg2: f32, %arg3: f32):
%4 = arith.addf %arg2, %arg3 : f32
scf.reduce.return %4 : f32
}
scf.reduce(%3) : f32 {
^bb0(%arg2: f32, %arg3: f32):
%4 = arith.maxf %arg2, %arg3 : f32
scf.reduce.return %4 : f32
}
scf.yield
}
%1 = arith.addf %0#0, %cst : f32
%2 = arith.maxf %0#1, %cst : f32
return %1, %2 : f32, f32
}
There are two scf.reduce
operations here, but it seems unclear to me (from here and the documentation) how we know which reduce operation corresponds to which reduction value of the for loop. How do I know which reduce corresponds to which reduction?
This question is part of a larger endeavor to map code with scf.reduce
operations onto GPUs. I’m working on getting parallel loop tiling to work with loops with reductions, after which I’ll try to get scf.reduce mapped to gpu.all_reduce.
I’m surprised that this doesn’t exist already, given many MLIR dialects are aiming to target GPUs – are these other projects skipping scf and mapping to gpu directly when their code contains reductions (like common linear algebra operations)?