Hi all,

At IREE, we are working on improving the generated code of innermost (fastest-varying dimension) reductions and would like to get feedback and give this work some visibility so that other Linalg users can improve reductions in the same way, if they haven’t done something similar already. Our proposal aligns with what LLVM and other production compilers generate for this kind of reductions and should be more performant than the existing alternatives in Linalg today.

# Running Example:

We will use the following `i8`

→ `i32`

reduction example throughout this proposal for illustration purposes, where the innermost tensor dimension is reduced:

```
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#map4 = affine_map<(d0, d1) -> (d0)>
util.global private @__A {noinline} = dense<1> : tensor<384x512xi8>
func.func @main_dispatch_5() -> tensor<384xi32> {
%c0_i32 = arith.constant 0 : i32
%0 = util.global.load @__A : tensor<384x512xi8>
%1 = linalg.init_tensor [384] : tensor<384xi32>
%2 = linalg.fill ins(%c0_i32 : i32) outs(%1 : tensor<384xi32>) -> tensor<384xi32>
%3 = linalg.generic {indexing_maps = [#map3, #map4], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<384x512xi8>) outs(%2 : tensor<384xi32>) {
^bb0(%arg2: i8, %arg3: i32):
%4 = arith.extsi %arg2 : i8 to i32
%5 = arith.addi %4, %arg3 : i32
linalg.yield %5 : i32
} -> tensor<384xi32>
return %3 : tensor<384xi32>
}
```

# Background

We currently have two strategies to code-generate reductions on the innermost tensor dimension in Linalg: *InnerParallel* and *InnerReduction*.

## InnerParallel

The *InnerParallel* strategy transposes the reduction dimension with an outer parallel dimension of the tensor. Vectorization is then applied to the innermost parallel dimension such that each vector lane computes a full independent reduction. No horizontal vector reductions are required by this approach. The following snippet shows the resulting IR after applying the *InnerParallel* strategy to the running example:

```
scf.for %arg0 = %c0 to %c32 step %c8 {
%5 = scf.for %arg1 = %c0 to %c512 step %c64 iter_args(%arg2 = %cst) -> (vector<8xi32>) {
%6 = vector.transfer_read %4[%arg0, %arg1], %c0_i8 {in_bounds = [true, true]} : memref<32x512xi8, affine_map<(d0, d1)[s0] -> (d0 * 512 + s0 + d1)>>, vector<8x64xi8>
%7 = arith.extsi %6 : vector<8x64xi8> to vector<8x64xi32>
%8 = vector.transpose %7, [1, 0] : vector<8x64xi32> to vector<64x8xi32>
%9 = vector.extract %8[0] : vector<64x8xi32>
%10 = arith.addi %9, %arg2 : vector<8xi32>
...
%135 = vector.extract %8[63] : vector<64x8xi32>
%136 = arith.addi %135, %134 : vector<8xi32>
scf.yield %136 : vector<8xi32>
}
vector.transfer_write %5, %3[%arg0] {in_bounds = [true]} : vector<8xi32>, memref<32xi32, affine_map<(d0)[s0] -> (d0 + s0)>>
}
```

*InnerParallel* may offer higher accuracy in floating-point reductions since floating-point elements are reduced in sequential order (assuming that sequential order means higher accuracy). However, transposing the data is very expensive!

## InnerReduction

The *InnerReduction* strategy keeps the reduction dimension in the innermost dimension. Vectorization is applied to the reduction dimension and a scalar accumulator is used along the reduced dimension. Using a *scalar* accumulator requires performing horizontal vector reductions per iteration of the innermost loop to add the partial reduction of each iteration to the *scalar* accumulator. The following snippet shows the resulting IR after applying the *InnerReduction* strategy to the running example:

```
scf.for %arg0 = %c0 to %c32 step %c8 {
%5 = scf.for %arg1 = %c0 to %c512 step %c64 iter_args(%arg2 = %cst) -> (vector<8xi32>) {
%6 = vector.transfer_read %4[%arg0, %arg1], %c0_i8 {in_bounds = [true, true]} : memref<32x512xi8, affine_map<(d0, d1)[s0] -> (d0 * 512 + s0 + d1)>>, vector<8x64xi8>
%7 = arith.extsi %6 : vector<8x64xi8> to vector<8x64xi32>
%8 = vector.extract %7[0] : vector<8x64xi32>
%9 = vector.extract %arg2[0] : vector<8xi32>
%10 = vector.reduction <add>, %8, %9 : vector<64xi32> into i32
%11 = vector.insertelement %10, %cst[%c0 : index] : vector<8xi32>
...
%36 = vector.extract %7[7] : vector<8x64xi32>
%37 = vector.extract %arg2[7] : vector<8xi32>
%38 = vector.reduction <add>, %36, %37 : vector<64xi32> into i32
%39 = vector.insertelement %38, %35[%c7 : index] : vector<8xi32>
scf.yield %39 : vector<8xi32>
}
vector.transfer_write %5, %3[%arg0] {in_bounds = [true]} : vector<8xi32>, memref<32xi32, affine_map<(d0)[s0] -> (d0 + s0)>>
}
```

*InnerReduction* may be more performant than InnerParallel since data transposition is not required. However, horizontal vector reductions are also somewhat expensive and they are used in every iteration of the reduction loop. This approach may not reduce the elements in sequential order so the result of floating-point reductions may be “less accurate”.

# Proposal

We want to implement a new reduction strategy that avoids the overhead of transposing data but also minimizes the number of horizontal vector reductions by:

- Using a
*vector*accumulator along the vector iterations of the reduction dimension instead of a*scalar*accumulator. - Computing a
*single*horizontal vector reduction per reduction, when all the elements have been reduced to the single vector accumulator.

The following snippet shows the potential resulting IR after applying the new strategy to the running example (handwritten IR, be aware of potential mistakes):

```
%6 = scf.for %arg1 = %c0 to %c32 step %c8 iter_args(%arg2 = %4) -> (tensor<32xi32>) {
// Create a new temporary that will hold the vector partial results that
// will later be horizontally reduced.
tmp = linalg.init_tensor [8, 16] : tensor<8x16xi32>
%tmp_init = vector.transfer_write %cst8x16, %tmp_ld[%c0, %c0] {in_bounds = [true, true]} : vector<8x16xi32>, tensor<8x16xi32>
%9 = scf.for %arg3 = %c0 to %c512 step %c64 iter_args(%arg4 = %tmp_init) -> (tensor<8x16xi32>) {
%11 = vector.transfer_read %5[%arg1, %arg3], %c0_i8 {in_bounds = [true, true]} : tensor<32x512xi8>, vector<8x64xi8>
%12 = vector.transfer_read %arg4[%c0, %c0], %c0_i32 {in_bounds = [true, true]} : tensor<8x16xi32>, vector<8x16xi32>
%13 = arith.extsi %11 : vector<8x64xi8> to vector<8x64xi32>
// Note 'vector.multi_reduction'’s output type. Each partial reduction is not
// reduced to a scalar but to a 16-wide vector accumulator.
%14 = vector.multi_reduction <add>, %13, %12 [1] : vector<8x64xi32> to vector<8x16xi32>
%15 = vector.transfer_write %14, %arg4[%c0, %c0] {in_bounds = [true, true]} : vector<8x16xi32>, tensor<8x16xi32>
scf.yield %15 : tensor<8x16xi32>
}
// A final horizontal reduction reduces the vector accumulator to a scalar.
%vred = vector.transfer_read %9[%c0, %c0], %c0_i32 {in_bounds = [true, true]} : tensor<8x16xi32>, vector<8x16xi32>
%hred = vector.multi_reduction <add>, %vred, %cst [1] : vector<8x16xi32> to vector<8xi32>
%10 = vector.transfer_write %hred, %arg2[%arg1] {in_bounds = [true]} : vector<8xi32>, tensor<32xi32>
scf.yield %10 : tensor<32xi32>
}
```

As noted in the IR, a 16-element vector accumulator is created for each reduction, which is used within the innermost loop to reduce the full reduction dimension to only 16 elements. We took the liberty of using the `vector.multi_reduction`

operation with a `vector<8x16xi32>`

return type to illustrate this idea, even though the return type is not supported right now. After the reduction loop, a single horizontal vector reduction is used after the reduction loop to produce the final scalar result out of the vector accumulator.

This approach should be more performant than *InnerReduction* and *InnerParallel*. Accuracy should be on par with *InnerReduction" since the order in which the elements are reduced is similar to some extent. Extra accuracy might be achieved if we enforced horizontal vector reductions to be computed in sequential order, at the expense of some performance.

# Implementation

We plan to follow the spirit of other transformations in Linalg by decomposing the new reduction approach into simple and composable transformations. For simple cases, we will apply the following steps:

- Split the input
`linalg.generic`

op into two`linalg.generic`

ops with the existing split reduction utility (thanks for the suggestion, @ThomasRaoux!). They will model the loop performing the reduction using a vector accumulator and final horizontal vector reduction, respectively. The following snippet illustrates how the running example would like like after such split (handwritten example, not the actual output of the split reduction utility):

```
%11 = scf.for %arg3 = %c0 to %c512 step %c64 iter_args(%arg4 = %10) -> (tensor<8x16xi32>) {
%12 = tensor.extract_slice %5[%arg1, %arg3] [8, 64] [1, 1] : tensor<32x512xi8> to tensor<8x64xi8>
%13 = tensor.cast %12 : tensor<8x64xi8> to tensor<*xi8>
%14 = tensor.cast %13 : tensor<*xi8> to tensor<8x32x16xi8>
// First generic on: partial reduction on vector accumulator.
%15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
affine_map<(d0, d1, d2) -> (d0, d2)>],
iterator_types = ["parallel", "reduction", "parallel"]}
ins(%14 : tensor<8x32x16xi8>) outs(%10 : tensor<8x16xi32>) {
^bb0(%arg5: i8, %arg6: i32):
%16 = arith.extsi %arg5 : i8 to i32
%17 = arith.addi %16, %arg6 : i32
linalg.yield %16 : i32
} -> tensor<8x16xi32>
scf.yield %15 : tensor<8x16xi32>
}
// Second generic op: final horizontal vector reductions.
%12 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
affine_map<(d0, d1) -> (d0)>],
iterator_types = ["parallel", "reduction"]}
ins(%11 : tensor<8x16xi32>) outs(%8 : tensor<8xi32>) {
^bb0(%arg4: i32, %arg5: i32):
%16 = arith.addi %arg4, %arg5 : i32
linalg.yield %16 : i32
} -> tensor<8xi32>
```

- Vectorize both generic ops with the
`InnerReduction`

approach, which will lead to vertical vector operations for the outer dimension reduction in the first`linalg.generic`

op and horizontal vector reductions for the innermost dimension in the second`linalg.generic op`

. No major changes would be needed in the vectorizer. The resulting code would be similar to the one shown in Proposal.

As part of this work, we will also extend the Linalg’s Code Generation Strategy infrastructure to support this new reduction approach.

Thanks! Any feedback would be appreciated!

@nicolasvasilache, @ThomasRaoux, @MaheshRavishankar, @vmurali