# Horizontal ADD across single vector not profitable in SLP Vectorization

Hi all,

Following Analysis is regarding horizontal add across single vector.

Test case for AARCH64:

``````#include <arm_neon.h>
return a[0] + a[1] + a[2] + a[3];
}
``````

Currently, we emit scalar instructions for above code.

IR for above code will involve -
4 ‘extractelement’ - to extract elements from vector ‘a’.
1 return statement.

Lets say, we somehow vectorize this kind of code.
The IR will probably have something like :

1. Extract a[0] and put it in vec1 <2 x i32>, 0
2. Extract a[1] and put it in vec1 <2 x i32>, 1
3. Extract a[2] and put it in vec2 <2 x i32>, 0
4. Extract a[3] and put it in vec2 <2 x i32>, 1
5. Add vec1 and vec2, sum in vec3 <2 x i32>
6. Extract vec3[0] in sum1
7. Extract vec3[1] in sum2
7 add sum1 and sum2 in sum3
8. return sum3

So overall instructions - 6 ‘extractlement’, 4 ‘insertelement’, 1 vector add, 1 scalar add and 1 return statement. We have vectorized add operation.

This indicates code getting worse than its scalar form (if i am not missing something).

This was related to PR 20035, where it was advised to handle add across single vector in SLP vectorizer.

If my analysis is correct, we can never have a more profitable horizontal add across a single vector in vectorized form (Unless if i am missing something, perhaps may be ‘insertelement and extractelement can be bundled together in single instruction’, not sure on this).

As there is an ARM vector instruction available - ADDV.4S for addition across a sinle vector and if such code cannot be made profitable by vectorizing it in SLP, isn’t it better to handle in SelectionDAG phase?

Please correct me if i am wrong and suggest better form of vectorized IR.

Hi Suyog,

Have a look at the code in HorizontalReduction::getReductionCost and HorizontalReduction::emitReduction.

You don't need 4 extracts. This can be modeled at the IR level as a combination of shufflevector and vector add instruction on a <4 x i32> vector. TargetTransformInfo::getReductionCost can return the appropriate cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if codegen can implement this sequence of instructions more efficiently.

For a <4 x i32> reduction you need only need two vector shuffles, two vector adds and one vector extract to get the scalar result.

<2, 3, x, x> // shuffled
=>

<0+2, 1+3, x, x>

<1+3, x, x x> // shuffled
=>

<0+2+1+3, x, x, x>

What it takes to get your example working in the SLPVectorizer is:

* Get the matching code up to snuff. I think, we should replace the depth first search matcher by explicitly matching the trees we expect in HorizontalReduction::matchReduction. The code should just look for:

(+ (+ (+ v1 v2) v3) v4)
and maybe
(+ ( + v1 v2) (+ v3 v4))

``````explicitly for v1, \.\., vn identical operations\.
``````

* Allow a tree of size of one (the vector loads) if the tree feeds a reduction.

* Adjust the cost model AArch64::getReductionCost

* AArch64 CodeGen would have to recognize the shuffle reduction if it does not do so already

Best,
Arnold

Hi Arnold,

Thanks for cc’ing me on this. As we discussed at the devmtg, my personal view on this is that the reductions might be better represented as an intrinsic - the matching code is rather complex for the system of shuffles, is duplicated in all backends and is not particularly robust due to the complexity of the pattern.

Intrinsics could lower to this pattern if there is no ISA support for a target- in the meantime it keeps the semantics without allowing later passes to muck up the matchable pattern.

I have a patch mostly implementing this but it’s stuck in my copious post-devmtg queue (notably with the LNT improvments I promised…)

Cheers,

James

Have a look at the code in HorizontalReduction::getReductionCost and
HorizontalReduction::emitReduction.

You don't need 4 extracts. This can be modeled at the IR level as a
combination of shufflevector and vector add instruction on a <4 x i32>
vector. TargetTransformInfo::getReductionCost can return the appropriate
cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if
codegen can implement this sequence of instructions more efficiently.

For a <4 x i32> reduction you need only need two vector shuffles, two
vector adds and one vector extract to get the scalar result.

<2, 3, x, x> // shuffled
=>

<0+2, 1+3, x, x>

<1+3, x, x x> // shuffled
=>

<0+2+1+3, x, x, x>

Ahh!! Shuffle vector comes to the rescue. Thanks Arnold for pointing out. I
ignored it completely in above explanation.

What it takes to get your example working in the SLPVectorizer is:

* Get the matching code up to snuff. I think, we should replace the depth
first search matcher by explicitly matching the trees we expect in
HorizontalReduction::matchReduction. The code should just look for:

(+ (+ (+ v1 v2) v3) v4)
and maybe
(+ ( + v1 v2) (+ v3 v4))

explicitly for v1, .., vn identical operations.

* Allow a tree of size of one (the vector loads) if the tree feeds a
reduction.

* Adjust the cost model AArch64::getReductionCost

* AArch64 CodeGen would have to recognize the shuffle reduction if it does