Horizontal ADD across single vector not profitable in SLP Vectorization

Hi all,

Following Analysis is regarding horizontal add across single vector.

Test case for AARCH64:

#include <arm_neon.h>
unsigned hadd(uint32x4_t a) {
  return a[0] + a[1] + a[2] + a[3];
}

Currently, we emit scalar instructions for above code.

IR for above code will involve -
4 ‘extractelement’ - to extract elements from vector ‘a’.
3 ‘adds’ - to perform add
1 return statement.

Lets say, we somehow vectorize this kind of code.
The IR will probably have something like :

  1. Extract a[0] and put it in vec1 <2 x i32>, 0
  2. Extract a[1] and put it in vec1 <2 x i32>, 1
  3. Extract a[2] and put it in vec2 <2 x i32>, 0
  4. Extract a[3] and put it in vec2 <2 x i32>, 1
  5. Add vec1 and vec2, sum in vec3 <2 x i32>
  6. Extract vec3[0] in sum1
  7. Extract vec3[1] in sum2
    7 add sum1 and sum2 in sum3
  8. return sum3

So overall instructions - 6 ‘extractlement’, 4 ‘insertelement’, 1 vector add, 1 scalar add and 1 return statement. We have vectorized add operation.

This indicates code getting worse than its scalar form (if i am not missing something).

This was related to PR 20035, where it was advised to handle add across single vector in SLP vectorizer.

If my analysis is correct, we can never have a more profitable horizontal add across a single vector in vectorized form (Unless if i am missing something, perhaps may be ‘insertelement and extractelement can be bundled together in single instruction’, not sure on this).

As there is an ARM vector instruction available - ADDV.4S for addition across a sinle vector and if such code cannot be made profitable by vectorizing it in SLP, isn’t it better to handle in SelectionDAG phase?

Please correct me if i am wrong and suggest better form of vectorized IR.

Suggestions/Comments/Corrections are most awaited !!

Hi Suyog,

Have a look at the code in HorizontalReduction::getReductionCost and HorizontalReduction::emitReduction.

You don't need 4 extracts. This can be modeled at the IR level as a combination of shufflevector and vector add instruction on a <4 x i32> vector. TargetTransformInfo::getReductionCost can return the appropriate cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if codegen can implement this sequence of instructions more efficiently.

For a <4 x i32> reduction you need only need two vector shuffles, two vector adds and one vector extract to get the scalar result.

vadd <0, 1, 2, 3>
<2, 3, x, x> // shuffled
=>

<0+2, 1+3, x, x>

vadd <0+2, 1+3, x x>
<1+3, x, x x> // shuffled
=>

<0+2+1+3, x, x, x>

What it takes to get your example working in the SLPVectorizer is:

* Get the matching code up to snuff. I think, we should replace the depth first search matcher by explicitly matching the trees we expect in HorizontalReduction::matchReduction. The code should just look for:

(+ (+ (+ v1 v2) v3) v4)
and maybe
(+ ( + v1 v2) (+ v3 v4))

explicitly for v1, \.\., vn identical operations\.

* Allow a tree of size of one (the vector loads) if the tree feeds a reduction.

* Adjust the cost model AArch64::getReductionCost

* AArch64 CodeGen would have to recognize the shuffle reduction if it does not do so already

Best,
Arnold

Hi Arnold,

Thanks for cc’ing me on this. As we discussed at the devmtg, my personal view on this is that the reductions might be better represented as an intrinsic - the matching code is rather complex for the system of shuffles, is duplicated in all backends and is not particularly robust due to the complexity of the pattern.

Intrinsics could lower to this pattern if there is no ISA support for a target- in the meantime it keeps the semantics without allowing later passes to muck up the matchable pattern.

I have a patch mostly implementing this but it’s stuck in my copious post-devmtg queue (notably with the LNT improvments I promised…)

What’s your opinion on this?

Cheers,

James

Have a look at the code in HorizontalReduction::getReductionCost and
HorizontalReduction::emitReduction.

You don't need 4 extracts. This can be modeled at the IR level as a
combination of shufflevector and vector add instruction on a <4 x i32>
vector. TargetTransformInfo::getReductionCost can return the appropriate
cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if
codegen can implement this sequence of instructions more efficiently.

For a <4 x i32> reduction you need only need two vector shuffles, two
vector adds and one vector extract to get the scalar result.

vadd <0, 1, 2, 3>
         <2, 3, x, x> // shuffled
=>

<0+2, 1+3, x, x>

vadd <0+2, 1+3, x x>
         <1+3, x, x x> // shuffled
=>

<0+2+1+3, x, x, x>

Ahh!! Shuffle vector comes to the rescue. Thanks Arnold for pointing out. I
ignored it completely in above explanation.

What it takes to get your example working in the SLPVectorizer is:

* Get the matching code up to snuff. I think, we should replace the depth
first search matcher by explicitly matching the trees we expect in
HorizontalReduction::matchReduction. The code should just look for:

   (+ (+ (+ v1 v2) v3) v4)
    and maybe
    (+ ( + v1 v2) (+ v3 v4))

    explicitly for v1, .., vn identical operations.

* Allow a tree of size of one (the vector loads) if the tree feeds a
reduction.

* Adjust the cost model AArch64::getReductionCost

* AArch64 CodeGen would have to recognize the shuffle reduction if it does
not do so already

Seems everything boils down to properly identifying the reduction chain. I
did look at your patch provided in earlier thread (similar discussion), it
was working for reduction chain of 4 elements (+(+(+( v1, v2) v3) v4).
However, when i tried it for 8 elements, it was asserting. I will look into
it. Thanks for getting back on this.