[RFC] Introducing a vector reduction add instruction.

Hi

When a reduction instruction is vectorized in a loop, it will be
turned into an instruction with vector operands of the same operation
type. This new instruction has a special property that can give us
more flexibility during instruction selection later: this operation is
valid as long as the reduction of all elements of the result vector is
identical to the reduction of all elements of its operands.

One example that can benefit this property is SAD (sum of absolute
differences) pattern detection in SSE2, which provides a psadbw
instruction whose description is shown below:

'''
psadbw: Compute the absolute differences of packed unsigned 8-bit
integers in a and b, then horizontally sum each consecutive 8
differences to produce two unsigned 16-bit integers, and pack these
unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
'''

In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one
v4i32 as output. However, psadbw will actually produce one i32 result
for four pairs of 8-bit integers (an already reduced result), and the
result is stored in the first element in v4i32. If we properly zero
out the other three elements in v4i32, and with the information that
we have a reduction add that is performed on this result, then we can
safely use psadbw here for much better performance. This can be done
during DAG combine. Another similar example is dot product. And I
think there may be many other scenarios that can benefit from this
property like eliminating redundant shuffles.

The question is, how to let DAG combiner know that a vector operation
is a reduction one?

Here I propose to introduce a "reduction add" instruction for vectors.
This will be a new instruction with vector operands only. Normally it
is treated as a normal ADD operation, but the selection DAG combiner
can make use of this new operation to generate better instructions.
This new instruction is generated when vectorizing reduction add in
loop vectorizer.

I would like to hear more comments on this proposal or suggestions of
better alternative implementations.

thanks,
Cong

Reductions can be implemented in two ways with SIMD

  1. partial reduction in the loop followed by a full reduction outside the loop
  2. use full SIMD reduction instruction in the loop

Having a way for the vectorizer to express patterns for 2) sounds useful.

The hsum intrinsic proposed by Shahid looks like a full reduction for ‘ADD’ operation. The question is 1) do we need to add new instruction for this? 2) do we need to generalize vector reduction to cover other scalar opcodes such as min/max etc?

As regards SAD implementation: the IR idiom for SAD involves integer promotion/widening, so it seems to me that what is missing is a widening diff operation/intrinsic. With that, the SAD pattern can be expressed by vectorizor as :

for (… ) {
v16i8 v1 = vect_load (…)
v16i8 v2 = vect_load (…)
v16i16 vdiff = vect_widen_diff(v1, v2)
v16i16 vabsdiff = vect_abs(vdiff)
i16 ss = vect_hsum(vabsdiff)
r += ss

}

David

Reductions can be implemented in two ways with SIMD
1) partial reduction in the loop followed by a full reduction outside the
loop
2) use full SIMD reduction instruction in the loop

Having a way for the vectorizer to express patterns for 2) sounds useful.

The hsum intrinsic proposed by Shahid looks like a full reduction for 'ADD'
operation. The question is 1) do we need to add new instruction for this? 2)
do we need to generalize vector reduction to cover other scalar opcodes such
as min/max etc?

As regards SAD implementation: the IR idiom for SAD involves integer
promotion/widening, so it seems to me that what is missing is a widening
diff operation/intrinsic. With that, the SAD pattern can be expressed by
vectorizor as :

for (... ) {
   v16i8 v1 = vect_load (...)
   v16i8 v2 = vect_load (...)
   v16i16 vdiff = vect_widen_diff(v1, v2)
   v16i16 vabsdiff = vect_abs(vdiff)
   i16 ss = vect_hsum(vabsdiff)
   r += ss
   ..
}

I think doing full reduction with hadd is unnecessary in loop: in most
cases it is inefficient than the partial reduction in loop + full
reduction outside of loop. But SSE's SAD and dot-product instructions
do horizontal adds in nature and if we want to combine instructions
into them, we need the information that we are doing reductions. That
is why I am proposing a reduction add IR operation.

Cong

After some attempt to implement reduce-add in LLVM, I found out a
easier way to detect reduce-add without introducing new IR operations.
The basic idea is annotating phi node instead of add (so that it is
easier to handle other reduction operations). In PHINode class, we can
add a flag indicating if the phi node is a reduction one (the flag can
be set in loop vectorizer for vectorized phi nodes). Then when we
build SDNode for instruction selection, we detect those reduction phi
nodes and then annotate reduction operations. This requires an
additional flag in SDNodeFlags. We can then check this flag when
combining instructions to detect reduction operations.

In this approach, I have managed to let LLVM compile a SAD loop into
psadbw instructions.

Source code:

const int N = 1024;
unsigned char a[N], b[N];

int sad() {
  int s = 0;
  for (int i = 0; i < N; ++i) {
    int res = a[i] - b[i];
    s += (res > 0) ? res : -res;
  }
  return s;
}

Emitted instructions on X86:

# BB#0: # %entry
pxor %xmm0, %xmm0
movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
pxor %xmm1, %xmm1
.align 16, 0x90
.LBB0_1: # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movd b+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero
movd a+1024(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero
psadbw %xmm2, %xmm3
paddd %xmm3, %xmm0
movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero
movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero
psadbw %xmm2, %xmm3
paddd %xmm3, %xmm1
addq $8, %rax
jne .LBB0_1
# BB#2: # %middle.block
paddd %xmm0, %xmm1
pshufd $78, %xmm1, %xmm0 # xmm0 = xmm1[2,3,0,1]
paddd %xmm1, %xmm0
pshufd $229, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3]
paddd %xmm0, %xmm1
movd %xmm1, %eax
retq

Note that due to smaller VF we are using now (currently 4), we could
not explore the most benefit of psadbw. The patch in
http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
on the smallest type in a loop. The follow-up work is refining the
cost model to let bigger VFs have less cost. For the example above the
best result is from VF >=16.

The draft of the patch is here: http://reviews.llvm.org/D14840

I will refine the patch later and submit it for review.

thanks,
Cong

Hi Cong,

You hit the right place.Seems I could not get your question regarding this earlier.
This is what we did to find reduction pattern at IR level in loop vectorizer.
Adding SDNode flag seems to be the right thing to do.

For the example above the best result is from VF >=16.

That is right. I also observed that for constant loop bound <16, the pattern gets unrolled
To straight line code. This may be due to the cost-model inefficiency. So after the improvement
Of cost-model this case would also be realized.

Regards,
Shahid

This sounds like the right approach and the result is very promising. The VF tuning patch is already in but just not enabled by default, right?

David

This sounds like the right approach and the result is very promising. The VF
tuning patch is already in but just not enabled by default, right?

Yes, but the cost model is not ready yet, which depends on the patch
that optimizes promotion/demotion between vectors of i8 and i32
(http://reviews.llvm.org/D14588). Once this patch is checked in, I
will update the cost model to let bigger VF be chosen.

Cong

Hi Cong,

After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations?

Thanks again,
Hal

Hi Cong,

After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations?

The flag is only useful for vectorized reduction for now. I'll give
you an example how this flag is used.

Given the following reduction loop:

int a[N];
int s = 0;
for (int i = 0; i < N; ++i)
  s += a[i];

After it is vectorized, we have a reduction operation add whose
operands and the result are vectors. Suppose it is represented as:

[s0, s1, s2, s3] = [s0, s1, s2, s3] + [a0, a1, a2, a3].

If we know this operation is a reduction one, then we could have some
flexibility on how elements in the result are organized, as long as
the sum of them stays the same (the precondition is that the only use
of the reduction result is the reduction phi node, and this is usually
true as long as a reduction loop can be vectorized). For example, if
we let the result be [s0+s1, 0, s2+s3, 0] or [0, 0, s0+s1+s2+s3, 0],
the reduction result won't change. This enable us to detect SAD or
dot-product patterns and use SSE's psadbw and pmaddwd instructions.
Please see my respond to your another email for more details.

Thanks!

Cong

Hal is probably not questioning about the usefulness of reduction recognition and a way to represent it, but the clear semantics of the flag. You can probably draw some ideas from OMP SIMD reduction clause, or intel’s SIMD pragma’s reduction clause.

https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40

David

From: "Xinliang David Li" <davidxl@google.com>
To: "Cong Hou" <congh@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, November 25, 2015 5:17:58 PM
Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.

Hal is probably not questioning about the usefulness of reduction
recognition and a way to represent it, but the clear semantics of
the flag. You can probably draw some ideas from OMP SIMD reduction
clause, or intel's SIMD pragma's reduction clause.

True, but nevertheless, Cong's reply was useful. Here's my interpretation so far:

Placing this flag on a PHI node, which is only valid for a vector-valued PHI, indicates that only the sum of vector elements is meaningful. This could easily be extended to cover any associative operation (only the product is useful, only the maximum or minimum value is useful, etc.).

Now I completely understand why the flag is useful at the SDAG level. Because SDAG is basic-block local, we can't examine the loop structure when doing instruction selection for the relevant operations composing the psadbw (and friends). We also need to realize, when lowering the horizontal reduction at the end of the loop, to lower it in some more-trivial way (right?).

Regarding the metadata at the IR level: the motivation here is that, without it, the SDAG builder would need to examine the uses of the PHI, determine that the only uses were shuffles and extracts representing a horizontal reduction (add, etc.), and then deduce that it could add the flag from that.

If this matching is practical (and I suspect that it is, given that we already do it in the backends to match horizontal adds), this seems better than using vectorizer-added metadata. In this way, IR generated using vector intrinsics (etc.) can receive the same good code generation as that generated by the vectorizer.

Thanks again,
Hal

From: "Xinliang David Li" <davidxl@google.com>
To: "Cong Hou" <congh@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, November 25, 2015 5:17:58 PM
Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.

Hal is probably not questioning about the usefulness of reduction
recognition and a way to represent it, but the clear semantics of
the flag. You can probably draw some ideas from OMP SIMD reduction
clause, or intel's SIMD pragma's reduction clause.

True, but nevertheless, Cong's reply was useful. Here's my interpretation so far:

Placing this flag on a PHI node, which is only valid for a vector-valued PHI, indicates that only the sum of vector elements is meaningful. This could easily be extended to cover any associative operation (only the product is useful, only the maximum or minimum value is useful, etc.).

Right.

Now I completely understand why the flag is useful at the SDAG level. Because SDAG is basic-block local, we can't examine the loop structure when doing instruction selection for the relevant operations composing the psadbw (and friends). We also need to realize, when lowering the horizontal reduction at the end of the loop, to lower it in some more-trivial way (right?).

The benefit of collecting the result outside of the loop is trivial
unless it is in an outer loop. And I am afraid if the reduction info
helps here as the way how the results are collected is determined
already after the loop is vectorized.

Regarding the metadata at the IR level: the motivation here is that, without it, the SDAG builder would need to examine the uses of the PHI, determine that the only uses were shuffles and extracts representing a horizontal reduction (add, etc.), and then deduce that it could add the flag from that.

The reduction pattern detection for the vectorized loop at this point
can be difficult for the following reasons:

1. The vectorized code outside of the loop which collects results in
the vector into a scalar can be messy. For example, there may be many
shuffles, which is not straightforward to understand (although it is
still possible to detected them).

2. After loop unrolling, the reduction operation is duplicated several
times but the phi node is not. We should be able to detect all those
unrolled reductions but this task is also challenging.

3. We need to make sure there is no other uses of the result of the
reduction operation (those operations after loop unrolling is an
exception, where the result can be used by another copy of the same
reduction operation).

However, all those information can be easily obtained during loop
vectorization. So why not record it in that stage?

If this matching is practical (and I suspect that it is, given that we already do it in the backends to match horizontal adds), this seems better than using vectorizer-added metadata. In this way, IR generated using vector intrinsics (etc.) can receive the same good code generation as that generated by the vectorizer.

This is a good point. But I am not sure if it is worth the effort to
write a reduction detector for instrinsics. But the whole idea is
flexible to accept any reduction detector.

Cong

From: "Cong Hou" <congh@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Xinliang David Li" <davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, November 25, 2015 6:33:04 PM
Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.

>> From: "Xinliang David Li" <davidxl@google.com>
>> To: "Cong Hou" <congh@google.com>
>> Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
>> <llvm-dev@lists.llvm.org>
>> Sent: Wednesday, November 25, 2015 5:17:58 PM
>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
>> instruction.
>>
>>
>> Hal is probably not questioning about the usefulness of reduction
>> recognition and a way to represent it, but the clear semantics of
>> the flag. You can probably draw some ideas from OMP SIMD reduction
>> clause, or intel's SIMD pragma's reduction clause.
>
> True, but nevertheless, Cong's reply was useful. Here's my
> interpretation so far:
>
> Placing this flag on a PHI node, which is only valid for a
> vector-valued PHI, indicates that only the sum of vector elements
> is meaningful. This could easily be extended to cover any
> associative operation (only the product is useful, only the
> maximum or minimum value is useful, etc.).

Right.

>
> Now I completely understand why the flag is useful at the SDAG
> level. Because SDAG is basic-block local, we can't examine the
> loop structure when doing instruction selection for the relevant
> operations composing the psadbw (and friends). We also need to
> realize, when lowering the horizontal reduction at the end of the
> loop, to lower it in some more-trivial way (right?).

The benefit of collecting the result outside of the loop is trivial
unless it is in an outer loop. And I am afraid if the reduction info
helps here as the way how the results are collected is determined
already after the loop is vectorized.

Yes, but "unless it is in an outer loop" is an important case in practice. The underlying situation is: Given that the target has lowered the instructions producing the PHI value to ones that always leave certain vector elements zero, we should be able to use that information when simplifying/lowering the horizontal sum in the exit block.

One way to handle this is to enhance the SelectionDAGISel::ComputeLiveOutVRegInfo() function, and related infrastructure, to understand what is going on (with target help), so that by the time we get to the exit block, the necessary information is known and can be used to simplify the incoming vector values. This can certainly be considered follow-up work.

>
> Regarding the metadata at the IR level: the motivation here is
> that, without it, the SDAG builder would need to examine the uses
> of the PHI, determine that the only uses were shuffles and
> extracts representing a horizontal reduction (add, etc.), and then
> deduce that it could add the flag from that.

The reduction pattern detection for the vectorized loop at this point
can be difficult for the following reasons:

1. The vectorized code outside of the loop which collects results in
the vector into a scalar can be messy. For example, there may be many
shuffles, which is not straightforward to understand (although it is
still possible to detected them).

2. After loop unrolling, the reduction operation is duplicated
several
times but the phi node is not. We should be able to detect all those
unrolled reductions but this task is also challenging.

3. We need to make sure there is no other uses of the result of the
reduction operation (those operations after loop unrolling is an
exception, where the result can be used by another copy of the same
reduction operation).

However, all those information can be easily obtained during loop
vectorization. So why not record it in that stage?

Because it adds yet-another place (PHI nodes) that we need to worry about preserving metadata, and it adds information redundant with that already contained in the IR. Generally, we try not to do that.

If it were the case that re-deriving that information during SDAG building would be non-trivial (because, for example, it would require pulling in some expensive analysis), I'd consider that a valid reason. That does not seem to be the case, however.

We need to be very careful not to fall into the trap of having a "magic vectorizer", meaning a vectorizer that knows how to induce special code-generation functionality not otherwise available to generic input code. It is trivial to write, using completely generic LLVM IR with vector types, the code that the vectorizer would produce. A frontend might generate this directly, for example. It is important that we treat such code as a "first-class citizen" within our framework. Thus, I'd highly prefer that we pattern match the necessary IR in the SDAG builder instead of adding metadata in the vectorizer.

>
> If this matching is practical (and I suspect that it is, given that
> we already do it in the backends to match horizontal adds), this
> seems better than using vectorizer-added metadata. In this way, IR
> generated using vector intrinsics (etc.) can receive the same good
> code generation as that generated by the vectorizer.

This is a good point. But I am not sure if it is worth the effort to
write a reduction detector for instrinsics. But the whole idea is
flexible to accept any reduction detector.

When I say "intrinsics", I mean using native LLVM vector types.

Thanks again,
Hal

From: "Cong Hou" <congh@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Xinliang David Li" <davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, November 25, 2015 6:33:04 PM
Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.

>> From: "Xinliang David Li" <davidxl@google.com>
>> To: "Cong Hou" <congh@google.com>
>> Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
>> <llvm-dev@lists.llvm.org>
>> Sent: Wednesday, November 25, 2015 5:17:58 PM
>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
>> instruction.
>>
>>
>> Hal is probably not questioning about the usefulness of reduction
>> recognition and a way to represent it, but the clear semantics of
>> the flag. You can probably draw some ideas from OMP SIMD reduction
>> clause, or intel's SIMD pragma's reduction clause.
>
> True, but nevertheless, Cong's reply was useful. Here's my
> interpretation so far:
>
> Placing this flag on a PHI node, which is only valid for a
> vector-valued PHI, indicates that only the sum of vector elements
> is meaningful. This could easily be extended to cover any
> associative operation (only the product is useful, only the
> maximum or minimum value is useful, etc.).

Right.

>
> Now I completely understand why the flag is useful at the SDAG
> level. Because SDAG is basic-block local, we can't examine the
> loop structure when doing instruction selection for the relevant
> operations composing the psadbw (and friends). We also need to
> realize, when lowering the horizontal reduction at the end of the
> loop, to lower it in some more-trivial way (right?).

The benefit of collecting the result outside of the loop is trivial
unless it is in an outer loop. And I am afraid if the reduction info
helps here as the way how the results are collected is determined
already after the loop is vectorized.

Yes, but "unless it is in an outer loop" is an important case in practice. The underlying situation is: Given that the target has lowered the instructions producing the PHI value to ones that always leave certain vector elements zero, we should be able to use that information when simplifying/lowering the horizontal sum in the exit block.

One way to handle this is to enhance the SelectionDAGISel::ComputeLiveOutVRegInfo() function, and related infrastructure, to understand what is going on (with target help), so that by the time we get to the exit block, the necessary information is known and can be used to simplify the incoming vector values. This can certainly be considered follow-up work.

Sounds great. It is also possible to move the collection of reduction
vectors from outer loop to outside of it when the collected result is
also a reduction variable of the outer loop.

>
> Regarding the metadata at the IR level: the motivation here is
> that, without it, the SDAG builder would need to examine the uses
> of the PHI, determine that the only uses were shuffles and
> extracts representing a horizontal reduction (add, etc.), and then
> deduce that it could add the flag from that.

The reduction pattern detection for the vectorized loop at this point
can be difficult for the following reasons:

1. The vectorized code outside of the loop which collects results in
the vector into a scalar can be messy. For example, there may be many
shuffles, which is not straightforward to understand (although it is
still possible to detected them).

2. After loop unrolling, the reduction operation is duplicated
several
times but the phi node is not. We should be able to detect all those
unrolled reductions but this task is also challenging.

3. We need to make sure there is no other uses of the result of the
reduction operation (those operations after loop unrolling is an
exception, where the result can be used by another copy of the same
reduction operation).

However, all those information can be easily obtained during loop
vectorization. So why not record it in that stage?

Because it adds yet-another place (PHI nodes) that we need to worry about preserving metadata, and it adds information redundant with that already contained in the IR. Generally, we try not to do that.

If it were the case that re-deriving that information during SDAG building would be non-trivial (because, for example, it would require pulling in some expensive analysis), I'd consider that a valid reason. That does not seem to be the case, however.

We need to be very careful not to fall into the trap of having a "magic vectorizer", meaning a vectorizer that knows how to induce special code-generation functionality not otherwise available to generic input code. It is trivial to write, using completely generic LLVM IR with vector types, the code that the vectorizer would produce. A frontend might generate this directly, for example. It is important that we treat such code as a "first-class citizen" within our framework. Thus, I'd highly prefer that we pattern match the necessary IR in the SDAG builder instead of adding metadata in the vectorizer.

I generally agree with you. We can retrieve the information that an
operation is a reduction one by pattern matching, which is a little
more expensive though. As it eliminates unnecessary or redundant
information recording in loop vectorizer, I think this idea is
reasonable. Loop unrolling makes the case more complicated but it is
still doable. I will try to implement this approach.

Thank you very much for the suggestion!

Cong