From: "Cong Hou" <congh@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Xinliang David Li" <davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, November 25, 2015 6:33:04 PM
Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.
>> From: "Xinliang David Li" <davidxl@google.com>
>> To: "Cong Hou" <congh@google.com>
>> Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
>> <llvm-dev@lists.llvm.org>
>> Sent: Wednesday, November 25, 2015 5:17:58 PM
>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
>> instruction.
>>
>>
>> Hal is probably not questioning about the usefulness of reduction
>> recognition and a way to represent it, but the clear semantics of
>> the flag. You can probably draw some ideas from OMP SIMD reduction
>> clause, or intel's SIMD pragma's reduction clause.
>
> True, but nevertheless, Cong's reply was useful. Here's my
> interpretation so far:
>
> Placing this flag on a PHI node, which is only valid for a
> vector-valued PHI, indicates that only the sum of vector elements
> is meaningful. This could easily be extended to cover any
> associative operation (only the product is useful, only the
> maximum or minimum value is useful, etc.).
Right.
>
> Now I completely understand why the flag is useful at the SDAG
> level. Because SDAG is basic-block local, we can't examine the
> loop structure when doing instruction selection for the relevant
> operations composing the psadbw (and friends). We also need to
> realize, when lowering the horizontal reduction at the end of the
> loop, to lower it in some more-trivial way (right?).
The benefit of collecting the result outside of the loop is trivial
unless it is in an outer loop. And I am afraid if the reduction info
helps here as the way how the results are collected is determined
already after the loop is vectorized.
Yes, but "unless it is in an outer loop" is an important case in practice. The underlying situation is: Given that the target has lowered the instructions producing the PHI value to ones that always leave certain vector elements zero, we should be able to use that information when simplifying/lowering the horizontal sum in the exit block.
One way to handle this is to enhance the SelectionDAGISel::ComputeLiveOutVRegInfo() function, and related infrastructure, to understand what is going on (with target help), so that by the time we get to the exit block, the necessary information is known and can be used to simplify the incoming vector values. This can certainly be considered follow-up work.
>
> Regarding the metadata at the IR level: the motivation here is
> that, without it, the SDAG builder would need to examine the uses
> of the PHI, determine that the only uses were shuffles and
> extracts representing a horizontal reduction (add, etc.), and then
> deduce that it could add the flag from that.
The reduction pattern detection for the vectorized loop at this point
can be difficult for the following reasons:
1. The vectorized code outside of the loop which collects results in
the vector into a scalar can be messy. For example, there may be many
shuffles, which is not straightforward to understand (although it is
still possible to detected them).
2. After loop unrolling, the reduction operation is duplicated
several
times but the phi node is not. We should be able to detect all those
unrolled reductions but this task is also challenging.
3. We need to make sure there is no other uses of the result of the
reduction operation (those operations after loop unrolling is an
exception, where the result can be used by another copy of the same
reduction operation).
However, all those information can be easily obtained during loop
vectorization. So why not record it in that stage?
Because it adds yet-another place (PHI nodes) that we need to worry about preserving metadata, and it adds information redundant with that already contained in the IR. Generally, we try not to do that.
If it were the case that re-deriving that information during SDAG building would be non-trivial (because, for example, it would require pulling in some expensive analysis), I'd consider that a valid reason. That does not seem to be the case, however.
We need to be very careful not to fall into the trap of having a "magic vectorizer", meaning a vectorizer that knows how to induce special code-generation functionality not otherwise available to generic input code. It is trivial to write, using completely generic LLVM IR with vector types, the code that the vectorizer would produce. A frontend might generate this directly, for example. It is important that we treat such code as a "first-class citizen" within our framework. Thus, I'd highly prefer that we pattern match the necessary IR in the SDAG builder instead of adding metadata in the vectorizer.
>
> If this matching is practical (and I suspect that it is, given that
> we already do it in the backends to match horizontal adds), this
> seems better than using vectorizer-added metadata. In this way, IR
> generated using vector intrinsics (etc.) can receive the same good
> code generation as that generated by the vectorizer.
This is a good point. But I am not sure if it is worth the effort to
write a reduction detector for instrinsics. But the whole idea is
flexible to accept any reduction detector.
When I say "intrinsics", I mean using native LLVM vector types.
Thanks again,
Hal