sum elements in the vector

My target has an instruction that adds up all elements in the vector and stores the result in a register. I’m trying to implement it in my compiler but I’m not sure even where to start.

I did look at other targets, but they don’t seem to have anything like it ( I could be wrong. My experience with LLVM is limited, so if I missed it, I’d appreciate if someone could point it out ).

My understanding is that if SDNode for such an instruction doesn’t exist I have to define one. Unfortunately, I don’t know how to do it. I don’t even know where to start looking. Would someone care to point me in the right direction?

Any help is appreciated.

This is roughly along the lines of x86 hadd* instructions though the semantics of hadd* may not exactly match what you are looking for. This is probably more in line with x86/ARM SAD-like instructions but I don’t think llvm generates SAD without intrinsics.

Thanks for the pointers. I looked at hadd instructions. They seem to do very similar to what I need. Unfortunately as I said before my LLVM experience is limited. My understanding is that when I create a new type of SDNode I need to specify a pattern for it, so that when LLVM is analyzing the code and is seeing a given pattern it would create this particular node. I’m really struggling to understand how it is done. So here are the problems that I’m having.

  1. How do I identify that pattern that should be used?
  2. How do I specify a given pattern?

Do you (or someone else) mind helping me out?

Any help is appreciated.

Forwarding to llvm mailing list, because I again forgot to include it.

I’m a little confused. Here is why.

I was able to add a vector add instruction to my target without using any intrinsics and without adding any new instructions to LLVM. So here is my question: how come I managed to add a new vector instruction without adding an intrinsic and why in order to add this particular instruction (sum elements in a vector) I need to add an insrinsic?

Another question that I have is whether compiler will be able to target this new instruction (sum elements in a vector) if it is implemented as an intrinsic or the user will have to specifically invoke an instrinsic.

Pardon if questions seem dumb, I’m still learning things.

Any help is appreciated.

why in order to add this particular instruction (sum elements in a vector) I need to add an insrinsic?

Adding intrinsic is not the only way, it is one of the way and user WILL-NOT be required to invoke

It specifically.

Currently LLVM does not have any instruction to directly represent “sum of elements in a vector” and

generate your particular instruction.However, you can do it without intrinsic by pattern matching the

LLVM-IRs representing “sum of elements in vector” to your particular instruction in DAGCombiner.

Regards,

Shahid

I’m starting to think we should directly implement horizontal operations on vector types.

My suspicion is that coming up with a nice model for this would help us a lot with things like:

  • Idiom recognition of reduction patterns that use horizontal arithmetic
  • Ability to use horizontal operations in SLPVectorizer
  • Significantly easier cost modeling of vectorizing loops with reductions in LoopVectorize
  • Other things I’ve not thought of?

Curious what others think?

-Chandler

This would be really cool. We have several instructions that perform horizontal vector operations, and have to use built-ins to select them as there is no easy way of expressing them in a TD file. Some like SUM for a ‘v4i32’ are easy enough to express with a pattern fragment, SUM ‘v8i16’ takes TableGen a long time to compute, but SUM ‘v16i8’ resulted in TableGen disappearing into itself for hours trying to reduce the patterns before I gave up and cancelled it.

If there were ISD nodes for these, then it would be far simpler to express in TableGen, and also, the pattern fragments only match a very specific form of IR to the desired instruction.

The horizontal operations are particularly useful for finalising a vectorised operation - for example I may want to compute the scalar MAX, MIN or SUM of a large number of items. If the number of items is divisible by the vector lanes (e.g. 4, 8, or 16 in our case), then 4, 8 or 16 at a time can be computed using normal vector operation, and then the final scalar value can be computed using a single horizontal operation.

MartinO

We are also thinking about necessity of horizontal intrinsics : sum, mul, min, max – for floating point and integers, and logical operations for integers.

It will allow to apply target specific optimizations for reduction tail.

Hi,

I’ve long been an advocate of intrinsics at least for horizontal operations; first class instructions I could also get behind.

Anything that isn’t pattern matching power-two-shuffles in the backend!

James

This would be really cool. We have several instructions that perform
horizontal vector operations, and have to use built-ins to select them as
there is no easy way of expressing them in a TD file. Some like SUM for a ‘
v4i32’ are easy enough to express with a pattern fragment,

Do you mind sharing how to do it with a pattern fragment? I'm not new to TD
files but all the work I've done was very simple.

Hi Rail,

We used a very simple pattern expansion (actually, not a pattern fragment). For example, for AND, ADD (horizontal sum), OR and XOR of 4 elements we use something like the following TableGen structure:

class HORIZ_Op4<SDNode opc, RegisterClass regVT, ValueType rt, ValueType vt, string asmstr> :

SHAVE_Instr<(outs regVT:$dst), (ins VRF128:$src),

!strconcat(asmstr, " $dst $src"),

[(set regVT:$dst,

(opc (rt (vector_extract(vt VRF128:$src), 0 ) ),

(opc (rt (vector_extract(vt VRF128:$src), 1 ) ),

(opc (rt (vector_extract(vt VRF128:$src), 2 ) ),

(rt (vector_extract(vt VRF128:$src), 3 ) )

)

)

)

)]>;

This is okay for 4 element vectors, and it will get selected if the programmer writes something like:

vec[0] & vec[1] & vec[2] & vec[3]

but not with a simple variant like:

vec[0] & vec[2] & vec[1] & vec[3]

If this was properly represented by an ISD node, the other permutations could be more easily handled through normalisation. We “could” write patterns for each of the permutations, but it is verbose, and in practice most people only write it one way anyway.

The 8-lane equivalent has TableGen left thinking for quite a long time, and the 16-lane equivalent seems to hang TableGen.

MartinO

Hi Rail,

We used a very simple pattern expansion (actually, not a pattern
fragment). For example, for AND, ADD (horizontal sum), OR and XOR of 4
elements we use something like the following TableGen structure:

class HORIZ_Op4<SDNode opc, RegisterClass regVT, ValueType rt, ValueType
vt, string asmstr> :

    SHAVE_Instr<(outs regVT:$dst), (ins VRF128:$src),

                    !strconcat(asmstr, " $dst $src"),

                    [(set regVT:$dst,

                      (opc (rt (vector_extract(vt VRF128:$src), 0 ) ),

                        (opc (rt (vector_extract(vt VRF128:$src), 1 ) ),

                          (opc (rt (vector_extract(vt VRF128:$src), 2 ) ),

                            (rt (vector_extract(vt VRF128:$src), 3 ) )

                          )

                        )

                      )

                    )]>;

This is okay for 4 element vectors, and it will get selected if the
programmer writes something like:

vec[0] & vec[1] & vec[2] & vec[3]

but not with a simple variant like:

vec[0] & vec[2] & vec[1] & vec[3]

If this was properly represented by an ISD node, the other permutations
could be more easily handled through normalisation. We “could” write
patterns for each of the permutations, but it is verbose, and in practice
most people only write it one way anyway.

The 8-lane equivalent has TableGen left thinking for quite a long time,
and the 16-lane equivalent seems to hang TableGen.

            MartinO

Martin,

Thanks for the reply. If I read a pattern correctly (and I'm not sure if I
do) then you are extracting data from the vector first and then perform an
operation. What I'm trying to place data into a vector and perform and
operation. Here is what I'm talking about:

Convert the following:

int a = {1, 2, 3, 4};
int sum = 0;
for (int i = 0; i < 4; i++)
  sum+= a[i];

into

vector.load vector.register.0, addressOfA
horizontal.add gpr.0, vector.register.0

Original thought was to match a pattern of adds and then use insert_elt
instruction, but this solution doesn't produce a good result, since it uses
more instructions than a chain of adds. Do you think there is a simple
solution to the problem that I'm suggesting or I have to make some major
code changes? As I said before, my experience with all of this is very
limited, so any help is greatly appreciated.

P.S. I wasn't able to find anything related to SHAVE_Instr in the LLVM
trunk. Are you guys not committing your work.

That’s how the pattern is expressed, but it selects a single instruction. Not sure how TableGen does this, but it does match that particular IR pattern to the single instruction match without actually performing the element-by-element extract. It is crude though, and is very particular about specific IR structures. If it was elevated to a true ISD node supported by a C++ expressed IR normalisation, then it would catch a wider number of variants such as the classic loop for you show. Other patterns simply reduce to element extract and scalar ADD, but my pattern catches a common case. Horizontal patterns as first class citizens would catch a broader range of IR forms.

MartinO

Just out of curiosity how do I specify multiple patterns for the same
instruction? I tried putting the pattern in the pattern list but it LLVM
complained that I can't set the same register multiple times.

Hi Chandler,

Regardless of the canonical form we choose, we need code to match non-canonical associated shuffle sequences and convert them into the canonical form. We also need code to match the pattern where we extractelement on all elements and sum them into this canonical form. This code needs to exist somewhere, so we need to decide whether it exists in the frontend or the backend.

Having an intrinsic is obviously smaller, in terms of IR memory overhead, than these instructions. However, I'm not sure how many passes we'll need to teach about the new intrinsic. Obviously there are many passes that understand integer addition, but likely many fewer would really learn anything useful by looking through the reduction. We would need add code into InstCombine in order to pull apart the reduction intrinsic when we learn that the vector has only one contributing element.

In short, I don't have a strong opinion on this, because we need the matching code somewhere regardless. Using the intrinsic means not matching multiple times, but it means adding extra code to handle the intrinsic. Regarding issues such as idiom recognition, use by the SLP vectorizer, etc. these seem independent of whether the canonical form is an intrinsic or a composite, and I don't think it makes the vectorizer cost model easier one way or the other.

In any case, we now have relevant pattern-matching code in SDAGBuilder (although it is currently somewhat specific to reductions after loops), so we already have a better infrastructure to help backends with the shuffle-matching problem. There is a corresponding 'VectorReduction' SDNode flag.

-Hal

Hi Chandler,

Regardless of the canonical form we choose, we need code to match non-canonical associated shuffle sequences and convert them into the canonical form. We also need code to match the pattern where we extractelement on all elements and sum them into this canonical form. This code needs to exist somewhere, so we need to decide whether it exists in the frontend or the backend.

Agreed. However, we also need to choose where it lives within the “backend” or LLVM more generally.

I think putting it late will end up with less powerful matching than having it fairly early and run often in instcombine.

Consider – how should the inliner or the loop unroller evaluate code sequences containing 16 vector shuffles that all amount to expanding a horizontal operation?

That’s why I think we should model this patterns as first class citizens in the IR. Then backends can lower them however is necessary.

Having an intrinsic is obviously smaller, in terms of IR memory overhead, than these instructions. However, I’m not sure how many passes we’ll need to teach about the new intrinsic.

An intrinsic already moves this into the IR instead of the code generator, which I like.

I think the important distinction is between a target specific intrinsic and a generic one that we expect every target to support. The latter I think will be much more useful.

Then we can debate the relative merits of using a generic intrinsic versus an instruction which I think are fairly mundane. I suspect this is a place where we should use instructions, but that’s a much more mechanical discussion IMO.

(And in case it isn’t clear, I’m not arguing we should avoid doing the vector reduction matching and other patterns at all. I’m just trying to start the discussion about the larger set of issues here.)

-Chandler

From: "Chandler Carruth" <chandlerc@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Chandler Carruth"
<chandlerc@gmail.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Monday, May 23, 2016 2:48:13 PM
Subject: Re: [llvm-dev] sum elements in the vector

> Hi Chandler,

> Regardless of the canonical form we choose, we need code to match
> non-canonical associated shuffle sequences and convert them into
> the
> canonical form. We also need code to match the pattern where we
> extractelement on all elements and sum them into this canonical
> form. This code needs to exist somewhere, so we need to decide
> whether it exists in the frontend or the backend.

Agreed. However, we also need to choose where it lives within the
"backend" or LLVM more generally.

I think putting it late will end up with less powerful matching than
having it fairly early and run often in instcombine.

Consider -- how should the inliner or the loop unroller evaluate code
sequences containing 16 vector shuffles that all amount to expanding
a horizontal operation?

I agree this is an issue. This is why I said that if we choose a composite canonical form, we'll end up matching the pattern multiple times. Regarding this kind of cost modeling, however, we already have this problem, and it's sometimes quite bad, and has nothing to do with reductions. Targets, in general, need to be better about providing better "user costs" for composite sequences that will end up being cheap. Last I looked at this in the context of loop unrolling, addressing modes were a common issue here as well.

That's why I think we should model this patterns as first class
citizens in the IR. Then backends can lower them however is
necessary.

> Having an intrinsic is obviously smaller, in terms of IR memory
> overhead, than these instructions. However, I'm not sure how many
> passes we'll need to teach about the new intrinsic.

An intrinsic already moves this into the IR instead of the code
generator, which I like.

I think the important distinction is between a *target specific*
intrinsic and a generic one that we expect every target to support.
The latter I think will be much more useful.

Then we can debate the relative merits of using a generic intrinsic
versus an instruction which I think are fairly mundane. I suspect
this is a place where we should use instructions, but that's a much
more mechanical discussion IMO.

Sure. If we're going to go with a dedicated representation, I'm fine with either, and I agree an instruction here might be mechanically easier.

(And in case it isn't clear, I'm not arguing we should avoid doing
the vector reduction matching and other patterns at all. I'm just
trying to start the discussion about the larger set of issues here.)

Understood.

In some mechanical sense, however, we're discussing:

if (isVectorReduction(I) == Instruction::Add) vs. if (auto *VRI = dyn_cast<VectorReductionInst>(I)) if (VRI->getReductionOp() == Instruction::Add) -- bikeshedding aside -- and so it might be better to start with the utility-function implementation, which is essentially what we should have now given the current implementation, and then see it we want to do more based on compile-time impacts or other issues.

Thanks again,
Hal

Hi Shahid.

Do you mind providing a concrete example of X86 code where an intrinsic was added (preferrable with filenames and line numbers)? I’m having difficulty tracking down the steps you provided.

Any help is appreciated.

Hi Rail,

Below 2 revisions might be of your interest which Detect SAD patterns and emit psadbw instructions on X86.:

http://reviews.llvm.org/D14840
http://reviews.llvm.org/D14897

Intrinsics related to absdiff revisons :

http://reviews.llvm.org/D10867
http://reviews.llvm.org/D11678

Hope this helps.

Regards,
Suyog