SelectionDAG scalarizes vector operations.

Duncan,

I had a few thoughts regarding our short discussion yesterday.

I am not sure how we can lower SEXT into the vpmovsx family of instructions. I propose the following strategy for the ZEXT and ANYEXT family of functions. At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle. This is possible because all of the inputs of the build vector come from two values(src and (undef or zero)). Finally, the shuffle lowering code lowers the new shuffle node into UNPCKLPS. This sequence should be optimal for all of the sane types.
Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction to support SEXT. Unfortunately, v2i64 SRA is not supported by the hardware and the code will be scalarized ...

Currently we promote vector elements to the widest possible type, until we hit the _first_ legal register type. For AVX, where YMM registers extend XMM registers, it is not clear to me why we stop at XMM sized registers. In some cases, masks of types <4 x i1> are legalized to <4 x i32> in XMM registers even if they are a result of a vector-compare of <4 x i64> types. I also had a second observation, which contradicts the first one. In many cases we 'over promote'. Consider the <2 x i32> type. Promoting the elements to <2 x i64> makes us to use types which are not supported by the instruction set. For example, not all of the shift operations are implemented for vector i64 types. Maybe a different strategy would be to promote vector elements up to i32, which is the common element type for most processors, and widen the vector from this point onwards. I am not sure how we can implement vector compare/select with this approach.

Thanks,
Nadav

Hi Nadav,

I had a few thoughts regarding our short discussion yesterday.

  I am not sure how we can lower SEXT into the vpmovsx family of instructions. I propose the following strategy for the ZEXT and ANYEXT family of functions.

what I would like to understand first is why there are any vector xEXT nodes
at all! As I tried to explain on IRC, I don't think you ever get these from
the GCC autovectorizer except as part of a shuffle-extend pair. Where do you
get these nodes from? Does the intel auto-vectorizer produce them more often
than the GCC one?

Ciao, Duncan.

  At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle. This is possible because all of the inputs of the build vector come from two values(src and (undef or zero)). Finally, the shuffle lowering code lowers the new shuffle node into UNPCKLPS. This sequence should be optimal for all of the sane types.

We generate xEXT nodes in many cases. Unlike GCC which vectorizes inner loops, we vectorize the implicit outermost loop of data-parallel workloads (also called whole function vectorization). We vectorize code even if the user uses xEXT instructions, uses mixed types, etc. We choose a vectorization factor which is likely to generate more legal vector types, but if the user mixes types then we are forced to make a decision. We rely on the LLVM code generator to produce quality code. To my understanding, the GCC vectorizer does not vectorize code if it thinks that it misses a single operation.

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

We generate xEXT nodes in many cases. Unlike GCC which vectorizes
inner loops, we vectorize the implicit outermost loop of data-parallel
workloads (also called whole function vectorization). We vectorize
code even if the user uses xEXT instructions, uses mixed types, etc.
We choose a vectorization factor which is likely to generate more
legal vector types, but if the user mixes types then we are forced to
make a decision. We rely on the LLVM code generator to produce
quality code. To my understanding, the GCC vectorizer does not
vectorize code if it thinks that it misses a single operation.

My experience is similar to Nadav's. The Cray vectorizer vectorizes
much more code that the gcc vectorizer. Things are much more
complicated than gcc vector code would lead one to believe.

                             -Dave

Hi Dave,

We generate xEXT nodes in many cases. Unlike GCC which vectorizes
inner loops, we vectorize the implicit outermost loop of data-parallel
workloads (also called whole function vectorization). We vectorize
code even if the user uses xEXT instructions, uses mixed types, etc.
We choose a vectorization factor which is likely to generate more
legal vector types, but if the user mixes types then we are forced to
make a decision. We rely on the LLVM code generator to produce
quality code. To my understanding, the GCC vectorizer does not
vectorize code if it thinks that it misses a single operation.

My experience is similar to Nadav's. The Cray vectorizer vectorizes
much more code that the gcc vectorizer. Things are much more
complicated than gcc vector code would lead one to believe.

I think it is important we produce non-scalarized code for the IR produced by
the GCC vectorizer, since we know it can be done (otherwise GCC wouldn't have
produced it). It is of course important to produce decent code in the most
common cases coming from other vectorizers too. However it seems sensible to
me to start with the case where you know you can easily get perfect results
(GCC vectorizer output) and then try to progressively extend the goodness to
the more problematic cases coming from other vectorizers.

Ciao, Duncan.

Duncan Sands <baldrick@free.fr> writes:

I think it is important we produce non-scalarized code for the IR produced by
the GCC vectorizer, since we know it can be done (otherwise GCC wouldn't have
produced it). It is of course important to produce decent code in the most
common cases coming from other vectorizers too. However it seems sensible to
me to start with the case where you know you can easily get perfect results
(GCC vectorizer output) and then try to progressively extend the goodness to
the more problematic cases coming from other vectorizers.

Of course. I was simply supporting Nadav's explanation that there's a
lot of pessimization in the current lowering that doesn't even appear
for code generated by gcc.

We have a number of lowering modifications here to handle many of these
cases. As always, I am slogging through trying to get them moved
upstream. It's a long process, unfortunately. But don't be surprised
to see changes that might look "unnecessary" but are very important for
various compilers.

                              -Dave

Hi David!

I'd be interested in hearing about the places that you had to fix. It seems like there is a number of people who are starting to look at the quality of the generated vector code. Maybe we should report our findings in bug reports, so that we could share the work and discuss possible findings. I also plan to fill a few bug reports with suboptimal code.

Thanks,
Nadav

Hi Nadav,

I'd be interested in hearing about the places that you had to fix. It seems like there is a number of people who are starting to look at the quality of the generated vector code. Maybe we should report our findings in bug reports, so that we could share the work and discuss possible findings. I also plan to fill a few bug reports with suboptimal code.

yes, it would be great to have collection of important testcases that are
handled poorly.

Ciao, Duncan.

I'd like to throw my backing for this also. We see some IR that our internal passes have vectorized that the SelectionDAG then scalarizes.

Micah

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Hi David!

I'd be interested in hearing about the places that you had to fix. It
seems like there is a number of people who are starting to look at the
quality of the generated vector code. Maybe we should report our
findings in bug reports, so that we could share the work and discuss
possible findings. I also plan to fill a few bug reports with
suboptimal code.

That's a good idea. I confess that I don't have a laundry list at-hand.
These were fixes I made two years ago. I'll have to do a giant diff
against the upstream LLVM. :slight_smile:

But yes, I fully intend to send these fixes up.

                               -Dave

Would it make sense to start something like a lib/Target/README-Vectors.txt file with a collection of small examples of pessimizations? We've successfully used text files like this to track "laundry lists" like this. High priority problems (e.g. the sorts that Duncan is pointing out) makes sense to track in bugzilla, of course.

-Chris

Chris Lattner <clattner@apple.com> writes:

Would it make sense to start something like a
lib/Target/README-Vectors.txt file with a collection of small examples
of pessimizations? We've successfully used text files like this to
track "laundry lists" like this. High priority problems (e.g. the
sorts that Duncan is pointing out) makes sense to track in bugzilla,
of course.

Sounds fine to me, but is there any reason *not* to file bugs other than
increasing the number of bugs in Bugzilla? Bugzilla is always where I
go first to see if a problem has already been identified.

                          -Dave

Hi All,

Hi Dave,

>> We generate xEXT nodes in many cases. Unlike GCC which vectorizes
>> inner loops, we vectorize the implicit outermost loop of
>> data-parallel workloads (also called whole function vectorization).

Just to clarify, GCC vectorizes innermost and next-to-innermost (aka outer) loops, packing instances of the same original scalar instruction across different iterations into a vector instruction. It also vectorizes within basic blocks (aka SLP), packing distinct scalar instructions into vectors. And, it does the latter while considering a (possible) enclosing loop -- in order to place loop-invariant code outside, and also to unroll the enclosing loop if/as needed to fill the vectors. But, in any event, it creates fully vectorized code regions, with scalar code used only in supporting computations such as addressing, loop induction variable handling, reduction epilogs etc.

>> We vectorize code even if the user uses xEXT instructions, uses mixed
types, etc.

GCC does vectorize code which contains multiple data types, by choosing the vectorization factor according to the smallest type, and using multiple vectors to hold larger types.

>> We choose a vectorization factor which is likely to generate more
>> legal vector types, but if the user mixes types then we are forced to
>> make a decision. We rely on the LLVM code generator to produce
>> quality code. To my understanding, the GCC vectorizer does not
>> vectorize code if it thinks that it misses a single operation.
>

Right. It queries whether the target supports a vectorized form (of the desired vectorization factor) for each scalar instruction in the loop or region. There is no scalarization -- code is either fully vectorized in a way that survives code generation, or else the vectorizer gives up and avoids modifying the relevant scalar code. This may indeed not be an optimal decision; but even then, there are cases where it's better not to vectorize.

> My experience is similar to Nadav's. The Cray vectorizer vectorizes
> much more code that the gcc vectorizer. Things are much more
> complicated than gcc vector code would lead one to believe.

I think it is important we produce non-scalarized code for the IR produced by
the GCC vectorizer, since we know it can be done (otherwise GCC wouldn't
have produced it). It is of course important to produce decent code in the
most common cases coming from other vectorizers too. However it seems
sensible to me to start with the case where you know you can easily get
perfect results (GCC vectorizer output) and then try to progressively extend
the goodness to the more problematic cases coming from other vectorizers.

BTW, the GCC vectorizer can also tell you why it did not vectorize; e.g., if some instruction was not available in vector form.

So the vectorizer takes care of any desired unrollings on its own, and does not rely on a separate unroll pass. It does rely on a separate if-conversion pass especially designed to eliminate if-then-else hammocks in relevant regions (loops) right before the vectorizer kicks in. This part may require undoing, when an if-converted loop is not vectorized and the target does not support the resulting predicated scalar instructions.

Hope this helps. Had the pleasure of working with the GCC autovect guys (or rather gals) from the start, before joining Nadav et al. recently.

Ayal.

Personally, I think that bugzilla is overkill (i.e. the wrong answer) for a long tail of random micro-optimizations. Much of the stuff in X86/README has been in there for years. Having it in bugzilla would just clutter things up.

-Chris

Hi Ayal, thanks for this interesting information.

We choose a vectorization factor which is likely to generate more
legal vector types, but if the user mixes types then we are forced to
make a decision. We rely on the LLVM code generator to produce
quality code. To my understanding, the GCC vectorizer does not
vectorize code if it thinks that it misses a single operation.

Right. It queries whether the target supports a vectorized form (of the desired vectorization factor) for each scalar instruction in the loop or region. There is no scalarization -- code is either fully vectorized in a way that survives code generation, or else the vectorizer gives up and avoids modifying the relevant scalar code. This may indeed not be an optimal decision; but even then, there are cases where it's better not to vectorize.

The problem right now is that LLVM's codegen takes the vector IR produced by
GCC and often scalarizes it.

Ciao, Duncan.