Extending SLP Vectorizer to deal with aggregates?

I’m looking for a sanity check on extending SLP Vectorizer to deal with aggregates.

I’d like to vectorize Julia tuple operations. The Julia compiler lowers tuples to LLVM arrays, not LLVM vectors. I’ve tried making Julia lower tuples to LLVM vectors, but that hurt performance when SLP Vectorizer was not applicable, because of extraction/insertion overhead. I.e., the Julia lowering is too early to make the right choice on vector vs. array representation of tuples. So instead I’d like to make SLP Vectorizer vectorize idioms involving aggregates. A sample of the idiom is at https://gist.github.com/ArchRobison/e762cd55b8cfc0e019a3 .

  1. Identify stores of arrays. E.g. “store [4 x float]”.

  2. Walk chains backwards from the array stores, similar to the way SLPVectorizer already walks chains.

  3. If vectorization is applicable, replace array construction/load/store with vector construction/load/store. Vector load/stores will be unaligned.

Does this sound like a reasonable approach? If it sounds too Julia-specific, I could do it as a custom Julia pass.

  • Arch D. Robison

I'm looking for a sanity check on extending SLP Vectorizer to deal with aggregates.

I'd like to vectorize Julia tuple operations. The Julia compiler lowers tuples to LLVM arrays, not LLVM vectors. I've tried making Julia lower tuples to LLVM vectors, but that hurt performance when SLP Vectorizer was not applicable, because of extraction/insertion overhead. I.e., the Julia lowering is too early to make the right choice on vector vs. array representation of tuples. So instead I'd like to make SLP Vectorizer vectorize idioms involving aggregates. A sample of the idiom is at https://gist.github.com/ArchRobison/e762cd55b8cfc0e019a3 .

1. Identify stores of arrays. E.g. "store [4 x float]".
2. Walk chains backwards from the array stores, similar to the way SLPVectorizer already walks chains.
3. If vectorization is applicable, replace array construction/load/store with vector construction/load/store. Vector load/stores will be unaligned.

Does this sound like a reasonable approach? If it sounds too Julia-specific, I could do it as a custom Julia pass.

Hi Arch,

This kind of vectorization would be useful for the AMDGPU backend
especially for allocas.

We already have a custom pass in the backend called AMDGPUPromoteAllocas
which replaces loads and stores to alloca pointers with vector
operations in some very simple cases. It would be great to have something more
generic that handles more cases.

-Tom

I like this idea, too. But unaligned access may need to be constrained
for the targets that support it. Not a big deal.

Another potential problem is the cost model. Right now, it's tuned to
make scalar vs vector loads and shuffles be "sensible", but array
patterns are slightly different. With luck, you can add gep costs to
balance individual allocas.

That IR you sent could be reduced to a few vector loads, one mul-add,
one vector store. :slight_smile:

IIRC, AVX has the indirect shuffle, so this could even reduce the
number of loads.

Seems like a good candidate for vectorizing!

cheers,
-renato

Will the cost model used by SLPVectorizer shoot down unaligned loads/stores on targets that don't support them (or support them slowly)?

I started working out details to my design, and for a first round I think it's safest to address only cases where the leaf of the chain is an array load that can be rewritten as an unaligned vector load. I.e., similar to the CanReuseExtract idea in SLPVectorizer, but further constrained to be resulting from an array load.

- Arch

I don’t see anything that is special to Julia in your example. I think this can be part of the SLPVectorizer.

It will need a pattern to start its tree building when it sees a store of array of insert values:

partial_value = insert_value(v1 ...)
full_value = insert_value(v2, partial_value)
store [2 x double] full_value, dest

Start a tree at (v1,v2).

A peephole to merge

partial_value = insert_value(extract_element(v, 0), undef)
full_value = insert_value(extract_element(v, 1), partial_value)
store [2 x double] full_value, dest

into

store <2 x double> v, dest

And adjust the cost modeling in the SLPVectorizer to account for this peephole.