An update on scalable vectors in LLVM

Hi all,

It's been a while since we've given an update on scalable vector support in LLVM. Over the last 12 months a lot of work has been done to make LLVM cope with scalable vectors. This effort is now starting to bear fruit with LLVM gaining more capabilities, including an intrinsics interface for AArch64 SVE/SVE2, LLVM IR Codegen for scalable vectors, and several loop-vectorization prototypes that show the ability to vectorize with scalable VFs.

Assuming not everyone is following this effort closely, people will undoubtably have seen some of the changes in the code-base around this, so here is a brief update to give this effort some wider visibility.

This email is structured as follows:
* Regular Sync-up meetings
* Changes made to represent scalable vectors in the C++ codebase
  * ElementCount and VectorType type hierarchy
  * Migrating to TypeSize
  * StackOffset
* What works for scalable vectors today?
* What’s next?
* Concluding
* Acknowledgements

Regular Sync-up meetings:

Hi Sander,

Awesome work from everyone involved. Thank you very much for your efforts!

I know some people wanted it to go a lot faster than it did, but now we have an infrastructure that has reached consensus across different companies and industries.

We’re finally discussing high level vectorisation strategies without having to worry about the mechanics of scalable vector representation. This is a big long term win.

We (Arm) prefer starting out with adding support for 1 in upstream LLVM, because it is the easiest to support and gives a lot of ‘bang for buck’ that will help us incrementally add more scalable auto-vec capabilities to the vectorizer. A proof of concept of what this style of vectorization requires was shared on Phabricator recently: https://reviews.llvm.org/D90343.

Barcelona Supercomputer Centre shared a proof of concept for style 2 that uses the Vector Predication Intrinsics proposed by Simon Moll (VP: https://reviews.llvm.org/D57504, link to the POC: https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi). In the past Arm has shared an alternative implementation of 2 which predates the Vector Predication intrinsics (https://reviews.llvm.org/D87056).

I think both are equally good. The third one seems a bit too restrictive to me (but I’m probably missing something).

I have previously recommended (1) for the sake of simplicity in implementation (one step at a time), but I don’t see anything wrong in us trying both, even at the same time. Or even a merged way where you first vectorise, then predicate, then fuse the tail.

We have enough interested parties that we can try out multiple solutions and pick the best ones, or all of them. And as you say, they’ll all use the same plumbing, so it’s more sharing than competing.

Hopefully in a couple of months we’ll be able to slowly enable more scalable vectorization and work towards building LNT with scalable vectors enabled. When that becomes sufficiently stable, we can consider gearing up a BuildBot to help guard any new changes we make for scalable vectors.

This would be great, even before it’s enabled by default.

cheers,
–renato

Thank you for the update. I really appreciate the high level summary of an effort like this.

Philip

Hi All,

@Sander, thanks a lot for the clear and concise summary of the whole effort.

We (Arm) prefer starting out with adding support for 1 in upstream LLVM, because it is the easiest to support and gives a lot of ‘bang for buck’ that will help us incrementally add more scalable auto-vec capabilities to the vectorizer. A proof of concept of what this style of vectorization requires was shared on Phabricator recently: https://reviews.llvm.org/D90343.

Barcelona Supercomputer Centre shared a proof of concept for style 2 that uses the Vector Predication Intrinsics proposed by Simon Moll (VP: https://reviews.llvm.org/D57504, link to the POC: https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi). In the past Arm has shared an alternative implementation of 2 which predates the Vector Predication intrinsics (https://reviews.llvm.org/D87056).

I think both are equally good. The third one seems a bit too restrictive to me (but I’m probably missing something).

I have previously recommended (1) for the sake of simplicity in implementation (one step at a time), but I don’t see anything wrong in us trying both, even at the same time. Or even a merged way where you first vectorise, then predicate, then fuse the tail.

I should have mentioned this earlier, but our first implementation was also the first approach (unpredicated vector body, scalar tail). It gave us a good base for implementing the 2nd approach on top, which was mostly modifying parts of the existing tail-folding infrastructure and use a TTI hook to decide to emit VP intrinsics. It does make a lot of sense to start with the first approach in the upstream. It will also let everyone get a taste of auto-vectorization for scalable vectors and give us a base for more insightful discussions on the best way to support other approaches on top of it.

We have enough interested parties that we can try out multiple solutions and pick the best ones, or all of them. And as you say, they’ll all use the same plumbing, so it’s more sharing than competing.

Thanks and Regards,

Vineet

WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer