Scalable Vector Types in IR - Next Steps?

I did see that reply.

While, like Hal, I do understand some concerns on introducing a
radical new concept to IR (the reason why I started this thread), I'm
unaware (mainly by not being on that meeting) of the individual issues
and how controversial they were with those involved.

Furthermore, the current state is uncertain and people need to be
convinced more of what will work by means of hacking up more
intrinsics and more kludge into the current IR.

This means that, even if we are to implement it natively in IR, it
won't come *before* we implement it with intrinsics, which will
hopefully convince people that this makes sense, and by which time,
the code will look completely different and we'll need a completely
new patch.

Ie. the current series is already dead, no matter what we do.

cheers,
--renato

I have no opinion on the technical aspects here, not having researched this topic at all.

But this last statement seems odd. So far, there looks to be a fairly good consensus from a number of experienced llvm developers that the approach seems like a good idea, both on this thread, and from skimming the earlier threads you linked from your original message.

Doesn’t that mean that the reasonable next step is to continue moving forward with the existing patch set?

Ie. the current series is already dead, no matter what we do

But this last statement seems odd. So far, there looks to be a fairly good consensus from a number of experienced llvm developers that the approach seems like a good idea, both on this thread, and from skimming the earlier threads you linked from your original message.

Doesn't that mean that the reasonable next step is to continue moving forward with the existing patch set?

It depends.

The previous public consensus was, indeed, that the native proposal
makes a lot of sense. I think this is ultimately where we want to be,
but I'm not clear on what the path really is.

From what Graham said, and from his current work, I guess the "new"

(not public) consensus seems to be to go with intrinsics first, then
move to native support, which is a valid path.

If the public agreement becomes that this is the path we want to take,
then that specific patch-set is dead, because even if we do native, it
will be a different set.

If the end result is that we'll stop at intrinsics (I really hope
not), the patch-set is also dead.

However, if people want to continue pushing for native support now,
the patch-set is not dead. But then we need to re-do the meeting that
happened in the US dev meeting with everyone in it, which won't
happen.

So, while I would also prefer to have native support first, and work
our the wrinkles between releases (as I proposed in this thread), I'm
ok with going the intrinsics way first, as long as the aim is to not
stop there.

Makes sense?

cheers,
--renato

PS: Until someone writes up what happened, who was involved, what were
the issues and why the current consensus is changed, we can only
guess...

Renato Golin <rengolin@gmail.com> writes:

I've talked with a number of people about this as well, and I think that
I understand the objections. I'm happy that ARM followed through with
the alternate set of patches. Regardless, however, unless those who had
wished to object still wish to object, and then actually do so, we now
clearly have a good collection of contributors actively desiring to do
code review, and we should move forward (i.e., start committing patches
once they're judged ready).

Let's start by closing the three flying revisions, so that people that
weren't involved in the discussion don't waste time looking at them.

See the reply I just posted to Hal. I am not sure we've made a decision
to abandon the current patches. We may in fact decide that, but I
haven't seen consensus for doing so yet. In fact I've seen the opposite
-- that people want to move forward with the scalable types.

I agree with David. We should move forward with native support for
scalable types.

-Hal

Hi,

From what Graham said, and from his current work, I guess the "new"
(not public) consensus seems to be to go with intrinsics first, then
move to native support, which is a valid path.

There wasn't a consensus, just a proposal for a different option to
present to the community for feedback and discussion to get things
moving (whether for the full scalable IR proposal, the opaque types one,
or something in between). Sorry if I didn't make that clear enough.

Arm felt it was worth investing some time in investigating an alternative
if there was the possibility of progressing upstreaming, then presenting
the findings for discussion.

If the public agreement becomes that this is the path we want to take,
then that specific patch-set is dead, because even if we do native, it
will be a different set.

If the end result is that we'll stop at intrinsics (I really hope
not), the patch-set is also dead.

However, if people want to continue pushing for native support now,
the patch-set is not dead. But then we need to re-do the meeting that
happened in the US dev meeting with everyone in it, which won't
happen.

While there was a roundtable at the devmeeting last year, there weren't
that many people in attendance to talk purely about SVE or scalable
types -- most of the discussion revolved around predication support,
from what I remember.

The main feedback I had which led to changes in the RFC were in side chats
when people had a few minutes of spare time (since they had other sessions
to attend which clashed with the roundtable slots).

-Graham

Hi Graham,

By the extent of your “further work” , I assumed you had quite a strong push back and that this was more of an official session. That’s why I wanted clarity over which reviews we should be looking at.

Honestly, so far, no one in this thread has pointed to any concrete request for non native support, so unless someone does so, the official consensus is still native.

So, with apologies to all bystanders, I repeat my original proposal: it’s now time to try and push native support (time to next release) on trunk.

If anyone has concerns, either on the current proposal (reviews on my first email) or on the general idea of native scalable types, please speak up.

As David said, it’s a bit silly that gcc has support for it for over a year and we’re still arguing about very basic stuff.

Cheers,
Renato

Sorry I haven’t been as available as usual for the past few weeks, but FWIW, I still am unconvinced that scalable vector types belong in the IR.

I think this adds complexity to LLVM’s IR to serve a niche use case without proven benefit to a broad spectrum of hardware or software. I think the complexity is significant and will be a net drag on all parts of the IR and IR-level transformations. But I don’t really think it is useful to re-hash all these debates. Nothing relevant has changed in the years this has been discussed.

That said, if I’m the only one who feels this way (and is willing to actually state this publicly), I’m not going to stop progress.

-Chandler

I think this adds complexity to LLVM’s IR to serve a niche use case without proven benefit to a broad spectrum of hardware or software. I think the complexity is significant and will be a net drag on all parts of the IR and IR-level transformations.

I view the situation differently, but I’m still relatively new to llvm-dev and may be too unfamiliar with the threshold for inclusion in trunk, so please help educate me. I see people from more than one organization saying that they’d like to see this in trunk. No one wants it to be a drag on all transforms because no one wants to have to rewrite a ton of code. So it would seem that there are two possible outcomes: it gets merged into trunk and the interested parties try hard to not adversely impact all of LLVM because that’s in their best interest, too, OR it isn’t merged and then we have potentially the same multiple organizations maintaining support for this on the side, out of trunk. This community tries to avoid the latter situation, right? I’ve always thought of LLVM trunk as where code ends up that is useful to multiple orgs and then each org maintains their own local patches for stuff that no one else would want (or that they can’t share for competitive reasons).

-Troy

"Finkel, Hal J." <hfinkel@anl.gov> writes:

You're not, and I'm in the same position here. I don't think there's a
really good answer for how this is going to affect a lot of the IR and
IR-level transformations from a maintainability perspective. It mostly
seems like this is a "we need this for the new ISA support" and while
I don't see a lot of compelling use case here and a lot of downside
that there...

-eric

Three ISAs at present:

- SVE in Aarch64
- MVE in ARM Cortex-M (quite different from SVE)
- RVV in RISC-V

It would not surprise me if other ISAs implement similar vector
extensions in future.

We’re planning on implementing scalable vector support in the SimpleV ISA extension as well. Admittedly, we will most likely need additional IR modifications (allowing vectors of vectors), but I think having scalable vector support already built in will help greatly.

Jacob Lifshay

As Simon Moll also wrote, please add the NEC SX-Aurora vector engine to
the list of architectures aiming at and awaiting eagerly the SVE/AVL/VP
changes in LLVM. We have long vectors (256x64bit) and a vector length
register since many years, with the latest CPU being available on the
market since a year, mainly aiming at HPC and AI.

We're working on an LLVM backend and intend to open it and post an RFC
on its inclusion soon. Progress with AVL/VP is very important for this
backend and we rely on LLVM moving forward on these.

Regards,
Erich Focht

Hi David,

I'll need to update the reviews (and rebase). I'll do that this week.

https://reviews.llvm.org/D32530 is the key patch. The backend codegen
patches can be safely ignored, I think -- we would want better isel
patterns.

It seems there's still some discussion to be had in this thread though.

-Graham

Hi Bruce,

Three ISAs at present:

- SVE in Aarch64
- MVE in ARM Cortex-M (quite different from SVE)
- RVV in RISC-V

MVE isn't scalable in terms of registers (it's fixed at 128b iirc), so won't be using these types.

It can use execution units that are narrower than 128b, and start execution on partially
completed vectors in subsequent cycles (a bit like an RVV implementation might do when using
VLMul > 1, or the established vector supercomputer architectures).

Just a different set of design constraints.

As Simon and Erich point out though, SX-Aurora would like to use scalable vectors too, so
we still have three architectures intending to use the feature.

-Graham

Hi Eric and Chandler,

I appreciate your concerns; I don't think the impact will be that great, but then it's
rather easy for me to keep SVE in mind when working on other parts of the codebase
given how long I've spent working on it.

Are there any additional constraints on the scalable types you think would alleviate
your concerns a little? At the moment we will prevent scalable vectors from being
included in structs and arrays, but we could add more (at least to start with) to
avoid potential hidden problems.

I'm also trying to come up with an idea of how much impact we have in our downstream
implementation; most places where there is divergence are in the AArch64 backend (as you'd
expect), followed by the generic SelectionDAG code -- but lowering and legalization for
current instructions should (hopefully) be a one-off.

Do you have any specific parts of the codebase you're interested in a report into the
extent of changes?

-Graham

Hi Eric and Chandler,

I appreciate your concerns; I don’t think the impact will be that great, but then it’s
rather easy for me to keep SVE in mind when working on other parts of the codebase
given how long I’ve spent working on it.

Are there any additional constraints on the scalable types you think would alleviate
your concerns a little? At the moment we will prevent scalable vectors from being
included in structs and arrays, but we could add more (at least to start with) to
avoid potential hidden problems.

While the constraints you mention are good, and important, I don’t think there are more that matter.

I’m also trying to come up with an idea of how much impact we have in our downstream
implementation; most places where there is divergence are in the AArch64 backend (as you’d
expect), followed by the generic SelectionDAG code – but lowering and legalization for
current instructions should (hopefully) be a one-off.

Do you have any specific parts of the codebase you’re interested in a report into the
extent of changes?

This is not about the changes required. It is about the long term (think 10-years) complexity forced onto the IR.

We now have vectors that are unlike all other vectors in the IR. They’re basically unlike all other types. I believe we will be finding bugs with this special case ~forever. Will it be an untenable burden? Definitely not. We can manage.

But the question is: does the benefit outweigh the cost? IMO, no.

I completely understand the benefit of this for the ISA, and I would encourage every ISA to adopt some vector instruction set with similar aspects.

However, the more I talk with and work with my users doing SIMD programming (and my entire experience doing it personally) leads to me to believe this will be of extremely limited utility to model in the IR. There will be a small number of places where it can be used. All of those where performance matters will end up being tuned for specific widths anyways to get the last few % of performance. Those that aren’t performance critical won’t provide any substantial advantage over just being 128-bit vectorized or left scalar. At that point, we pay the complexity and maintenance cost of this completely special type in the IR for no material benefit.

I’ve said this several times in various discussions. My opinion has not changed. No new information has been presented by others or by me. So I think debating this technical point is not really interesting at this point.

That said, it is entirely possible that I am wrong about the utility. If the consensus in the community is that we should move forward, I’m not going to block forward progress. It sounds like Hal, the Cray folks, and many ARM folks are all positive. So far, only myself and Eric have said anything to the contrary. If there really isn’t anyone else concerned with this, please just move forward. I think the cost of continuing to debate this is rapidly becoming unsustainable all on its own.

To me, this is nothing like SIMD programming. I've done that, with
VMX/Altivec and NEON.

I've been working with a number of kernels implemented on RISC-V
vectors recently. At least for the things we've been looking at so
far, the code is almost exactly the same as you'd use to implement the
same algorithm (possibly pipelined, unrolled etc) using 32 normal FP
registers, it's just that you work on some unknown-at-compile-time
number of different outer-loop iterations in parallel. For example,
maybe you've got a whole lot of 3x3 matrices to invert. You load each
element of the first matrix into nine registers, then calculate the
determinant, then permute the input values into their new positions
while dividing them by the determinant, and write them all out. It's
exactly the same with the vector ISA, except you might be loading and
working on 1, 2, 4, ... 1000 of the matrices in parallel. You just
don't know, and it doesn't matter. The same for sgemm. You work on
strips eight (say) wide/high. In one dimension you have normal
loads/stores, and in the other dimension you have strided
loads/stores. You're working on rectangular blocks 8 high/wide and
some unknown-at-compile-time amount wide/high -- one some small
machine it might be 1 (i.e. basically a standard FP register file, but
the vector ISA works on it correctly), but presumably on most it will
be something like 4 or 8 or 16 elements. If you unroll either of these
kernels once (or software pipeline it) then you're going to pretty
much saturate your memory system or your fma units or both, depending
on the particular kernel's ratio of compute-to-bytes, how many
functional units you have, and the width of your memory bus.

Maybe you're right and hand-tuned SIMD code with explicit knowledge of
the vector length might get you single-digit percentage better
performance, but it probably won't be more than that and it's a lot of
work.

As for LLVM IR support .. I don't have a firm opinion on whether this
scalable type proposal is sufficient, insufficient, or overkill.

My own gut feeling is that the existing type system is fine for
describing vector data in memory, and that all we need (at least for
RISC-V) is a new register file that is very similar to any machine
with a unified int/fp register file. LLVM needs to manage register
allocation in this register file just as it does for regular int or fp
register files. Spills and reloads of these registers would be
undesirable, but it they are needed then the compiler would have to
allocate the space for this using alloca (or malloc).

The biggest thing needed I think is understanding one unusual
instruction: vsetvl{i}. At the head of each loop you explicitly use
the vsetvl{i} instruction to set the register width (the vector
element width) to something between 8 bits and 1024 bits. The vsetvl
instruction returns an integer which you normally use only to scale by
the element width that you just set, and use the result to bump your
input and output pointers to bump them by N elements instead of 1
element.

So, you kind of need a new type for the registers, but it's purely for
the registers. Not only can you not include it in arrays or structs,
you also can't load it from memory or store it to memory.

The plan for RISC-V is also that all 32 vector registers will be
caller-save/volatile. If you call a function then when it returns you
have to assume that all vector registers have been trashed. There are
no functions using the standard ABI that take vector registers as
arguments or return vector registers as results. The only apparent
exception is the compiler's runtime library that will have things the
compiler explicitly knows about such as transcendental functions --
but they don't use the standard ABI.

I am of the opinion that handling scalable vectors (SV)
as builtins and an opaque SV type is a good option:

1. The implementation of SV with builtins is simpler than changing the IR.

2. Most of the transforms in opt are scalar opts; they do not optimize
vector operations and will not deal with SV either.

3. With builtins there are fewer places to pay attention to,
as most of the compiler is already dealing with builtins in
a neutral way.

4. The builtin approach is more targeted and confined: it allows
to amend one optimizer at a time.
In the alternative of changing the IR, one has to touch all the
passes in the initial implementation.

5. Optimizing code written with SV intrinsic calls can be done
with about the same implementation effort in both cases
(builtins and changing the IR.) I do not believe that changing
the IR to add SV types makes any optimizer work magically out
of a sudden: no free lunch. In both cases we need to amend
all the passes that remove inefficiencies in code written with
SV intrinsic calls.

6. We will need a new SV auto-vectorizer pass that relies less on
if-conversion, runtime disambiguation, and unroll for the prolog/epilog,
as the HW is helping with all these cases and expands the number
of loops that can be vectorized.
Having native SV types or just plain builtins is equivalent here
as the code generator of the vectorizer can be improved to not
generate inefficient code.

7. This is my point of view, I may be wrong,
so don't let me slow you down in getting it done!

Sebastian

I am of the opinion that handling scalable vectors (SV)
as builtins and an opaque SV type is a good option:

1. The implementation of SV with builtins is simpler than changing the IR.

2. Most of the transforms in opt are scalar opts; they do not optimize
vector operations and will not deal with SV either.

3. With builtins there are fewer places to pay attention to,
as most of the compiler is already dealing with builtins in
a neutral way.

4. The builtin approach is more targeted and confined: it allows
to amend one optimizer at a time.
In the alternative of changing the IR, one has to touch all the
passes in the initial implementation.

Interestingly, with similar considerations, I've come to the opposite
conclusion. While in theory the intrinsics and opaque types are more
targeted and confined, this only remains true *if* we don't end up
teaching a bunch of transformations and analysis passes about them.
However, I feel it is inevitable that we will:

1. While we already have unsized types in the IR, SV will add more of
them, and opaque or otherwise, there will be some cost to making all of
the relevant places in the optimizer not crash in their presence. This
cost we end up paying either way.

2. We're going to end up wanting to optimize SV operations. If we have
intrinsics, we can add code to match (a + b) - b => a, but the question
is: can we reuse the code in InstCombine which does this? We can make
the answer yes by adding sufficient abstraction, but the code
restructuring seems much worse than just adjusting the type system.
Otherwise, we can't reuse the existing code for these SV optimizations
if we use the intrisics, and we'll be stuck in the unfortunate situation
of slowing rewriting a version of InstCombine just to operate on the SV
intrinsics. Moreover, the code will be worse because we need to
effectively extract the type information from the intrinsic names. By
changing the type system to support SV, it seems like we can reuse
nearly all of the relevant InstCombine code.

3. It's not just InstCombine (and InstSimplify, etc.), but we might
also need to teach other passes about the intrinsics and their types
(GVN?). It's not clear that the problem will be well confined.

5. Optimizing code written with SV intrinsic calls can be done
with about the same implementation effort in both cases
(builtins and changing the IR.) I do not believe that changing
the IR to add SV types makes any optimizer work magically out
of a sudden: no free lunch. In both cases we need to amend
all the passes that remove inefficiencies in code written with
SV intrinsic calls.

6. We will need a new SV auto-vectorizer pass that relies less on
if-conversion, runtime disambiguation, and unroll for the prolog/epilog,

It's not obvious to me that this is true. Can you elaborate? Even with
SV, it seems like you still need if conversion and pointer checking, and
unrolling the prologue/epilogue loops is handled later anyway by the
full/partial unrolling pass and I don't see any fundamental change there.

What is true is that we need to change the way that the vectorizer deals
with horizontal operations (e.g., reductions) - these all need to turn
into intrinsics to be handled later. This seems like a positive change,
however.

as the HW is helping with all these cases and expands the number
of loops that can be vectorized.
Having native SV types or just plain builtins is equivalent here
as the code generator of the vectorizer can be improved to not
generate inefficient code.

This does not seem equivalent because while the mapping between scalar
operations and SV operations is straightforward with the adjusted type
system, the mapping between the scalar operations and the intrinsics
will require extra infrastructure to implement the mapping. Not that
this is necessarily difficult to build, but it needs to be updated
whenever we otherwise change the IR, and thus adds additional
maintenance cost for all of us.

Thanks again,

Hal