[RFC] Supporting ARM's SVE in LLVM

No. Let's make one thing clear now: we don't expect the VL to be
changed on the fly, once the process is started it's fixed. Otherwise
things like stack frames with SVE objects will be invalid.

This then forbids different lengths on shared objects, which in turn
forces all objects in the same OS to have the same length. Still a
kernel option (like VMA or page tables sizes), but not on a
per-process thing.

Like ARM's ABI, it only makes sense to go that far on public interfaces.

Given how distros are conservative on their deployments, can force
"normal" people using SVE (ie. not super-computers) to either accept
the lowest denominator or to optimise local code better.

Or distros will base that on the hardware default value (reported via
kernel interface) and the deployment will be per-process... or there's
will be multilib.

In any case, SVE is pretty cool, but deployment is likely to be *very*
complicated. Nothing we haven't seen with AArch64, or worse, on ARM,
so...

1. The vectorizer will have to deal with loop carried dependences as
normal, if it doesn't have a guarantee about the VL then it has to
either avoid vectorizing some loops, or it can cap the effective
vectorization factor by restricting the loop predicate to a safe
value.

This should be easily done via the predicate register.

2. Yes the cost model is more complicated, but it's not necessarily
the case that we assume the smallest VL. We can cross that bridge when
we get to it though.

Ok.

cheers,
--renato

Libraries do not require the vector length to be encoded inside them,
so there is no restriction here. You don't pick a VL for them, the
point of vector length agnosticism is that the code generated runs
across ALL vector lengths. However, once your process starts then the
VL can be assumed to be constant, whatever it is. You could still
theoretically have two different processes using the same shared
libraries running with different VLs.

Bringing the discussion back onto the IR proposals:

One way to make your "seriesvector" concept show up *before* any spec
is out is to apply it to non-scalable vectors.

Today, we have the "zeroinitializer", which is very similar to what
you want. You can even completely omit the "vscale" if we get the
semantics right.

There is nothing to stop other targets from using
stepvector/seriesvector. In fact for wide vector targets, often the IR
constant for representing a step vector is explicitly expressed as
<i32 0, i32 1, i32 2..> and so on (this gets really cumbersome when
your vector length is 512bits for example). That could be replaced by
a single "stepvector" constant, and it works the same for both
fixed-length and scalable vectors.

Amara

I see. That would only work well if the arguments are passed directly
via SVE vectors. Encoding that in a soft-float variant would be
madness.

But also, it would require one to pass the predicate, as it's possible
that the original vectors are restricted. This also fits nicely with
the data/predicate pairs that are always required for SVE vectors.

A simple way would be to mandate z0/p0, z1/p1 to be paired for PCS
purposes. Interesting...

cheers,
--renato

Indeed! For this particular point, I think we should start there.

Also, on a more general comment regarding David's point about Hwacha,
maybe we could get some traction on the RISC-V front, to see if the
proposal is acceptable on their end, since they're likely to be using
this in the future in LLVM.

Alex, any comments?

cheers,
--renato

Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF.

As much sense as is possible, I guess.

I’ll take that. Let's move on to the relationship between scalable vectors and VL. VL is very much a hardware centric value that we'd prefer not to expose at the IR level, beyond the requirements for a sufficiently accurate cost model.

An initial attempt to represent scalable vectors might be <n x Ty>. The problem with this approach is there's no perfect interpretation as to what the following type definitions me:

  <n x i8>
  <n x i16>
  <n x i32>
  <n x i64>

[Interpretation 1]

A vector of "n" elements of the specified type. Here "n" is likely to be scaled based on the largest possible element type. This fits well with the following loop:

(1) for (0..N) { bigger_type[i] += smaller_type[i]; }

but becomes inefficient when the largest element type is not required.

[Interpretation 2]

A vector full of the specified type. Here the isolated meaning of "n" means nothing without an associated element type. This fits well with the following loop:

(2) for (0..N) { type[i] += type[i]; }

Neither interpretation is ideal with implicit knowledge required to understand the relationship between different vector types. Our proposal is a vector type where that relationship is explicit, namely <n x M x Ty>.

Reconsidering the above loops with this type system leads to IR like:

(1) <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32> ; bigger_type=i32, smaller_type=i8
(2) <n x 16 x i8> += <n x 16 x i8>

Here the value of "n" is the same across both loops and more importantly the bit-width of the largest vectors within both loops is the same. The relevance of the second point it that we now have a property that can be varied based on a cost model. This results in a predictable set of types that should lead to performant code, whilst allowing types outside that range to work as expected, just like non-scalable vectors.

All that remains is the ability to reference the isolated value of the "n" in "<n x M x Ty>", which is where the "vscale" constant proposal comes in.

    %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>")

I see what you mean.

Quick question: Since you're saying "vscale" is an unknown constant,
why not just:
    %index.next = add nuw nsw i64 %index, i64 vscale

Hopefully the answer to this is now clear. Our intention is for a single constant to represent the runtime part of a scalable vector's length. Using the same loop examples from above, the induction variable updates become:

(1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
(2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)

The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration.

Does this make sense? Is this sufficient argument for the new type and associated "vscale" constant, or is there another topic that needs covering first?

As an aside, note that I am not describing a new style of vectorisation here. SVE is perfectly capable of non-predicated vectorisation with the loop-vectoriser ensuring no data-dependency violations using the same logic as for non-scalable vectors. The exception is that if a strict VF is required to maintain safety we can simply fall back to non-scalable vectors that target Neon. Obviously not ideal but it gets the ball rolling.

  Paul!!!

I don’t really get why the “naive” <n x Ty> be enough for the loops you mentioned:

1) <n x i32> += += zext <n x i8> as <n x i32>
2) <n x i8> += <n x i8>

The RISC-V vector proposal is still in the development stage, but it
will inevitably be vector length agnostic much like Hwacha. Krste gave
a talk about his proposal for the 'V' extension last year
<https://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf&gt;
and I'm looking forward to his update at the RISC-V Workshop this
Wednesday, not least because I'm hoping he'll have done my homework
for me and contrast his proposal to what is publicly known about SVE.
The proposal includes a vsetvl instruction (slide 20) which returns
the minimum of the hardware vector length and requested vector length.

I'll wait for the update on the proposed RISC-V vector extension
before really digging in to the SVE RFC on handling vector lengths. In
the mean time, are there any other vector-length agnostic
architectures that either have existing out-of-tree backends or may
have them in the future?

Best,

Alex

Reconsidering the above loops with this type system leads to IR like:

(1) <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32> ; bigger_type=i32, smaller_type=i8
(2) <n x 16 x i8> += <n x 16 x i8>

Hi Paul,

I'm with Mehdi on this... these examples don't look problematic. You
have shown what the different constructs would be good at, but I still
can't see where they won't be.

I originally though that the extended version "<n x m x Ty>" was
required because SVE needs all vector lengths to be a multiple of
128-bits, so they'd be just "glorified" NEON vectors. Without it,
there is no way to make sure it will be a multiple.

(1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
(2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)

The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration.

Right, but this is a "constant", and LLVM would be forgiven by asking
the "size" of it. With that proposal, there's no way to know if that's
a <16 x i8> or <16 x i32>.

The vectorizer concerns itself mostly with number of elements, not raw
sizes, but these types will survive the whole process, especially if
they come from intrinsics.

As an aside, note that I am not describing a new style of vectorisation here. SVE is perfectly capable of non-predicated vectorisation with the loop-vectoriser ensuring no data-dependency violations using the same logic as for non-scalable vectors. The exception is that if a strict VF is required to maintain safety we can simply fall back to non-scalable vectors that target Neon. Obviously not ideal but it gets the ball rolling.

Right, got that. Baby steps, safety first.

cheers,
--renato

(1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
(2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)

The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration.

Right, but this is a "constant", and LLVM would be forgiven by asking
the "size" of it. With that proposal, there's no way to know if that's
a <16 x i8> or <16 x i32>.

The vectorizer concerns itself mostly with number of elements, not raw
sizes, but these types will survive the whole process, especially if
they come from intrinsics.

What is the relevance of the vector’s element type. The induction variable update is purely in terms of elements, it doesn’t care about its type. If you need to reference the vector length in bytes you would simply multiply it by the size of vector’s element type just as we do for non-scalable vectors.

Paul!!!

An initial attempt to represent scalable vectors might be <n x Ty>. The problem with this approach is there's no perfect interpretation as to what the following type definitions me:

<n x i8>
<n x i16>
<n x i32>
<n x i64>

[Interpretation 1]

A vector of "n" elements of the specified type. Here "n" is likely to be scaled based on the largest possible element type. This fits well with the following loop:

(1) for (0..N) { bigger_type[i] += smaller_type[i]; }

but becomes inefficient when the largest element type is not required.

[Interpretation 2]

A vector full of the specified type. Here the isolated meaning of "n" means nothing without an associated element type. This fits well with the following loop:

(2) for (0..N) { type[i] += type[i]; }

I'm with Mehdi on this... these examples don't look problematic. You
have shown what the different constructs would be good at, but I still
can't see where they won't be.

I'll apply the loops to their opposite interpretation assuming bigger_type=i64, smaller_type=type=i8:

[Interpretation 1]

(2) for (0..N) { bytes[i] += other_bytes[i]; } ====> <n x i8> += <n x i8>
(2) for (0..N) { int64s[i] += other_int64s[i]; } ====> <n x i64> += <n x i64>

because this interpretation requires "n" to be the same for all scalable vectors clearly the int64 loop involves vectors that are 8x bigger than the byte loop. Structurally this is fine from the IR's point of view but in hardware they'll operate on vectors of the same length. The code generator will either split the int64 loop's instructions thus planting 8 adds, or promote the byte loop's instructions thus only utilising an 8th of the lanes.

[Interpretation 2]

(1) for (0..N) { int64s[i] += bytes[i]; } ==> <n x i64> += zext <??? x i8> as <n x i64>

This interpretation falls down at the IR level. If <n x i8> represents a vector full of bytes, how do you represent a vector that's an 8th full of bytes ready be zero-extended.

I originally though that the extended version "<n x m x Ty>" was
required because SVE needs all vector lengths to be a multiple of
128-bits, so they'd be just "glorified" NEON vectors. Without it,
there is no way to make sure it will be a multiple.

Surely this is true of most vector architectures, hence the reason for costing vectors across a range of element counts to determine which the code generator likes best. Scalable vectors are no different with SVE's cost model preferring scalable vectors whose statically known length component (i.e. "M x sizeof(Ty)") is 128-bits because they'll better match the way the code generator models SVE registers.

Paul!!!

“If the above holds true then the the length would be only variable
between different hardware implementations…”

This seems related to a problem that has independently hit several different projects around the world for a while now, and only recently have people understood what the problem is.

These projects are all doing things that require cache invalidation, for example JIT compilation. They have hit a problem that the size of the cache block is changing unexpectedly underneath the program when the program is migrated from a big processor to a LITTLE. The program might start on a CPU with a 64 byte cache block and then suddenly find itself on a CPU with a 32 byte cache block, but it’s still doing cache flushes with a 64 byte stride. So half the cache blocks don’t get flushed.

As far as I’m aware, there is no defined time at which this happens. Maybe it could be between one instruction and the next! We don’t even know a good way to enumerate all cache block sizes present in the system at runtime (and always use the smallest one as the stride). So we’re for the moment hard-coding a value which we hope will always be small enough, and taking the (minor) hit from trying to flush the same cache block multiple times. A 32 byte stride, say, on a machine with 128 byte cache blocks is still a lot better than using a stride of 1 or 4 bytes.

If there is a defined time when these changes can happen e.g. at a system call then we’d really love to know about it!

Not having seen any actual designs for SVE It seems possible to me that the vector width could also change on migration between core types. So perhaps the answer is the same.

The RISC-V vector proposal is still in the development stage, but it
will inevitably be vector length agnostic much like Hwacha. Krste gave
a talk about his proposal for the 'V' extension last year
<https://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf&gt;
and I'm looking forward to his update at the RISC-V Workshop this
Wednesday, not least because I'm hoping he'll have done my homework
for me and contrast his proposal to what is publicly known about SVE.

Thanks! This is really helpful!

The proposal includes a vsetvl instruction (slide 20) which returns
the minimum of the hardware vector length and requested vector length.

I haven't seen a similar instruction in SVE yet, but the compulsory
predicate on all instructions kinda make that redundant, since you can
always use it to calculate the number of "affected" lanes, and thus
only increment the "right" amount per iteration and not rely on
additional instructions. But this also seem to fit the concept of
"vscale", so if you say:

  %scale = i64 vscale

In RISC-V, this would literally translate to:

  vsetvl t0, a0

Then you could use it to increment the induction variable(s):

  %index.next = add nuw nsw i64 %index, mul (i64 %scale, i64 4)
  %index2.next = add nuw nsw i64 %index, mul (i64 %scale, i64 16)

If using "vscale" directly, the back-end would have to know which
instructions it's pertinent to:

  %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
  %index2.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)

If we assume that "vscale" is constant throughout the module, then
it's irrelevant. If there could be some change, this becomes a
problem, as you'd need a validity domain, which %scale would give you
for free.

Paul,

This is one of the issues we have to get right before any IR changes
are in effect, to make sure we won't need to change it again soon. In
SVE, such an operation would be a NOP, as the back-end is already
tracking it via the predicate registers.

cheers,
--renato

For pointer inductions, you have to add the total size, not the index
count. Wouldn't that need the final vector size?

I'm just trying to figure out is there's any pass that is any pass
that needs to know the vector's actual length. I'm not saying there
is... :slight_smile:

cheers,
--renato

I haven't seen a similar instruction in SVE yet, but the compulsory
predicate on all instructions kinda make that redundant, since you can
always use it to calculate the number of "affected" lanes, and thus
only increment the "right" amount per iteration and not rely on
additional instructions. But this also seem to fit the concept of
"vscale", so if you say:

%scale = i64 vscale

In RISC-V, this would literally translate to:

vsetvl t0, a0

This is one of the issues we have to get right before any IR changes
are in effect, to make sure we won't need to change it again soon. In
SVE, such an operation would be a NOP, as the back-end is already
tracking it via the predicate registers.

SVE has a similar instruction that returns the current vector length (rdvl). The reason you don’t see it in our example loop’s instruction output is because in that example we are able to use “incw x2” that increments x2 by the number of i32s a vector can hold.

None of our proposals are syntactic sugar with all having relevance to directing efficient code generation.

Right, of course! A <n x i8> vector can be on any number of lanes.

So, for vscale = 4, <4 x 4 x i8> would use 16 lanes (out of a possible
64), while <4 x 16 x i8> would use all 64 lanes. The instructions that
are needed are also different: an extend + copy or just a copy.

All that matters here is the actual number of lanes, which is directly
obtained by (n * m) from <n x m x Ty>. If the number of lanes is
different, and the types can be converted (extend/truncate), then
you'll need additional pre-ops to fudge the data between the moves /
ops.

I think I'm getting the idea, now. :slight_smile:

cheers,
--renato

Ok, makes sense. Your proposal is also not affected by using "vscale"
or %vscale on the induction variable.

cheers,
--renato

Hi Renato,

I’m just revisiting the seriesvector question you had a couple of days back.

%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)

This intrinsic matches the seriesvector instruction we original proposed. However, on reflection we didn't like how it allowed multiple representations for the same >>constant.

Can you expand how this allows multiple representations for the same constant?

This is a series, with a start and a step, and will only be identical
to another which has the same start and step.

Just like C constants can "appear" different...

const int foo = 4;
const int bar = foo;
const int baz = 2 + 2;

With the original "seriesvector" proposal the sequence 0, 2, 4, 6, 8... can be built is multiple ways:

(1) seriesvector(i32 0, i32 2)
(2) mul(seriesvector(i32 0, i32,1), splat(2))

Both patterns are likely to occur and so must be matched within InstCombine and friends. The solution could be to first simplify seriesvector related IR to a canonical form so only a single pattern requires matching against. The "stepvector" proposal asks why we would bother introducing the problem in the first place and instead force a canonical representation from the outset.

Obviously there will exist new idioms to canonicalise ((2*stepvector) + stepvector <==> 3*stepvector) but at least the most common cases are handled automatically.

We made this change when creating the upstream patch for "seriesvector" and observed its effect on the code base. When switching to "stepvector" the resulting patch was about 1/5 the size. That said if a constant intrinsic is the way to go then I imagine said patch will be even smaller.

One way to make your "seriesvector" concept show up *before* any spec
is out is to apply it to non-scalable vectors.

That is my intention with the stepvector patch (⚙ D27105 [Constants] Add "stepvector" to represent the sequence 0,1,2,3... [IR support for SVE scalable vectors 4/4]). You can see that the interface is common but for non-scalable vectors the result is its equivalent ConstantVector. Once an agreed form is available LoopVectorize::getStepVector can be converted to become compatible with scalable vectors very easily, albeit being only a small step on a long road.

  Paul!!!

(1) seriesvector(i32 0, i32 2)
(2) mul(seriesvector(i32 0, i32,1), splat(2))

Both patterns are likely to occur and so must be matched within InstCombine and friends. The solution could be to first simplify seriesvector related IR to a canonical form so only a single pattern requires matching against. The "stepvector" proposal asks why we would bother introducing the problem in the first place and instead force a canonical representation from the outset.

Obviously there will exist new idioms to canonicalise ((2*stepvector) + stepvector <==> 3*stepvector) but at least the most common cases are handled automatically.

I guess this is true for all constants, and it falls to things like
constant folding to bring it to a canonical representation.

We made this change when creating the upstream patch for "seriesvector" and observed its effect on the code base. When switching to "stepvector" the resulting patch was about 1/5 the size. That said if a constant intrinsic is the way to go then I imagine said patch will be even smaller.

If we use an intrinsic, we'll need to add this special case to the
vectoriser and maybe other passes, to treat that as a constant, but it
will be an addition, not a replacement.

If we create a new constant type, we'll need to change *a lot* of
existing tests (providing we want to use it across the board), which
will be a more complex change overall.

I'd leave that for later, once we have settled the semantics between
SVE and RISC-V, and just have an intrinsic for now. Once RISC-V
implementation comes, we can compare it to SVE, and if it's identical,
it'll be easier to implement as an intrinsic (for quick enablement),
then later on, it'll be a no-brainer refactory.

But this is just my personal (and reasonably uninformed) opinion. I'll
let other people broaden the discussion a bit. :slight_smile:

cheers,
--renato

(This is somewhat of a digression from the topic of SVE, but…)