Questions about vscale

Hi,

In RISC-V v-extension, operations could operate on a group of vector registers; we called it LMUL. If LMUL equals 2, it means we could operate on 2 vector registers at the same time. So, we have the following combinations of types.

LMUL = 1 LMUL = 2 LMUL = 4 LMUL = 8

int64_t | vscale x 1 x i64 | vscale x 2 x i64 | vscale x 4 x i64 | vscale x 8 x i64

int32_t | vscale x 2 x i32 | vscale x 4 x i32 | vscale x 8 x i32 | vscale x 16 x i32

int16_t | vscale x 4 x i16 | vscale x 8 x i16 | vscale x 16 x i16 | vscale x 32 x i16

int8_t | vscale x 8 x i8 | vscale x 16 x i8 | vscale x 32 x i8 | vscale x 64 x i8

We have another architecture parameter, ELEN, which means the maximum size of a single vector element in bits.

We hope the type system could be consistent under ELEN = 32 and ELEN = 64. However, vscale may be a fractional value under ELEN = 32 in the above type system. When ELEN = 32, i64 is an invalid type (we could ignore the first row for ELEN = 32) and vscale may become 1/2 on run time to fit the architecture (if the vector register only has 32 bits). Is there any problem to assume vscale to be fractional under some circumstances? vscale should be an unknown value when compiling. So, it should have no impact on code generation and optimization. The relationship between types is correct regardless vscale’s value. Is there anything I missed?

Thanks!

Hsiangkai

          LMUL = 1 LMUL = 2 LMUL = 4 LMUL = 8
int64_t | vscale x 1 x i64 | vscale x 2 x i64 | vscale x 4 x i64 | vscale x 8 x i64
int32_t | vscale x 2 x i32 | vscale x 4 x i32 | vscale x 8 x i32 | vscale x 16 x i32
int16_t | vscale x 4 x i16 | vscale x 8 x i16 | vscale x 16 x i16 | vscale x 32 x i16
int8_t | vscale x 8 x i8 | vscale x 16 x i8 | vscale x 32 x i8 | vscale x 64 x i8

We have another architecture parameter, ELEN, which means the maximum size of a single vector element in bits.

Hi,

For my own education, some quick questions:

1. is LMUL always a multiple of ELEN?
2. Is this fixed on the hardware, depending on the actual lengths, or
is this dynamically set by software (on a register or status flag)?
2a. If dynamic, can it change from program to program? Function to function?

We hope the type system could be consistent under ELEN = 32 and ELEN = 64. However, vscale may be a fractional value under ELEN = 32 in the above type system. When ELEN = 32, i64 is an invalid type (we could ignore the first row for ELEN = 32) and vscale may become 1/2 on run time to fit the architecture (if the vector register only has 32 bits).

Do you mean ELEN=32 like this?
int32_t | vscale x 1 x i32 | vscale x 2 x i32 | vscale x 4 x i32 |
vscale x 8 x i32
int16_t | vscale x 2 x i16 | vscale x 4 x i16 | vscale x 8 x i16 |
vscale x 16 x i16
  int8_t | vscale x 4 x i8 | vscale x 8 x i8 | vscale x 16 x i8 |
vscale x 32 x i8

If the type is invalid, you would need to legalise it, and in that
case create some cluttered accessors (via insert/extract element) and
possibly use intrinsics to expose underlying instructions that can
deal with it.

Perhaps I'm not clear on what you need, but vscale is supposed to be
the number of valid elements (lanes), and given i64 is invalid, vscale
wouldn't apply?

Is there any problem to assume vscale to be fractional under some circumstances? vscale should be an unknown value when compiling. So, it should have no impact on code generation and optimization. The relationship between types is correct regardless vscale’s value. Is there anything I missed?

I believe the assumption was always that vscale is an integer.
Representing it as a fraction would need code change for sure, but
also reevaluate the assumptions.

I'm copying some SVE and LV people to give a more informed opinion.

cheers,
--renato

Hi all,

> LMUL = 1 LMUL = 2 LMUL = 4 LMUL = 8
> int64_t | vscale x 1 x i64 | vscale x 2 x i64 | vscale x 4 x i64 | vscale x 8 x i64
> int32_t | vscale x 2 x i32 | vscale x 4 x i32 | vscale x 8 x i32 | vscale x 16 x i32
> int16_t | vscale x 4 x i16 | vscale x 8 x i16 | vscale x 16 x i16 | vscale x 32 x i16
> int8_t | vscale x 8 x i8 | vscale x 16 x i8 | vscale x 32 x i8 | vscale x 64 x i8
>
> We have another architecture parameter, ELEN, which means the maximum size of a single vector element in bits.

Hi,

For my own education, some quick questions:

1. is LMUL always a multiple of ELEN?

This happens to be true (at least in the current spec, disregarding
some in-progress proposals) just because both are powers of two and
the largest possible LMUL equals the smallest possible ELEN (8), but I
don't think there is any meaning to be found in this observation. The
two values govern unrelated aspects of the vector unit.

2. Is this fixed on the hardware, depending on the actual lengths, or
is this dynamically set by software (on a register or status flag)?
2a. If dynamic, can it change from program to program? Function to function?

It's not clear whether by "this" you mean ELEN, LMUL, or something
else. ELEN is fixed in hardware. LMUL is a property of each individual
instruction. Most instructions take it from a control register, a few
encode it in the instruction as an immediate, but in any case it needs
to be statically determined (on a per-instruction basis) to be able to
allocate registers. This is not just a constraint for
compiler-generated code, but also for all hand-written assembly code
I've seen or can imagine.

> We hope the type system could be consistent under ELEN = 32 and ELEN = 64. However, vscale may be a fractional value under ELEN = 32 in the above type system. When ELEN = 32, i64 is an invalid type (we could ignore the first row for ELEN = 32) and vscale may become 1/2 on run time to fit the architecture (if the vector register only has 32 bits).

Do you mean ELEN=32 like this?
int32_t | vscale x 1 x i32 | vscale x 2 x i32 | vscale x 4 x i32 |
vscale x 8 x i32
int16_t | vscale x 2 x i16 | vscale x 4 x i16 | vscale x 8 x i16 |
vscale x 16 x i16
  int8_t | vscale x 4 x i8 | vscale x 8 x i8 | vscale x 16 x i8 |
vscale x 32 x i8

If the type is invalid, you would need to legalise it, and in that
case create some cluttered accessors (via insert/extract element) and
possibly use intrinsics to expose underlying instructions that can
deal with it.

Perhaps I'm not clear on what you need, but vscale is supposed to be
the number of valid elements (lanes), and given i64 is invalid, vscale
wouldn't apply?

I don't know what "vscale wouldn't apply" is supposed to mean. Whether
it's legal or not, you can write LLVM IR using (for example) the type
<vscale x 1 x i64> even if the target doesn't natively support it. The
purpose of legalization is to make sure that results in the behavior
the type is supposed to have. For <vscale x 1 x i32>, this means among
other things:

- it has the same number of elements as <vscale x 1 x i32>, but each
element is twice as big
- it has half as many elements (each of the same size) as <vscale x 2 x i64>
- its total size in bits is the same as <vscale x 2 x i32>

I think that focusing on the completely illegal i64 might obscure the
real problem I see with the fractional vscale concept. Let's look at
<vscale x 1 x i32> instead. The elements are clearly legal in this
context, even in some vector types, but the <vscale x 1 x i32> type is
absent from Kai's table. This makes sense: the same vector register
fits 2x as many i32 elements as i64 elements, so if you start with
<vscale x 1 x i64> mapping to a single register, then <vscale x 2 x
i32> is the same size and fits in the same register class, while
<vscale x 1 x i32> is too small and must be legalized somehow.

But how? If we take Kai's table as gospel and look at a VLEN = ELEN =
32 machine, the vector type <vscale x 2 x i32> is supposed to map to a
single vector register, which is 32b small, and thus <vscale x 2 x
i32> would have just one element in this context (matching the "vscale
= 1/2" intuition). To be consistent with this, <vscale x 1 x i32>
would have be contain just *half* an element. This is not something
any legalization strategy can achieve, because it is a fundamentally
impossible notion. So we end up in a situation where some types are
not just illegal and have to be legalized, but are contradictory and
can't be legalized in any meaningful way.

I don't think LLVM can/should support this kind of contradiction. Some
types have to be legalized, sometimes the legalization is not
efficient, sometimes it's not even implemented, that's all fine. But
letting some targets decide that <vscale x 1 x i32> is a fundamentally
impossible type to even assign a meaning to... that seems
unprecedented and contrary to the philosophy of LLVM IR as reasonably
target-independent IR.

The obvious solution is to use a different set of legal vector types
(and thus, a different interpretation of vscale) depending on the
largest legal element type (ELEN in RISC-V jargon). This corresponds
to the table for ELEN=32 that Renato gave above. Kai's proposal is
intended to avoid this, and I can understand the desire for that, but
it really seems like the lesser evil to me.

Best regards
Hanna

Hi,

thanks Hanna for pointing at the contradiction under this modeling.

I wonder if HwModes can help us here. I feel in some way ELEN is playing a similar role to RISC-V’s XLEN. This way could assign different value types to the register classes associated to the different LMUL values.

E.g.
ELEN=32 the register class for base registers (i.e. LMUL=1) could include nxv1i32, nxv1f32, nxv2i16, nxv2i8, etc.
ELEN=64 the register class for base registers could include nxv1i64, nxv1f64, nxv2i32, nxv2f32, … but does not have to include nxv1f32, nxv1i32, nxv2i16, etc. (my understanding is that there is an ongoing proposal to efficiently allow manipulating such values as if they were subregisters of the base registers, but I’m ignoring that for now).

Kind regards,

Missatge de Hanna Kruppe via llvm-dev <llvm-dev@lists.llvm.org> del dia dt., 7 d’abr. 2020 a les 13:52:

Hi,

thanks Hanna for pointing at the contradiction under this modeling.

I wonder if HwModes can help us here. I feel in some way ELEN is playing a similar role to RISC-V's XLEN. This way could assign different value types to the register classes associated to the different LMUL values.

E.g.
ELEN=32 the register class for base registers (i.e. LMUL=1) could include nxv1i32, nxv1f32, nxv2i16, nxv2i8, etc.
ELEN=64 the register class for base registers could include nxv1i64, nxv1f64, nxv2i32, nxv2f32, ... but does not have to include nxv1f32, nxv1i32, nxv2i16, etc. (my understanding is that there is an ongoing proposal to efficiently allow manipulating such values as if they were subregisters of the base registers, but I'm ignoring that for now).

Hi Roger,

HwModes indeed seem a nice way to model "different legal types
depending on ELEN" in the backend. However, at the moment it seems
there's still no consensus that this is the route we should/need to
take. Maybe we should settle that question before giving these
implementation details more thought?

Kind regards,
Hanna

> 1. is LMUL always a multiple of ELEN?
This happens to be true (at least in the current spec, disregarding
some in-progress proposals) just because both are powers of two and
the largest possible LMUL equals the smallest possible ELEN (8), but I
don't think there is any meaning to be found in this observation. The
two values govern unrelated aspects of the vector unit.

Sorry, I meant multiple of basic types. But you have answered my question. :slight_smile:

> 2. Is this fixed on the hardware, depending on the actual lengths, or
> is this dynamically set by software (on a register or status flag)?
> 2a. If dynamic, can it change from program to program? Function to function?
It's not clear whether by "this" you mean ELEN, LMUL, or something
else. ELEN is fixed in hardware. LMUL is a property of each individual
instruction.

Sorry again, "this" as in both ELEN and LMUL and their relationship. Ack.

I don't know what "vscale wouldn't apply" is supposed to mean.

Legalisation-wise, you got right, like <n x 0.5 x i64> is invalid and
gets converted to <n x 1 x i32>, which it is.

"Wouldn't apply" as in "what would be the point of having half-scale
on a type that needs to be broken in half", and thus making it whole.
You explain better below, so ignore it for now.

But how? If we take Kai's table as gospel and look at a VLEN = ELEN =
32 machine, the vector type <vscale x 2 x i32> is supposed to map to a
single vector register, which is 32b small, and thus <vscale x 2 x
i32> would have just one element in this context (matching the "vscale
= 1/2" intuition). To be consistent with this, <vscale x 1 x i32>
would have be contain just *half* an element. This is not something
any legalization strategy can achieve, because it is a fundamentally
impossible notion. So we end up in a situation where some types are
not just illegal and have to be legalized, but are contradictory and
can't be legalized in any meaningful way.

Right, we have faced that problem before on non-scalable vector extensions.

For example, vectorising 3 operations in a 4-wide vector and adding an
undef in the last lane.

It didn't use to be possible to do that, many years ago, as a general
case. But if you look at register aliasing (VFP and NEON in ARMv7), we
had the idea of different number of elements on the same register,
depending on how you look.

I'm not proposing to create all combinations of half-vscale shadowing,
but perhaps adding half-length types as valid and lowering them in a
special way could work much simpler than changing the interpretation
of vscale.

Also, I'm acting like devil's advocate, so don't take my comments as a
rejection of your proposal, I'm just trying to understand where you
are coming from.

cheers,
--renato

Hi,

Looking at the language reference, vscale is an integer. This might pose a problem for fractional vscale. Furthermore, I believe that vscale is constant throughout the life of the program; so if RISC-V vscale can vary from instruction to instruction that may also be problematic unless you can just commit to one specific value of vscale.

Also, I had a question about your table. Based on your description of how LMUL works, I’d expect that LMUL == vscale, and that each column in your table would be the same:

int64_t | vscale x 1 x i64

int32_t | vscale x 2 x i32

int16_t | vscale x 4 x i16

int8_t | vscale x 8 x i8

… which is basically equivalent to:

int64_t | LMUL x 1 x i64

int32_t | LMUL x 2 x i32

int16_t | LMUL x 4 x i16

int8_t | LMUL x 8 x i8

… is this not the case?

Thanks,

Chris Tetreault

Hi Chris,

vscale is constant in RISC-V as well. It doesn’t change from instruction to instruction.

There are two implementation parameters of the RISC-V V-extension called VLEN, the number of bits of a (base) vector register, and ELEN, the maximum width in bits of an element in a vector register (i.e. if ELEN=32 one can’t operate vectors of i64 or f64). Ideally we’d want to make the compiler as oblivious as possible to these parameters as we can. VLEN itself is not that different to the SVE situation. ELEN is a bit different because it impacts what vector types are legal in the code generator, this in turns impacts other parts like the loop vectorizer.

It seems reasonable to make vscale = VLEN / ELEN.

In this model, assume ELEN=64, legal types include <vscale x 1 x i64> and <vscale x 2 x i32>. As the ISA stands now, it is not efficient to operate, still in ELEN=64, a value of type <vscale x 1 x i32> so it seems reasonable not to make them legal. However in ELEN=32, <vscale x 1 x i32> is a sensible thing.

LMUL is a way that the ISA provides to operate groups of registers (which is useful in a number of cases I won’t go into details here). There are 32 (base) vector registers (i.e. LMUL=1). LMUL=2 means grouping them in groups of two, LMUL=4 in groups of 4 and so on, up to LMUL=8. So, under LMUL=2, there are 16 groups, twice the length of the base vector registers, under LMUL=4 there are 8 groups, four times the length of the base registers. The ISA allows using these groups as registers so they can be modelled as “super registers” of the base registers (or the opposite, base registers can be seen as subregisters of group registers).

Continuing with the model above (ELEN=64), an LMUL=2 group would be able to represent IR values of <vscale x 2 x i64> and <vscale x 4 x i32>. An LMUL=4 group can represent IR values of <vscale x 4 x i64>, <vscale x 8 x i32>, and so on.

Kai original question is related to the fact that vscale depends on ELEN and this has the effect that it impacts the legal types. So one modelling option is try to get rid of such dependence on ELEN and fix it, say, to 64. Under this approach, legal values would always be, regardless of ELEN, <vscale x 1 x i64>, <vscale x 2 x i32>. As Hannah has pointed in an earlier answer, this may lead to nonsensical (regardless of legalization) IR types.

Hope this helps.

Kind regards,

Missatge de Chris Tetreault via llvm-dev <llvm-dev@lists.llvm.org> del dia dt., 7 d’abr. 2020 a les 18:39:

OK, I see now. The original email did not mention VLEN, so I assumed LMUL was VLEN. This makes sense.

I guess my original point about vscale being an integer still stands though. If VLEN / ELEN is computed using integer arithmetic, then the result rounded towards 0 is 0. Unless we make floating point vscale a thing, it should be impossible for vscale to be fractional.

Thanks, Hanna.

Hi all,

LMUL = 1 LMUL = 2 LMUL = 4 LMUL = 8
int64_t | vscale x 1 x i64 | vscale x 2 x i64 | vscale x 4 x i64 | vscale x 8 x i64
int32_t | vscale x 2 x i32 | vscale x 4 x i32 | vscale x 8 x i32 | vscale x 16 x i32
int16_t | vscale x 4 x i16 | vscale x 8 x i16 | vscale x 16 x i16 | vscale x 32 x i16
int8_t | vscale x 8 x i8 | vscale x 16 x i8 | vscale x 32 x i8 | vscale x 64 x i8

We have another architecture parameter, ELEN, which means the maximum size of a single vector element in bits.

Hi,

For my own education, some quick questions:

  1. is LMUL always a multiple of ELEN?

This happens to be true (at least in the current spec, disregarding
some in-progress proposals) just because both are powers of two and
the largest possible LMUL equals the smallest possible ELEN (8), but I
don’t think there is any meaning to be found in this observation. The
two values govern unrelated aspects of the vector unit.

  1. Is this fixed on the hardware, depending on the actual lengths, or
    is this dynamically set by software (on a register or status flag)?
    2a. If dynamic, can it change from program to program? Function to function?

It’s not clear whether by “this” you mean ELEN, LMUL, or something
else. ELEN is fixed in hardware. LMUL is a property of each individual
instruction. Most instructions take it from a control register, a few
encode it in the instruction as an immediate, but in any case it needs
to be statically determined (on a per-instruction basis) to be able to
allocate registers. This is not just a constraint for
compiler-generated code, but also for all hand-written assembly code
I’ve seen or can imagine.

We hope the type system could be consistent under ELEN = 32 and ELEN = 64. However, vscale may be a fractional value under ELEN = 32 in the above type system. When ELEN = 32, i64 is an invalid type (we could ignore the first row for ELEN = 32) and vscale may become 1/2 on run time to fit the architecture (if the vector register only has 32 bits).

Do you mean ELEN=32 like this?
int32_t | vscale x 1 x i32 | vscale x 2 x i32 | vscale x 4 x i32 |
vscale x 8 x i32
int16_t | vscale x 2 x i16 | vscale x 4 x i16 | vscale x 8 x i16 |
vscale x 16 x i16
int8_t | vscale x 4 x i8 | vscale x 8 x i8 | vscale x 16 x i8 |
vscale x 32 x i8

If the type is invalid, you would need to legalise it, and in that
case create some cluttered accessors (via insert/extract element) and
possibly use intrinsics to expose underlying instructions that can
deal with it.

Perhaps I’m not clear on what you need, but vscale is supposed to be
the number of valid elements (lanes), and given i64 is invalid, vscale
wouldn’t apply?

I don’t know what “vscale wouldn’t apply” is supposed to mean. Whether
it’s legal or not, you can write LLVM IR using (for example) the type
<vscale x 1 x i64> even if the target doesn’t natively support it. The
purpose of legalization is to make sure that results in the behavior
the type is supposed to have. For <vscale x 1 x i32>, this means among
other things:

  • it has the same number of elements as <vscale x 1 x i32>, but each
    element is twice as big
  • it has half as many elements (each of the same size) as <vscale x 2 x i64>
  • its total size in bits is the same as <vscale x 2 x i32>

I think that focusing on the completely illegal i64 might obscure the
real problem I see with the fractional vscale concept. Let’s look at
<vscale x 1 x i32> instead. The elements are clearly legal in this
context, even in some vector types, but the <vscale x 1 x i32> type is
absent from Kai’s table. This makes sense: the same vector register
fits 2x as many i32 elements as i64 elements, so if you start with
<vscale x 1 x i64> mapping to a single register, then <vscale x 2 x
i32> is the same size and fits in the same register class, while
<vscale x 1 x i32> is too small and must be legalized somehow.

But how? If we take Kai’s table as gospel and look at a VLEN = ELEN =
32 machine, the vector type <vscale x 2 x i32> is supposed to map to a
single vector register, which is 32b small, and thus <vscale x 2 x
i32> would have just one element in this context (matching the “vscale
= 1/2” intuition). To be consistent with this, <vscale x 1 x i32>
would have be contain just half an element. This is not something
any legalization strategy can achieve, because it is a fundamentally
impossible notion. So we end up in a situation where some types are
not just illegal and have to be legalized, but are contradictory and
can’t be legalized in any meaningful way.

I don’t think LLVM can/should support this kind of contradiction. Some
types have to be legalized, sometimes the legalization is not
efficient, sometimes it’s not even implemented, that’s all fine. But
letting some targets decide that <vscale x 1 x i32> is a fundamentally
impossible type to even assign a meaning to… that seems
unprecedented and contrary to the philosophy of LLVM IR as reasonably
target-independent IR.

If we apply the type system pointed out by Renato, is the vector type <vscale x 1 x i16> legal? If we decide that <vscale x 1 x i16> is a fundamentally impossible type, does it contrary to the philosophy of LLVM IR as reasonably target-independent IR? I do not get the point of your argument.

The obvious solution is to use a different set of legal vector types
(and thus, a different interpretation of vscale) depending on the
largest legal element type (ELEN in RISC-V jargon). This corresponds
to the table for ELEN=32 that Renato gave above. Kai’s proposal is
intended to avoid this, and I can understand the desire for that, but
it really seems like the lesser evil to me.

The problem of defining a different type system depending on the largest legal element type (ELEN in RISC-V jargon) is that they are not compatible. I assume that programs compiled under ELEN = 32 could be run on ELEN = 64 machines. It should be possible to link ELEN = 32 objects with ELEN = 64 objects. If we use the type <vscale x 1 x i32> under ELEN = 32, there is no corresponding type under ELEN = 64 for <vscale x 1 x i32> (look up in my table). It seems an illegal type under ELEN = 64. Does it follow the philosophy of target independent IR?

I hope we could design an unified type system for different ELEN. However, the vscale may be fractional on run time under some circumstances (VLEN = 32, ELEN = 32) in my proposal. That is why I wonder to know whether the fractional vscale is matter or not.

Thanks,
Kai

Hi Chris,

I guess my original point about vscale being an integer still stands though. If VLEN / ELEN is computed using integer arithmetic, then the result rounded towards 0 is 0. Unless we make floating point vscale a thing, it should be impossible for vscale to be fractional.

I forgot to mention that, as the spec stands now, VLEN and ELEN are powers of two and ELEN <= VLEN, so ELEN should always evenly divide VLEN (this would make 1 the minimum “runtime” value of vscale if we define vscale = VLEN / ELEN).

Kind regards,

If we apply the type system pointed out by Renato, is the vector type <vscale x 1 x i16> legal? If we decide that <vscale x 1 x i16> is a fundamentally impossible type, does it contrary to the philosophy of LLVM IR as reasonably target-independent IR? I do not get the point of your argument.

Hi Kai,

Don't worry about target-independent IR in your design of intermediate
passes or lowering.

By the time the front-end lowers to LLVM IR, it already has, often
irreversible, target-specific knowledge in it.

If by some stroke of luck that doesn't happen, then using "<vscale x

" is enough indication that you should not try to lower that

onto a target that it wasn't specifically aimed at.

No one expects the middle-end to be target-neutral. That's the whole
point of constantly asking target-specific machinery (like TTI) about
what's possible or what's "good" and what's not.

More importantly, the closer you are to the end of the pass pipeline,
the closer the IR is to machine IR. It's not uncommon, and often
expected, to see "just the right amount of shuffles" to match lowering
patterns into MIR and then Asm.

IIRC, OpenCL or some other parallel-compute/graphic oriented pipeline
does use odd vector shapes and legalise them later on.

I may be severely outdated in my opinion, and happy to be corrected,
but I don't think it would be totally egregious to carry on with
(whole numbered) vector shapes that aren't strictly legal, as long as
you guarantee that *any* such pattern gets correctly legalised by the
lowering.

If you can make the adversarial cases performing on top of that, it's
a bonus, not a target.

Hope this helps.

cheers,
--renato

Hi Renato,

Thanks for your reply.
IIUC. If I could lower LLVM IR to Asm correctly under my type system design, there should be no problem with it.
Another concern I have is that llvm.vscale intrinsic is designed to return integer value. Should we relax it to return float or something else?Thanks.

Kai

Hi Kai,

IIUC. If I could lower LLVM IR to Asm correctly under my type system design, there should be no problem with it.

It depends on what you mean by "no problem". :slight_smile:

The design of the IR is target independent, so that we can represent
language constructs in a generic way and optimise them with a common
infrastructure.

However, there are three main sources of target-dependence:

1. Front-ends may lower different IR depending on different targets,
for example valid types, call ABI, etc.
2. The middle-end takes decisions based on target validity, which
changes the shape of IR, for example specific sequences of
instructions on specific types.
3. Intrinsic functions. Those can be generated by the front-end or a
pass optimises code with them, for example, the vectoriser.

The general trend is that the IR becomes more target-dependent as more
passes change it, but it also means passes are less and less able to
recognise patterns and therefore are less useful on your code. That's
why pass ordering matters.

A good example is GPU/accelerator code, that if you pass the IR
through the normal pipeline, it comes out the other way unrecognisable
and impossible to lower, so they tend to have their own pass
pipelines.

You don't want to have a special, so you need to make sure your IR is
as generic as possible, or you won't profit as much, or worse, will
break apart.

More specifically, the entirety of vscale design has been assuming
integral scales, so anything that is not integral will very likely not
work with existing (and future) standard passes.

By making your target more special, you have also made it harder to
optimise. Neither you nor the rest of the community wants to add
special cases for odd targets in every optimisation pass, so we need
to find a common ground.

Another concern I have is that llvm.vscale intrinsic is designed to return integer value. Should we relax it to return float or something else?

I personally believe that this would open a can of worms we don't want
to handle. Not right now, at the very least.

I would strongly oppose to a float value, for all of the problems FP
representation has and the expectation of existing instructions of
taking integer values.

But it could be doable to have a fractional value, like two integers,
where the numerator doesn't have to be greater than nor a multiple of
the denominator (with default value as 1).

Again, I'm not the authority here, I'm just giving some context. Other
scalable vector developers should chime in with their opinion.

cheers,
--renato

Hi Renato,

Thanks for your explanation. It is very helpful. Now I totally understand your concern.

In GCC, it has a similar representation for scalable vector type. It uses a data structure called poly-int[1] to represent the run-time part. It also used to represent the relationship(half size, double size, etc.) between types. It looks similar to (a * X + b) * type. X represents the run-time value. It is equivalent to LLVM representation using vscale * n * type.

Similarly, there is a target hook to calculate (a * X + b) in GCC, called TARGET_ESTIMATED_POLY_VALUE[2]. The difference between GCC and LLVM is that GCC uses whole polynomial expression to calculate the actual run-time vector length instead of getting the run-time variable only, i.e., X in GCC and vscale in LLVM. Maybe it is not a good idea to return float for vscale intrinsic. It should be more reasonable to take the element count into the intrinsic to calculate the actual run-time length for scalable vector types. That is,

declare i32 llvm.vscale.i32(i32 ElementCount)
declare i64 llvm.vscale.i64(i32 ElementCount)

Does it make sense?

[1] https://gcc.gnu.org/onlinedocs/gccint/Overview-of-poly_005fint.html#Overview-of-poly_005fint
[2] https://github.com/gcc-mirror/gcc/blob/master/gcc/targhooks.c#L1703

> > 1. is LMUL always a multiple of ELEN?
> This happens to be true (at least in the current spec, disregarding
> some in-progress proposals) just because both are powers of two and
> the largest possible LMUL equals the smallest possible ELEN (8), but I
> don't think there is any meaning to be found in this observation. The
> two values govern unrelated aspects of the vector unit.

Sorry, I meant multiple of basic types. But you have answered my question. :slight_smile:

> > 2. Is this fixed on the hardware, depending on the actual lengths, or
> > is this dynamically set by software (on a register or status flag)?
> > 2a. If dynamic, can it change from program to program? Function to function?
> It's not clear whether by "this" you mean ELEN, LMUL, or something
> else. ELEN is fixed in hardware. LMUL is a property of each individual
> instruction.

Sorry again, "this" as in both ELEN and LMUL and their relationship. Ack.

> I don't know what "vscale wouldn't apply" is supposed to mean.

Legalisation-wise, you got right, like <n x 0.5 x i64> is invalid and
gets converted to <n x 1 x i32>, which it is.

"Wouldn't apply" as in "what would be the point of having half-scale
on a type that needs to be broken in half", and thus making it whole.
You explain better below, so ignore it for now.

> But how? If we take Kai's table as gospel and look at a VLEN = ELEN =
> 32 machine, the vector type <vscale x 2 x i32> is supposed to map to a
> single vector register, which is 32b small, and thus <vscale x 2 x
> i32> would have just one element in this context (matching the "vscale
> = 1/2" intuition). To be consistent with this, <vscale x 1 x i32>
> would have be contain just *half* an element. This is not something
> any legalization strategy can achieve, because it is a fundamentally
> impossible notion. So we end up in a situation where some types are
> not just illegal and have to be legalized, but are contradictory and
> can't be legalized in any meaningful way.

Right, we have faced that problem before on non-scalable vector extensions.

For example, vectorising 3 operations in a 4-wide vector and adding an
undef in the last lane.

It didn't use to be possible to do that, many years ago, as a general
case. But if you look at register aliasing (VFP and NEON in ARMv7), we
had the idea of different number of elements on the same register,
depending on how you look.

I'm not proposing to create all combinations of half-vscale shadowing,
but perhaps adding half-length types as valid and lowering them in a
special way could work much simpler than changing the interpretation
of vscale.

[re-sending because I dropped the list -- sorry for the extra copy, Renato!]

I don't see how the situation you mention is comparable. Legalization
for e.g. <3 x i32> was not implemented at first, but as demonstrated
by the fact that it *was* implemented later, there's no conceptual
problem with legalizing that kind of type. You don't even have to
legalize them in vector registers, three scalar registers work fine
(you can even do that on the IR level).

For <vscale x 1 x i32> with a fractional value of vscale, there are
several conceivable ways to "legalize" this type, but none of them
work. Legalization (codegen in general) does not know if the machine
code will eventually run on a chip with vector registers so small that
vscale works out to 1/2, but it has to choose some legalization
strategy. I can imagine several approaches to this, but since the
actual value of vscale is not known at this time, it will have to map
the illegal scalable vector types to the vector registers in some way,
to ensure there's enough space even when vscale is very large in some
executions of the program.

Depending on how you do that exactly, the generated code might have
different behavior when running on a vscale == 1/2 machine, e.g. you
might end up with a vector register holding *one* i32 element or a
vector register holding *zero* i32 elements (i.e., the sole lane of
the 32-bit vector register is masked out). There might be other
approaches that result in yet another behavior, such as a hardware
fault, but crashes and other immediate problems aside, you're going to
end up with a certain discrete number of i32 values. That's a problem.
If <vscale x 1 x i32> ends up having one element, and <vscale x 2 x
i32> also has one (= 2 * 0.5) element, then that's wrong: the latter
type must have twice as many elements as the former (one example where
this matters: split_low / split_high / concat shuffle patterns). The
second option, a vector with *zero* elements, is just as wrong if not
worse.

It's not that a correct legalization exists but it's too annoying to
implement, or that one might exist but I'm too lazy to work it out.
We're also not running in a limitation or oddity of the RISC-V vector
ISA in particular. It's simply that, if you set vscale == 0.5, then by
the way scalable vector types work (vscale * const elements), some
vector types that can be written in the IR would need to have a
fractional number of elements to be consistent with the other scalable
vector types. As that is not possible (not even conceptually),
whatever code you emit to try to legalize that type will end up being
wrong in some respect.

So if we'd decide to support fractional vscale, we can't say these
types are "illegal". In LLVM parlance, illegal types can be used in
LLVM IR and targets aspire to turn them into something that works
correctly, even if it's very inefficient. Sometimes a legalization is
unimplemented or buggy, but these problems can be patched and this has
often happened in the past. With fractional vscale, the situation is
quite different: nobody will ever be able to use certain scalable
vector types on the target in question, because they can't be
legalized even in principle.

In contrast, scalable vector types that are illegal because they're
too large (e.g. <vscale x 32 x i64>) can be legalized just fine. For
example, you could split them across a sufficiently large (fixed)
number of vector registers and maybe spill them to the stack for
inserts/extracts/shuffles/etc. that cross lanes or access elements at
data-dependent positions. Implementing this will probably not be a
priority for any targets, but it can be implemented whenever it does
become important to someone.

I hope this lengthy explanation help you see where I'm coming from.

Thanks,
Hanna

[snip]

If we apply the type system pointed out by Renato, is the vector type <vscale x 1 x i16> legal? If we decide that <vscale x 1 x i16> is a fundamentally impossible type, does it contrary to the philosophy of LLVM IR as reasonably target-independent IR? I do not get the point of your argument.

<vscale x 1 x i16> would be illegal, but like other illegal types, it
can be legalized. It does not run into the problem I see with
fractional vscale, since the number of elements each vector type is
supposed to have is still a whole number. For example, it could be
legalized by using a full vector register as for <vscale x 1 x i32>
and sign-extending or zero-extending each element as needed by the
operations performed on it. Another possibility is using a full vector
register with SEW=16, but computing `vl` as for SEW=32 which
effectively means using only the lower half of the vector register.
Both options always ensure we correctly (if slowly) emulate a vector
containing as many i16 elements as <vscale x 1 x i32> has i32
elements.

To be clear: in LLVM jargon, a type being "illegal" does not mean that
the type is not supposed to be used. It only means that the type isn't
directly supported by the hardware, but can be mapped to the things
the hardware does support with extra effort and at the expense of some
performance. For example, i64 is illegal on typical 32 bit targets,
but clang happily uses i64 for C types like `long` or `long long`
(depending on ABI), and backends support this. Another example are the
float and double types on any target without FPU (including RISC-V
without F/D extension).

I hope this clarifies the distinction I made before.

Kind regards,
Hanna

PS: I'm ignoring the "fractional LMUL" proposal in this discussion, I
hope that's okay for you. If it was adopted it would give us a larger
set of legal types, but all the concepts we're discussing here would
still apply to other types, so let's stick with LMUL > 1 to avoid
confusion.

I don't see how the situation you mention is comparable. Legalization
for e.g. <3 x i32> was not implemented at first, but as demonstrated
by the fact that it *was* implemented later, there's no conceptual
problem with legalizing that kind of type. You don't even have to
legalize them in vector registers, three scalar registers work fine
(you can even do that on the IR level).

That was the point I was trying to make, but in my head that fused
with register shadowing, which derailed the point.

To be clear, yes, "invalid" register configurations can easily usually
be legalised in multiple ways at lowering.

Not all will be optimal, though, and there is where the problem lives.

Legalization (codegen in general) does not know if the machine
code will eventually run on a chip with vector registers so small that
vscale works out to 1/2, but it has to choose some legalization
strategy.

This is interesting, I had not realised that from the descriptions of
the problem so far. I thought it was just due to non-power-of-two
lengths.

A "vector" register that is smaller than 64 bits wouldn't make much
sense, unless this is a DSP-type extension on very small types. In
those cases, every clock cycle and every instruction counts,
especially inside the inner loop.

I'm struggling to see how this can be optimally executed from a
generic scalable code, which usually profits from the fact that vscale

1.

If <vscale x 1 x i32> ends up having one element, and <vscale x 2 x
i32> also has one (= 2 * 0.5) element, then that's wrong: the latter
type must have twice as many elements as the former (one example where
this matters: split_low / split_high / concat shuffle patterns). The
second option, a vector with *zero* elements, is just as wrong if not
worse.

Right, that was the idea behind vscale from the beginning. I don't
know how many elements either has, but I know the latter has twice as
many as the former.

I see why you would want half-length, because that truth still holds:
the latter has twice as many halves as the former.

But how do you handle the last half? Do you ignore? Do you load /
store half? Do you always mask it out? Do you fuse with the next
iterations' first half?

If the semantics is not clear on how the back-ends are supposed to use
that extra half, then extending the IR in such a way can make it very
hard for generic optimisations to understand anything about the
ranges, validity of operations, alignment, masks, undefined behaviour,
etc.

It's not that a correct legalization exists but it's too annoying to
implement, or that one might exist but I'm too lazy to work it out.

I never meant to imply that. Apologies if that's what came through.

We're also not running in a limitation or oddity of the RISC-V vector
ISA in particular. It's simply that, if you set vscale == 0.5, then by
the way scalable vector types work (vscale * const elements), some
vector types that can be written in the IR would need to have a
fractional number of elements to be consistent with the other scalable
vector types. As that is not possible (not even conceptually),
whatever code you emit to try to legalize that type will end up being
wrong in some respect.

Honestly, I'm running out of breath in this discussion. :slight_smile:

I don't know a lot about SVE and even less about RISC-V, so I'll leave
the more in-depth technical discussions for Florian/Sander and others
to chime in.

So if we'd decide to support fractional vscale, we can't say these
types are "illegal". In LLVM parlance, illegal types can be used in
LLVM IR and targets aspire to turn them into something that works
correctly, even if it's very inefficient. Sometimes a legalization is
unimplemented or buggy, but these problems can be patched and this has
often happened in the past. With fractional vscale, the situation is
quite different: nobody will ever be able to use certain scalable
vector types on the target in question, because they can't be
legalized even in principle.

I have not spent the time you guys have on this, but if I understood
your problem correctly, I too can't think of a way to represent this
in non-fractional ways.

I'm not saying this is a good idea, and I think you're not saying it
is either, but perhaps the only idea.

If that's the case, then I have proposed to use a different
flag/integer to mean half-scale instead of floating points, and
hopefully that can be transparent to the rest of scalable vector code.

But I'd really like to get other people's point of view, as I'm not
confident on my appraisal.

I hope this lengthy explanation help you see where I'm coming from.

It did, thanks!

--renato

Hi Hanna,

Thanks Hanna. I got your point.
You mean that If the type does not exist in the type system, we still need to legalize it.
I support the following four kinds of i32 scalable vector types. I also do not know how to reason about vscale x 1 x i32 under this type system.

LMUL = 1 LMUL = 2 LMUL = 4 LMUL = 8
int32_t | vscale x 2 x i32 | vscale x 4 x i32 | vscale x 8 x i32 | vscale x 16 x i32

Could we just support the types in the table on the RISC-V target? I mean do not legalize it, and just issue error messages for vscale x 1 x i32.

In my latest reply, I do not propose fractional vscale. I propose “vscale x n” be an integer. Under the assumption, I could not reason about vscale x 1 x i32. However, I could reason about vscale x 2 x i32 even when vscale = 1/2. We only care about the part “vscale x n” being integer.

The original problem is the type system proposed by Hanna under ELEN = 64 is

LMUL = 1 LMUL = 2 LMUL = 4 LMUL = 8
int32_t | vscale x 2 x i32 | vscale x 4 x i32 | vscale x 8 x i32 | vscale x 16 x i32

Under ELEN = 32 is

LMUL = 1 LMUL = 2 LMUL = 4 LMUL = 8
int32_t | vscale x 1 x i32 | vscale x 2 x i32 | vscale x 4 x i32 | vscale x 8 x i32

The problem is there are multiple kinds of type systems under RISC-V RVV implementation. They are not compatible under different ELEN configurations. AFAIK, there are no such compatible problems in GCC implementation. (In GCC, they reason about the whole “poly_int”, instead of “X”.)

If llvm.vscale(i32 ElementCount) is not the way we want to go, is there any proposal to solve the compatibility problems in your type system?

> I don't see how the situation you mention is comparable. Legalization
> for e.g. <3 x i32> was not implemented at first, but as demonstrated
> by the fact that it *was* implemented later, there's no conceptual
> problem with legalizing that kind of type. You don't even have to
> legalize them in vector registers, three scalar registers work fine
> (you can even do that on the IR level).

That was the point I was trying to make, but in my head that fused
with register shadowing, which derailed the point.

To be clear, yes, "invalid" register configurations can easily usually
be legalised in multiple ways at lowering.

Not all will be optimal, though, and there is where the problem lives.

> Legalization (codegen in general) does not know if the machine
> code will eventually run on a chip with vector registers so small that
> vscale works out to 1/2, but it has to choose some legalization
> strategy.

This is interesting, I had not realised that from the descriptions of
the problem so far. I thought it was just due to non-power-of-two
lengths.

A "vector" register that is smaller than 64 bits wouldn't make much
sense, unless this is a DSP-type extension on very small types. In
those cases, every clock cycle and every instruction counts,
especially inside the inner loop.

I'm struggling to see how this can be optimally executed from a
generic scalable code, which usually profits from the fact that vscale
>> 1.

I can understand this kind of concern, but the specification permits
it and this entire thread is predicated on needing to target those
cores too. If we'd decide we're okay with LLVM-based toolchains only
supporting hardware with e.g. VLEN >= 64 (but see [*] below) then
there's no problem to begin with and no need for ideas like fractional
vscale or types that can't be legalized. Alternatively, we could treat
support for vector registers smaller than 64b as a separate ABI, like
with soft-float ABI vs hard-float ABI.

However, the current aspiration among the ISA designers and
software/toolchain developers is different. Cores with tiny VRF are
expected to be useful for some markets, and it's hoped that the V
extension can "scale down" well enough to avoid the need for a second
vector extension specifically for those cores. Personally, I have some
doubts about how well this will work out in practice, but of course
software and toolchain developers (including myself) would prefer to
keep everything as portable as possible. Binary portability across
wildly different vector register sizes is an explicit goal of the ISA,
adding an exception to this for no good reason would be very
unfortunate.

[*] If we settled on requiring VLEN >= 64, we'd still face the same
problem again if we ever want to add support for vectors elements
larger than 64b, such as quad-precision floats or 128 bit integers. I
don't really expect those to be commonly implemented for a long time,
but once again: it would be great to avoid the need for a separate and
incompatible target triple if and when such cores become relevant.

> If <vscale x 1 x i32> ends up having one element, and <vscale x 2 x
> i32> also has one (= 2 * 0.5) element, then that's wrong: the latter
> type must have twice as many elements as the former (one example where
> this matters: split_low / split_high / concat shuffle patterns). The
> second option, a vector with *zero* elements, is just as wrong if not
> worse.

Right, that was the idea behind vscale from the beginning. I don't
know how many elements either has, but I know the latter has twice as
many as the former.

I see why you would want half-length, because that truth still holds:
the latter has twice as many halves as the former.

But how do you handle the last half? Do you ignore? Do you load /
store half? Do you always mask it out? Do you fuse with the next
iterations' first half?

If the semantics is not clear on how the back-ends are supposed to use
that extra half, then extending the IR in such a way can make it very
hard for generic optimisations to understand anything about the
ranges, validity of operations, alignment, masks, undefined behaviour,
etc.

> It's not that a correct legalization exists but it's too annoying to
> implement, or that one might exist but I'm too lazy to work it out.

I never meant to imply that. Apologies if that's what came through.

Oh no, not at all! Sorry for the confusion, I should avoid rhetoric
that can create this impression.

> We're also not running in a limitation or oddity of the RISC-V vector
> ISA in particular. It's simply that, if you set vscale == 0.5, then by
> the way scalable vector types work (vscale * const elements), some
> vector types that can be written in the IR would need to have a
> fractional number of elements to be consistent with the other scalable
> vector types. As that is not possible (not even conceptually),
> whatever code you emit to try to legalize that type will end up being
> wrong in some respect.

Honestly, I'm running out of breath in this discussion. :slight_smile:

I don't know a lot about SVE and even less about RISC-V, so I'll leave
the more in-depth technical discussions for Florian/Sander and others
to chime in.

Fair, thanks for the discussion so far :slight_smile:

Best regards
Hanna