Vector instructions

Hi,

I have some questions as to the definition of various vector
instructions. In particular, I believe there are some gaps and
inconsistencies in the vector instructions, and I'm interested in
hearing whether you agree that these should be improved or whether
there are other ways to solve these problems.

Hi,

I have some questions as to the definition of various vector
instructions. In particular, I believe there are some gaps and
inconsistencies in the vector instructions, and I'm interested in
hearing whether you agree that these should be improved or whether
there are other ways to solve these problems.

===
1. Shufflevector only accepts vectors of the same type

Shufflevector seems overly restrictive right now in terms of its type
requirements. In particular, it requires (this could be clearer in the
language reference, but is supported by equivalent assertions in the
source code) that the types of the sources and destinations match
exactly. Obviously it makes sense for the element types to match, but
the requirement that the number of elements in each source operand
matches the number of elements in the destination appears overly
restrictive.

I would propose to change the syntax from:

<result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <n x i32>
<mask> ; yields <n x <ty>>

to:

<result> = shufflevector <a x <ty>> <v1>, <b x <ty>> <v2>, <d x i32>
<mask> ; yields <d x <ty>>

With the requirement that the entries in the (still constant) mask are
within the range of [0, a + b - 1].

This allows things like taking a <2 x i32> and duplicating it into a
<4 x i32>.

This is very useful for frontends that provide general vector
functionality. I think this is more consistent with the relaxed rules
on numbers of vector elements.

The alternative is to have the frontend synthesize the needed
operations with extracts, inserts, and possibly shuffles if needed.
LLVM is actually fairly well prepared to optimize code like this.
I recommend giving this a try, and reporting any problems you
encounter.

2. vector select

[...]

3. vector trunc, sext, zext, fptrunc, fpext

[...]

4. vector shl, lshr, ashr

[...]

We agree that these would be useful. There are intentions to add them
to LLVM; others can say more.

4. vfcmp, vicmp return types

This topic came up recently on llvm-dev (thread "VFCmp failing when
unordered or UnsafeFPMath on x86"). Having vector compares return a
vector of integer elements with bit width equal to the element types
being compared seems unnecessarily inconsistent with the icmp and fcmp
instructions. Since only one bit is well-defined anyways, why not just
return a vector of i1 elements? If after codegen this is expanded into
some other width, that's fine -- the backend can do this since only
one bit of information is exposed at the IR level anyways.

Having these instructions return vector of i1 would also be nicely
consistent with a vector select operation. Vector bit shifts and trunc
would make this easier, but it seems to me that the entire IR would be
far more consistent if these operations simply returned i1 vectors.

It turns out that having them return vectors of i1 would be somewhat
complicated. For example, a <4 x i1> on an SSE2 target could expand
either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would
be to make the decision based on the comparison that produced them,
but LLVM isn't yet equipped for that.

vicmp and vfcmp are very much aimed at solving practical problems on
popular architectures without needing significant new infrastructure
They're relatively new, and as you say, they'll be more useful when
combined with vector shifts and friends and we start teaching LLVM
to recognize popular idioms with them.

One interesting point here is that if we ever do want to have vector
comparisons that return vectors of i1, one way to do that would be to
overload plain icmp and fcmp, which would be neatly consistent with
the way plain mul and add are overloaded for vector types instead of
having a separate vadd and vmul.

Dan

Hi Dan,

Thanks for your comments. I've responded inline below.

===
1. Shufflevector only accepts vectors of the same type

I would propose to change the syntax from:

<result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <n x i32>
<mask> ; yields <n x <ty>>

to:

<result> = shufflevector <a x <ty>> <v1>, <b x <ty>> <v2>, <d x i32>
<mask> ; yields <d x <ty>>

With the requirement that the entries in the (still constant) mask are
within the range of [0, a + b - 1].

The alternative is to have the frontend synthesize the needed
operations with extracts, inserts, and possibly shuffles if needed.
LLVM is actually fairly well prepared to optimize code like this.
I recommend giving this a try, and reporting any problems you
encounter.

That certainly appears to be the only option at the moment, and we'll have a look to see how that works out. However, note that a sufficiently generalized shufflevector would remove the need for insertelement and extractelement to exist completely.

2. vector select
3. vector trunc, sext, zext, fptrunc, fpext
4. vector shl, lshr, ashr

[...]

We agree that these would be useful. There are intentions to add them
to LLVM; others can say more.

OK. I'd love to hear more, especially if someone is planning to do this in the short term.

4. vfcmp, vicmp return types

This topic came up recently on llvm-dev (thread "VFCmp failing when
unordered or UnsafeFPMath on x86"). Having vector compares return a
vector of integer elements with bit width equal to the element types
being compared seems unnecessarily inconsistent with the icmp and fcmp
instructions. Since only one bit is well-defined anyways, why not just
return a vector of i1 elements? If after codegen this is expanded into
some other width, that's fine -- the backend can do this since only
one bit of information is exposed at the IR level anyways.

Having these instructions return vector of i1 would also be nicely
consistent with a vector select operation. Vector bit shifts and trunc
would make this easier, but it seems to me that the entire IR would be
far more consistent if these operations simply returned i1 vectors.

It turns out that having them return vectors of i1 would be somewhat
complicated. For example, a <4 x i1> on an SSE2 target could expand
either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would
be to make the decision based on the comparison that produced them,
but LLVM isn't yet equipped for that.

Can you expand on this a bit? I'm guessing you're referring to specific mechanics in LLVM's code generation framework making this difficult to do?

From a theoretical point of view (keeping in mind current architectures) there can definitely be a need to change the representation at various points in the program. For example, a <2 x i1> generated from comparing some <2 x float> types could later be used to select amongst two <2 x double> types, and on most current architectures this would require a mask to be expanded. There are three cases of instructions at which representations for i1 vectors must be chosen: instructions which generate vectors of i1s (e.g. vfcmp), instructions which consume them (e.g. a vector select) and instructions which do as both (e.g. shufflevector, phi statements). I would expect generating statements to generally have only one fixed choice (since common ISAs generally create masks of the size of what's being compared), and consumers to generally dictate a representation, possibly necessitating conversion code to be inserted. For instructions that transform vectors of i1 in some way, there may be some additional optimization opportunity to avoid conversion code. I would also presume that standard optimizations such as a form of CSE or VN and code motion would help reduce the number of conversions.

In fact, this issue of representing i1 vectors isn't even constrained to vector types. Scalar code which stores i1 values in regular registers needs to have the same considerations (e.g. on Cell SPU where there is no condition register).

Perhaps all that was way off topic, and you're just referring to a simple mechanical issue preventing this right now. My point is just that representation of booleans in registers is an important architecturally specific issue to handle, and the frontend is (in my humble opinion) probably not the best place for this.

vicmp and vfcmp are very much aimed at solving practical problems on
popular architectures without needing significant new infrastructure
They're relatively new, and as you say, they'll be more useful when
combined with vector shifts and friends and we start teaching LLVM
to recognize popular idioms with them.

Can you give me examples of how they are used today at all? I'm having a really hard time figuring out a good use for them (that doesn't involve effectively scalarizing them immediately) without these other instructions.

One interesting point here is that if we ever do want to have vector
comparisons that return vectors of i1, one way to do that would be to
overload plain icmp and fcmp, which would be neatly consistent with
the way plain mul and add are overloaded for vector types instead of
having a separate vadd and vmul.

Agreed, this would be nice.

<result> = shufflevector <a x <ty>> <v1>, <b x <ty>> <v2>, <d x i32>
<mask> ; yields <d x <ty>>

With the requirement that the entries in the (still constant) mask
are
within the range of [0, a + b - 1].

The alternative is to have the frontend synthesize the needed
operations with extracts, inserts, and possibly shuffles if needed.
LLVM is actually fairly well prepared to optimize code like this.
I recommend giving this a try, and reporting any problems you
encounter.

That certainly appears to be the only option at the moment, and we'll
have a look to see how that works out. However, note that a
sufficiently generalized shufflevector would remove the need for
insertelement and extractelement to exist completely.

You should look into how this works with clang. Clang allows you to do things like this, for example:

typedef __attribute__(( ext_vector_type(4) )) float float4;

float2 vec2, vec2_2;
float4 vec4, vec4_2;
float f;

void test2() {
     vec2 = vec4.xy; // shorten
     f = vec2.x; // extract elt
     vec4 = vec4.yyyy; // splat
     vec4.zw = vec2; // insert
}

etc. It also offers operators to extract all the even or odd elements of a vector, do arbitrary two-input-vector shuffles with __builtin_shuffle etc.

2. vector select
3. vector trunc, sext, zext, fptrunc, fpext
4. vector shl, lshr, ashr

[...]

We agree that these would be useful. There are intentions to add them
to LLVM; others can say more.

OK. I'd love to hear more, especially if someone is planning to do
this in the short term.

Most of the extensions you suggest are great ideas, but we need more than ideas: we need someone to help implement the ideas ;-).

It turns out that having them return vectors of i1 would be somewhat
complicated. For example, a <4 x i1> on an SSE2 target could expand
either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would
be to make the decision based on the comparison that produced them,
but LLVM isn't yet equipped for that.

Can you expand on this a bit? I'm guessing you're referring to
specific mechanics in LLVM's code generation framework making this
difficult to do?

I'm not sure that it really is a matter of what is easy or hard to do in LLVM. This model more closely matches the model implemented by common SIMD systems like Altivec, SSE, CellSPU, Alpha, etc. The design of the system was picked to model the systems that we know of well, we obviously can't plan to handle systems that we don't know about.

vicmp and vfcmp are very much aimed at solving practical problems on
popular architectures without needing significant new infrastructure
They're relatively new, and as you say, they'll be more useful when
combined with vector shifts and friends and we start teaching LLVM
to recognize popular idioms with them.

Can you give me examples of how they are used today at all? I'm having
a really hard time figuring out a good use for them (that doesn't
involve effectively scalarizing them immediately) without these other
instructions.

They can be used with target-specific intrinsics. For example, SSE provides a broad range of intrinsics to support instructions that LLVM IR can't express well. See llvm/include/llvm/IntrinsicsX86.td for more details.

If you're interested in helping shape the direction of LLVM vector support, my advice is that "patches speak louder than words" :). I'd love to see improved vector support in LLVM, but unless someone is willing to step forward and implement it, it is all just idle talk.

-Chris

You should look into how this works with clang. Clang allows you to
do things like this, for example:

typedef __attribute__(( ext_vector_type(4) )) float float4;

float2 vec2, vec2_2;
float4 vec4, vec4_2;
float f;

void test2() {
    vec2 = vec4.xy; // shorten
    f = vec2.x; // extract elt
    vec4 = vec4.yyyy; // splat
    vec4.zw = vec2; // insert
}

etc. It also offers operators to extract all the even or odd elements
of a vector, do arbitrary two-input-vector shuffles with
__builtin_shuffle etc.

Right, our frontend supports all these things as well. I think it's a pretty small change to shufflevector to be able to support any of these with a single LLVM IR instruction. We'll have a look at how clang handles this.

OK. I'd love to hear more, especially if someone is planning to do
this in the short term.

Most of the extensions you suggest are great ideas, but we need more
than ideas: we need someone to help implement the ideas ;-).

Understood. I didn't mean to imply that I expect someone to do this :). If someone was already working on something like this it would be valuable to know though. I really just want to understand if these are real gaps or there's something I'm missing.

It turns out that having them return vectors of i1 would be somewhat
complicated. For example, a <4 x i1> on an SSE2 target could expand
either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would
be to make the decision based on the comparison that produced them,
but LLVM isn't yet equipped for that.

Can you expand on this a bit? I'm guessing you're referring to
specific mechanics in LLVM's code generation framework making this
difficult to do?

I'm not sure that it really is a matter of what is easy or hard to do
in LLVM. This model more closely matches the model implemented by
common SIMD systems like Altivec, SSE, CellSPU, Alpha, etc. The
design of the system was picked to model the systems that we know of
well, we obviously can't plan to handle systems that we don't know
about.

Those are precisely the architectures I have in mind as well. I was curious to hear more about why "LLVM isn't yet equipped for that".

Take the SPU for example. In the quote above, Dan mentioned it's necessary to make the decision as to what to actually use as the representation for an i1 based on the comparison that generated it. On SSE this is true for vectors of i1. On the SPU, this is true even for scalars, since there's no actual condition register, and booleans must be represented as bitmasks when it comes to selecting from them. Thus this problem must be solved with the IR as it is today to map a regular scalar cmp followed by a scalar select instruction to efficient SPU code.

Of course an option is to make this a problem for the frontend instead of the backend, but in the case of vectors that's difficult to do right now without vector shifts, and there are cases where the backend will be able to make a smarter choice about the representation of booleans (e.g. using bit operations vs CMOV on x86).

vicmp and vfcmp are very much aimed at solving practical problems on
popular architectures without needing significant new infrastructure
They're relatively new, and as you say, they'll be more useful when
combined with vector shifts and friends and we start teaching LLVM
to recognize popular idioms with them.

Can you give me examples of how they are used today at all? I'm having
a really hard time figuring out a good use for them (that doesn't
involve effectively scalarizing them immediately) without these other
instructions.

They can be used with target-specific intrinsics. For example, SSE
provides a broad range of intrinsics to support instructions that LLVM
IR can't express well. See llvm/include/llvm/IntrinsicsX86.td for
more details.

Ah, now that makes sense. I can see how these instructions can be useful if one is also using intrinsics.

If you're interested in helping shape the direction of LLVM vector
support, my advice is that "patches speak louder than words" :). I'd
love to see improved vector support in LLVM, but unless someone is
willing to step forward and implement it, it is all just idle talk.

Absolutely understood. As a past contributor to and maintainer of a number of open source projects, I hear you! I just want to make sure I'm understanding the existing semantics, and the reasons behind them, correctly, before I venture way off course! :slight_smile: