Extending vector operations

Hi,

We would like to extend the vector operations in llvm a bit. We're hoping to get some feedback on the right way to go, or some starting points. I had previously had some discussion on this list about a subset of the changes we have in mind.

All of these changes are intended to make target-independent IR (i.e. IR without machine specific intrinsics) generate better code or be easier to generate from a frontend with vector support (whether from manual or autovectorization).

If you have any insight into how to best get started with any of these changes, and whether they are feasible and sensible, please let me know. We're mostly interested in x86 as a target in the short term, but obviously want these to apply to other LLVM targets as well. We're prepared to implement these changes, but would like to hear any suggestions and objections you might have.

Below are the specific additions we have in mind.

1) Vector shl, lshr, ashr

I think these are no-brainers. We would like to extend the semantics
of the shifting instructions to naturally apply to vectors as well.
One issue is that these operations often only support a single shift
amount for an entire vector. I assume it should be fairly
straightforward to select on this pattern, and scalarize the general
case as necessary.

I have a rough draft of a patch for this that works reasonably well
for simple cases... I don't think I really have the time to finish it
properly, but I'll clean it up a bit and send it to you as a starting
point.

2) Vector strunc, sext, zext, fptrunc and fpext

Again, I think these are hopefully straightforward. Please let me know
if you expect any issues with vector operations that change element
sizes from the RHS to the LHS, e.g. around legalization.

These are tricky to generate efficient code for because they change
the size of the vector and most likely introduce illegal vectors.
It'll be a lot of work to get these working well for non-trivial
cases.

5) Vector comparisons that return <N x i1>

This is maybe not a must-have, and perhaps more a question of
preference. I understand the current vfcmp/vicmp semantics, returning
a vector of iK where K matches the bitwidth of the operands being
compared with the high bit set or not, are there for pragmatic
reasons, and that these functions exist to aid with code emitted that
uses machine-specific intrinsics.

For code that does not use machine intrinsics, I believe it would be
cleaner, simpler, and potentially more efficient, to have a vector
compare that returns <N x i1> instead. For example, in conjunction
with the above-mentioned vector select, this would allow a max to be
expressed simply as a sequence of compare and select.

I realize this is probably the most controversial change amongst
these. I gather there is some concern about representing "variable
width" i1s, but I would contend that that's the case even for i1s
which are not vectors.

It makes sense; the difficulty is making CodeGen generate efficient
code for these constructs. Legalization has to transform illegal
types into legal types, and as far as I know, none of the current
transformations varies the type transformation based on what is using
the value. I don't know exactly how hard it would be, though... maybe
someone who's more familiar with the legalization infrastructure could
say more?

-Eli

Hi,

We would like to extend the vector operations in llvm a bit. We’re
hoping to get some feedback on the right way to go, or some starting
points. I had previously had some discussion on this list about a
subset of the changes we have in mind.

All of these changes are intended to make target-independent IR (i.e.
IR without machine specific intrinsics) generate better code or be
easier to generate from a frontend with vector support (whether from
manual or autovectorization).

If you have any insight into how to best get started with any of these
changes, and whether they are feasible and sensible, please let me
know. We’re mostly interested in x86 as a target in the short term,
but obviously want these to apply to other LLVM targets as well. We’re
prepared to implement these changes, but would like to hear any
suggestions and objections you might have.

Below are the specific additions we have in mind.

===

  1. Vector shl, lshr, ashr

I think these are no-brainers. We would like to extend the semantics
of the shifting instructions to naturally apply to vectors as well.
One issue is that these operations often only support a single shift
amount for an entire vector. I assume it should be fairly
straightforward to select on this pattern, and scalarize the general
case as necessary.

That seems reasonable.

  1. Vector strunc, sext, zext, fptrunc and fpext

Again, I think these are hopefully straightforward. Please let me know
if you expect any issues with vector operations that change element
sizes from the RHS to the LHS, e.g. around legalization.

Is the proposed semantics here that the number of elements stays the same size, and the overall vector width changes?

  1. Vector intrinsics for floor, ceil, round, frac/modf

These are operations that are not trivially specified in terms of
simpler operations. It would be nice to have these as overloaded,
target-independent intrinsics, in the same way as llvm.cos etc. are
supported now.

It seems like these could be handled through intrinsics in the LLVM IR, and could use general improvement in the selection dag.

  1. Vector select

We consider a vector select extremely important for a number of
operations. This would be an extension of select to support an vector mask to select between elements of vectors for some
basic type T. Vector min, max, sign, etc. can be built on top of this
operation.

How is this anything other than AND/ANDN/OR on any integer vector type? I don’t see what adding this to the IR gets you for vectors, since “vector of i1” doesn’t mean “vector of bits” necessarily.

  1. Vector comparisons that return

This is maybe not a must-have, and perhaps more a question of
preference. I understand the current vfcmp/vicmp semantics, returning
a vector of iK where K matches the bitwidth of the operands being
compared with the high bit set or not, are there for pragmatic
reasons, and that these functions exist to aid with code emitted that
uses machine-specific intrinsics.

I totally disagree with this approach; A vector of i1 doesn’t actually match what you want to do with the hardware, unless you had say, 128 x i1 for SSE, and it’s strange when you have to spill and reload it. The current VICMP and VFCMP instructions do not exist for use with machine intrinsics; they exist to allow code written use C-style comparison operators to generate efficient code on a wide range of both scalar and vector hardware.

For code that does not use machine intrinsics, I believe it would be
cleaner, simpler, and potentially more efficient, to have a vector
compare that returns instead. For example, in conjunction
with the above-mentioned vector select, this would allow a max to be
expressed simply as a sequence of compare and select.

Having gone down this path, I’d have to disagree with you.

In addition to the above suggestions, I’d also like to hear what
others think about handling vector operations that aren’t powers of
two in size, e.g. <3 x float> operations. I gather the status quo is
that only POT sizes are expected to work (although we’ve found some
bugs for things like <2 x float> that we’re submitting). Ideally
things like <3 x float> operands would usually be rounded up to the
size supported by the machine directly. We can try to do this in the
frontend, but it would of course be ideal if these just worked. I’m
curious if anyone else out there has dealt with this already and has
some suggestions.

Handling NPOT vectors in the code generator ideally would be great; I know some people are working on widening the operations to a wider legal vector type, and scalarizing is always a possibility as well. The main problem here is what to do with address calculations, and alignment.

Nate

Consider something like <4 x int> -> <4 x double>. On a target with 128-bit vectors, this could be custom expanded/split into a convert hi and convert low operation. This is real work to support, but I don't think this is too crazy.

-Chris

One interesting point here is that clang follows the same semantics that LLVM IR does for comparisons. This means you can use "V1 < V2" and the result is defined to be a vector of integers the same width with only the sign bits defined.

-Chris

1) Vector shl, lshr, ashr

I have a rough draft of a patch for this that works reasonably well
for simple cases... I don't think I really have the time to finish it
properly, but I'll clean it up a bit and send it to you as a starting
point.

That's great. Please do send it over, it'd be great to have a starting point like this.

2) Vector strunc, sext, zext, fptrunc and fpext

These are tricky to generate efficient code for because they change
the size of the vector and most likely introduce illegal vectors.
It'll be a lot of work to get these working well for non-trivial
cases.

Right, though it seems they can generally be split up as necessary (as Chris pointed out as well). I just don't know how well this fits the existing infrastructure. More pointers would be welcome here!

5) Vector comparisons that return <N x i1>

It makes sense; the difficulty is making CodeGen generate efficient
code for these constructs. Legalization has to transform illegal
types into legal types, and as far as I know, none of the current
transformations varies the type transformation based on what is using
the value. I don't know exactly how hard it would be, though... maybe
someone who's more familiar with the legalization infrastructure could
say more?

Right, the type that needs to be used at the birthpoint of the value will generally be determined by the operation being performed to define it. At uses, that representation will have to be converted. There are some operations which may not provide a unique representation at a definition. Phi statements are the most obvious ones - if a phi statement joins two boolean vectors of different representations, the optimal choice of the result may depend on where that result is used.

Conversion code that may need to be inserted can also probably benefit from optimizations like CSE.

Hi Nate,

1) Vector shl, lshr, ashr

That seems reasonable.

Thanks.

2) Vector strunc, sext, zext, fptrunc and fpext

Again, I think these are hopefully straightforward. Please let me know
if you expect any issues with vector operations that change element
sizes from the RHS to the LHS, e.g. around legalization.

Is the proposed semantics here that the number of elements stays the same size, and the overall vector width changes?

Yes.

3) Vector intrinsics for floor, ceil, round, frac/modf

These are operations that are not trivially specified in terms of
simpler operations. It would be nice to have these as overloaded,
target-independent intrinsics, in the same way as llvm.cos etc. are
supported now.

It seems like these could be handled through intrinsics in the LLVM IR, and could use general improvement in the selection dag.

Right, that's what we were thinking too. Glad to hear that makes sense!

4) Vector select

We consider a vector select extremely important for a number of
operations. This would be an extension of select to support an <N x
i1> vector mask to select between elements of <N x T> vectors for some
basic type T. Vector min, max, sign, etc. can be built on top of this
operation.

How is this anything other than AND/ANDN/OR on any integer vector type? I don't see what adding this to the IR gets you for vectors, since "vector of i1" doesn't mean "vector of bits" necessarily.

Note that I don't mean a "select bits", but rather a "select components" operation. In other words, a straightforward vectorization of the existing "select" IR operation.

You can implement the existing LLVM select instruction with bitwise operations, but it's still convenient to have. For one, as I mentioned, it provides an obvious idiomatic way to express operations like min and max.

Vector selection is a common operation amongst languages that support vectors directly, and vector code will often avoid branches by performing some form of predication instead.

I'm really not that concerned about how it's expressed, but there needs to be a well-understood way to lower something that looks like "vector ?:", vector max, etc in a frontend to something that will actually generate good code in the backend. If the idiom for vector float max is "vfcmp, ashr, bitcast, and, xor, and, or, bitcast" and that generates a single maxps from the x86 backend, great. If the idiom is "vfcmp, select", even better.

5) Vector comparisons that return <N x i1>

This is maybe not a must-have, and perhaps more a question of
preference. I understand the current vfcmp/vicmp semantics, returning
a vector of iK where K matches the bitwidth of the operands being
compared with the high bit set or not, are there for pragmatic
reasons, and that these functions exist to aid with code emitted that
uses machine-specific intrinsics.

I totally disagree with this approach; A vector of i1 doesn't actually match what you want to do with the hardware, unless you had say, 128 x i1 for SSE, and it's strange when you have to spill and reload it.

I definitely am not thinking of <128 x i1>. The intent is really just to express "here is a vector of values where only one bit matters in each element" and have codegen map that to a representation appropriate to the machine being targeted.

Of course these eventually need to be widened to an appropriately sized integer vector, and that vector may be a mask or a 0/1 value, or whatever. The responsibility to doing so can be placed at pretty much any level of the stack, all the way up to making the user worry about it.

For us, making the user worry about it isn't an option; we have first class bools in our frontend, support vectors of these, and naturally define comparisons, selection, etc to be consistent with this. So that leaves it to either the part generating LLVM IR, or the LLVM backend/mid-end. I'm of the tendency that this job is best suited to be done in an SSA representation, and best suited to be done with a specific machine in mind. If this is completely infeasible, we'll have to worry about it during generation of LLVM IR, which is fine. I'd just rather see it done in LLVM where it can hopefully benefit others with the same issues.

To me this isn't much different whether we're talking about vectors or scalars; it's about the utility of an "i1" type generally, and about whose responsibility it is to map such a type to hardware.

The current VICMP and VFCMP instructions do not exist for use with machine intrinsics; they exist to allow code written use C-style comparison operators to generate efficient code on a wide range of both scalar and vector hardware.

OK. My understanding of this was based on an email from Chris to llvm-dev. I had asked how these were used today, especially given the lack of vector shifts. Here's his response:

They can be used with target-specific intrinsics. For example, SSE
provides a broad range of intrinsics to support instructions that LLVM
IR can't express well. See llvm/include/llvm/IntrinsicsX86.td for
more details.

If you have examples on how these are expected to be used today without machine intrinsics, that would really help - e.g. for expressing something like a vector max.

For code that does not use machine intrinsics, I believe it would be
cleaner, simpler, and potentially more efficient, to have a vector
compare that returns <N x i1> instead. For example, in conjunction
with the above-mentioned vector select, this would allow a max to be
expressed simply as a sequence of compare and select.

Having gone down this path, I'd have to disagree with you.

OK. Hopefully my comments above will make my reasoning clearer. If not please let me know. You have a lot more experience with this in LLVM than me so pardon my ignorance :). I'm not suggesting that these semantics are absolutely the way to go, but we do need some way to address these issues.

In addition to the above suggestions, I'd also like to hear what
others think about handling vector operations that aren't powers of
two in size, e.g. <3 x float> operations. I gather the status quo is
that only POT sizes are expected to work (although we've found some
bugs for things like <2 x float> that we're submitting). Ideally
things like <3 x float> operands would usually be rounded up to the
size supported by the machine directly. We can try to do this in the
frontend, but it would of course be ideal if these just worked. I'm
curious if anyone else out there has dealt with this already and has
some suggestions.

Handling NPOT vectors in the code generator ideally would be great; I know some people are working on widening the operations to a wider legal vector type, and scalarizing is always a possibility as well. The main problem here is what to do with address calculations, and alignment.

Right, we've dealt with this in the past by effectively scalarizing loads and stores (since simply extending a load might go beyond a page boundary, etc.), but extending all register operations. I think this addresses the address calculation and alignment issues?

Thanks for the feedback, I hope to hear more.

Stefanus

We would like to extend the vector operations in llvm a bit. We're
hoping to get some feedback on the right way to go, or some starting
points. I had previously had some discussion on this list about a
subset of the changes we have in mind.

Woohoo! We've been interested in talking about this for some time.

All of these changes are intended to make target-independent IR (i.e.
IR without machine specific intrinsics) generate better code or be
easier to generate from a frontend with vector support (whether from
manual or autovectorization).

Very excellent.

===
1) Vector shl, lshr, ashr

I think these are no-brainers. We would like to extend the semantics
of the shifting instructions to naturally apply to vectors as well.
One issue is that these operations often only support a single shift
amount for an entire vector. I assume it should be fairly

So you're assuming a shift of a vector by a scalar? What about the
general vector-by-vector version?

straightforward to select on this pattern, and scalarize the general
case as necessary.

Yep.

2) Vector strunc, sext, zext, fptrunc and fpext

Again, I think these are hopefully straightforward. Please let me know
if you expect any issues with vector operations that change element
sizes from the RHS to the LHS, e.g. around legalization.

Is the assumption that all elements are changed in the same way?

4) Vector select

We consider a vector select extremely important for a number of
operations. This would be an extension of select to support an <N x
i1> vector mask to select between elements of <N x T> vectors for some
basic type T. Vector min, max, sign, etc. can be built on top of this
operation.

Yes, merge/blend is a very important operation. Also, it would be nice to
think about generalizing this to apply masks to all vector operations,
particularly loads and stores.

5) Vector comparisons that return <N x i1>

This is maybe not a must-have, and perhaps more a question of
preference. I understand the current vfcmp/vicmp semantics, returning
a vector of iK where K matches the bitwidth of the operands being
compared with the high bit set or not, are there for pragmatic
reasons, and that these functions exist to aid with code emitted that
uses machine-specific intrinsics.

As long as we have some way of representing the result of a vector
compare, I'm not too worried. As you say, target-specific code can
convert these to masks or whatever form the target architecture has.

I don't have any experience trying to do such a conversion, though, so
there may be gotchas we're not aware of.

For code that does not use machine intrinsics, I believe it would be
cleaner, simpler, and potentially more efficient, to have a vector
compare that returns <N x i1> instead. For example, in conjunction
with the above-mentioned vector select, this would allow a max to be
expressed simply as a sequence of compare and select.

True enough. It also leads naturally to generalized masked vector operations,
which are _extremely_ handy.

Vector bitshifts would actually help with the amount of code generated
for something like a vectorized max, but makes the patterns for
recognizing these a lot longer.

Right.

I realize this is probably the most controversial change amongst
these. I gather there is some concern about representing "variable
width" i1s, but I would contend that that's the case even for i1s
which are not vectors.

What do you mean by "variable width?"

In addition to the above suggestions, I'd also like to hear what
others think about handling vector operations that aren't powers of
two in size, e.g. <3 x float> operations. I gather the status quo is

It's goodness.

that only POT sizes are expected to work (although we've found some
bugs for things like <2 x float> that we're submitting). Ideally
things like <3 x float> operands would usually be rounded up to the
size supported by the machine directly. We can try to do this in the

You might need mask support as well, especially if the operation can trap.

frontend, but it would of course be ideal if these just worked. I'm
curious if anyone else out there has dealt with this already and has
some suggestions.

It's something we've thought about but not really delved into yet.

Please let me know what you think,

Let's connect at the dev meeting along with others interested in this stuff
and start thinking about how to proceed.

                                                   -Dave

> 2) Vector strunc, sext, zext, fptrunc and fpext
>
> Again, I think these are hopefully straightforward. Please let me know
> if you expect any issues with vector operations that change element
> sizes from the RHS to the LHS, e.g. around legalization.

These are tricky to generate efficient code for because they change
the size of the vector and most likely introduce illegal vectors.
It'll be a lot of work to get these working well for non-trivial
cases.

It depends on the architecture. For something like SSE, I can see that
this would be a real pain (the vector length changes). For a traditional Cray
vector machine (as an example), it's trivial.

> 5) Vector comparisons that return <N x i1>

It makes sense; the difficulty is making CodeGen generate efficient
code for these constructs. Legalization has to transform illegal
types into legal types, and as far as I know, none of the current
transformations varies the type transformation based on what is using
the value. I don't know exactly how hard it would be, though... maybe
someone who's more familiar with the legalization infrastructure could
say more?

Can you elaborate with a specific example? I'm having a hard time imagining
what transformations need to do. When I think of a vector mask, the length
of the mask is the same as the vector length of the operands. So a compare
of two vectors with four elements results in a four bit mask. If the input
types change to a different vector length, then the mask type changes as well.

                                               -Dave

Right, the type that needs to be used at the birthpoint of the value
will generally be determined by the operation being performed to
define it. At uses, that representation will have to be converted.

Why would this ever happen? If the mask doesn't match the vector
length of the operation, isn't the operation undefined?

I suppose a mask could be longer than the vector length and still
have defined behavior, but this kind of mismatch strikes me as odd.

There are some operations which may not provide a unique
representation at a definition. Phi statements are the most obvious
ones - if a phi statement joins two boolean vectors of different
representations, the optimal choice of the result may depend on where
that result is used.

Again, why would this happen? Does a phi today ever join two vectors
of different type/length?

                                                -Dave

> 4) Vector select
>
> We consider a vector select extremely important for a number of
> operations. This would be an extension of select to support an <N x
> i1> vector mask to select between elements of <N x T> vectors for some
> basic type T. Vector min, max, sign, etc. can be built on top of this
> operation.

How is this anything other than AND/ANDN/OR on any integer vector
type? I don't see what adding this to the IR gets you for vectors,
since "vector of i1" doesn't mean "vector of bits" necessarily.

Yes, you can implement the operation with bit twiddling but it's less
convenient. Since we have a scalar select, it seems only natural to
provide a vector version. Some vector hardware has direct support
for this operation. AVX is just one example.

> 5) Vector comparisons that return <N x i1>
>
> This is maybe not a must-have, and perhaps more a question of
> preference. I understand the current vfcmp/vicmp semantics, returning
> a vector of iK where K matches the bitwidth of the operands being
> compared with the high bit set or not, are there for pragmatic
> reasons, and that these functions exist to aid with code emitted that
> uses machine-specific intrinsics.

I totally disagree with this approach; A vector of i1 doesn't actually
match what you want to do with the hardware, unless you had say, 128 x
i1 for SSE, and it's strange when you have to spill and reload it.

I don't follow. "What you want to do with the hardware" depends on what
hardware you have. Don't be restricted in thinking SSE-style only. And
for spill-reload on SSE, why couldn't you do a bitcast to some SSE-supported
type and just dump it to memory?

The current VICMP and VFCMP instructions do not exist for use with
machine intrinsics; they exist to allow code written use C-style
comparison operators to generate efficient code on a wide range of
both scalar and vector hardware.

Well, either you convert the current model to a bitvector for some hardware
or you convert a bitvector to the current model for other hardware. Perhaps
this can all be done by isel without too much trouble. I honestly don't know.

> For code that does not use machine intrinsics, I believe it would be
> cleaner, simpler, and potentially more efficient, to have a vector
> compare that returns <N x i1> instead. For example, in conjunction
> with the above-mentioned vector select, this would allow a max to be
> expressed simply as a sequence of compare and select.

Having gone down this path, I'd have to disagree with you.

Can you elaborate? What were your experiences? What architecture were
you targeting? Real-world experience is valuable.

Handling NPOT vectors in the code generator ideally would be great; I
know some people are working on widening the operations to a wider
legal vector type, and scalarizing is always a possibility as well.
The main problem here is what to do with address calculations, and
alignment.

With a vector select one could potentially use it to insert "safe" values
into the unaffected vector elements on fixed-VL architectures like SSE.
This would allow us to do NPOT operations on vector hardware that
requires POT vector lengths without scalarizing.

                                                -Dave

We would like to extend the vector operations in llvm a bit. We're
hoping to get some feedback on the right way to go, or some starting
points. I had previously had some discussion on this list about a
subset of the changes we have in mind.

Woohoo! We've been interested in talking about this for some time.

Glad to hear it!

===
1) Vector shl, lshr, ashr

I think these are no-brainers. We would like to extend the semantics
of the shifting instructions to naturally apply to vectors as well.
One issue is that these operations often only support a single shift
amount for an entire vector. I assume it should be fairly

So you're assuming a shift of a vector by a scalar? What about the
general vector-by-vector version?

No, I am proposing a general vector-by-vector version, just pointing out that many architectures do not support this natively, but that this should be a simple matter of legalization.

2) Vector strunc, sext, zext, fptrunc and fpext

Again, I think these are hopefully straightforward. Please let me know
if you expect any issues with vector operations that change element
sizes from the RHS to the LHS, e.g. around legalization.

Is the assumption that all elements are changed in the same way?

Since these only depend on the types of the source and destination operand, and vectors are homogeneous, yes.

The intention is that for any of these conversion ops,

   %A = op <N x S> %B to <N x T>

is semantically equivalent to this pseudo-IR:

   %A = undef <N x T>
   for i = 0 .. N - 1:
     %t1 = extractelement <N x S> %B, i
     %t2 = op S %t1 to T
     %A = insertelement <N x T> %A, %t2, i

4) Vector select

We consider a vector select extremely important for a number of
operations. This would be an extension of select to support an <N x
i1> vector mask to select between elements of <N x T> vectors for some
basic type T. Vector min, max, sign, etc. can be built on top of this
operation.

Yes, merge/blend is a very important operation. Also, it would be nice to
think about generalizing this to apply masks to all vector operations,
particularly loads and stores.

Can you elaborate about what you mean by this? I've built vector IRs in the past that have writemasking effectively built into all IR ops, and I don't know if I would recommend this approach generally. But I'm not sure that's what you mean.

I realize this is probably the most controversial change amongst
these. I gather there is some concern about representing "variable
width" i1s, but I would contend that that's the case even for i1s
which are not vectors.

What do you mean by "variable width?"

At some point the i1 vectors may need to be converted to masks. The sizes for these masks will often depend on how they're used. On ISAs like SSE and CellSPU, a comparison generates a mask corresponding to the size of the operands being compared, and a selection requires a mask corresponding to the size of the operands being selected.

The "simple" way to deal with this is to insert appropriate conversion code at uses of i1s, and pick a representation for a given i1 based on its SSA birthpoint. It's a little more ambiguous when you start adding phis into the mix, e.g.:

a:
  %a1 = fcmp olt <2 x float> %f1, %f2 ; yields <2 x i1>
  br label %c

b:
  %a2 = fcmp olt <2 x double> %d1, %d2 ; yields <2 x i1>
  br label %c

c:
  %a3 = phi <2 x i1> [%a1, %a], [%a2, %b]
  select <2 x i1> %a3, %s1, %s2 ; where s1, s2 are <2 x i16>
  select <2 x i1> %a3, %c1, %c2 ; where s1, s2 are <2 x i8>

The representation for %a1 may be <2 x i32>, for %a2 <2 x i64>, but for %a3 it's less lear.

I don't think this is a huge problem, but it's something to be aware of. Note the same issue can affect even scalar i1 values. In fact, Scott Michel posted to llvm-dev recently about dealing with this issue for setcc in the CellSPU backend.

that only POT sizes are expected to work (although we've found some
bugs for things like <2 x float> that we're submitting). Ideally
things like <3 x float> operands would usually be rounded up to the
size supported by the machine directly. We can try to do this in the

You might need mask support as well, especially if the operation can trap.

Hmm. Yes, divisions by zero etc. are something we should think about.

Please let me know what you think,

Let's connect at the dev meeting along with others interested in this stuff
and start thinking about how to proceed.

Sounds good.

The issue is not joining vectors of different types or length, but that at some point an <N x i1> will (on many architectures) have to be turned into a mask that's something like <N x i8>, <N x i32>, <N x i64> etc. This internal type will depend on the operation producing the mask, and phi nodes may join such <N x i1>s which are internally going to be differently sized.

I've included an example in another response to you that I just sent. Hopefully that makes things clear. Let me know if it's not.

Stefanus

No, I am proposing a general vector-by-vector version, just pointing
out that many architectures do not support this natively, but that
this should be a simple matter of legalization.

Cool.

It would be handy to have LLVM IR handle vector/scalar generally as well.
For example, multiply a vector by a scalar, add a scalar to a vector, etc.
isel/codegen would have to expand the scalar on architectures that
don't have hardware to do these operations.

>> 2) Vector strunc, sext, zext, fptrunc and fpext

The intention is that for any of these conversion ops,

   %A = op <N x S> %B to <N x T>

is semantically equivalent to this pseudo-IR:

   %A = undef <N x T>
   for i = 0 .. N - 1:
     %t1 = extractelement <N x S> %B, i
     %t2 = op S %t1 to T
     %A = insertelement <N x T> %A, %t2, i

Yep, that's what I would expect.

> Yes, merge/blend is a very important operation. Also, it would be
> nice to
> think about generalizing this to apply masks to all vector operations,
> particularly loads and stores.

Can you elaborate about what you mean by this? I've built vector IRs
in the past that have writemasking effectively built into all IR ops,
and I don't know if I would recommend this approach generally. But I'm
not sure that's what you mean.

That's basically what I mean. What about this gives you pause? Generating
code for an architecture without these mask operations might get interesting
(you'd have to do reverse if-conversion). But right now, there's no way to
effectively support machines that _do_ have this hardware.

The if-conversion could be done depending on whether or not the vector
hardware supports generalized masked operations. Right now the only
way to get this support is to define a whole set of MachineInstructions
that describe these operations and write the if-conversion passes to make
the transformation on machine IR. It would be nice to provide one model that
can be reused.

Obviously, a lot more discussion about this needs to happen.

>> I realize this is probably the most controversial change amongst
>> these. I gather there is some concern about representing "variable
>> width" i1s, but I would contend that that's the case even for i1s
>> which are not vectors.
>
> What do you mean by "variable width?"

At some point the i1 vectors may need to be converted to masks. The
sizes for these masks will often depend on how they're used. On ISAs
like SSE and CellSPU, a comparison generates a mask corresponding to
the size of the operands being compared, and a selection requires a
mask corresponding to the size of the operands being selected.

Right.

The "simple" way to deal with this is to insert appropriate conversion
code at uses of i1s, and pick a representation for a given i1 based on
its SSA birthpoint. It's a little more ambiguous when you start adding
phis into the mix, e.g.:

a:
  %a1 = fcmp olt <2 x float> %f1, %f2 ; yields <2 x i1>
  br label %c

b:
  %a2 = fcmp olt <2 x double> %d1, %d2 ; yields <2 x i1>
  br label %c

c:
  %a3 = phi <2 x i1> [%a1, %a], [%a2, %b]
  select <2 x i1> %a3, %s1, %s2 ; where s1, s2 are <2 x i16>
  select <2 x i1> %a3, %c1, %c2 ; where s1, s2 are <2 x i8>

The representation for %a1 may be <2 x i32>, for %a2 <2 x i64>, but
for %a3 it's less lear.

As I said in a previous message, I'm not sure that code like this makes
sense even without i1. I would assume you'd just generate two different
masks, one for <2 x double> and one for <2 x float>. If the masks can
later be combined into one register, great! But that's a machine-dependent
decision.

I'm trying to think of what kind of source code would produce this. The
codes I imagine would either generate two vector loops with different
masks corresponding to checking doubles or floats or would generate
a loop that sets a vector of integers based on either doubles or floats
and then another loop that uses the vector of integers to mask operations.

In either code, I don't imagine a phi like this would be generated. Because
vector masks generally result from if-conversion, it's strange to even see a
phi here.

Perhaps I'm thinking small. Can you give a source code example you
could see mapping to this?

My gut tells me we should declare this kind of thing illegal.

I don't think this is a huge problem, but it's something to be aware
of. Note the same issue can affect even scalar i1 values. In fact,
Scott Michel posted to llvm-dev recently about dealing with this issue
for setcc in the CellSPU backend.

Thanks for the pointer. I'll go read that thread.

I should think conversion at the point of use as you describe would be
workable. But I still really question whether we have to worry about this
at all.

>> that only POT sizes are expected to work (although we've found some
>> bugs for things like <2 x float> that we're submitting). Ideally
>> things like <3 x float> operands would usually be rounded up to the
>> size supported by the machine directly. We can try to do this in the
>
> You might need mask support as well, especially if the operation can
> trap.

Hmm. Yes, divisions by zero etc. are something we should think about.

You can always scalarize, but that kinda defeats the point of non-POT.
Still, for vector hardware that supports non-POT, this support is a win.,

                                                        -Dave