Vectors of Pointers and Vector-GEP

Hi,

Following the discussion in last week’s LLVM developers conference I started working on support for vectors-of-pointers. Vectors of pointers are needed for supporting scatter/gather operations and are the first step in the direction of supporting predicated architectures. In the attached patch, I change the LLVM-IR in order to support vectors-of-pointers and added basic support for vector-gep. In following patches I plan to extend the vector-gep support to more indices and add scatter/gather intrinsics.

I started by implementing vector-gep support for a simple case where there is a single index. The reason for this limitation, as noted by Dan Gohman, is that pointers may point to structs, and vectors of indices may point to different members in the struct. I am aware of the fact that supporting multiple indices is an important feature and I do intend to add support for multiple indices in the future.

Please review the attached patch.

Thanks,

Nadav

vector_gep.patch (13.9 KB)

Hi,

Following the discussion in last week’s LLVM developers conference I
started working on support for vectors-of-pointers. Vectors of
pointers are needed for supporting scatter/gather operations and are
the first step in the direction of supporting predicated
architectures.

This is also important for vanilla (auto-)vectorization. At least in the
context of my basic-block autovectorizer, the fact that there is no
vector GEP results in a lot of missed vectorization opportunities.

-Hal

Duncan,

Thanks for the quick review! Here is a short description (design) of where I am going with this patch:

1. Motivation: Vectors-of-pointers is the first step in supporting scatter/gather instructions (available in AVX2, for example). I believe that this feature was requested on the mailing list before. As mentioned by Hal Finkel earlier today, this feature is desired by autovectorizers as it simplifies the abstraction. I can also mention the Intel OpenCL Autovectorizer and ISPC as two other vectorizers which would benefit from this type.

2. The type Vectors-of-pointers is similar to the pointer type, in the sense that it cannot be bit-casted and it has no size. It supports the following conversions:
  a. Bitcast to other vector-of-pointers, as long as the element count is the same.
  b. No other bit-cast operations are allowed.
3. The type Vectors-of-pointers can be used by the following instructions:
  a. Insertelement/extractelement/shuffle - just like any other vector
  b. Getelementptr - for address calculations, with the restrictions mentioned below.
  c. Intrinsics - to implement the scatter/gather operations
  d. IntToPtr, PtrToInt - for conversion to vectors of integers [This is not implemented in this patch].
4. The getelementptr instruction can be used to access structs, which may cause a problem for vector-geps when accessing different fields. Currently I limit the use of vector-gep to vectors of pointers with a single index. In the future, I may add support for more indices if users of the vector-gep would request for it. Matt Pharr from ISPC also requested support for vectors of arrays, which are not supported at this point.
5. The getelementptr indices must all be vectors, if the pointer operand is a vector of pointers.

Regarding your specific comments, Vectors of pointers are a new type in LLVM, so I believe that any predictable and sensible behavior is acceptable. Much like pointers, 'getBitWidth' should return zero. In my current patch bitcasting of one vector pointer to another vector pointer of the same length is okay, but any other bitcast is invalid. I added an assert to getInteger to prevent calling 'getInteger' on vectors of pointers. I agree that IntToPtr and PtrToInt need to be extended in order to support vectors of pointers and I plan to add this.

Thank you,
Nadav

Duncan,

Thanks for the quick review! Here is a short description (design) of where I am going with this patch:

1. Motivation: Vectors-of-pointers is the first step in supporting scatter/gather instructions (available in AVX2, for example). I believe that this feature was requested on the mailing list before. As mentioned by Hal Finkel earlier today, this feature is desired by autovectorizers as it simplifies the abstraction. I can also mention the Intel OpenCL Autovectorizer and ISPC as two other vectorizers which would benefit from this type.

2. The type Vectors-of-pointers is similar to the pointer type, in the sense that it cannot be bit-casted and it has no size. It supports the following conversions:
  a. Bitcast to other vector-of-pointers, as long as the element count is the same.
  b. No other bit-cast operations are allowed.
3. The type Vectors-of-pointers can be used by the following instructions:
  a. Insertelement/extractelement/shuffle - just like any other vector
  b. Getelementptr - for address calculations, with the restrictions mentioned below.
  c. Intrinsics - to implement the scatter/gather operations
  d. IntToPtr, PtrToInt - for conversion to vectors of integers [This is not implemented in this patch].
4. The getelementptr instruction can be used to access structs, which may cause a problem for vector-geps when accessing different fields. Currently I limit the use of vector-gep to vectors of pointers with a single index. In the future, I may add support for more indices if users of the vector-gep would request for it. Matt Pharr from ISPC also requested support for vectors of arrays, which are not supported at this point.

So long as coverage will be incomplete, I request that you add some kind
of utility function that can be called on a type and set of GEP indicies
(or maybe a GEP instruction itself) to determine whether it can be
vectorized.

I think it would be easiest on the user if all GEPs can be vectorized.
The default can just be to break up the vector GEP into some scalar GEPs
and some vector insert/extracts.

Regardless, this feature should really help with vectorization, and I
think it is great that you're implementing it.

5. The getelementptr indices must all be vectors, if the pointer operand is a vector of pointers.

Regarding your specific comments, Vectors of pointers are a new type in LLVM, so I believe that any predictable and sensible behavior is acceptable. Much like pointers, 'getBitWidth' should return zero.

Having getBitWidth return 0 makes sense to me. We'll need to grep the
code and add or change a bunch of if statements, but I think that will
be the case however we do this.

-Hal

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Hi,

Following the discussion in last week’s LLVM developers conference I started working on support for vectors-of-pointers. Vectors of pointers are
needed for supporting scatter/gather operations and are the first step in the direction of supporting predicated architectures. In the attached
patch, I change the LLVM-IR in order to support vectors-of-pointers and added basic support for vector-gep. In following patches I plan to extend
the vector-gep support to more indices and add scatter/gather intrinsics.

I started by implementing vector-gep support for a simple case where there is a single index. The reason for this limitation, as noted by Dan
Gohman, is that pointers may point to structs, and vectors of indices may point to different members in the struct. I am aware of the fact that
supporting multiple indices is an important feature and I do intend to add support for multiple indices in the future.

So glad to see this happening!

But I'm unclear about your description. A vector GEP produces a vector
of pointers, right? And a scatter/gather operation takes a vector of
pointers as the list of addresses to fetch?

What does the non-unit stride case look like? I have two ways I think
about a gather/scatter or non-unit stride access.

1. Vector-of-indices

This looks something like:

V1 <- Base + V2

where Base is a base address and V2 is a vector of offsets from the
base.

2. Vector-of-addresses

This looks omsething like:

V1 <- Offset + V2

Where Offset is an offset value applied to each address in V2.

The first form is somewhat more convenient for striding through an array
while the second is somewhat more convenient for doing gather/scatter on
a set of "random" address locations (think operating on a field in
several randomly-allocated structs).

The multiple-index case (if I understand what you mean) is just a
special case of the above, where the values in V2 have been adjusted to
point to the various different fields.

Can you explain with some example what your proposal provides and how it
relates to #1 and #2 above? I'm just trying to understand the overall
scheme.

Thanks!

                           -Dave

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Duncan,

Thanks for the quick review! Here is a short description (design) of
where I am going with this patch:

Did I miss Duncan's feedback? It didn't come across to me.

                              -Dave

David,

Thanks for the support! I sent a detailed email with the overall plan. But just to reiterate, the GEP would look like this:

  %PV = getelementptr <4 x i32*> %base, <4 x i32> <i32 1, i32 2, i32 3, i32 4>

Where the index of the GEP is a vector of indices. I am not against having multiple indices. I just want to start with a basic set of features.

Thanks,
Nadav

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

David,

Thanks for the support! I sent a detailed email with the overall
plan. But just to reiterate, the GEP would look like this:

  %PV = getelementptr <4 x i32*> %base, <4 x i32> <i32 1, i32 2, i32 3, i32 4>

Where the index of the GEP is a vector of indices. I am not against
having multiple indices. I just want to start with a basic set of
features.

Ah, I see. I actually think multiple indices as in multiple vectors of
indices to the GEP above would be pretty rare.

                             -Dave

greened@obbligato.org (David A. Greene) writes:

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Following the discussion in last week’s LLVM developers conference I
started working on support for vectors-of-pointers. Vectors of
pointers are needed for supporting scatter/gather operations and are
the first step in the direction of supporting predicated
architectures.

I just want to make clear that gather/scatter and prediction are
orthogonal concepts. You don't need one to do the other. You can use
scatter/gather to vectorize conditional code but it's not the same as
true predication, which is generally much more efficient.

                               -Dave

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

It went to llvm-commits.

This is a problem. I can't possibly sort through all of the stuff on
llvm-commits to find design documents like this.

It would be easier if llvm-commits were restricted to, say, commits and
higher-level design and development discussions occurred on, say, -dev.
:slight_smile:

                                 -Dave

David,

I agree with you, scatter/gather and predication are orthogonal concepts. In theory, you could implement load/store predication using scatter/gather instructions by loading/storing to dedicated 'dead' stack slots. Having said that, I don't know of any architecture which has scatter/gather but does not have predicated instructions.

Looking onward, I definitely want to see LLVM support predicated architectures. Dan Gohman had some excellent ideas on different way this can be done. As soon as I finish stabilizing the pointer-vector patch, and implement scatter/gather intrinsics, I would like to start a discussion on the implementation of predicated instructions in LLVM.

Nadav

I agree that a single vector index is sufficient for many cases. Matt Pharr (from the ISPC compiler), showed me an interesting case where there is a single pointer into an array. In this case we need to have two indices, where the first index is zero. Once the basic patch is in, we can start looking at adding support for arrays and multiple indices.

Nadav

Nadav, David,

I'd like to understand a bit better the final role of these pointer vector types in 64bit architectures, where the pointers are often bigger than the elements stored/fetch (e.g, 32bits floats/ints).

Will 64bits backends be forced to actually operate with 64bit pointer vectors all the time? Or will they be able to retain operations on base + 32bit offsets as such?

In particular, an important use case for 3D software rendering is to be able to gather <4 x i32> values, from a i32* scalar base pointer in a 64bit address space, indexed by <N x i32> offsets. [1] And it is important that the intermediate <N x i32*> pointer vectors is actually never instanced, as it wouldn't fit in the hardware SIMD registers, and therefore would require two gather operations.

It would be nice to see how this use case would look in the proposed IR, and get assurance that backends will be able to emit efficient code (i.e., a single gather instruction) from that IR.

Jose

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-June/040825.html

Hi Jose,

The proposed IR change does not contribute nor hinder the usecase you mentioned. The case of a base + vector-index should be easily addressed by an intrinsic. The pointer-vector proposal comes to support full scatter/gather instructions (such as the AVX2 gather instructions).

Nadav

Yes, indeed I can always fallback to intrinsics.

But still, I believe that the case I described is in its essence quite common-place, so it should be a first-class citizen in the LLVM IR. AVX2 is the target ISA I'm thinking of too BTW.

Let's forget 3D, and imagine something as trivial as a vectorized i32 => float table look up. I'd expect that the IR would look something like:

  ; Look Up Table with precomputed values
  declare float* @lut;

  define <8 x float> @foo(<8 x float> %indices) {
    %pointer = getelementptr float* @lut, <8 x i32> %indices
    %values = load <8 x float*> %pointer
    ret <8 x float> %values;
  }

And the final AVX2 code I'd expect would consist of a single VGATHERDPS, both on 64bits and 32bits addressing mode:

foo:
  VPCMPEQB ymm1, ymm1, ymm1 ; generate all ones
  VGATHERDPS ymm0, DWORD PTR [ymm0 * 4 + lut], ymm1
  RET

Jose

Jose,

The scenario you described is probably the most important/common case. Supporting GEPs with a scalar base pointer and multiple indices can indeed assist IR-level optimizations in detecting these patterns and replace them with intrinsics. But even without a single scalar base pointers, optimizations can detect that the base pointer is broadcasted from a scalar. Having said that, I am still not sure how to add codegen support for AVX2 scatter/gather of base + 32bit-indices. The problem is that the GEP would return a vector of pointers, which need to be reversed back to the 'base+index' form. I think that replacing the GEP/LOAD sequence with an intrinsic if probably the best choice.

Nadav

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

I agree that a single vector index is sufficient for many cases. Matt Pharr (from the ISPC compiler), showed me an interesting case where there is a single pointer into an array. In this case we need to have two indices, where the first index is zero. Once the basic patch is in, we can start looking at adding support for arrays and multiple indices.

Ah, I see. I forgot about the GEP zero offset case. This is indeed a
common case. In fact I would like to see a couple of additions:

%PV = getelementptr i32* %base, <4 x i32> <i32 0, i32 0, i32 0, i32 0>, <4 x i32> <i32 w, i32 x, i32 y, i32 z>

That is, make the base address a scalar and all vector indices get
scaled and added to it to produce the vector of pointers. This is the
case Matt gave you and is very common. It's used for striding through
an array or G/S where all of the elements are in the same array.

This avoids the need to broadcast the base address to a vector. We
could match the broadcast in codegen and revert it to an instruction
that takes a scalar base, but it's a bit trickier. We'd probably have
to make special target ISD nodes _a_la_ the X86 shuffle stuff. The
above would make this more target-independent. It is easy enough for a
target to lower the base address to a broadcast if necessary.

%PV = getelementptr i32* %base, <4 x i32> <i32 0, i32 0, i32 0, i32 0>, i32 %index, i32 %stride

This is for common strided accesses, for example:

x[i], x[i+2], x[i+4], x[i+6]

The stride component would be multiplied against an incremented index:

%PV = getelementptr i32* %base, <4 x i32> <i32 0, i32 0, i32 0, i32 0>, i32 %i, i32 %stride

=> <4 x i32 *><0+%i*%stride*bitwidth, 0+(%i+1)*%stride*bitwidth, 0+(%i+2)*%stride*bitwidth, 0+(%i+3)*%stride*bitwidth>

This avoids the need to generate a vector of indices that are a regular
strided pattern. Similar to the above, it simplifies matching on vector
architectures that directly support strided accesses.

I would really like to see these two versions of GEP added at some
point. They would simplify isel quite a bit.

Some other ideas:

An alternative is to introduce a cidx (compressed index) instruction:

%IV = cidx i32 %stride, i32 %vector_length

(if %vector_length is 4) => <4 x i32><%stride*0, %stride*1, %stride*2, %stride*3>

This generates the strided index pattern which could be fed into the GEP
as an index vector.

Or:

%IV = cidx i32 %i, i32 %stride, i32 %vector_length

(if %vector_length is 4) => <4 x i32><%stride*%i, %stride*(%i+1), %stride*(%i+2), %stride*(%i+3)>

You can see examples of these and other instructions from the Cray X1
here:

http://docs.cray.com/books/S-2314-51/S-2314-51-manual.pdf

2.5.4 is the section of interest for index generation.

In particular, the scan() and cmprss() (compress) instructions are also
very useful to generate index sets.

Note that the X1 index generation instructions also take a mask
(predicate) operand. The difference between cidx() and scan() is how
the mask and index are interpreted. scan() in particular is useful for
vector compress/expand operations. This is something else to think
about if you plan to tackle predication in the future.

These are just some additional ideas to think about as you continue to
design this.

                            -Dave

Jose Fonseca <jfonseca@vmware.com> writes:

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

> David,
>
> Thanks for the support! I sent a detailed email with the overall
> plan. But just to reiterate, the GEP would look like this:
>
> %PV = getelementptr <4 x i32*> %base, <4 x i32> <i32 1, i32 2, i32
> 3, i32 4>
>
> Where the index of the GEP is a vector of indices. I am not against
> having multiple indices. I just want to start with a basic set of
> features.

Ah, I see. I actually think multiple indices as in multiple vectors
of
indices to the GEP above would be pretty rare.

Nadav, David,

I'd like to understand a bit better the final role of these pointer
vector types in 64bit architectures, where the pointers are often
bigger than the elements stored/fetch (e.g, 32bits floats/ints).

The pointers are addresses. On a 64-bit address machine they will be 64
bits. On a 32-bit address machine they will be 32 bits.

For a situation like PTX that has multiple addresses sizes, we will need
additional LLVM support. Right now a pointer can only have one size per
target.

Will 64bits backends be forced to actually operate with 64bit pointer
vectors all the time? Or will they be able to retain operations on
base + 32bit offsets as such?

Are you talking about 32-bit pointers? If so, Nadav has talked about
vector inttoptr and ptrtoint instructions which I think can address the
need you're getting at. But I'm a little unclear on what you want.

In particular, an important use case for 3D software rendering is to
be able to gather <4 x i32> values, from a i32* scalar base pointer in
a 64bit address space, indexed by <N x i32> offsets. [1] And it is
important that the intermediate <N x i32*> pointer vectors is actually
never instanced, as it wouldn't fit in the hardware SIMD registers,
and therefore would require two gather operations.

By "fit" are you worried about vector length? If so, legalize would
have to break up the <N x i32*> vector into two or more smaller vectors.

If you are worried about element size (there are only 32-bit elements)
then inttoptr/ptrtoint should handle it, I think.

                              -Dave

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

The scenario you described is probably the most important/common
case. Supporting GEPs with a scalar base pointer and multiple indices
can indeed assist IR-level optimizations in detecting these patterns
and replace them with intrinsics. But even without a single scalar
base pointers, optimizations can detect that the base pointer is
broadcasted from a scalar.

I just wrote an extensive reply that covers this. I think we do want a
scalar base GEP to make isel easier and the IR more target-independent.
We should also consider a strided GEP (also covered in my reply) for the
same reason.

Having said that, I am still not sure how to add codegen support for
AVX2 scatter/gather of base + 32bit-indices. The problem is that the
GEP would return a vector of pointers, which need to be reversed back
to the 'base+index' form. I think that replacing the GEP/LOAD sequence
with an intrinsic if probably the best choice.

In the same reply I mentioned various index generation instructions like
cidx. This allows you to retain the index set separate from the final
addresses. You'd then match load (gep(base, <0>, index set)) as the
gather operation in isel, similar to how memops are done for X86 today
(i.e. manual lowering would put the GEP information in special address
match data structures).

The same can be done without cidx, but you need a more complex
instruction sequence to generate the index set and match that in the
address manual lowering code. This does make the manual address
matching code more complicated.

There are lots of options here but I do think we need something beyond
the simple "all vector" GEP.

                             -Dave

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Looking onward, I definitely want to see LLVM support predicated
architectures. Dan Gohman had some excellent ideas on different way
this can be done. As soon as I finish stabilizing the pointer-vector
patch, and implement scatter/gather intrinsics, I would like to start
a discussion on the implementation of predicated instructions in LLVM.

Me too. This is going to be very important going forward as we're going
to have to extract and represent parallelism in the new age of the clock
ceiling.

                           -Dave