Indexed Load and Store Intrinsics - proposal

Hi,

Recent Intel architectures AVX-512 and AVX2 provide vector gather and/or scatter instructions.
Gather/scatter instructions allow read/write access to multiple memory addresses. The addresses are specified using a base address and a vector of indices.
We’d like Vectorizers to tap this functionality, and propose to do so by introducing new intrinsics:

VectorValue = @llvm.sindex.load (BaseAddr, VectorOfIndices, Scale)
VectorValue = @llvm.uindex.load (BaseAddr, VectorOfIndices, Scale)
VectorValue = @llvm.sindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask)
VectorValue = @llvm.uindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask)

Semantics:
For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];}

void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale)
void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale)
void @llvm.sindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask)
void @llvm.uindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask)

Semantics:
For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) = VectorValue[i];}

VectorValue: any float or integer vector type.
BaseAddr: a pointer; may be zero if full address is placed in the index.
VectorOfIndices: a vector of i32 or i64 signed or unsigned integer values.
Scale: a compile time constant 1, 2, 4 or 8.
VectorValue, VectorOfIndices and Mask must have the same vector width.

An indexed store instruction with complete or partial overlap in memory (i.e., two indices with same or close values) will provide the result equivalent to serial scalar stores from least to most significant vector elements.

The new intrinsics are common for all targets, like recently introduced masked load and store.

Examples:

<16 x float> @llvm.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32> %index, i32 %scale)
<16 x float> @llvm.masked.sindex.load.v16f32.v16i32 (i8 %ptr, <16 x i32> %index, <16 x float> %passthru, <16 x i1> %mask)
void @llvm.sindex.store.v16f32.v16i64(i8
%ptr, <16 x float> %value, <16 x 164> %index, i32 %scale, <16 x i1> %mask)

Comments?

Thank you.

  • Elena

Hi Elena,

I think that in general this proposal makes sense and is consistent with discussions that we’ve had in the past. These new intrinsics can be very useful for vectorization. I have a few comments below

Hi,

Recent Intel architectures AVX-512 and AVX2 provide vector gather and/or scatter instructions.
Gather/scatter instructions allow read/write access to multiple memory addresses. The addresses are specified using a base address and a vector of indices.
We’d like Vectorizers to tap this functionality, and propose to do so by introducing new intrinsics:

VectorValue = @llvm.sindex.load (BaseAddr, VectorOfIndices, Scale)
VectorValue = @llvm.uindex.load (BaseAddr, VectorOfIndices, Scale)
VectorValue = @llvm.sindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask)
VectorValue = @llvm.uindex.masked.load (BaseAddr, VectorOfIndices, Scale, PassThruVal, Mask)

It looks like the proposed intrinsic is very specific to the x86 implementation of gather/scatter. Would it be possible to remove the PassThrough value from the intrinsic and define the masked-out value to be undef? You would still be able to pattern match it if you use a maskedload + select.

Can we remove the masked version of the intrinsic altogether and pattern match it using the non-masked version somehow?

Can we infer the scale value based on the loaded element type?

Semantics:
For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr + VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];}

void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale)
void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices, Scale)
void @llvm.sindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask)
void @llvm.uindex.masked.store (BaseAddr, VectorValue, VectorOfIndices, Scale, Mask)

Semantics:
For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale) = VectorValue[i];}

VectorValue: any float or integer vector type.

We should also support loading and storing pointer values.

BaseAddr: a pointer; may be zero if full address is placed in the index.
VectorOfIndices: a vector of i32 or i64 signed or unsigned integer values.
Scale: a compile time constant 1, 2, 4 or 8.

Why do we need to limit the scale values?

VectorValue, VectorOfIndices and Mask must have the same vector width.

An indexed store instruction with complete or partial overlap in memory (i.e., two indices with same or close values) will provide the result equivalent to serial scalar stores from least to most significant vector elements.

Thanks,
Nadav

"Demikhovsky, Elena" <elena.demikhovsky@intel.com> writes:

Semantics:
For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices[i]*Scale)
= VectorValue[i];}
VectorValue: any float or integer vector type.
BaseAddr: a pointer; may be zero if full address is placed in the
index.
VectorOfIndices: a vector of i32 or i64 signed or unsigned integer
values.

What about the case of a gather/scatter where the BaseAddr is zero and
the indices are pointers? Must we do a ptrtoint? llvm.org is down at
the moment but I don't think we currently have a vector ptrtoint.

Scale: a compile time constant 1, 2, 4 or 8.

This seems a bit too Intel-focused. Why not allow arbitrary scales? Or
alternatively, eliminate the Scale and do a vector multiply on
VectorOfIndices. It should be simple enough to write matching TableGen
patterns. We do it now for the x86 memop stuff.

VectorValue, VectorOfIndices and Mask must have the same vector width.

From your example, you mean they must have the same number of vector

elements, not the same bit width, right? I'm used to "width" meaning a
specific bit length and "vector length" meaning "number of elements."
With that terminology, I think you mean they must have the same vector
length.

An indexed store instruction with complete or partial overlap in
memory (i.e., two indices with same or close values) will provide the
result equivalent to serial scalar stores from least to most
significant vector elements.

Yep, they must be ordered. Do we want to provide unordered scatters as
well? Some (non-LLVM) targets have them. We don't need to add them
right now but it's worth thinking about.

The new intrinsics are common for all targets, like recently
introduced masked load and store.
Examples:
<16 x float> @llvm.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x i32>
%index, i32 %scale)
<16 x float> @llvm.masked.sindex.load.v16f32.v16i32 (i8 *%ptr, <16 x
i32> %index, <16 x float> %passthru, <16 x i1> %mask)
void @llvm.sindex.store.v16f32.v16i64(i8* %ptr, <16 x float> %value,
<16 x 164> %index, i32 %scale, <16 x i1> %mask)
Comments?

I think it's definitely a good idea to introduce them, but let's make
them a little more target-neutral if we can.

                           -David

Nadav Rotem <nrotem@apple.com> writes:

It looks like the proposed intrinsic is very specific to the x86
implementation of gather/scatter. Would it be possible to remove the
PassThrough value from the intrinsic and define the masked-out value
to be undef? You would still be able to pattern match it if you use a
maskedload + select.

This makes sense to me.

Can we remove the masked version of the intrinsic altogether and
pattern match it using the non-masked version somehow?

I don't think that's possible. The masking needs to be atomic with the
memory operation. A non-masked memory operation could fault. This:

vec = masked_gather(base, indices, mask)

is not semantically equivalent to

vec1 = unmasked_gather(base, indices)
vec = select(mask, vec1, othervec)

The gather could fault. It is not sufficient to pattern-match this
because some pass before isel could look at this code and assume the
gather doesn't fault, rearranging code in illegal ways.

Can we infer the scale value based on the loaded element type?

No, I don't think we can do that. Consider the case where Base is zero
and VectorOfIndices contains pointers.

[As an aside, LLVM does indeed have vector ptrtoint, so we could always
use that, though another intrinsic allowing vectors of pointers might be
cleaner.]

I think requiring a multiply of the VectorOfIndices before the
gather/scatter is the most flexible course.
    

    Semantics:
    For i=0,1,…,N-1: if (Mask[i]) {VectorValue[i] = *(BaseAddr +
    VectorOfIndices[i]*Scale) else VectorValue[i]=PassThruVal[i];}
    
    void @llvm.sindex.store (BaseAddr, VectorValue, VectorOfIndices,
    Scale)
    void @llvm.uindex.store (BaseAddr, VectorValue, VectorOfIndices,
    Scale)
    void @llvm.sindex.masked.store (BaseAddr, VectorValue,
    VectorOfIndices, Scale, Mask)
    void @llvm.uindex.masked.store (BaseAddr, VectorValue,
    VectorOfIndices, Scale, Mask)
    
    Semantics:
    For i=0,1,…,N-1: if (Mask[i]) {*(BaseAddr + VectorOfIndices
    [i]*Scale) = VectorValue[i];}
    
    VectorValue: any float or integer vector type.

We should also support loading and storing pointer values.

Yes, though ptrtoint could also be used here.
     
                              -David

Hi Nadav,

Hi David,

What about the case of a gather/scatter where the BaseAddr is zero and the indices are pointers? Must we do a ptrtoint? llvm.org is down at the moment but I don't think we currently have a vector ptrtoint.

[Demikhovsky, Elena] From the site:
The ‘ptrtoint‘ instruction converts the pointer or a vector of pointers value to the integer (or vector of integers) type ty2.

Scale: a compile time constant 1, 2, 4 or 8.

This seems a bit too Intel-focused. Why not allow arbitrary scales? Or alternatively, eliminate the Scale and do a vector multiply on VectorOfIndices. It should be simple enough to write matching TableGen patterns. We do it now for the x86 memop stuff.

[Demikhovsky, Elena] As I wrote to Nadav, may be two intrinsics will be more general. I'm just looking at usage model. If the index is a pointer, scale = 1, base = 0. If it is an index inside array, scale covers all basic types from char to double.
Do you think that 2 intrinsics will be less-Intel-focused?
(1) non-zero base and vector of indices (index is relative to base) and implicit scale based on element type.
(2) without base, without scale, just vector of pointers

We should also support loading and storing pointer values.

Does it mean that on vectorizer stage we don’t know size of pointer?

I think that masked load / store / gather / scatter should handle the same data types.

I’m not against supporting vector of pointers in all intrinsics.

"Demikhovsky, Elena" <elena.demikhovsky@intel.com> writes:

Hi David,

What about the case of a gather/scatter where the BaseAddr is zero
and the indices are pointers? Must we do a ptrtoint? llvm.org is
down at the moment but I don't think we currently have a vector
ptrtoint.

[Demikhovsky, Elena] From the site:
The ‘ptrtoint‘ instruction converts the pointer or a vector of
pointers value to the integer (or vector of integers) type ty2.

Yep, found it after llvm.org came back up. :slight_smile:

Scale: a compile time constant 1, 2, 4 or 8.

This seems a bit too Intel-focused. Why not allow arbitrary scales?
Or alternatively, eliminate the Scale and do a vector multiply on
VectorOfIndices. It should be simple enough to write matching
TableGen patterns. We do it now for the x86 memop stuff.

[Demikhovsky, Elena] As I wrote to Nadav, may be two intrinsics will
be more general. I'm just looking at usage model. If the index is a
pointer, scale = 1, base = 0. If it is an index inside array, scale
covers all basic types from char to double.
Do you think that 2 intrinsics will be less-Intel-focused?
(1) non-zero base and vector of indices (index is relative to base)
and implicit scale based on element type.
(2) without base, without scale, just vector of pointers

I don't really have an opinion as to one or two intrinsics as long as
the vector-of-pointers idiom is supported. Does making the scaling
implicit hide optimization opportunities on architectures that don't
have instructions that automatically scale? I don't know if there are
any architectures supporting gather/scatter that don't auto-scale but I
can imagine such architectures. I'm just throwing out things to think
about.

I usually tend toward making everything explicit, meaning the scaling
would be done as a multiply feeding the gather/scatter intrinsic, but
that's just personal preference. I don't have strong data to argue one
way or the other.

                              -David

I would be opposed to any representation which required the introduction of ptrtoint casts by the vectorizer. If it were the only option available, I could be argued around, but I think we should try to avoid this.

More generally, I'm somewhat hesitant of representing a scatter with explicit base and offsets at all. Why shouldn't the IR representation simply be a load from a vector of arbitrary pointers? The backend can pattern match the actual gather instructions it supports and scalarize the rest. The proposal being made seems very specific to the current generation of x86 hardware.

p.s. Where is the documentation for the existing mask load intrinsics? I can't find it with a quick search through the LangRef.

Philip

Why shouldn't the IR representation simply be a load from a vector of arbitrary pointers?

Such a load could indeed serve as a general form of a gather or scatter. As Elena responded, we can propose two distinct intrinsics: one with a vector of pointers, and another with (non-zero) base, a vector of indices, and a scale implicitly inferred from the element type.

The motivation for the latter stems from vectorizing a load or store to "b[i]", where b is invariant. Broadcasting b and using a vector gep to feed a vector of pointers, to be pattern matched and folded later, may work. The alternative intrinsic proposed keeps b scalar and uses a vector of indices for i. In any case, it's important to recognize such common patterns, at-least for x86, so could deserve an x86 intrinsic. But it's a general pattern that could potentially serve other implementations; any other gathers to consider atm?

Documentation indeed needs to be provided.

Ayal.

From: "Ayal Zaks" <ayal.zaks@intel.com>
To: "Philip Reames" <listmail@philipreames.com>, dag@cray.com, "Elena Demikhovsky" <elena.demikhovsky@intel.com>
Cc: "Robert Khasanov" <robert.khasanov@intel.com>, llvmdev@cs.uiuc.edu
Sent: Monday, December 22, 2014 8:05:43 AM
Subject: Re: [LLVMdev] Indexed Load and Store Intrinsics - proposal

> Why shouldn't the IR representation simply be a load from a vector
> of arbitrary pointers?

Such a load could indeed serve as a general form of a gather or
scatter. As Elena responded, we can propose two distinct intrinsics:
one with a vector of pointers, and another with (non-zero) base, a
vector of indices, and a scale implicitly inferred from the element
type.

The motivation for the latter stems from vectorizing a load or store
to "b[i]", where b is invariant. Broadcasting b and using a vector
gep to feed a vector of pointers, to be pattern matched and folded
later, may work.

I would like you to explore this direction, where we use a vector GEP and the intrinsic simply takes a vector of pointers. The backend should pattern-match this as appropriate. I see no reason why we can't make this work, especially because we don't have any real uses of vector GEPs now, so we can *define* the canonical optimized form of them to be conducive to the kind of pattern matching we'd like to perform in the backends.

This, I imagine, will require some additional infrastructure work. Currently, GEPs, including vector GEPs, are expanded very early during SDAG building, and the form produced may not be appropriate for reliable pattern matching during later lowering phases. The way this is done is not set in stone, however, and we can certainly change it (including via the introduction of new SDAG nodes) to keep the necessary information together in compact form.

Thanks again,
Hal

For non-zero case, most of time (e.g. a[i]), the "base" is uniform, the property helps to identify and perform neighboring gather/scatter optimizations in some applications. If the information is preserved, later, it may not easy to recover this "uniform" information. Non-zero case is not designed only for x86, basically, it helps to preserve certain program information.

Thanks,
Xinmin

From: "Xinmin Tian" <xinmin.tian@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Ayal Zaks" <ayal.zaks@intel.com>
Cc: dag@cray.com, "Robert Khasanov" <robert.khasanov@intel.com>, llvmdev@cs.uiuc.edu
Sent: Tuesday, December 23, 2014 7:36:44 PM
Subject: RE: [LLVMdev] Indexed Load and Store Intrinsics - proposal

For non-zero case, most of time (e.g. a[i]), the "base" is uniform,

Agreed.

the property helps to identify and perform neighboring
gather/scatter optimizations in some applications. If the
information is preserved, later, it may not easy to recover this
"uniform" information.

I don't understand what you mean. Why do you feel the information will be difficult to recover later from vector GEPs feeding the scatter/gather intrinsics?

-Hal

I don't understand what you mean. Why do you feel the information will be difficult to recover later from vector GEPs feeding the scatter/gather intrinsics?

As the intrinsic only hold vector of pointers, we would have to trace back to GEPs, and decompose the address calculation of each elements, for base and offset ... etc, right? For vector of pointers case, the process is to get base and offset, and compose them, feed it into intrinsic, later decompose them when needed. If we keep base and offset, it does simplify the compose and decompose process in some cases, right?

Xinmin

From: "Xinmin Tian" <xinmin.tian@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: dag@cray.com, "Robert Khasanov" <robert.khasanov@intel.com>, llvmdev@cs.uiuc.edu, "Ayal Zaks"
<ayal.zaks@intel.com>
Sent: Tuesday, December 23, 2014 8:47:12 PM
Subject: RE: [LLVMdev] Indexed Load and Store Intrinsics - proposal

>>>I don't understand what you mean. Why do you feel the information
>>>will be difficult to recover later from vector GEPs feeding the
>>>scatter/gather intrinsics?

As the intrinsic only hold vector of pointers, we would have to trace
back to GEPs, and decompose the address calculation of each
elements, for base and offset ... etc, right?

Correct.

For vector of pointers
case, the process is to get base and offset, and compose them, feed
it into intrinsic, later decompose them when needed.

Yes.

If we keep base
and offset, it does simplify the compose and decompose process in
some cases, right?

It does, but it also makes the infrastructure less generic (and duplicates functionality contained within the vector GEP within the intrinsic), and complicates any code that needs to analyze the instrinsic. It is not clear that it makes it sufficiently simpler to justify that complexity.

-Hal

Why shouldn't the IR representation simply be a load from a vector of arbitrary pointers?

Such a load could indeed serve as a general form of a gather or scatter. As Elena responded, we can propose two distinct intrinsics: one with a vector of pointers, and another with (non-zero) base, a vector of indices, and a scale implicitly inferred from the element type.

The motivation for the latter stems from vectorizing a load or store to "b[i]", where b is invariant. Broadcasting b and using a vector gep to feed a vector of pointers, to be pattern matched and folded later, may work. The alternative intrinsic proposed keeps b scalar and uses a vector of indices for i. In any case, it's important to recognize such common patterns, at-least for x86, so could deserve an x86 intrinsic. But it's a general pattern that could potentially serve other implementations; any other gathers to consider atm?

I think ARM supports a very limited form of gather/scatter through VLD2/3/4 and VST2/3/4 interleaved load and stores instructions.

For example, this instruction performs one gather load to d0, and another one to d1:

VLD2.8 {d0, d1}, [r0]

Using the proposed 'vector index' intrinsics, this would translate to IR like:

%scale = i32 1
%indices0 = <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
%indices1 = <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
%d0 = @llvm.uindex.load(i8* %addr.r0, <8 x i32> %indices0, i32 %scale) ; <8 x i8>
%d1 = @llvm.uindex.load(i8* %addr.r0, <8 x i32> %indices1, i32 %scale) ; <8 x i8>

Same for the 3-element variant:

VLD3.8 {d0, d1, d2}, [r0]

%scale = i32 1
%indices0 = <8 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21>
%indices1 = <8 x i32> <i32 1, i32 4, i32 7, i32 10, i32 13, i32 16, i32 19, i32 22>
%indices2 = <8 x i32> <i32 2, i32 5, i32 8, i32 11, i32 14, i32 17, i32 20, i32 23>
%d0 = @llvm.uindex.load(i8* %addr.r0, <8 x i32> %indices0, i32 %scale) ; <8 x i8>
%d1 = @llvm.uindex.load(i8* %addr.r0, <8 x i32> %indices1, i32 %scale) ; <8 x i8>
%d2 = @llvm.uindex.load(i8* %addr.r0, <8 x i32> %indices2, i32 %scale) ; <8 x i8>

This pattern comes up with code that converts data from AoS to SoA, for example when doing whole-function vectorization (e.g. if b is an array of vectors, due to scalarization).

It is quite limited tough (sequential indices, fixed scale), and probably more difficult to match than single intrinsics.

Pierre-Andre

I think ARM supports a very limited form of gather/scatter through
VLD2/3/4 and VST2/3/4 interleaved load and stores instructions.

Yes, such *strided* gathers/scatters are an important special case, that deserves a separate discussion.

Thanks,
Ayal.

Hi Elena,

I think such intrinsics are very useful.
Do you have any plan to upstream them?

Thanks,
-Hao

hi Hao,

I started to upstream and the second patch is stalled under review now.

- Elena

From: "Elena Demikhovsky" <elena.demikhovsky@intel.com>
To: "Hao Liu" <haoliuts@gmail.com>
Cc: llvmdev@cs.uiuc.edu
Sent: Sunday, March 15, 2015 5:21:14 AM
Subject: Re: [LLVMdev] Indexed Load and Store Intrinsics - proposal

hi Hao,

I started to upstream and the second patch is stalled under review
now.

Is this the patch that Craig was working on? I thought we had hit some problem with the TableGen changes. If it is just awaiting review, please ping it.

-Hal