Hi all,
Intel has just revealed its AVX2 instruction set, to be supported by the 2013 Haswell architecture, and it’s looking quite revolutionary: http://software.intel.com/en-us/forums/showthread.php?t=83399&o=a&s=lr
It includes powerful ‘gather’ instructions, which allow reading multiple vector elements from non-contiguous memory locations. It also extends all integer vector instructions to 256-bit, and can shift vector elements by independent counts. This offers tremendous opportunity for auto-vectorizing loops, since practically every scalar operation will have a direct vector equivalent. It also facilitates implementing throughput computing languages like OpenCL.
So I was wondering whether in LLVM a gather operation is best represented with a ‘load’ instruction taking vector operands, or whether it’s better to define it as a separate ‘gather’ instruction. What would be the pros and cons of each approach, and what do you think should be the long-term goals for the LLVM instruction set?
Cheers,
Nicolas
Hi all,
Intel has just revealed its AVX2 instruction set, to be supported by the 2013 Haswell architecture, and it’s looking quite revolutionary: http://software.intel.com/en-us/forums/showthread.php?t=83399&o=a&s=lr
It includes powerful ‘gather’ instructions, which allow reading multiple vector elements from non-contiguous memory locations. It also extends all integer vector instructions to 256-bit, and can shift vector elements by independent counts. This offers tremendous opportunity for auto-vectorizing loops, since practically every scalar operation will have a direct vector equivalent. It also facilitates implementing throughput computing languages like OpenCL.
So I was wondering whether in LLVM a gather operation is best represented with a ‘load’ instruction taking vector operands, or whether it’s better to define it as a separate ‘gather’ instruction. What would be the pros and cons of each approach, and what do you think should be the long-term goals for the LLVM instruction set?
Cheers,
Nicolas
Jose Fonseca <jfonseca@vmware.com> writes:
The important thing IMO, is to not represent the gather operation as
an instruction which takes a vector of pointers, because that's too
restrictive for architectures with 64bits pointers.
How is it restrictive?
What one most frequently wants to do in those architectures is to specify a
64bit scalar base pointer with a vector of 32bit offsets.
Or 64-bit offsets. We should not restrict offsets to 32 bits.
-Dave
"Nicolas Capens" <nicolas.capens@gmail.com> writes:
Hi all,
Intel has just revealed its AVX2 instruction set, to be supported by the 2013 Haswell architecture, and it's looking quite
revolutionary: Developer Software Forums - Intel Community
Hooray!
But boo! No 64-bit integer multiply yet. 
In any case, once I get the major AVX changes in, it should be almost
trivial to add HNI support. I've got some more patches ready to go in
ASAP. We're very close to sending "the big one" up. 
-Dave
greened@obbligato.org (David A. Greene) writes:
Jose Fonseca <jfonseca@vmware.com> writes:
The important thing IMO, is to not represent the gather operation as
an instruction which takes a vector of pointers, because that's too
restrictive for architectures with 64bits pointers.
How is it restrictive?
Ah, I think you mean you don't want it ONLY to allow a vector of
pointers. I absolutely agree with this view.
What one most frequently wants to do in those architectures is to specify a
64bit scalar base pointer with a vector of 32bit offsets.
Or 64-bit offsets. We should not restrict offsets to 32 bits.
To reiterate, a base address + vector of indices gets my vote. If the
base happens to be zero and the indices happen to be pre-scaled pointer
values, so be it. 
The raises the question of whether indices get scaled by the vector
element type size. This would compilcate the semantics of load, I
think, because getelementptr is really the instruction that does the
scaling. If we have a gather operation (wether a load with vector of
indices or a special instruction) it seems that we will need some kind
of vector getelementptr as well.
-Dave
greened@obbligato.org (David A. Greene) writes:
> Jose Fonseca <jfonseca@vmware.com> writes:
>
>> The important thing IMO, is to not represent the gather operation
>> as
>> an instruction which takes a vector of pointers, because that's
>> too
>> restrictive for architectures with 64bits pointers.
>
> How is it restrictive?
Ah, I think you mean you don't want it ONLY to allow a vector of
pointers. I absolutely agree with this view.
>> What one most frequently wants to do in those architectures is to
>> specify a
>> 64bit scalar base pointer with a vector of 32bit offsets.
>
> Or 64-bit offsets. We should not restrict offsets to 32 bits.
To reiterate, a base address + vector of indices gets my vote. If
the
base happens to be zero and the indices happen to be pre-scaled
pointer
values, so be it. 
Precisely.
The raises the question of whether indices get scaled by the vector
element type size. This would compilcate the semantics of load, I
think, because getelementptr is really the instruction that does the
scaling. If we have a gather operation (wether a load with vector of
indices or a special instruction) it seems that we will need some
kind
of vector getelementptr as well.
-Dave
Good question.
At any rate, I agree with everybody here on that we should start with intrinsics.
Jose