Vector select/compare support in LLVM

Hello,

I started working on adding vector support for the SELECT and CMP instructions in the codegen (bugs: 3384, 1784, 2314).

Currently, the codegen scalarizes vector CMPs into multiple scalar CMPs. It is easy to add similar scalarization support to the SELECT instruction. However, using multiple scalar operations is slower than using vector operations.
In LLVM, vector-compare operations generate a vector of i1s, and the vector-select instruction uses these vectors. In between, these values (masks) can be manipulated (xor-ed, and-ed, etc).
For x86, I would like the codegen to generate the ‘pcmpeq’ and ‘blend’ family of instructions. SSE masks are implemented using a 32bit word per item, where the MSB bit is used as a predicate and the rest of the bits are ignored. I believe that PPC Altivec and ARM Neon are also implemented this way.

I can think of two ways to represent masks in x86: sparse and packed. In the sparse method, the masks are kept in <4 x 32bit> registers, which are mapped to xmm registers. This is the ‘native’ way of using masks.
In the second representation, the packed method, the MSB bits are collected from the xmm register into a packed general purpose register. Luckily, SSE has the MOVMSKPS instruction, which converts sparse masks to packed masks. I am not sure which representation is better, but both are reasonable. The former may cause register pressure in some cases, while the latter may add the packing-unpacking overhead.

_Sparse_
After my discussion with Duncan, last week, I started working on the promotion of type <4 x i1> to <4 x i32>, and I ran into a problem. It looks like the codegen term ‘promote’ is overloaded. For scalars, the ‘promote’ operation converts scalars to larger bit-width scalars. For vectors, the ‘promote’ operation widens the vector to the next power of two. This is reasonable for types such as ‘<3 x float>’. Maybe we need to add another legalization operation which will mean widening the vectors? In any case, I estimated that implementing this per-element promotion would require major changes and decided that this is not the way to go.

_Packed_
I followed Duncan’s original suggestion which was packing vectors of i1s into general purpose registers.
I started by adding several new types to ValueTypes (td and h). I added ‘4vi1, 8vi1, 16vi1 … 64vi1’. For x86, I mapped the v8i1 .. v8i64 to general purpose x86 registers. I started playing with a small program, which performed a vector CMP on 4 elements. The legalizer promoted the v4i1 to the next legal pow-of-two type, which was v8i1. I changed WidenVecRes_SETCC and added a new method WidenVecOp_Select to handle the legalization of these types. The widening of the Select and SETCC ops was simple since I only widened the operands which needed widening. I am not sure if this is correct, but I ran into more problems before I could test it.
Another problem that I had was that i1 types are still promoted to i8 types. So if I have a vector such as ‘4 x i1: <0, 0, 1, 1>’, it will be mapped to DAG node ‘BUILD_VECTOR’ which accepts 4 i8s and returns a single v4i1. This fails somewhere because the cast is illegal. The desired result should be that the above vector would be translated to the (packed) scalar value ‘3’. I hacked TargetLowering::ReplaceNodeResults and added a minimal support for BUILD_VECTOR.

I’d be interested in hearing your suggestions in which direction/s to proceed.

Thank you,
Nadav

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

I can think of two ways to represent masks in x86: sparse and
packed. In the sparse method, the masks are kept in <4 x 32bit>
registers, which are mapped to xmm registers. This is the ‘native’ way
of using masks.

This argues for the sparse representation, I think.

_Sparse_ After my discussion with Duncan, last week, I started working
on the promotion of type <4 x i1> to <4 x i32>, and I ran into a
problem. It looks like the codegen term ‘promote’ is overloaded.

Heavily. :-/

For scalars, the ‘promote’ operation converts scalars to larger
bit-width scalars. For vectors, the ‘promote’ operation widens the
vector to the next power of two. This is reasonable for types such as
‘<3 x float>’. Maybe we need to add another legalization operation which
will mean widening the vectors?

You mean widening the element type, correct? Yes, that's definitely a
useful concept.

In any case, I estimated that implementing this per-element promotion
would require major changes and decided that this is not the way to
go.

What major changes? I think this will end up giving much better code in
the end. The pack/unpack operations could be very expensive.

There is another huge cost in using GPRs to hold masks. There will be
fewer GPRs to hold addresses, which is a precious resource. We should
avoid doing anything that uses more of that resource unnecessarily.

                             -Dave

Hi David,

The MOVMSKPS instruction is cheap (2 cycles). Not to be confused with VMASKMOV, the AVX masked move, which is expensive.

One of the arguments for packing masks is that it reduces vector-registers pressure. Auto-vectorizing compilers maintain multiple masks for different execution paths (for each loop nesting, etc). Saving masks in xmm registers may result in vector-register pressure which will cause spilling of these registers. I agree with you that GP registers are also a precious resource.
I am not sure what is the best way to store masks.

In my private branch, I added the [v4i1 .. v64i1] types. I also implemented a new type of target lowering: "PACK". This lowering packs vectors of i1s into integer registers. For example, the <4 x i1> type would get packed into the i8 type. I modified LegalizeTypes and LegalizeVectorTypes and added legalization for SETCC, XOR, OR, AND, and BUILD_VECTOR. I also changed the x86 lowering of SELECT to prevent lowering of selects with vector condition operand. Next, I am going to add new patterns for SETCC and SELECT which use i8/i16/i32/i64 as a condition value.

I also plan to experiment with promoting <4 x i1> to <4 x i32>. At this point I can't really say what needs to be done. Implementing this kind of promotion also requires adding legalization support for strange vector types such as <4 x i65>.

-Nadav

After I implemented a new type of legalization (the packing of i1 vectors), I found that x86 does not have a way to load packed masks into SSE registers. So, I guess that legalizing of <4 x i1> to <4 x i32> is the way to go.

Cheers,
Nadav

Hey,

I am currently forced to create the BLENDVPS intrinsic as an external call (via Intrinsic::x86_sse41_blendvps) which has the following signature (from IntrinsicsX86.td):

def int_x86_sse41_blendvps :
GCCBuiltin<"__builtin_ia32_blendvps">,
Intrinsic<[llvm_v4f32_ty],[llvm_v4f32_ty, llvm_v4f32_ty, llvm_v4f32_ty],[IntrNoMem]>

Thus, it expects the mask (first operand if i recall correctly) to be a <4 x float>.
It would be great to have this mirrored in the IR, meaning one should be able to create a SelectInst with 3 <4 x float> operands which would generate this intrinsic.
Is there anything that speaks against this?

I think I also recall something similar for ICmp/FCmp instructions...

Best,
Ralf

P.S. I am not up-to-date on the latest status of "direct" support of vector instructions, the corresponding part of my system has been written over a year ago.

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

One of the arguments for packing masks is that it reduces
vector-registers pressure. Auto-vectorizing compilers maintain
multiple masks for different execution paths (for each loop nesting,
etc). Saving masks in xmm registers may result in vector-register
pressure which will cause spilling of these registers. I agree with
you that GP registers are also a precious resource.

GPRs are more precious than vector registers in my experience. Spilling
a vector register isn't that painful. Spilling a GPR holding an address
is disastrous.

In my private branch, I added the [v4i1 .. v64i1] types. I also
implemented a new type of target lowering: "PACK". This lowering packs

Is PACK in the X86 namespace? It seems a pretty target-specific thing.

I also plan to experiment with promoting <4 x i1> to <4 x i32>. At
this point I can't really say what needs to be done. Implementing
this kind of promotion also requires adding legalization support for
strange vector types such as <4 x i65>.

How often do we see something like that? Baby steps, baby steps... :slight_smile:

                                -Dave

Ralf Karrenberg <Chareos@gmx.de> writes:

Hey,

I am currently forced to create the BLENDVPS intrinsic as an external
call (via Intrinsic::x86_sse41_blendvps) which has the following
signature (from IntrinsicsX86.td):

def int_x86_sse41_blendvps :
GCCBuiltin<"__builtin_ia32_blendvps">,
Intrinsic<[llvm_v4f32_ty],[llvm_v4f32_ty, llvm_v4f32_ty,
llvm_v4f32_ty],[IntrNoMem]>

Thus, it expects the mask (first operand if i recall correctly) to be a
<4 x float>.
It would be great to have this mirrored in the IR, meaning one should be
able to create a SelectInst with 3 <4 x float> operands which would
generate this intrinsic.
Is there anything that speaks against this?

To me a v4i1 makes more sense as an IR mask type. The fact that on X86
the native mask type is v4i32 should be handled by the X86 codegen, I
think.

Another option is to rewrite the intrinsic to take a v4i1. Or more
correctly, create a new intrinsic to live alongside the existing one,
since we want the existing one for gcc compatibility.

                                   -Dave

David,

The problem with the sparse representation is that it is word-width dependent. For 32-bit data-types, the mask is the 32nd bit, while fore 64bit types the mask is the 64th bit.

How would you legalize the mask for the following code ?

%mask = cmp nge <4 x float> %A, %B ; <4 x i1>
%val = select <4 x i1>% mask, <4 x double> %X, %Y ; <4 x double>

Moreover, in some cases the generator of the mask and the consumer of the mask are in different basic blocks. The legalizer works on one basic block at a time. This makes it impossible for the legalizer to find the 'native' representation.

I wrote down some of the comments which were made in this email thread:

http://wiki.llvm.org/Vector_select

Cheers,
Nadav

Hi Nadav,

The problem with the sparse representation is that it is word-width dependent. For 32-bit data-types, the mask is the 32nd bit, while fore 64bit types the mask is the 64th bit.

How would you legalize the mask for the following code ?

%mask = cmp nge<4 x float> %A, %B ;<4 x i1>
%val = select<4 x i1>% mask,<4 x double> %X, %Y ;<4 x double>

I would expect this to become

%mask = cmp nge<4 x float> %A, %B with result type <4 x i32>
%mask_lo = extract elements 0, 1 from %mask, result type <2 x i64>
%mask_hi = extract elements 2, 3 from %mask, result type <2 x i64>
%val_lo = select <2 x i64> %mask_lo, <2 x double> %X_lo, %Y_lo
%val_hi = select <2 x i64> %mask_hi, <2 x double> %X_hi, %Y_hi

Moreover, in some cases the generator of the mask and the consumer of the mask are in different basic blocks. The legalizer works on one basic block at a time. This makes it impossible for the legalizer to find the 'native' representation.

I don't understand what you are saying here.

Ciao, Duncan.