Is it possible to load a value into a vector register and broadcast it in LLVM?

For example, for the following address %x

%x = getelementptr inbounds %struct._Ray* %ray, i32 0, i32 0, i32 0

instead of loading the value at %x into a scalar register %0:

%0 = load double* %x, align 4, !tbaa !0

I want to load it into a <2 x double> vector register %1 and make both of the two elements in %1 be the value at %x.

I guess one way to do this is to make getelementptr return a <2 x i32>* address, where the two addresses in <2 X 32> are the same. But I don’t know if it is possible to do this in LLVM.

Any help would be appreciated.

Best,

Zhi

The canonical way to do it would be to load into a scalar, and then broadcast the scalar using a shufflevector.

Hopefully, the backend will be smart enough to match this as a single load+broadcast, if the platform has such an instruction.

(The trick you’re suggesting with getelementptr will probably possible soon, now that gather intrinsics are being introduced, but it probably will create worse code, not better.)

Michael

Hi Zhi,

If I get your question correctly, Yes, you can do it by using the IRBuilder’s CreateVectorSplat() API.

/// \brief Return a vector value that contains \arg V broadcasted to \p

/// NumElts elements.

Value *CreateVectorSplat(unsigned NumElts, Value *V, const Twine &Name = “”)

For your case, here the Value V will be your loaded value %0 and NumElts will be 2.

So after %0 = load double* %x, align 4, !tbaa !0

you will get a sequence of LLVM-IR

%1= insertelement <2 x double > %0, …

%2= shufflevector <2 x double > %1, …

%2 will be your desired value.

Regards,

Shahid

Hi Shahid,

Thank you so much for your response. You suggested approach is what I am right now using. However, it seems that the overhead is a little bit high because we are introducing two more instructions. I was wondering if there was a cheaper way to do it.

Best,

Zhi

Hi Kuperstein,

It seemed that the backend would generate a VMOVAPD and a VPERMILPD. However, it also introduced many spills. I don’t quite understand why.

Best,

Zhi

Hi Zhi,

At IR level, yes there is an overhead of two more instruction, however, as Michel has pointed

backend may fold it to single instruction wherever there is such an instruction is available.

Regards,

Shahid

Zhi -

If your IR is not ending up as the expected splat instructions (simple AVX examples below), please file a bug.

$ cat broadcast.ll

define <2 x double> @v2f64(double* %d) {

%ld = load double, double* %d

%v = insertelement <2 x double> undef, double %ld, i32 0

%sh = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32><i32 0, i32 0>

ret <2 x double> %sh

}

define <4 x double> @v4f64(double* %d) {

%ld = load double, double* %d

%v = insertelement <4 x double> undef, double %ld, i32 0

%sh = shufflevector <4 x double> %v, <4 x double> undef, <4 x i32><i32 0, i32 0, i32 0, i32 0>

ret <4 x double> %sh

}

$ ./llc broadcast.ll -o - -mattr=avx

_v2f64: ## @v2f64

vmovddup (%rdi), %xmm0 ## xmm0 = mem[0,0]

retq

_v4f64: ## @v4f64

vbroadcastsd (%rdi), %ymm0

retq

Hi Snayay,

It is able to produce the vmovddup and vbroadcastsd instruction now if I add the -mattr=avx option. Thanks.

Best,

Zhi