Question about llvm vectors


I love llvm vectors, yet I wonder why some advanced vector operations are specific to some CPU targets?

Let me take an example:

/// Horizontally adds the adjacent pairs of values contained in two
/// 128-bit vectors of [4 x float].
/// \headerfile <x86intrin.h>
/// This intrinsic corresponds to the VHADDPS instruction.
/// \param __a
/// A 128-bit vector of [4 x float] containing one of the source operands.
/// The horizontal sums of the values are stored in the lower bits of the
/// destination.
/// \param __b
/// A 128-bit vector of [4 x float] containing one of the source operands.
/// The horizontal sums of the values are stored in the upper bits of the
/// destination.
/// \returns A 128-bit vector of [4 x float] containing the horizontal sums of
/// both operands.
static inline __m128 __DEFAULT_FN_ATTRS
_mm_hadd_ps(__m128 __a, __m128 __b)
return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);

Here clang will translate _mm_hadd_ps to a CPU specific feature.
Why not create __builtin_vector_hadd(a, b) which would select the CPU specific instruction or a fallback generic implementation?

Many thanks,

I’m not sure everyone would agree that the behavior of a __builtin_vector_hadd should do what the X86 instruction does. It takes two vectors and produces a result with elements from both vectors. Someone might argue that a horizontal add should just take one source and produce a vector with half the number of elements. Someone else might argue that a horizontal add should sum all the elements to a single scalar value. With different implementation choices like that its hard to say it should be a generic operation when the behavior might only make sense for one target’s instruction set.

The behavior of the 256-bit vhaddps instruction on X86 is also weird since it treats the upper and lower 128-bits of the sources and destination independently. That quirk wouldn’t make sense in a generic operation.

You can emulate __builtin_ia32_haddps generically using __builtin_shufflevector and the + operator. The X86 backend should recognize it and use haddps.


Hi Craig,

Thank you very much for your answer.

I did not want to discuss exactly the semantic and name of one operation but instead raise the question “would it be beneficial to have more vector builtins?”.

You wrote that the compiler will recognize a pattern and replace it by __builtin_ia32_haddps when possible, but how can I be sure of that? I would have to disassemble the generated code right? It is very impractical isn’it? And it leads me to understand that each CPU target has a bank of patterns which it can recognize but wouldn’t it be very similar to have advanced generic vector operations and CPU specific implementation for those builtins?

Regarding hadd; I agree, the name does not very well describe what it is doing. And yes hadd could be summing all the vector elements, but I think that the usual terminology for that is reduce_add.

In my case I use it for computing the mono signal of a stereo interleaved signal:

a = load(in);
b = load(in + K);
l = suffle(a, b, 0, 2, 4, 6, …); // l and r have the same size as a
r = suffle(a, b, 1, 3, 5, 7, …);
m = .5 * (l + r); // m has the same size as a and b which is maybe optimal for memory I/O?
store(m, out);

As you said it, I could have m being half of the size of a, and I would not need to load b. Which approach would deliver the best performance? Does the compiler recognize both? Maybe there is another valid approach, will the compiler recognize it?

I would like also to discuss reduce_add, there might be multiple ways of doing it right but is there one that is faster? Is the same approach always the best or it depends on the CPU? I believe that those questions are best answered by the compiler.

Then some side-notes regarding clang documentation __builtin_shufflevector is not referenced there

Best regards,
Alexandre Bique

__builtin_shufflevector was supposed to be linked here but due to a mistake in the source file its generated from a link was made to __builtin_shufflevector instead. I’ve fixed that and it should hopefully update in the next day or two.

We have internal intrinsics for reduce_add that are used by the autovectorizers. I could see it making sense to expose those to C as a builtin. For X86 I think we always reduce at each stage by moving the upper half of the vector to the lower half with a shuffle and then adding it to the lower half. I think on some CPUs we use haddps/haddpd to do the last stage of combining element 1 with element 0. But most CPUs we use a shuffle and a addps/addpd. Intel CPUs use 2 shuffles and addps/addpd internally to implement haddps/haddpd. And on Intel CPUs there’s only one execution unit that can do the 2 shuffles. So they execute serially before the addps/addpd. So for reductions it is better just emit a single shuffle in assembly than to use haddps/pd.

Thank you very much for the explanation.

I have one more question: it is possible in LLVM IR to call sin() on a vector. Yet I did not find how to do it with clang and I’ve tried various things:


using vec = float attribute((vector_size(4 * 4)));

vec fct(vec a)
vec b = std::exp(a);
//vec b = __builtin_exp(a);
//vec b{std::exp(a[0]), std::exp(a[1]), std::exp(a[2]), std::exp(a[3])};
//vec b{__builtin_expf(a[0]), __builtin_expf(a[1]), __builtin_expf(a[2]), __builtin_expf(a[3])};
return b;

Do you know how to do that?