Failure to optimize vector select


I've found a case I would expect would optimize easily, but it doesn't. A simple implementation of vector select:

float4 simple_select(float4 a, float4 b, int4 c)
    float4 result;

    result.x = c.x ? a.x : b.x;
    result.y = c.y ? a.y : b.y;
    result.z = c.z ? a.z : b.z;
    result.w = c.w ? a.w : b.w;

    return result;

I would expect this would be optimized to

%bool = icmp eq <4 x i32> %c, 0
%result = select <4 x i1> %bool, <4 x float> %a, <4x float> %b
ret <4 x float> %result

However, it actually ends up as the 4 separate extractelement/icmp/select sequence.

Where would be the best place to fix this? Should InstCombine be taking care of this or the vectorizer?


Have you tried running SLP vectorizer pass (-vectorize-slp)?


Yes. That was the first thing i tried, and it didn't do anything. I was looking the vectorizer, but then I saw some things that made me wonder if it was even supposed to do this

Can you send the IR of the function ?

I suspect that in the IR you will see a sequence of inserts. At the moment the SLP-vectorizer does not look at “insert” sequences. But it should be really easy (and beneficial) to.

Attached is the -O0 and -O3 IR

vselect_optimized.ll (1.51 KB)

vselect_unoptimized.ll (4.32 KB)

Hi Matt,

This code maintains a vector of float4 and it inserts and extracts values from this vector. The ’select’ operations are already vectorized. Maybe a sequence of inst-combines (or DAG-combines) can help. If you re-write this code using scalars then the slp-vectorizer, with some tweaks, will be able to catch it.


I think what matt was looking for is why the slp-vectorizer is not vectorizing the booleans? To me it seems like the vectorizer got the first step right(vectorizing the operands), but not the second step(vectorizing the comparison operation). I actually would expect a single icmp ne <4 x i32> %c, <4 x i32><i32 0, i32 0, i32 0, i32 0> instead of 4 icmp's.


I've tried manually scalarizing the arguments so the other select arguments are scalars, but the vectorizer still doesn't change it. Here is the scalarized IR.

manual_scalarize.ll (1.7 KB)

Hi Matt,

We are really close. :slight_smile: Now, all you have to do is teach the SLP-vectorizer to start looking at “trees” that start with this pattern:

  %ra = insertelement <4 x float> undef, float %s0, i32 0
  %rb = insertelement <4 x float> %ra, float %s1, i32 1
  %rc = insertelement <4 x float> %rb, float %s2, i32 2
  %rd = insertelement <4 x float> %rc, float %s3, i32 3
  ret <4 x float> %rd

It’s really easy to do. Look at the code in runOnFunction in SLPVectorizer.cpp ; Just put %s0, %s1, %s2 and %s3 in a list and call tryToVectorize(…).