Thanks so much for your feedback Simon.

I am not sure that what I am proposing here is at odds with what you’re referring to (here and in the PR you linked). The key difference AFAICT is that the pattern I am referring to is probably more aptly described as “reducing scalarization” than as “vectorization”. The reason I say that is that the inputs are vectors and the output is also a vector - we just perform the operation on extracted elements rather than on the input vectors themselves.

In the PR you linked, there is an example that shows the difference (simplified to <2 x double> for brevity):

define dso_local <2 x double> @test(i64 %a, i64 %b) {

entry:

%conv = uitofp i64 %a to double

%conv1 = uitofp i64 %b to double

%vecinit = insertelement <2 x double> undef, double %conv, i32 0

%vecinit2 = insertelement <2 x double> %vecinit, double %conv1, i32 1

ret <2 x double> %vecinit2

}

The inputs here are scalars so I suppose it is quite possible (perhaps likely) that on some targets, doing the insert with integers and then converting the vector is cheaper (although this is definitely not the case with PPC).

But what I am proposing here is actually handling something like this:

define dso_local <2 x double> @test(<2 x i64> %a) {

entry:

%vecext = extractelement <2 x i64> %a, i32 0

%vecext1 = extractelement <2 x i64> %a, i32 1

%conv = sitofp i64 %vecext to double

%conv2 = sitofp i64 %vecext1 to double

%vecinit = insertelement <2 x double> undef, double %conv, i32 0

%vecinit3 = insertelement <2 x double> %vecinit, double %conv2, i32 1

ret <2 x double> %vecinit3

}

With this type conversion, InstCombine will actually simplify this as expected. And I think that is the right thing to do - I can’t see the scalarized version being cheaper on any target. Since we already do something quite similar in InstCombine, I would assume it would be rather uncontroversial to do it on the SDAG.

Now, a reasonable question might be “why do it on the SDAG if we already do it in InstCombine?” And the short answer is it is quite possible that legalization will introduce scalarization code and a subsequent DAG combine creates an opportunity to remove the scalarization. Here is an example of that:

define dso_local <2 x i64> @testv(<2 x i64> %a, <2 x i64> %b) {

entry:

%sexta = sext <2 x i64> %a to <2 x i128>

%sextb = sext <2 x i64> %b to <2 x i128>

%mul = mul nsw <2 x i128> %sexta, %sextb

%shift = lshr <2 x i128> %mul, <i128 64, i128 64>

%trunc = trunc <2 x i128> %shift to <2 x i64>

ret <2 x i64> %trunc

}

On PPC, the legalizer will scalarize this since we do not have v2i128. Then the DAG combiner will produce the pattern I am referring to in this RFC:

```
(v2i64 build_vector (mulhs (extractelt %a, 0), (extractelt %b, 0)),
(mulhs (extractelt %a, 1), (extractelt %b, 1)))
```

And if the target has mulhs legal for the vector type, this is strictly worse. So no matter what we do in InstCombine or the SLP vectorizer, we will end up with non-optimal code.

If we also handle shuffles of input vectors, we can catch things such as the following:

define dso_local <4 x float> @test(<4 x i32> %a, <4 x i32> %b) {

entry:

%vecext = extractelement <4 x i32> %a, i32 0

%vecext1 = extractelement <4 x i32> %a, i32 1

%vecext4 = extractelement <4 x i32> %b, i32 2

%vecext7 = extractelement <4 x i32> %b, i32 3

%conv = sitofp i32 %vecext to float

%conv2 = sitofp i32 %vecext1 to float

%conv5 = sitofp i32 %vecext4 to float

%conv8 = sitofp i32 %vecext7 to float

%vecinit = insertelement <4 x float> undef, float %conv, i32 0

%vecinit3 = insertelement <4 x float> %vecinit, float %conv2, i32 1

%vecinit6 = insertelement <4 x float> %vecinit3, float %conv5, i32 2

%vecinit9 = insertelement <4 x float> %vecinit6, float %conv8, i32 3

ret <4 x float> %vecinit9

}

Is equivalent to:

define dso_local <4 x float> @testv(<4 x i32> %a, <4 x i32> %b) {

entry:

%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 1, i32 6, i32 7>

%0 = sitofp <4 x i32> %shuffle to <4 x float>

ret <4 x float> %0

}

Of course, this is something we can handle in InstCombine, but I am wondering if we may again be missing situations where it is the DAG legalizer that creates the scalarization code.