Codegen for vector float->double cast fails on x86 above SSE3

I've isolated a bug in SSE codegen to the attached example.

  define void @f(<2 x float>* %in, <2 x double>* %out) {
    %0 = load <2 x float>* %in, align 8
    %1 = fpext <2 x float> %0 to <2 x double>
    store <2 x double> %1, <2 x double>* %out, align 1
    ret void

The code should load a <2 x float> vector from %in, fpext cast it to a
<2 x double>, and do an unaligned store (movupd) of the result to
%out. This works as expected on earlier SSE targets, generating this
with llc -mcpu=core2:

  movss (%rdi), %xmm1
  movss 4(%rdi), %xmm0
  cvtss2sd %xmm0, %xmm0
  cvtss2sd %xmm1, %xmm1
  unpcklpd %xmm0, %xmm1 ## xmm1 = xmm1[0],xmm0[0]
  movupd %xmm1, (%rsi)

Load both, cast float to double (cvtss2sd), pack vectors, and store.

But with llc -mcpu=penryn or greater, it yields nonsense:

  movq (%rdi), %xmm0
  pshufd $16, %xmm0, %xmm0 ## xmm0 = xmm0[0,0,1,0]
  movdqu %xmm0, (%rsi)

vec_cast.ll (406 Bytes)

vec_cast.sse3.s (368 Bytes)

vec_cast.sse4.s (303 Bytes)

Hi Jonathan,

Great bugreport!

Please file it in Bugzilla: