Vector promotions for calling conventions

The X86-64 calling convention (annoyingly) specifies that "struct x { float a,b,c,d; }" is passed or returned in the low 2 elements of two separate XMM registers. For example, returning that would return "a,b" in the low elements of XMM0 and "c,d" in the low elements of XMM1. Both llvm-gcc and clang currently generate atrocious IR for these structs, which you can see if you compile this:

struct x { float a,b,c,d; };
struct x foo(struct x *P) { return *P; };

The machine code generated by llvm-gcc[*] for this is:

_foo:
  movl (%rdi), %eax
  movl 4(%rdi), %ecx
  shlq $32, %rcx
  addq %rax, %rcx
  movd %rcx, %xmm0
  movl 8(%rdi), %eax
  movl 12(%rdi), %ecx
  shlq $32, %rcx
  addq %rax, %rcx
  movd %rcx, %xmm1
  ret

when we really just want:

_foo:
  movq (%rdi), %xmm0
  movq 8(%rdi), %xmm1
  ret

I'm looking at having clang generate IR for this by passing and returning the two halfs as v2f32 values, which they are, and doing insert/extracts in the caller/callee. However, at the moment, the x86 backend is passing each element of the v2f32 as an f32, instead of promoting the type and passing the v2f32 as the low two elements of the v4f32. In the example above, this means it returns each element in XMM0,XMM1,XMM2,XMM3 instead of just XMM0/1.

We already do this sort of vector promotion for operators in type legalization. Is there any reason not to do it for the calling convention case? Is there anyone interested in working on this? :slight_smile:

-Chris

[*] Clang happens to generate good machine code for this case, but the IR is still awful and it falls down hard on other similar cases.