Win64 Calling Convention problem

Hi!

I have discovered a problem with LLVM's interpretation of the Win64
calling convention w.r.t. passing of aggregates as arguments. The
following code is part of my host application that is compiled with
Visual Studio 2005 in 64-bit debug mode. noise4 expects a structure of
four floats as its first and only argument, which is - in accordance
with the specs of the Win64 calling convention - passed by pointer.

--- snip ---
struct float4 { float x, y, z, w; }

float noise4(float4 v)
{
0000000140067AE0 mov qword ptr [rsp+8],rcx
0000000140067AE5 push rdi
0000000140067AE6 sub rsp,10h
0000000140067AEA mov rdi,rsp
0000000140067AED mov rcx,4
0000000140067AF7 mov eax,0CCCCCCCCh
0000000140067AFC rep stos dword ptr [rdi]
0000000140067AFE mov rcx,qword ptr [rsp+20h]
  return v.x + v.y;
0000000140067B03 mov rax,qword ptr [v]
0000000140067B08 mov rcx,qword ptr [v]
0000000140067B0D movss xmm0,dword ptr [rax]
0000000140067B11 addss xmm0,dword ptr [rcx+4]
0000000140067B16 add rsp,10h
0000000140067B1A pop rdi
0000000140067B1B ret
}
--- snip ---

noise4 is supposed to be called by jitted LLVM code, just like in the
following example.

--- snip ---
target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-n8:16:32:64"
target triple = "x86_64-pc-win32"

%0 = type opaque
%float4 = type { float, float, float, float }

define void @main(%float4* noalias nocapture, %0* noalias nocapture) nounwind {
  %3 = call float @"noise$float4"(%float4 { float 1.000000e+000, float
2.000000e+000, float 3.000000e+000, float 4.000000e+000 }) ; <float>
[#uses=4]
  %4 = insertvalue %float4 undef, float %3, 0 ; <%float4> [#uses=1]
  %5 = insertvalue %float4 %4, float %3, 1 ; <%float4> [#uses=1]
  %6 = insertvalue %float4 %5, float %3, 2 ; <%float4> [#uses=1]
  %7 = insertvalue %float4 %6, float %3, 3 ; <%float4> [#uses=1]
  store %float4 %7, %float4* %0
  ret void
}

declare float @"noise$float4"(%float4) nounwind readnone
--- snip ---

When compiling this module with llc (Intel assembler syntax) I get the
following code. As you can see, the float4 argument is not passed to
the noise-function by pointer. Instead, noise is treated as if it
expected four individual floats as arguments, which are passed in the
registers XMM0-XMM3.

--- snip ---
  .data
  ALIGN 4
$CPI1_0: ; constant float
  dd 1065353216 ; float 1.000000e+000
$CPI1_1: ; constant float
  dd 1073741824 ; float 2.000000e+000
$CPI1_2: ; constant float
  dd 1077936128 ; float 3.000000e+000
$CPI1_3: ; constant float
  dd 1082130432 ; float 4.000000e+000
  .text
  ALIGN 16
  .globl _main
_main: ; @main
; BB#0:
  sub RSP, 40
  mov QWORD PTR [RSP + 32], RSI ; Spill
  mov RSI, RCX
  movss XMM0, DWORD PTR [RIP + ($CPI1_0)]
  movss XMM1, DWORD PTR [RIP + ($CPI1_1)]
  movss XMM2, DWORD PTR [RIP + ($CPI1_2)]
  movss XMM3, DWORD PTR [RIP + ($CPI1_3)]
  call _noise$float4
  movss DWORD PTR [RSI + 12], XMM0
  movss DWORD PTR [RSI + 8], XMM0
  movss DWORD PTR [RSI + 4], XMM0
  movss DWORD PTR [RSI], XMM0
  mov RSI, QWORD PTR [RSP + 32] ; Reload
  add RSP, 40
  ret
--- snip ---

This clearly doesn't work and I'd be glad if someone could look into
this issue. Other than that I'm pleased to say that my experiences
with 64-bit code generation on Windows have been very positive. Great
job!

Best regards,
Stephan

Hello

When compiling this module with llc (Intel assembler syntax) I get the
following code. As you can see, the float4 argument is not passed to
the noise-function by pointer. Instead, noise is treated as if it
expected four individual floats as arguments, which are passed in the
registers XMM0-XMM3.

That's correct behaviour. ABI under question is C/C++ ABI and it is a frontend
responsibility to lower stuff like struct-by-value into ABI-compliant IR.

So, in short - you need to pass pointer to the struct in your IR.

Hi!

When compiling this module with llc (Intel assembler syntax) I get the
following code. As you can see, the float4 argument is not passed to
the noise-function by pointer. Instead, noise is treated as if it
expected four individual floats as arguments, which are passed in the
registers XMM0-XMM3.

That's correct behaviour. ABI under question is C/C++ ABI and it is a frontend
responsibility to lower stuff like struct-by-value into ABI-compliant IR.

So, in short - you need to pass pointer to the struct in your IR.

I don't know. I feel reluctant to generate different IRs for Win32 and
for Win64.

Since the C calling convention is the default for LLVM functions, I
thought that it would map to *the* Win64 calling convention (since
cdecl, fastcall and stdcall are all the same) when a 64-bit build was
used. This is quite confusing ...

Hello

I don't know. I feel reluctant to generate different IRs for Win32 and
for Win64.

Unfortunately, you should. Think about differences and between
_Complex type and struct {double, double}.

From LLVM's point of view these are same types, however many ABIs have

special rules for passing / returning _Complex,
this is possible to handle in frontend only.

Since the C calling convention is the default for LLVM functions, I
thought that it would map to *the* Win64 calling convention (since
cdecl, fastcall and stdcall are all the same) when a 64-bit build was
used. This is quite confusing ...

Yes, default calling convention is C. But you're not using an
ABI-compliant C compiler to generate the IR.

Thanks, Anton!

I didn't know about exceptions like _Complex that you mentioned. The
only way to support them is to place the burden of correct parameter
passing on the front-end, I understand that now.

So, today I created a new transformation pass that makes sure that
LLVM IR, which works alright with the default Win32 calling
conventions, also plays nice with Win64 code within the limited scope
of my language (that requires packing of aggregates of sizes 8, 16,
32, and 64-bits into an integer argument, and passing of larger
aggregates by reference).

Best regards,
Stephan

Hello, Stephan

I didn't know about exceptions like _Complex that you mentioned. The
only way to support them is to place the burden of correct parameter
passing on the front-end, I understand that now.

Right. Especially when ABI dictates quite "weird" rules (read e.g.
x86-64 ABI or ARM EABI).
We do support one "exception" though for win64 - __m128 vectors. This
was done since they are
native types and thus do not get changed by type legalizer. I have no
idea why MS decided not to
pass then in registers... This is silly...

So, today I created a new transformation pass that makes sure that
LLVM IR, which works alright with the default Win32 calling
conventions, also plays nice with Win64 code within the limited scope
of my language (that requires packing of aggregates of sizes 8, 16,
32, and 64-bits into an integer argument, and passing of larger
aggregates by reference).

Sounds like pretty sane approach :slight_smile:

Ahoi Anton,

We do support one "exception" though for win64 - __m128 vectors. This
was done since they are
native types and thus do not get changed by type legalizer. I have no
idea why MS decided not to
pass then in registers... This is silly...

Hmm, maybe I can explain this: The Microsoft docs state that for
unprototyped functions floating-point values need to be passed as
double precision in both the floating-point registers (XMMn) and the
integer registers (RCX, etc.). That obviously doesn't work for SSE
vectors because they don't fit into the integer registers. So, there's
only pass-by-reference of the first four vectors, which they might
have generalized to all functions (even with complete signatures) for
consistency. Still weird though ...

Sounds like pretty sane approach :slight_smile:

... which I just modified to *not* massage SSE vectors, as you
mentioned that LLVM would take care of this. :slight_smile:

Have a great weekend and thanks for your help,
Stephan

Hello, Stephan

Hmm, maybe I can explain this: The Microsoft docs state that for
unprototyped functions floating-point values need to be passed as
double precision in both the floating-point registers (XMMn) and the
integer registers (RCX, etc.). That obviously doesn't work for SSE
vectors because they don't fit into the integer registers. So, there's
only pass-by-reference of the first four vectors, which they might
have generalized to all functions (even with complete signatures) for
consistency. Still weird though ...

Right. Since this is definitely a waste of resources (=registers).
Looks like they didn't care about performance at all!