Is it legal to pass a half by value on x86_64?

Hello,

I am attempting to understand an anomaly I am seeing when dealing with half on Windows and could use some help.

Using LLVM 8 or 10, if I have IR of the flavor below:
define void @foo(i8, i8, i8, i8, half) {
%6 = alloca half

store half %4, half* %6, align 1

ret void
}

Using x86_64-pc-linux, we convert the float passed in with __gnu_f2h_ieee.
Using x86_64-pc-windows I do not get the conversion, so we end up with incorrect math operations.

While investigating I noticed clang gave me the error below:

error: parameters cannot have __fp16 type; did you forget * ?
void foo(int dc1, int dc2,int dc3,int dc4, __fp16 in)

So, this got me wondering if "define void @foo(i8, i8, i8, i8, half) " is even legal to use or if I should rather pass by ref? I have yet to find documentation to convince me one way or the other. Thus, I was hoping someone here might be able to shed some light on the issue.

Thank you in advance!

Cheers,

JP

I’m not sure how robust the half support is on X86, clang should never generate it. I believe in llvm 11 it changed to pass in the lower 16 bits of an integer register instead of in a float register. What does “incorrect math operations” mean? We’re emulating half precision with floats and a conversion function, I don’t think it will always match exactly with what native half would do.

Do you have a more complete IR file for Windows that I can take a look at?

Hi Jason,

__fp16 is a pure storage format. You cannot pass it by value, because only ABI permissive types can be passed by value while __fp16 is not one of them.

  • if "define void @foo(i8, i8, i8, i8, half) " is even legal to use

half as a target independent type is legal for LLVM. It’s not legal for unsupported target like X86. The behavior depends on how we lowering it. But I don’t know why there’s differences between Linux and Windows. Maybe because “__gnu_f2h_ieee” is a Linux only function?

__fp16 is a pure storage format. You cannot pass it by value, because only ABI permissive types can be passed by value while __fp16 is not one of them.

Yep. Any specific reason to use a pure storage format? The native type is _Float16 and would give some benefits, but this is not yet supported on x86, see also:

https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point

Cheers,
Sjoerd.

I guess it’s designed for language portability. You can use this type across different platforms. Nevertheless, I’m not a FE expert, so I cannot think out other intentions.

The _Float16 is a primitive type in the latest x86 ABI, but there’s no X86 target that supports it yet. So you cannot use it on X86 by now. I think that’s the difference from __fp16 and why should use it.

We also have some discussion here. https://reviews.llvm.org/D97318

Hi All,

Thank you very much for all the great information. This is awesome!

To circle back on Craig’s questions.
I did notice LLVM 11 behave very differently.

** Per: What does “incorrect math operations” mean?
The half is passed to the function as a float. The function does operations with other half numbers. On Windows when we don’t get the float to half conversation the input is always truncated to 0.0.

** Per: “Do you have a more complete IR file for Windows that I can take a look at?”
I can get you our IR if you want, but I think it is more convoluted than required. I was working on a unit test and I think all one needs to see the anomaly is:

define void @foo(i8, i8, i8, i8, half) {
; CHECK-I686: callq __gnu_f2h_ieee

%6 = alloca half
store half %4, half* %6, align 1
ret void
}

x86_64-pc-windows gives:
push rax
.seh_stackalloc 8
.seh_endprologue
movss xmm0, dword ptr [rsp + 48] # xmm0 = mem[0],zero,zero,zero
movss dword ptr [rsp + 4], xmm0 # 4-byte Spill
pop rax
ret
.seh_handlerdata
.text
.seh_endproc

What I find extremely interesting is the behavior seems has something to do with the stack? For dropping the inputs by one then even Windows will generate the conversion.

define void @foo(i8, i8, i8, half) {
; CHECK-I686: callq __gnu_f2h_ieee

%5 = alloca half
store half %3, half* %5, align 1
ret void
}

x86_64-pc-windows gives:

sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
movabs rax, offset __gnu_f2h_ieee
movaps xmm0, xmm3
call rax
mov word ptr [rsp + 38], ax
add rsp, 40
ret
.seh_handlerdata
.text
.seh_endproc

** If interested, here is a dissection of our real asm.
For both Windows and Linux our IR calls c2_foo() with a half(2):

call void @c2_foo(i8* %S_6, [21 x i8*]* %ptr_gvar_instance_7, %emlrtStack* %c2_b_st_, [18 x float]* @15, half 0xH4000, [18 x i8]* %t10)

They both register this in c2_foo as:

%c2_in2_ = alloca half
store half %c2_in2, half* %c2_in2_, align 1

When we compile them, they both send 0x40000000 to c2_foo (a single).
The Linux c2_foo() asm addresses this with a float2half conversion:

mov qword ptr [rsp + 448], rdi
mov qword ptr [rsp + 440], rsi
mov qword ptr [rsp + 432], rdx
mov qword ptr [rsp + 424], rcx
movabs rcx, offset __gnu_f2h_ieee # <—Convert Here
mov qword ptr [rsp + 336], r8 # 8-byte Spill
call rcx
mov word ptr [rsp + 422], ax
mov rcx, qword ptr [rsp + 336] # 8-byte Reload
mov qword ptr [rsp + 408], rcx
mov qword ptr [rsp + 392], 0
mov qword ptr [rsp + 384], 0
mov qword ptr [rsp + 376], 0
mov qword ptr [rsp + 368], 0
mov rdx, qword ptr [rsp + 432]
mov qword ptr [rsp + 360], rdx
mov rdx, qword ptr [rsp + 432]
mov rdx, qword ptr [rdx + 8]
mov qword ptr [rsp + 352], rdx
mov rdx, qword ptr [rsp + 440]
mov rdx, qword ptr [rdx + 56]
mov qword ptr [rsp + 344], rdx
mov dword ptr [rsp + 400], 0
jmp .LBB9_9

The Windows c2_foo() asm is missing this conversion but treats the value as if it has been converted.

mov rax, qword ptr [rsp + 424]
movss xmm0, dword ptr [rsp + 416] # xmm0 = mem[0],zero,zero,zero # ← moves the data like it wants to convert but never does
mov qword ptr [rsp + 344], rcx
mov qword ptr [rsp + 336], rdx
mov qword ptr [rsp + 328], r8
mov qword ptr [rsp + 320], r9
mov qword ptr [rsp + 304], 0
mov qword ptr [rsp + 296], 0
mov qword ptr [rsp + 288], 0
mov qword ptr [rsp + 280], 0
mov rcx, qword ptr [rsp + 328]
mov qword ptr [rsp + 272], rcx
mov rcx, qword ptr [rsp + 328]
mov rcx, qword ptr [rcx + 8]
mov qword ptr [rsp + 264], rcx
mov rcx, qword ptr [rsp + 336]
mov rcx, qword ptr [rcx + 56]
mov qword ptr [rsp + 256], rcx
mov dword ptr [rsp + 312], 0
mov qword ptr [rsp + 248], rax # 8-byte Spill
movss dword ptr

Hi Jason,

The different behavior between Linux and Windows comes form the difference of the calling conversion. Windows uses 4 registers for arguments passing which Linux uses 6.

https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-160#parameter-passing

For this code the half store from the IR appears to have been removed because it is a local variable that was never read from. The store that says “4-byte Spill” is a different store and seems to be some -O0 artifact. With -O2 the whole thing becomes just a ret.

define void @foo(i8, i8, i8, i8, half) {
; CHECK-I686: callq __gnu_f2h_ieee

%6 = alloca half
store half %4, half* %6, align 1

ret void
}

x86_64-pc-windows gives:
push rax
.seh_stackalloc 8
.seh_endprologue
movss xmm0, dword ptr [rsp + 48] # xmm0 = mem[0],zero,zero,zero
movss dword ptr [rsp + 4], xmm0 # 4-byte Spill
pop rax
ret
.seh_handlerdata
.text
.seh_endproc

As an experiment, I tried this which does produce a call to __gnu_f2h_ieee on windows with llvm 8.0 and llvm 10.0

define void @foo(half*, i8, i8, half) {
store half %3, half* %0, align 1
ret void
}

For this assembly you provided, I don’t see any reads from xmm0, or any word stores. So it’s hard for me to determine what might be going wrong. Can provide the assembly where xmm0 is eventually used?

mov rax, qword ptr [rsp + 424]
movss xmm0, dword ptr [rsp + 416] # xmm0 = mem[0],zero,zero,zero # ← moves the data like it wants to convert but never does
mov qword ptr [rsp + 344], rcx
mov qword ptr [rsp + 336], rdx
mov qword ptr [rsp + 328], r8
mov qword ptr [rsp + 320], r9
mov qword ptr [rsp + 304], 0
mov qword ptr [rsp + 296], 0
mov qword ptr [rsp + 288], 0
mov qword ptr [rsp + 280], 0
mov rcx, qword ptr [rsp + 328]
mov qword ptr [rsp + 272], rcx
mov rcx, qword ptr [rsp + 328]
mov rcx, qword ptr [rcx + 8]
mov qword ptr [rsp + 264], rcx
mov rcx, qword ptr [rsp + 336]
mov rcx, qword ptr [rcx + 56]
mov qword ptr [rsp + 256], rcx
mov dword ptr [rsp + 312], 0
mov qword ptr [rsp + 248], rax # 8-byte Spill
movss dword ptr

Hi Craig,

I am sorry for my poor example, probably better to take me out of the middle.
I have attached the complete IR for the example on which I am working. c2_foo() is where we break down.

Cheers.

JP

halfOpAnom.ll’ (52.3 KB)

I think because the argument got passed in memory and was immediately stored to a local variable it triggered some copy elision code. And something went wrong. In the basic blocks where the %c2_in2_ alloca is loaded from, in the assembly I see 16-bit loads that are loading the same location %xmm0 is being loaded from in the entry block. So those loads are accessing the argument directly instead of a local copy. I guess something about this copy elision lost the knowledge that it needed to be converted or maybe it shouldn’t be eligible for copy elision.