[PATCH] Add optional _Float16 support

H.J_Lu · July 1, 2021, 9:05pm

1. Pass _Float16 and _Complex _Float16 values on stack.
2. Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.

Joseph_Myers1 · July 1, 2021, 10:10pm

That restricts use of _Float16 to processors with SSE. Is that what we
want in the ABI, or should _Float16 be available with base 32-bit x86
architecture features only, much like _Float128 and the decimal FP types
are? (If it is restricted to SSE, we can of course ensure relevant libgcc
functions are built with SSE enabled, and likewise in glibc if that gains
_Float16 functions, though maybe with some extra complications to get
relevant testcases to run whenever possible.)

H.J_Lu · July 1, 2021, 10:27pm

> 2. Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.

That restricts use of _Float16 to processors with SSE. Is that what we
want in the ABI, or should _Float16 be available with base 32-bit x86
architecture features only, much like _Float128 and the decimal FP types

Yes, _Float16 requires XMM registers.

are? (If it is restricted to SSE, we can of course ensure relevant libgcc
functions are built with SSE enabled, and likewise in glibc if that gains
_Float16 functions, though maybe with some extra complications to get
relevant testcases to run whenever possible.)

_Float16 functions in libgcc should be compiled with SSE enabled.

BTW, _Float16 software emulation may require more than just SSE
since we need to do _Float16 load and store with XMM registers.
There is no 16bit load/store for XMM registers without AVX512FP16.

Joseph_Myers1 · July 1, 2021, 10:40pm

You should be able to make the move go via general-purpose registers (for
example) if you can't do a direct 16-bit load/store for XMM registers.

H.J_Lu · July 1, 2021, 11:01pm

There is no 16bit move between GPRs and XMM registers without
AVX512FP16.

topperc · July 1, 2021, 11:05pm

BTW, _Float16 software emulation may require more than just SSE
since we need to do _Float16 load and store with XMM registers.
There is no 16bit load/store for XMM registers without AVX512FP16.

You should be able to make the move go via general-purpose registers (for
example) if you can’t do a direct 16-bit load/store for XMM registers.

There is no 16bit move between GPRs and XMM registers without
AVX512FP16.

Isn’t PINSRW supported since SSE1?

programmerjake · July 1, 2021, 11:33pm

Umm, if you just need to load/store 16-bit scalars in XMM registers you can use pextrw and pinsrw which don’t require AVX. f16x8 can use any of the standard full-register load/stores.

https://gcc.godbolt.org/z/ncznr9TM1

Jacob

Richard_Biener1 · July 2, 2021, 7:45am

> >
> >
> > > 2. Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1
> registers.
> >
> > That restricts use of _Float16 to processors with SSE. Is that what we
> > want in the ABI, or should _Float16 be available with base 32-bit x86
> > architecture features only, much like _Float128 and the decimal FP types
>
> Yes, _Float16 requires XMM registers.
>
> > are? (If it is restricted to SSE, we can of course ensure relevant
> libgcc
> > functions are built with SSE enabled, and likewise in glibc if that gains
> > _Float16 functions, though maybe with some extra complications to get
> > relevant testcases to run whenever possible.)
> >
>
> _Float16 functions in libgcc should be compiled with SSE enabled.
>
> BTW, _Float16 software emulation may require more than just SSE
> since we need to do _Float16 load and store with XMM registers.
> There is no 16bit load/store for XMM registers without AVX512FP16.
>

Umm, if you just need to load/store 16-bit scalars in XMM registers you can
use pextrw and pinsrw which don't require AVX. f16x8 can use any of the
standard full-register load/stores.

It looks like that requires SSE2, with SSE only inserts/extracts
to/from MMX regs
are supported. But of course GPR half-word loads and GPR->XMM moves of
full size would work.

Hongtao_Liu · July 2, 2021, 8:03am

>
>
> > >
> > >
> > > > 2. Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1
> > registers.
> > >
> > > That restricts use of _Float16 to processors with SSE. Is that what we
> > > want in the ABI, or should _Float16 be available with base 32-bit x86
> > > architecture features only, much like _Float128 and the decimal FP types
> >
> > Yes, _Float16 requires XMM registers.
> >
> > > are? (If it is restricted to SSE, we can of course ensure relevant
> > libgcc
> > > functions are built with SSE enabled, and likewise in glibc if that gains
> > > _Float16 functions, though maybe with some extra complications to get
> > > relevant testcases to run whenever possible.)
> > >
> >
> > _Float16 functions in libgcc should be compiled with SSE enabled.
> >
> > BTW, _Float16 software emulation may require more than just SSE
> > since we need to do _Float16 load and store with XMM registers.
> > There is no 16bit load/store for XMM registers without AVX512FP16.
> >
>
> Umm, if you just need to load/store 16-bit scalars in XMM registers you can
> use pextrw and pinsrw which don't require AVX. f16x8 can use any of the
> standard full-register load/stores.

It looks like that requires SSE2, with SSE only inserts/extracts
to/from MMX regs
are supported. But of course GPR half-word loads and GPR->XMM moves of
full size would work.

movd between sse registers and gpr also required sse2.

Jakub_Jelinek · July 2, 2021, 9:21am

Loads can be done in SSE2 directly with PINSRW, that supports 16-bit load
from memory to XMM reg. But SSE2 PEXTRW only supports stores into GPR
and one needs SSE4.1 fo PEXTRW into memory. So, for the stores and SSE2 one
needs secondary reload...

Jakub

phoebe · July 13, 2021, 6:19am

(Forward to llvm-dev)

Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.

Can you please explain the behavior here? Is there difference between _Float16 and _Complex _Float16 when return? I.e., 1, In which case will _Float16 values return in both %xmm0 and %xmm1?
2, For a single _Float16 value, are both real part and imaginary part returned in %xmm0? Or returned in %xmm0 and %xmm1 respectively?

Thanks
Pengfei

H.J_Lu · July 13, 2021, 2:26pm

> Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.

Can you please explain the behavior here? Is there difference between _Float16 and _Complex _Float16 when return? I.e.,
1, In which case will _Float16 values return in both %xmm0 and %xmm1?
2, For a single _Float16 value, are both real part and imaginary part returned in %xmm0? Or returned in %xmm0 and %xmm1 respectively?

Here is the v2 patch to add the missing _Float16 bits. The PDF file is at

v2-0001-Add-optional-_Float16-support.patch (7.14 KB)

phoebe · July 13, 2021, 2:48pm

Hi H.J.,

Our LLVM implementation currently use %xmm0 for both _Complex's real part and imaginary part. Do we have special reason to use two registers?
We are using one register on X64. Considering the performance, especially the register pressure, should it be better to use one register for _Complex _Float16 on 32 bits target?

Thanks
Pengfei

H.J_Lu · July 13, 2021, 3:04pm

Hi H.J.,

Our LLVM implementation currently use %xmm0 for both _Complex's real part and imaginary part. Do we have special reason to use two registers?
We are using one register on X64. Considering the performance, especially the register pressure, should it be better to use one register for _Complex _Float16 on 32 bits target?

x86-64 psABI is unrelated to i386 psABI. Using a pair of registers is
more natural for
complex _Float16. Since it is only used for function return value, I
don't think there is
a register pressure issue.

Joseph_Myers1 · July 13, 2021, 3:41pm

This PDF shows _Complex _Float16 as having a size of 2 bytes (should be
4-byte size, 2-byte alignment).

It also seems to change double from 4-byte to 8-byte alignment, which is
wrong. And it's inconsistent about whether it covers the long double =
double (Android) case - it shows that case for _Complex long double but
not for long double itself.

H.J_Lu · July 13, 2021, 4:24pm

> >
> > > Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.
> >
> > Can you please explain the behavior here? Is there difference between _Float16 and _Complex _Float16 when return? I.e.,
> > 1, In which case will _Float16 values return in both %xmm0 and %xmm1?
> > 2, For a single _Float16 value, are both real part and imaginary part returned in %xmm0? Or returned in %xmm0 and %xmm1 respectively?
>
> Here is the v2 patch to add the missing _Float16 bits. The PDF file is at
>
> Wiki · x86 psABIs / i386 and IAMCU psABIs · GitLab

This PDF shows _Complex _Float16 as having a size of 2 bytes (should be
4-byte size, 2-byte alignment).

It also seems to change double from 4-byte to 8-byte alignment, which is
wrong. And it's inconsistent about whether it covers the long double =
double (Android) case - it shows that case for _Complex long double but
not for long double itself.

Here is the v3 patch with the fixes. I also updated the PDF file.

v3-0001-Add-optional-_Float16-support.patch (7.17 KB)

H.J_Lu · July 29, 2021, 1:39pm

Here is the final patch I checked in. _Complex _Float16 is changed to return
in XMM0 register. The new PDF file is at

0001-Add-optional-_Float16-support.patch (6.83 KB)

rjmccall · August 24, 2021, 5:55am

This should be explicit that the real part is returned in bits 0…15 and the imaginary part is returned in bits 16…31, or however we conventionally designate subcomponents of a vector.

John.

H.J_Lu · August 25, 2021, 12:35pm

How about this?

diff --git a/low-level-sys-info.tex b/low-level-sys-info.tex
index 860ff66..8f527c1 100644
--- a/low-level-sys-info.tex
+++ b/low-level-sys-info.tex
@@ -457,6 +457,9 @@ and \texttt{unions}) are always returned in memory.
     & \texttt{__float128} & memory \\
     \hline
     & \texttt{_Complex _Float16} & \reg{xmm0} \\
+ & & The real part is returned in bits 0..15. The imaginary part is
+ returned \\
+ & & in bits 16..31.\\
     \cline{2-3}
     Complex & \texttt{_Complex float} & \EDX:\EAX \\
     floating- & & The real part is returned in \EAX. The imaginary part is

rjmccall · August 25, 2021, 8:32pm

Looks good to me, thanks.

John.

Topic		Replies	Views
16 bit floats LLVM Dev List Archives	8	79	February 6, 2009
[PATCH] Add optional __Bfloat16 support X86	0	623	June 13, 2022
change type allocoted register LLVM Dev List Archives	0	84	January 4, 2010
SSE return w/ elf64 ABI LLVM Dev List Archives	2	86	August 26, 2015
[RFC] Half-Precision Support in the Arm Backends LLVM Dev List Archives	5	145	January 22, 2018

[PATCH] Add optional _Float16 support

Related topics