Add initial support for half precision builtins (take 3)

Hi,

I've tried to post this patch but it's too big for the ML (and I don't
know who to bug to get it released).

It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
(especially the sw implementations of math functions), but that needs
fixed conversion routines.

This should be enough to make clpeak happy (BZ: 96897)

Jan

0001-Add-initial-support-for-half-precision-builtins.patch.gz (6.82 KB)

Hi,

I've tried to post this patch but it's too big for the ML (and I don't
know who to bug to get it released).

It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
(especially the sw implementations of math functions), but that needs
fixed conversion routines.

This should be enough to make clpeak happy (BZ: 96897)

In: amdgcn/lib/math/fmax.cl
+ return (y < x) ? y : x;

Isn't this backwards? This is returning the minimum of the two values,
not the max. It looks like you did fmin first, and then copy-pasted
to fmax, but forgot to change the ternary. It is correct in
generic/lib/math/fmax.cl at least, so just the GCN codepath is
affected.

Otherwise, I've skimmed my way through the rest. I noted a bunch of
your TODOs for things that are reliant on atan2/sin/cos
implementations and a few functions that are disabled for
half-precision. Nothing really stands out as blatantly wrong other
than that amdgcn fmax bit I noted up above.

--Aaron

Hi Jan,

Hi,

I've tried to post this patch but it's too big for the ML (and I don't
know who to bug to get it released).

It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
(especially the sw implementations of math functions), but that needs
fixed conversion routines.

Is that going to fly? Wouldn’t you run into double rounding issues?

In addition to Aaron’s comments:

+#define HALF_RADIX 2
+#define HALF_MAX 0x1.ffcp15h
+#define HALF_MIN 0x1.0p-14h
+#define HALF_EPSILON 0x1.0p-10h

There seems to be both a space and a tab between the macro name and its definition. I also see a few places where indentation is done with a tab, e.g., generic/lib/relational/isunordered.cl.

From nan.inc:

#if __CLC_FPSIZE == 64
#define __CLC_NATN __CLC_XCONCAT(ulong, __CLC_VECSIZE)
-#else
+#elif __CLC_FPSIZE == 32
#define __CLC_NATN __CLC_XCONCAT(uint, __CLC_VECSIZE)
+#else
+#define __CLC_NATN __CLC_XCONCAT(ushort, __CLC_VECSIZE)
#endif

Wouldn’t it make more sense to also use “#elif __CLC_FPSIZE == 16”, and maybe #error otherwise? Similar in a few other places.

In generic/lib/math/acos.inc:

+#else
+#define __CLC_CONST(x) xh

I think there should be a ## between the x and h.

In generic/include/math/clc_ldexp.h, the guarded text in indented, while indentation is not used anywhere else.

I see some cases where the "+#if __CLC_FPSIZE > 16” guard does not have a TODO attached to it, is that on purpose?

In generic/lib/math/modf.inc casts 0.0f, while generic/lib/math/fract.inc introduces a ZERO macro. Maybe this should be made more uniform (doesn’t have to be as part of this patch).

Near the end of the patch the int/vector of short difference is explained in a comment "returns an int, but the vector versions return short”. Maybe add a similar comment to generic/include/clc/relational/floatn.inc?

Jeroen

Hi,

thank you both, for fast feedback. I've attached v2 that should address
most of the comments. I left larger whitespace cleanup for another day.

Hi Jan,

>
> Hi,
>
> I've tried to post this patch but it's too big for the ML (and I don't
> know who to bug to get it released).
>
> It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
> (especially the sw implementations of math functions), but that needs
> fixed conversion routines.

Is that going to fly? Wouldn’t you run into double rounding issues?

It probably will, but it's the fastest solution. The alternative is to
adapt all sw routines to half precision (including tables). At least
GCN llvm backend does this for some ops anyway.

Fixing conversion routines is higher priority. We can revisit this
after.

regards,
Jan

v2-0001-Add-initial-support-for-half-precision-builtins.patch.gz (7.1 KB)

Hi,

thank you both, for fast feedback. I’ve attached v2 that should address
most of the comments. I left larger whitespace cleanup for another day.

Hi Jan,

Hi,

I’ve tried to post this patch but it’s too big for the ML (and I don’t
know who to bug to get it released).

It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
(especially the sw implementations of math functions), but that needs
fixed conversion routines.

Is that going to fly? Wouldn’t you run into double rounding issues?

It probably will, but it’s the fastest solution. The alternative is to
adapt all sw routines to half precision (including tables). At least
GCN llvm backend does this for some ops anyway.

Fixing conversion routines is higher priority. We can revisit this
after.

Yeah, if the aim is just to get something working, then just wrapping
the fp32 code is fine.

One further remark: I think generic/lib/math/modf.inc needs to #undef ZERO.

One question about generic/lib/math/clc_sqrt_impl.inc:

+#define __CLC_NAN (half)NAN

I assume the reason for this is that there’s __builtin_nanh(“”), or something
like that? (The ‘h’ in there is coming from my imagination.)

Otherwise, I don’t have any further comments.

Jeroen

>
> Hi,
>
> thank you both, for fast feedback. I've attached v2 that should address
> most of the comments. I left larger whitespace cleanup for another day.
>
>
> > Hi Jan,
> >
> > >
> > > Hi,
> > >
> > > I've tried to post this patch but it's too big for the ML (and I don't
> > > know who to bug to get it released).
> > >
> > > It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
> > > (especially the sw implementations of math functions), but that needs
> > > fixed conversion routines.
> >
> > Is that going to fly? Wouldn’t you run into double rounding issues?
>
> It probably will, but it's the fastest solution. The alternative is to
> adapt all sw routines to half precision (including tables). At least
> GCN llvm backend does this for some ops anyway.
>
> Fixing conversion routines is higher priority. We can revisit this
> after.

Yeah, if the aim is just to get something working, then just wrapping
the fp32 code is fine.

One further remark: I think generic/lib/math/modf.inc needs to #undef ZERO.

Thanks. I've fixed this locally. Guess since it was the same definition
the compiler didn't complain.

One question about generic/lib/math/clc_sqrt_impl.inc:

+#define __CLC_NAN (half)NAN

I assume the reason for this is that there’s __builtin_nanh(“”), or something
like that? (The ‘h’ in there is coming from my imagination.)

yes. there's no __builtin_nanh(), this relies on the conversion
preserving NaN.

Otherwise, I don’t have any further comments.

thanks,
Jan

>
> Hi,
>
> thank you both, for fast feedback. I've attached v2 that should address
> most of the comments. I left larger whitespace cleanup for another day.
>
>
> > Hi Jan,
> >
> > >
> > > Hi,
> > >
> > > I've tried to post this patch but it's too big for the ML (and I don't
> > > know who to bug to get it released).
> > >
> > > It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
> > > (especially the sw implementations of math functions), but that needs
> > > fixed conversion routines.
> >
> > Is that going to fly? Wouldn’t you run into double rounding issues?
>
> It probably will, but it's the fastest solution. The alternative is to
> adapt all sw routines to half precision (including tables). At least
> GCN llvm backend does this for some ops anyway.
>
> Fixing conversion routines is higher priority. We can revisit this
> after.

Yeah, if the aim is just to get something working, then just wrapping
the fp32 code is fine.

One further remark: I think generic/lib/math/modf.inc needs to #undef ZERO.

Thanks. I've fixed this locally. Guess since it was the same definition
the compiler didn't complain.

One question about generic/lib/math/clc_sqrt_impl.inc:

+#define __CLC_NAN (half)NAN

I assume the reason for this is that there’s __builtin_nanh(“”), or something
like that? (The ‘h’ in there is coming from my imagination.)

yes. there's no __builtin_nanh(), this relies on the conversion
preserving NaN.

Otherwise, I don’t have any further comments.

thanks,
Jan

One more thing.

I was running through the "quick" CTS test profile yesterday, just to
see where we're at, and I noticed the following failures
from conformance_Test_half:

Testing vstorea_half
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rte
Testing vstorea_half_rte
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtz
Testing vstorea_half_rtz
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtp
Testing vstorea_half_rtp
268435452) Failure at 0x1.fffff8p-96: *0x0001 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtn
Testing vstorea_half_rtn
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global

Something about the 3-component vstore_half is off, but I'm not sure
if this was already broken before, or if the "#if __CLC_FPSIZE > 16"
change in vstore_half.inc broke it.

--Aaron

> > >
> > > Hi,
> > >
> > > thank you both, for fast feedback. I've attached v2 that should address
> > > most of the comments. I left larger whitespace cleanup for another day.
> > >
> > >
> > > > Hi Jan,
> > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I've tried to post this patch but it's too big for the ML (and I don't
> > > > > know who to bug to get it released).
> > > > >
> > > > > It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
> > > > > (especially the sw implementations of math functions), but that needs
> > > > > fixed conversion routines.
> > > >
> > > > Is that going to fly? Wouldn’t you run into double rounding issues?
> > >
> > > It probably will, but it's the fastest solution. The alternative is to
> > > adapt all sw routines to half precision (including tables). At least
> > > GCN llvm backend does this for some ops anyway.
> > >
> > > Fixing conversion routines is higher priority. We can revisit this
> > > after.
> >
> > Yeah, if the aim is just to get something working, then just wrapping
> > the fp32 code is fine.
> >
> > One further remark: I think generic/lib/math/modf.inc needs to #undef ZERO.
>
> Thanks. I've fixed this locally. Guess since it was the same definition
> the compiler didn't complain.
>
> >
> > One question about generic/lib/math/clc_sqrt_impl.inc:
> >
> > +#define __CLC_NAN (half)NAN
> >
> > I assume the reason for this is that there’s __builtin_nanh(“”), or something
> > like that? (The ‘h’ in there is coming from my imagination.)
>
> yes. there's no __builtin_nanh(), this relies on the conversion
> preserving NaN.
>
> >
> > Otherwise, I don’t have any further comments.
>
> thanks,
> Jan

One more thing.

I was running through the "quick" CTS test profile yesterday, just to
see where we're at, and I noticed the following failures
from conformance_Test_half:

Testing vstorea_half
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rte
Testing vstorea_half_rte
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtz
Testing vstorea_half_rtz
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtp
Testing vstorea_half_rtp
268435452) Failure at 0x1.fffff8p-96: *0x0001 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtn
Testing vstorea_half_rtn
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global

Something about the 3-component vstore_half is off, but I'm not sure
if this was already broken before, or if the "#if __CLC_FPSIZE > 16"
change in vstore_half.inc broke it.

Hi,

this was broken before. vstore/vload_half exist irrespective of
cl_khr_fp16. The "#if" part is there to prevent vstore_half(half,
size_t, half*), there's vstore(half, size_t, half*) for that.
vload_half has similar guard but it only exists for float.

I've actually started looking into this as a simpler problem to fix
than the conversion generator.
I found the issue to be in CTS numVecs function(cl_utils.c:327) which
aligns up for all cases but (aligned && vecsize == 3).
Thus the last few values are simply not run and return the poison value
(0xdead). You can check whether "count" is divisible by 4.

A simple patch to numVecs function:

- return count/4;
+ return (count + 3)/4;

should fix the issue, I've only checked vstorea_half.
I'm pretty sure it technically results in OOB access, but thanks to GPU
buffer size alignment to 64 bytes it shouldn't corrupt memory.

Does the patch look OK to you now? v3 adds "ZERO" undef. Let me know if
you'd prefer I sent it out as well.

thanks,
Jan

> > >
> > > Hi,
> > >
> > > thank you both, for fast feedback. I've attached v2 that should address
> > > most of the comments. I left larger whitespace cleanup for another day.
> > >
> > >
> > > > Hi Jan,
> > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I've tried to post this patch but it's too big for the ML (and I don't
> > > > > know who to bug to get it released).
> > > > >
> > > > > It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
> > > > > (especially the sw implementations of math functions), but that needs
> > > > > fixed conversion routines.
> > > >
> > > > Is that going to fly? Wouldn’t you run into double rounding issues?
> > >
> > > It probably will, but it's the fastest solution. The alternative is to
> > > adapt all sw routines to half precision (including tables). At least
> > > GCN llvm backend does this for some ops anyway.
> > >
> > > Fixing conversion routines is higher priority. We can revisit this
> > > after.
> >
> > Yeah, if the aim is just to get something working, then just wrapping
> > the fp32 code is fine.
> >
> > One further remark: I think generic/lib/math/modf.inc needs to #undef ZERO.
>
> Thanks. I've fixed this locally. Guess since it was the same definition
> the compiler didn't complain.
>
> >
> > One question about generic/lib/math/clc_sqrt_impl.inc:
> >
> > +#define __CLC_NAN (half)NAN
> >
> > I assume the reason for this is that there’s __builtin_nanh(“”), or something
> > like that? (The ‘h’ in there is coming from my imagination.)
>
> yes. there's no __builtin_nanh(), this relies on the conversion
> preserving NaN.
>
> >
> > Otherwise, I don’t have any further comments.
>
> thanks,
> Jan

One more thing.

I was running through the "quick" CTS test profile yesterday, just to
see where we're at, and I noticed the following failures
from conformance_Test_half:

Testing vstorea_half
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rte
Testing vstorea_half_rte
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtz
Testing vstorea_half_rtz
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtp
Testing vstorea_half_rtp
268435452) Failure at 0x1.fffff8p-96: *0x0001 vs 0xdead vector_size =
3 address_space = global
Testing vstore_half_rtn
Testing vstorea_half_rtn
268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead vector_size =
3 address_space = global

Something about the 3-component vstore_half is off, but I'm not sure
if this was already broken before, or if the "#if __CLC_FPSIZE > 16"
change in vstore_half.inc broke it.

Hi,

this was broken before. vstore/vload_half exist irrespective of
cl_khr_fp16. The "#if" part is there to prevent vstore_half(half,
size_t, half*), there's vstore(half, size_t, half*) for that.
vload_half has similar guard but it only exists for float.

I've actually started looking into this as a simpler problem to fix
than the conversion generator.
I found the issue to be in CTS numVecs function(cl_utils.c:327) which
aligns up for all cases but (aligned && vecsize == 3).
Thus the last few values are simply not run and return the poison value
(0xdead). You can check whether "count" is divisible by 4.

A simple patch to numVecs function:

- return count/4;
+ return (count + 3)/4;

should fix the issue, I've only checked vstorea_half.
I'm pretty sure it technically results in OOB access, but thanks to GPU
buffer size alignment to 64 bytes it shouldn't corrupt memory.

Does the patch look OK to you now? v3 adds "ZERO" undef. Let me know if
you'd prefer I sent it out as well.

Yeah, I'm ok with this in its current state. I figured there was a
good chance that vstore_half was already failing before your changes,
but I just wanted to check.

Regarding testing, I ran through the CTS quick profile and didn't see
anything obvious going wrong with regards to the half-precision
support. I also ran the clpeak performance tests successfully without
issue on my RX580.

I previously went through v1 of this change, and I just went through
the diff between v1 and v2.

Looks good to me, and the TODOs left make sense. We'll get there
eventually... there's no reason to block getting most of the support
in on the few things that aren't finished yet.

Tested/Reviewed-By: Me

--Aaron