[PATCH 01/14] half_rsqrt: Switch implementation ot native_rsqrt

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Pases CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Passes CTS on carrizo

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

This one also passes CTS on carrizo when applied together with 16/14 of
the native series.

Jan

This assumes that native_rsqrt is more accurate than half_rsqrt, which is not guaranteed by the OpenCL spec as far as I know.

Jeroen

This assumes that native_rsqrt is more accurate than half_rsqrt,
which is not guaranteed by the OpenCL spec as far as I know.

yes. the entire series assumes that native ops are accurate enough for
half_* (8192 ulps), I've tested this on carrizo, and afaik it should be
generally OK for both GCN and EG+.**
It'd be safer to redirect half_ ops to full ops, and include per target
overrides, but since I expect both nvidia and amdgpu to have those
overrides it'd be just a bunch of dead code in generic directory.

Jan

**cos/sin/tan have issues with large inputs, but I think that can be
fixed in llvm by improving the initial scaling op.

This assumes that native_rsqrt is more accurate than half_rsqrt,
which is not guaranteed by the OpenCL spec as far as I know.

yes. the entire series assumes that native ops are accurate enough for
half_* (8192 ulps), I've tested this on carrizo, and afaik it should be
generally OK for both GCN and EG+.**
It'd be safer to redirect half_ ops to full ops, and include per target
overrides, but since I expect both nvidia and amdgpu to have those
overrides it'd be just a bunch of dead code in generic directory.

Maybe. Then again, no one is currently testing this on Nvidia.

In general I would be worried about edge cases, but these are
apparently fine on AMD platforms.

Jeroen

>
> > This assumes that native_rsqrt is more accurate than half_rsqrt,
> > which is not guaranteed by the OpenCL spec as far as I know.
>
> yes. the entire series assumes that native ops are accurate enough for
> half_* (8192 ulps), I've tested this on carrizo, and afaik it should be
> generally OK for both GCN and EG+.**
> It'd be safer to redirect half_ ops to full ops, and include per target
> overrides, but since I expect both nvidia and amdgpu to have those
> overrides it'd be just a bunch of dead code in generic directory.

Maybe. Then again, no one is currently testing this on Nvidia.

In general I would be worried about edge cases, but these are
apparently fine on AMD platforms.

I took a look at cuda math intrinsics[0] which should give us an idea
about PTX opcode error values.
sqrt, rsqrt, recip look to be properly rounded
divide, log, log10, log2 look to be OK for half_ and even regular ops
exp, exp10 are not good enough
sin, cos can't tell

I can add special overloads for the last 4 ops for nvptx.

Jan

[0] http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#int
rinsic-functions