[PATCH 1/2] integer: Add popcount implementation using ctpop intrinsic

Also copy/modify the unary_intrin.inc from math/ to make the
intrinsic declaration somewhat reusable.

Passes CL CTS integer_ops/test_integer_ops popcount tests for CL 1.2

Tested on GCN 1.0 (Pitcairn)

Signed-off-by: Aaron Watry <awatry@gmail.com>

Signed-off-by: Aaron Watry <awatry@gmail.com>

Also copy/modify the unary_intrin.inc from math/ to make the
intrinsic declaration somewhat reusable.

Passes CL CTS integer_ops/test_integer_ops popcount tests for CL 1.2

Tested on GCN 1.0 (Pitcairn)

Signed-off-by: Aaron Watry <awatry@gmail.com>

Passes piglit on Turks.
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>

Jan

I think this would need fast/unsafe math flag to actually generate
reciprocal, but since all our native_* functions are macros for now.
Acked-by: Jan Vesely <jan.vesely@rutgers.edu>

I think it'd be a good idea to come up with a plan to properly support
native_* functions. to me it looks like the options are:
a) Treat llvm intrinsics as native precision and use those
b) Treat llvm intrinsics as properly rounded and export device specific
__builtins via clang.

a) is less work (both in LLVM and libclc), but I think b) is what is
expected of us

Jan

Signed-off-by: Aaron Watry <awatry@gmail.com>
---
generic/include/clc/clc.h | 1 +
generic/include/clc/math/native_recip.h | 1 +
2 files changed, 2 insertions(+)
create mode 100644 generic/include/clc/math/native_recip.h

diff --git a/generic/include/clc/clc.h b/generic/include/clc/clc.h
index a93c8ef..9c0e00c 100644
--- a/generic/include/clc/clc.h
+++ b/generic/include/clc/clc.h
@@ -106,6 +106,7 @@
#include <clc/math/native_log.h>
#include <clc/math/native_log2.h>
#include <clc/math/native_powr.h>
+#include <clc/math/native_recip.h>
#include <clc/math/native_sin.h>
#include <clc/math/native_sqrt.h>
#include <clc/math/native_rsqrt.h>
diff --git a/generic/include/clc/math/native_recip.h b/generic/include/clc/math/native_recip.h
new file mode 100644
index 0000000..5187661
--- /dev/null
+++ b/generic/include/clc/math/native_recip.h
@@ -0,0 +1 @@
+#define native_recip(x) ((1) / (x))

I think this would need fast/unsafe math flag to actually generate
reciprocal, but since all our native_* functions are macros for now.
Acked-by: Jan Vesely <jan.vesely@rutgers.edu>

I think it'd be a good idea to come up with a plan to properly support
native_* functions. to me it looks like the options are:
a) Treat llvm intrinsics as native precision and use those
b) Treat llvm intrinsics as properly rounded and export device specific
__builtins via clang.

a) is less work (both in LLVM and libclc), but I think b) is what is
expected of us

Given that the llvm cos/sin intrinsics are already of insufficient
precision for CL purposes, I'm ok with the idea that the intrinsic
functions are sufficient for native_* where precision is up to us.
The only real requirement for native_* is that a result is returned,
with no accuracy requirements given to us by the spec. Given that
llvm.sin.* is already not accurate enough for us, I don't think we can
do your option b without improving the accuracy of some of the AMDGPU
intrinsic functions anyway.

And yeah, for now, the native_recip generates a floating divide of
1.0f/x. I didn't find an llvm intrinsic for reciprocal, and I don't
think I found one for __builtin_recip/rcp/etc when I looked either. I
guess I could hand-code something in llvm assembly, but that was more
effort than I wanted to sink into something that just happened to be
needed to get the blender cycles kernels to compile.

Note: I think that with this function, we're to the point that we've
got everything needed in libclc for blender, but llvm asserts/crashes
out with either SelectionDAG or instruction selection issues later on.
I did had to hack up blender a bit to get clover to be a recognized
platform based on a patch from Edward O'Callaghan [0].

--Aaron

[0] - Blender Archive - developer.blender.org

> > Signed-off-by: Aaron Watry <awatry@gmail.com>
> > ---
> > generic/include/clc/clc.h | 1 +
> > generic/include/clc/math/native_recip.h | 1 +
> > 2 files changed, 2 insertions(+)
> > create mode 100644 generic/include/clc/math/native_recip.h
> >
> > diff --git a/generic/include/clc/clc.h b/generic/include/clc/clc.h
> > index a93c8ef..9c0e00c 100644
> > --- a/generic/include/clc/clc.h
> > +++ b/generic/include/clc/clc.h
> > @@ -106,6 +106,7 @@
> > #include <clc/math/native_log.h>
> > #include <clc/math/native_log2.h>
> > #include <clc/math/native_powr.h>
> > +#include <clc/math/native_recip.h>
> > #include <clc/math/native_sin.h>
> > #include <clc/math/native_sqrt.h>
> > #include <clc/math/native_rsqrt.h>
> > diff --git a/generic/include/clc/math/native_recip.h b/generic/include/clc/math/native_recip.h
> > new file mode 100644
> > index 0000000..5187661
> > --- /dev/null
> > +++ b/generic/include/clc/math/native_recip.h
> > @@ -0,0 +1 @@
> > +#define native_recip(x) ((1) / (x))
>
> I think this would need fast/unsafe math flag to actually generate
> reciprocal, but since all our native_* functions are macros for now.
> Acked-by: Jan Vesely <jan.vesely@rutgers.edu>
>
> I think it'd be a good idea to come up with a plan to properly support
> native_* functions. to me it looks like the options are:
> a) Treat llvm intrinsics as native precision and use those
> b) Treat llvm intrinsics as properly rounded and export device specific
> __builtins via clang.
>
> a) is less work (both in LLVM and libclc), but I think b) is what is
> expected of us

Given that the llvm cos/sin intrinsics are already of insufficient
precision for CL purposes, I'm ok with the idea that the intrinsic
functions are sufficient for native_* where precision is up to us.
The only real requirement for native_* is that a result is returned,
with no accuracy requirements given to us by the spec. Given that
llvm.sin.* is already not accurate enough for us, I don't think we can
do your option b without improving the accuracy of some of the AMDGPU
intrinsic functions anyway.

I think this goes beyond amdgpu target. It has been decided to not have
amdgpu target in compiler-rt, I'm not sure if it's correct to implement
llvm.cos using GPU instructions rather than introducing
amdgcn.cos/r600.cos.
We might want to look at other architectures (nvptx, x86 for llvmpipe
OCL) even the chance anybody will use those is low.
I want to avoid a situation where llvm.cos results in single inaccurate
instruction on one target, and 1000 instructions of accurate algorithm
on another.

And yeah, for now, the native_recip generates a floating divide of
1.0f/x. I didn't find an llvm intrinsic for reciprocal, and I don't
think I found one for __builtin_recip/rcp/etc when I looked either. I
guess I could hand-code something in llvm assembly, but that was more
effort than I wanted to sink into something that just happened to be
needed to get the blender cycles kernels to compile.

feel free to go ahead with ack on this patch. proper solution in the
form of clang __builtin (either generic, or target specific) can come
later.

Note: I think that with this function, we're to the point that we've
got everything needed in libclc for blender, but llvm asserts/crashes
out with either SelectionDAG or instruction selection issues later on.
I did had to hack up blender a bit to get clover to be a recognized
platform based on a patch from Edward O'Callaghan [0].

Sounds like removing __CL_USE_NATIVE__ from the blender patch would
also be an option.

https://bugs.freedesktop.org/show_bug.cgi?id=100212
mentions that vstore_half_rtn is also needed. Is that not the case?

Anyway, I get "*** Couldn't join subrange!" in RegisterCoalescer when
trying to run blender on Carrizo/LLVM-5.0

Jan

> > Signed-off-by: Aaron Watry <awatry@gmail.com>
> > ---
> > generic/include/clc/clc.h | 1 +
> > generic/include/clc/math/native_recip.h | 1 +
> > 2 files changed, 2 insertions(+)
> > create mode 100644 generic/include/clc/math/native_recip.h
> >
> > diff --git a/generic/include/clc/clc.h b/generic/include/clc/clc.h
> > index a93c8ef..9c0e00c 100644
> > --- a/generic/include/clc/clc.h
> > +++ b/generic/include/clc/clc.h
> > @@ -106,6 +106,7 @@
> > #include <clc/math/native_log.h>
> > #include <clc/math/native_log2.h>
> > #include <clc/math/native_powr.h>
> > +#include <clc/math/native_recip.h>
> > #include <clc/math/native_sin.h>
> > #include <clc/math/native_sqrt.h>
> > #include <clc/math/native_rsqrt.h>
> > diff --git a/generic/include/clc/math/native_recip.h b/generic/include/clc/math/native_recip.h
> > new file mode 100644
> > index 0000000..5187661
> > --- /dev/null
> > +++ b/generic/include/clc/math/native_recip.h
> > @@ -0,0 +1 @@
> > +#define native_recip(x) ((1) / (x))
>
> I think this would need fast/unsafe math flag to actually generate
> reciprocal, but since all our native_* functions are macros for now.
> Acked-by: Jan Vesely <jan.vesely@rutgers.edu>
>
> I think it'd be a good idea to come up with a plan to properly support
> native_* functions. to me it looks like the options are:
> a) Treat llvm intrinsics as native precision and use those
> b) Treat llvm intrinsics as properly rounded and export device specific
> __builtins via clang.
>
> a) is less work (both in LLVM and libclc), but I think b) is what is
> expected of us

Given that the llvm cos/sin intrinsics are already of insufficient
precision for CL purposes, I'm ok with the idea that the intrinsic
functions are sufficient for native_* where precision is up to us.
The only real requirement for native_* is that a result is returned,
with no accuracy requirements given to us by the spec. Given that
llvm.sin.* is already not accurate enough for us, I don't think we can
do your option b without improving the accuracy of some of the AMDGPU
intrinsic functions anyway.

I think this goes beyond amdgpu target. It has been decided to not have
amdgpu target in compiler-rt, I'm not sure if it's correct to implement
llvm.cos using GPU instructions rather than introducing
amdgcn.cos/r600.cos.
We might want to look at other architectures (nvptx, x86 for llvmpipe
OCL) even the chance anybody will use those is low.
I want to avoid a situation where llvm.cos results in single inaccurate
instruction on one target, and 1000 instructions of accurate algorithm
on another.

And yeah, for now, the native_recip generates a floating divide of
1.0f/x. I didn't find an llvm intrinsic for reciprocal, and I don't
think I found one for __builtin_recip/rcp/etc when I looked either. I
guess I could hand-code something in llvm assembly, but that was more
effort than I wanted to sink into something that just happened to be
needed to get the blender cycles kernels to compile.

feel free to go ahead with ack on this patch. proper solution in the
form of clang __builtin (either generic, or target specific) can come
later.

Note for reference... while this generates .ll of:
  %1 = load float, float addrspace(1)* %arrayidx1, align 4, !tbaa !7
  %div2 = fdiv float 1.000000e+00, %1, !fpmath !11

The amdgpu back-end generates:
    s_load_dword s2, s[4:5], 0x0
    s_waitcnt lgkmcnt(0)
    v_rcp_f32_e32 v0, s2

So while we don't have an intrinsic or built-in for this, amdgpu is
smart enough to recognize the pattern.

--Aaron