[PATCH 1/2] amdgcn: Implement {read_, write_, }mem_fence builtin

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Specs require using fences when barrier() is invoked:
"The barrier function will either flush any variables stored in local memory
or queue a memory fence to ensure correct ordering of memory operations to local memory."
and
"The barrier function will queue a memory fence to ensure correct ordering
of memory operations to global memory."

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---

__builtin_amdgcn_s_waitcnt path is compile tested only. I currently
don't have machine with GCN hw and LLVM > 4

Tested on GCN 1.0 (PITCAIRN) with LLVM 6.0 svn (from sometime last week).

I ran the local-memory/global-memory piglit tests, and the conformance
basic/test_basic vload*, vstore*, and barrier tests. No change in
pass-rate.

The only ones that fail are vload_private with char/uchar/short/ushort
data types, but those failed before your series.

See below for one question/clarification.

Jan

amdgcn/lib/SOURCES | 2 ++
amdgcn/lib/mem_fence/fence.cl | 32 ++++++++++++++++++++++
amdgcn/lib/mem_fence/waitcnt.ll | 11 ++++++++
generic/include/clc/clc.h | 3 ++
.../clc/explicit_fence/explicit_memory_fence.h | 3 ++
5 files changed, 51 insertions(+)
create mode 100644 amdgcn/lib/mem_fence/fence.cl
create mode 100644 amdgcn/lib/mem_fence/waitcnt.ll
create mode 100644 generic/include/clc/explicit_fence/explicit_memory_fence.h

diff --git a/amdgcn/lib/SOURCES b/amdgcn/lib/SOURCES
index 1ff5fd1..24f5949 100644
--- a/amdgcn/lib/SOURCES
+++ b/amdgcn/lib/SOURCES
@@ -1,4 +1,6 @@
math/ldexp.cl
+mem_fence/fence.cl
+mem_fence/waitcnt.ll
synchronization/barrier_impl.ll
workitem/get_global_offset.cl
workitem/get_group_id.cl
diff --git a/amdgcn/lib/mem_fence/fence.cl b/amdgcn/lib/mem_fence/fence.cl
new file mode 100644
index 0000000..f64c6e2
--- /dev/null
+++ b/amdgcn/lib/mem_fence/fence.cl
@@ -0,0 +1,32 @@
+#include <clc/clc.h>
+
+void __clc_amdgcn_s_waitcnt(unsigned flags);
+
+// Newer clang supports __builtin_amdgcn_s_waitcnt
+#if __clang_major__ >= 5
+# define __waitcnt(x) __builtin_amdgcn_s_waitcnt(x)
+#else
+# define __waitcnt(x) __clc_amdgcn_s_waitcnt(x)
+#endif
+
+_CLC_DEF void mem_fence(cl_mem_fence_flags flags)
+{
+ if (flags & CLK_GLOBAL_MEM_FENCE) {
+ // scalar mem is counted with LGKM but we don't know whether
+ // the compiler turned any loads/stores to scalar
+ __waitcnt(0);

Silly question: does waitcnt(0) also synchronize local memory?

It's possible for someone to call write_mem_fence(CLK_GLOBAL_MEM_FENCE

CLK_GLOBAL_MEM_FENCE), and I want to make sure that doing an else-if

here is acceptable.

If the waitcnt(0) works to synchronize both global/local, then this series is:
Reviewed-By: Aaron Watry <awatry@gmail.com>
Tested-By: Aaron Watry <awatry@gmail.com>

Also, thanks for taking care of this. Both the barrier and mem_fence
functions have been bugging me for a while, and I've never gotten
around to cleaning them up.

--Aaron

> Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> ---
>
> __builtin_amdgcn_s_waitcnt path is compile tested only. I currently
> don't have machine with GCN hw and LLVM > 4

Tested on GCN 1.0 (PITCAIRN) with LLVM 6.0 svn (from sometime last week).

I ran the local-memory/global-memory piglit tests, and the conformance
basic/test_basic vload*, vstore*, and barrier tests. No change in
pass-rate.

The only ones that fail are vload_private with char/uchar/short/ushort
data types, but those failed before your series.

This sounds like type extension gone wrong/missing. I'll be upgrading
my GCN machine in the coming weeks.

See below for one question/clarification.

>
> Jan
>
> amdgcn/lib/SOURCES | 2 ++
> amdgcn/lib/mem_fence/fence.cl | 32 ++++++++++++++++++++++
> amdgcn/lib/mem_fence/waitcnt.ll | 11 ++++++++
> generic/include/clc/clc.h | 3 ++
> .../clc/explicit_fence/explicit_memory_fence.h | 3 ++
> 5 files changed, 51 insertions(+)
> create mode 100644 amdgcn/lib/mem_fence/fence.cl
> create mode 100644 amdgcn/lib/mem_fence/waitcnt.ll
> create mode 100644 generic/include/clc/explicit_fence/explicit_memory_fence.h
>
> diff --git a/amdgcn/lib/SOURCES b/amdgcn/lib/SOURCES
> index 1ff5fd1..24f5949 100644
> --- a/amdgcn/lib/SOURCES
> +++ b/amdgcn/lib/SOURCES
> @@ -1,4 +1,6 @@
> math/ldexp.cl
> +mem_fence/fence.cl
> +mem_fence/waitcnt.ll
> synchronization/barrier_impl.ll
> workitem/get_global_offset.cl
> workitem/get_group_id.cl
> diff --git a/amdgcn/lib/mem_fence/fence.cl b/amdgcn/lib/mem_fence/fence.cl
> new file mode 100644
> index 0000000..f64c6e2
> --- /dev/null
> +++ b/amdgcn/lib/mem_fence/fence.cl
> @@ -0,0 +1,32 @@
> +#include <clc/clc.h>
> +
> +void __clc_amdgcn_s_waitcnt(unsigned flags);
> +
> +// Newer clang supports __builtin_amdgcn_s_waitcnt
> +#if __clang_major__ >= 5
> +# define __waitcnt(x) __builtin_amdgcn_s_waitcnt(x)
> +#else
> +# define __waitcnt(x) __clc_amdgcn_s_waitcnt(x)
> +#endif
> +
> +_CLC_DEF void mem_fence(cl_mem_fence_flags flags)
> +{
> + if (flags & CLK_GLOBAL_MEM_FENCE) {
> + // scalar mem is counted with LGKM but we don't know whether
> + // the compiler turned any loads/stores to scalar
> + __waitcnt(0);

Silly question: does waitcnt(0) also synchronize local memory?

It's possible for someone to call write_mem_fence(CLK_GLOBAL_MEM_FENCE
> CLK_GLOBAL_MEM_FENCE), and I want to make sure that doing an else-if

here is acceptable.

Yes. I'll expand the comment. s_waitcnt takes 16bit argument with a
combined number of allowed pending operations:
[12:8] LGKM -- LDS, GDS, Konstant (scalar loads if I understood the
specs correctly), Messages (s_sendmsg, I don't think there's a way for
us to use this in OCL, otherwise very useful instruction)
[7] -- undefined
[6:4] -- exports and mem write (I understand that these are scalar mem
writes)
[3:0] -- vmem ops

using waitcnt(0) waits until there is no pending operation of any kind

If the waitcnt(0) works to synchronize both global/local, then this series is:
Reviewed-By: Aaron Watry <awatry@gmail.com>
Tested-By: Aaron Watry <awatry@gmail.com>

thanks, I'll add more detailed comment and datalayout to the .ll file
before pushing.

Jan

do you remember if those failures are new in LLVM 6? I just posted new
vload/vstore piglit tests, and those ran OK* on carrizo/iceland system
using LLVM 5.

*mostly. vload_half is missing from libclc to those test failed.
Unlike Turks, which fails ~40% of them (LLVM 6).

regards,
Jan

[SNIP]

> Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> ---
>
> __builtin_amdgcn_s_waitcnt path is compile tested only. I currently
> don't have machine with GCN hw and LLVM > 4

Tested on GCN 1.0 (PITCAIRN) with LLVM 6.0 svn (from sometime last week).

I ran the local-memory/global-memory piglit tests, and the conformance
basic/test_basic vload*, vstore*, and barrier tests. No change in
pass-rate.

The only ones that fail are vload_private with char/uchar/short/ushort
data types, but those failed before your series.

do you remember if those failures are new in LLVM 6? I just posted new
vload/vstore piglit tests, and those ran OK* on carrizo/iceland system
using LLVM 5.

*mostly. vload_half is missing from libclc to those test failed.
Unlike Turks, which fails ~40% of them (LLVM 6).

I don't recall running this test specifically before the 5.0rc
branch-point, so I don't know if this is a new failure. All of the
vload tests for local/global/constant seem ok, and vload private is
working for anything that has a 32-bit gentype or larger
(int/long/float/double). It's just the vloads of
char/uchar/short/ushort that are failing. I've been looking at the
vload code, and at least the CL C code we've got looks correct.

The CTS test allows you to tweak the number of data points that are
tested in individual threads, and curiously, when the global size goes
from 15 to 16, things start failing (at least for the char2 test).

I've diffed the clover_dump.cl and .ll files for both, and they're
identical. I copied all of the test data into a piglit test, and the
same CL code and input/output data passes in piglit... The piglit CL
and .ll files are identical to what is executed by the CTS other than
the giant piglit comment section at the top related to buffer/test
setup. Possibly a difference in API usage, or buffer alignments, or
something.

I'm not entirely sure that the vload code is to blame here, but it's
possible. It's working just fine for global/local/constant vload
(which doesn't rule out any weirdness with private memory in GCN
having possible alignment/register-size restrictions that I'm not
aware of). I think I'm going to spend a little more time working on
making sure that the create/write/read buffer pieces in clover (and
all of our synchronization pieces) are solid before I spend too much
more time here.

Ooh, fun fact... I just tried to run the test on my Barts (6850/NI),
and it hung the machine on the first test. At least my GCN just fails
the test...

Anyway, I'll be out of communication for a few days. New kid arriving
tomorrow, so I'll be buried in parents/in-laws and lacking in sleep.

If you want to take a look at the .cl/ll files I used from piglit, I'm
attaching them.

--Aaron

clover_dump.cl (2.09 KB)

clover_dump.link-0.ll (6.3 KB)

clover_dump.ll (6.97 KB)

> > > Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> > > ---
> > >
> > > __builtin_amdgcn_s_waitcnt path is compile tested only. I currently
> > > don't have machine with GCN hw and LLVM > 4
> >
> > Tested on GCN 1.0 (PITCAIRN) with LLVM 6.0 svn (from sometime last week).
> >
> > I ran the local-memory/global-memory piglit tests, and the conformance
> > basic/test_basic vload*, vstore*, and barrier tests. No change in
> > pass-rate.
> >
> > The only ones that fail are vload_private with char/uchar/short/ushort
> > data types, but those failed before your series.
>
> do you remember if those failures are new in LLVM 6? I just posted new
> vload/vstore piglit tests, and those ran OK* on carrizo/iceland system
> using LLVM 5.
>
> *mostly. vload_half is missing from libclc to those test failed.
> Unlike Turks, which fails ~40% of them (LLVM 6).

I don't recall running this test specifically before the 5.0rc
branch-point, so I don't know if this is a new failure. All of the
vload tests for local/global/constant seem ok, and vload private is
working for anything that has a 32-bit gentype or larger
(int/long/float/double). It's just the vloads of
char/uchar/short/ushort that are failing. I've been looking at the
vload code, and at least the CL C code we've got looks correct.

The CTS test allows you to tweak the number of data points that are
tested in individual threads, and curiously, when the global size goes
from 15 to 16, things start failing (at least for the char2 test).

If it's intermittent I'd suspect buffer manipulation rather than GPU
execution, but I might be wrong.
I mostly wanted to find out whether the new piglit tests hit the issue,
or the situation is more complicated.
Modified vstore tests managed to uncover fail in vstore-{u,}char-
private, but that's assertion failure, which I assume is different from
what you're seeing in CTS.

http://paul.rutgers.edu/~jv356/piglit/gcn-latest-3/problems.html

I've diffed the clover_dump.cl and .ll files for both, and they're
identical. I copied all of the test data into a piglit test, and the
same CL code and input/output data passes in piglit... The piglit CL
and .ll files are identical to what is executed by the CTS other than
the giant piglit comment section at the top related to buffer/test
setup. Possibly a difference in API usage, or buffer alignments, or
something.

I'm not entirely sure that the vload code is to blame here, but it's
possible. It's working just fine for global/local/constant vload
(which doesn't rule out any weirdness with private memory in GCN
having possible alignment/register-size restrictions that I'm not
aware of).

VI+ GCN parts support i16 instructions, and there was a similar bug
before: https://reviews.llvm.org/D30281

I think I'm going to spend a little more time working on
making sure that the create/write/read buffer pieces in clover (and
all of our synchronization pieces) are solid before I spend too much
more time here.

thanks. clover on GCN is not really high on my list. my gcn machine is
setup with rocm.
Moreover, upgrading to LLVM 5 regressed both pyrite (can't select
device) and bfgminer (flood of acpi errors in dmesg).

Ooh, fun fact... I just tried to run the test on my Barts (6850/NI),
and it hung the machine on the first test. At least my GCN just fails
the test...

this looks related to the ongoing clpeak reported bug, the kernel
driver is hit or miss when it comes to recovering hung GPU.

Anyway, I'll be out of communication for a few days. New kid arriving
tomorrow, so I'll be buried in parents/in-laws and lacking in sleep.

congratz, and no worries, there's no rush when it comes to clover :slight_smile:

If you want to take a look at the .cl/ll files I used from piglit, I'm
attaching them.

thanks,
Jan

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

__builtin_amdgcn_s_waitcnt path is compile tested only. I currently
don’t have machine with GCN hw and LLVM > 4

Tested on GCN 1.0 (PITCAIRN) with LLVM 6.0 svn (from sometime last week).

I ran the local-memory/global-memory piglit tests, and the conformance
basic/test_basic vload*, vstore*, and barrier tests. No change in
pass-rate.

The only ones that fail are vload_private with char/uchar/short/ushort
data types, but those failed before your series.

do you remember if those failures are new in LLVM 6? I just posted new
vload/vstore piglit tests, and those ran OK* on carrizo/iceland system
using LLVM 5.

*mostly. vload_half is missing from libclc to those test failed.
Unlike Turks, which fails ~40% of them (LLVM 6).

I don’t recall running this test specifically before the 5.0rc
branch-point, so I don’t know if this is a new failure. All of the
vload tests for local/global/constant seem ok, and vload private is
working for anything that has a 32-bit gentype or larger
(int/long/float/double). It’s just the vloads of
char/uchar/short/ushort that are failing. I’ve been looking at the
vload code, and at least the CL C code we’ve got looks correct.

The CTS test allows you to tweak the number of data points that are
tested in individual threads, and curiously, when the global size goes
from 15 to 16, things start failing (at least for the char2 test).

If it’s intermittent I’d suspect buffer manipulation rather than GPU
execution, but I might be wrong.
I mostly wanted to find out whether the new piglit tests hit the issue,
or the situation is more complicated.
Modified vstore tests managed to uncover fail in vstore-{u,}char-
private, but that’s assertion failure, which I assume is different from
what you’re seeing in CTS.

Good point about the assertions, I don’t know that I’m actually running a debug+assertions build right now. That being said the failures aren’t transient. The piglit test I sent consistently passes and the cts one passes when working on ≤ 15 threads. It always fails at or over 16 threads… I’ll see if I can track it down at some point. With libclc/clover there’s enough bugs and missing features to chase down that I might just go for some other low-hanging fruit first.

Unfortunately I didn’t get a chance to run those new tests last night. I’ll let you know how it goes when I get a chance to try them.

–Aaron