[PATCH 1/3] Implement wait_group_events builtin

This is a simple default implemetation which just calls barrier().

This is a simple implementation which just copies data synchronously.

This is a simple implementation which just copies data synchronously.

Why not barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE)?

The spec says "This function does not perform any implicit synchronization of source data such as using a barrier before performing the copy" although this doesn't sound like a really enforceable restriction

This should use size_t instead of unsigned to match the bound type.

The __builtins not really working with address spaces is a general problem that keeps coming up. How should it be fixed? I looked into fixing this for the __builtin_nanf problem. The problem seems to be clang conflates address space 0 with the lack of an address space in some places, and not others. The builtins all have no address space specified for their types, which is de factor considered to be 0, although Qualifiers/QualType can distinguish between having a specified address space 0 and not having an address space set. I think clang needs to move to considering the private address space as a distinct address space like the others, and address space checks should consistently check if there is an explicitly set address space before checking if they match.

size_t like the other one

This is a simple implementation which just copies data synchronously.
---
generic/include/clc/async/async_work_group_copy.h | 15 +++++++++++++++
generic/include/clc/async/async_work_group_copy.inc | 5 +++++
generic/include/clc/clc.h | 1 +
generic/lib/SOURCES | 1 +
generic/lib/async/async_work_group_copy.cl | 21 +++++++++++++++++++++
generic/lib/async/async_work_group_copy.inc | 16 ++++++++++++++++
6 files changed, 59 insertions(+)
create mode 100644 generic/include/clc/async/async_work_group_copy.h
create mode 100644 generic/include/clc/async/async_work_group_copy.inc
create mode 100644 generic/lib/async/async_work_group_copy.cl
create mode 100644 generic/lib/async/async_work_group_copy.inc

diff --git a/generic/include/clc/async/async_work_group_copy.h b/generic/include/clc/async/async_work_group_copy.h
new file mode 100644
index 0000000..39c637b
--- /dev/null
+++ b/generic/include/clc/async/async_work_group_copy.h
@@ -0,0 +1,15 @@
+#define __CLC_DST_ADDR_SPACE local
+#define __CLC_SRC_ADDR_SPACE global
+#define __CLC_BODY <clc/async/async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
+
+#define __CLC_DST_ADDR_SPACE global
+#define __CLC_SRC_ADDR_SPACE local
+#define __CLC_BODY <clc/async/async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
diff --git a/generic/include/clc/async/async_work_group_copy.inc b/generic/include/clc/async/async_work_group_copy.inc
new file mode 100644
index 0000000..d85df6c
--- /dev/null
+++ b/generic/include/clc/async/async_work_group_copy.inc
@@ -0,0 +1,5 @@
+_CLC_OVERLOAD _CLC_DECL event_t async_work_group_copy(
+ __CLC_DST_ADDR_SPACE __CLC_GENTYPE *dst,
+ const __CLC_SRC_ADDR_SPACE __CLC_GENTYPE *src,
+ size_t num_gentypes,
+ event_t event);
diff --git a/generic/include/clc/clc.h b/generic/include/clc/clc.h
index f499e6d..ed741b1 100644
--- a/generic/include/clc/clc.h
+++ b/generic/include/clc/clc.h
@@ -125,6 +125,7 @@
#include <clc/synchronization/barrier.h>

/* 6.11.10 Async Copy and Prefetch Functions */
+#include <clc/async/async_work_group_copy.h>
#include <clc/async/prefetch.h>
#include <clc/async/wait_group_events.h>

diff --git a/generic/lib/SOURCES b/generic/lib/SOURCES
index 3e847fe..e7dbca5 100644
--- a/generic/lib/SOURCES
+++ b/generic/lib/SOURCES
@@ -1,3 +1,4 @@
+async/async_work_group_copy.cl
async/prefetch.cl
async/wait_group_events.cl
atomic/atomic_impl.ll
diff --git a/generic/lib/async/async_work_group_copy.cl b/generic/lib/async/async_work_group_copy.cl
new file mode 100644
index 0000000..31c71d6
--- /dev/null
+++ b/generic/lib/async/async_work_group_copy.cl
@@ -0,0 +1,21 @@
+#include <clc/clc.h>
+
+#ifdef cl_khr_fp64
+#pragma OPENCL EXTENSION cl_khr_fp64 : enable
+#endif
+
+#define __CLC_DST_ADDR_SPACE local
+#define __CLC_SRC_ADDR_SPACE global
+#define __CLC_BODY <async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
+
+#define __CLC_DST_ADDR_SPACE global
+#define __CLC_SRC_ADDR_SPACE local
+#define __CLC_BODY <async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
diff --git a/generic/lib/async/async_work_group_copy.inc b/generic/lib/async/async_work_group_copy.inc
new file mode 100644
index 0000000..dd3db3f
--- /dev/null
+++ b/generic/lib/async/async_work_group_copy.inc
@@ -0,0 +1,16 @@
+_CLC_OVERLOAD _CLC_DEF event_t async_work_group_copy(
+ __CLC_DST_ADDR_SPACE __CLC_GENTYPE *dst,
+ const __CLC_SRC_ADDR_SPACE __CLC_GENTYPE *src,
+ size_t num_gentypes,
+ event_t event) {
+
+ // __builtin_memcpy doesn't work with address spaces, so we need to
+ // implement the copy using a loop.
+
+ unsigned i;
+ for (i = 0; i < num_gentypes; ++i) {
+ dst[i] = src[i];
+ }

If I understand this correctly, this lets every thread in the workgroup do the copy.
So this code has a data races if executed by more than one thread. OpenCL 1.2/1.1
does not say anything about the behaviour of racy code, so I’m not sure whether
the behaviour of this code is properly defined. If the intention is to eventually support
OpenCL 2.0, then the behaviour of is definitely undefined (due to the data races).

Jeroen

+_CLC_DEF void wait_group_events(int num_events, event_t *event_list) {
+ barrier(CLK_LOCAL_MEM_FENCE);
+ barrier(CLK_GLOBAL_MEM_FENCE);
+}

Why not barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE)?

Technically, I think that using a barrier here is too strong, as the function just has to guarantee that the copying has finished. It does not have to guarantee that all memory operations induced by the copying have finished:

http://www.khronos.org/message_boards/showthread.php/5875-Profiling-Code/page2

However, I guess that with a pure OpenCL implementation using a barrier is the only option.

The spec says "This function does not perform any implicit synchronization of source data such as using a barrier before performing the copy" although this doesn't sound like a really enforceable restriction

I think this applies to async_work_group_copy, not to wait_groups_events. Also, it doesn’t seem something that needs to be enforced, it’s something the user can expect. Her or she better insert fences/barriers before starting a copy.

Jeroen

> This is a simple implementation which just copies data synchronously.
> ---
> generic/include/clc/async/async_work_group_copy.h | 15 +++++++++++++++
> generic/include/clc/async/async_work_group_copy.inc | 5 +++++
> generic/include/clc/clc.h | 1 +
> generic/lib/SOURCES | 1 +
> generic/lib/async/async_work_group_copy.cl | 21 +++++++++++++++++++++
> generic/lib/async/async_work_group_copy.inc | 16 ++++++++++++++++
> 6 files changed, 59 insertions(+)
> create mode 100644 generic/include/clc/async/async_work_group_copy.h
> create mode 100644 generic/include/clc/async/async_work_group_copy.inc
> create mode 100644 generic/lib/async/async_work_group_copy.cl
> create mode 100644 generic/lib/async/async_work_group_copy.inc
>
> diff --git a/generic/include/clc/async/async_work_group_copy.h b/generic/include/clc/async/async_work_group_copy.h
> new file mode 100644
> index 0000000..39c637b
> --- /dev/null
> +++ b/generic/include/clc/async/async_work_group_copy.h
> @@ -0,0 +1,15 @@
> +#define __CLC_DST_ADDR_SPACE local
> +#define __CLC_SRC_ADDR_SPACE global
> +#define __CLC_BODY <clc/async/async_work_group_copy.inc>
> +#include <clc/async/gentype.inc>
> +#undef __CLC_DST_ADDR_SPACE
> +#undef __CLC_SRC_ADDR_SPACE
> +#undef __CLC_BODY
> +
> +#define __CLC_DST_ADDR_SPACE global
> +#define __CLC_SRC_ADDR_SPACE local
> +#define __CLC_BODY <clc/async/async_work_group_copy.inc>
> +#include <clc/async/gentype.inc>
> +#undef __CLC_DST_ADDR_SPACE
> +#undef __CLC_SRC_ADDR_SPACE
> +#undef __CLC_BODY
> diff --git a/generic/include/clc/async/async_work_group_copy.inc b/generic/include/clc/async/async_work_group_copy.inc
> new file mode 100644
> index 0000000..d85df6c
> --- /dev/null
> +++ b/generic/include/clc/async/async_work_group_copy.inc
> @@ -0,0 +1,5 @@
> +_CLC_OVERLOAD _CLC_DECL event_t async_work_group_copy(
> + __CLC_DST_ADDR_SPACE __CLC_GENTYPE *dst,
> + const __CLC_SRC_ADDR_SPACE __CLC_GENTYPE *src,
> + size_t num_gentypes,
> + event_t event);
> diff --git a/generic/include/clc/clc.h b/generic/include/clc/clc.h
> index f499e6d..ed741b1 100644
> --- a/generic/include/clc/clc.h
> +++ b/generic/include/clc/clc.h
> @@ -125,6 +125,7 @@
> #include <clc/synchronization/barrier.h>
>
> /* 6.11.10 Async Copy and Prefetch Functions */
> +#include <clc/async/async_work_group_copy.h>
> #include <clc/async/prefetch.h>
> #include <clc/async/wait_group_events.h>
>
> diff --git a/generic/lib/SOURCES b/generic/lib/SOURCES
> index 3e847fe..e7dbca5 100644
> --- a/generic/lib/SOURCES
> +++ b/generic/lib/SOURCES
> @@ -1,3 +1,4 @@
> +async/async_work_group_copy.cl
> async/prefetch.cl
> async/wait_group_events.cl
> atomic/atomic_impl.ll
> diff --git a/generic/lib/async/async_work_group_copy.cl b/generic/lib/async/async_work_group_copy.cl
> new file mode 100644
> index 0000000..31c71d6
> --- /dev/null
> +++ b/generic/lib/async/async_work_group_copy.cl
> @@ -0,0 +1,21 @@
> +#include <clc/clc.h>
> +
> +#ifdef cl_khr_fp64
> +#pragma OPENCL EXTENSION cl_khr_fp64 : enable
> +#endif
> +
> +#define __CLC_DST_ADDR_SPACE local
> +#define __CLC_SRC_ADDR_SPACE global
> +#define __CLC_BODY <async_work_group_copy.inc>
> +#include <clc/async/gentype.inc>
> +#undef __CLC_DST_ADDR_SPACE
> +#undef __CLC_SRC_ADDR_SPACE
> +#undef __CLC_BODY
> +
> +#define __CLC_DST_ADDR_SPACE global
> +#define __CLC_SRC_ADDR_SPACE local
> +#define __CLC_BODY <async_work_group_copy.inc>
> +#include <clc/async/gentype.inc>
> +#undef __CLC_DST_ADDR_SPACE
> +#undef __CLC_SRC_ADDR_SPACE
> +#undef __CLC_BODY
> diff --git a/generic/lib/async/async_work_group_copy.inc b/generic/lib/async/async_work_group_copy.inc
> new file mode 100644
> index 0000000..dd3db3f
> --- /dev/null
> +++ b/generic/lib/async/async_work_group_copy.inc
> @@ -0,0 +1,16 @@
> +_CLC_OVERLOAD _CLC_DEF event_t async_work_group_copy(
> + __CLC_DST_ADDR_SPACE __CLC_GENTYPE *dst,
> + const __CLC_SRC_ADDR_SPACE __CLC_GENTYPE *src,
> + size_t num_gentypes,
> + event_t event) {
> +
> + // __builtin_memcpy doesn't work with address spaces, so we need to
> + // implement the copy using a loop.
> +
> + unsigned i;
> + for (i = 0; i < num_gentypes; ++i) {
> + dst[i] = src[i];
> + }

If I understand this correctly, this lets every thread in the workgroup do the copy.
So this code has a data races if executed by more than one thread. OpenCL 1.2/1.1
does not say anything about the behaviour of racy code, so I’m not sure whether
the behaviour of this code is properly defined. If the intention is to eventually support
OpenCL 2.0, then the behaviour of is definitely undefined (due to the data races).

This isn't a bug in the implementation though, right? Isn't
it up the user to make sure the memory being written by each
thread doesn't overlap?

-Tom

This is a simple implementation which just copies data synchronously.
---
generic/include/clc/async/async_work_group_copy.h | 15 +++++++++++++++
generic/include/clc/async/async_work_group_copy.inc | 5 +++++
generic/include/clc/clc.h | 1 +
generic/lib/SOURCES | 1 +
generic/lib/async/async_work_group_copy.cl | 21 +++++++++++++++++++++
generic/lib/async/async_work_group_copy.inc | 16 ++++++++++++++++
6 files changed, 59 insertions(+)
create mode 100644 generic/include/clc/async/async_work_group_copy.h
create mode 100644 generic/include/clc/async/async_work_group_copy.inc
create mode 100644 generic/lib/async/async_work_group_copy.cl
create mode 100644 generic/lib/async/async_work_group_copy.inc

diff --git a/generic/include/clc/async/async_work_group_copy.h b/generic/include/clc/async/async_work_group_copy.h
new file mode 100644
index 0000000..39c637b
--- /dev/null
+++ b/generic/include/clc/async/async_work_group_copy.h
@@ -0,0 +1,15 @@
+#define __CLC_DST_ADDR_SPACE local
+#define __CLC_SRC_ADDR_SPACE global
+#define __CLC_BODY <clc/async/async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
+
+#define __CLC_DST_ADDR_SPACE global
+#define __CLC_SRC_ADDR_SPACE local
+#define __CLC_BODY <clc/async/async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
diff --git a/generic/include/clc/async/async_work_group_copy.inc b/generic/include/clc/async/async_work_group_copy.inc
new file mode 100644
index 0000000..d85df6c
--- /dev/null
+++ b/generic/include/clc/async/async_work_group_copy.inc
@@ -0,0 +1,5 @@
+_CLC_OVERLOAD _CLC_DECL event_t async_work_group_copy(
+ __CLC_DST_ADDR_SPACE __CLC_GENTYPE *dst,
+ const __CLC_SRC_ADDR_SPACE __CLC_GENTYPE *src,
+ size_t num_gentypes,
+ event_t event);
diff --git a/generic/include/clc/clc.h b/generic/include/clc/clc.h
index f499e6d..ed741b1 100644
--- a/generic/include/clc/clc.h
+++ b/generic/include/clc/clc.h
@@ -125,6 +125,7 @@
#include <clc/synchronization/barrier.h>

/* 6.11.10 Async Copy and Prefetch Functions */
+#include <clc/async/async_work_group_copy.h>
#include <clc/async/prefetch.h>
#include <clc/async/wait_group_events.h>

diff --git a/generic/lib/SOURCES b/generic/lib/SOURCES
index 3e847fe..e7dbca5 100644
--- a/generic/lib/SOURCES
+++ b/generic/lib/SOURCES
@@ -1,3 +1,4 @@
+async/async_work_group_copy.cl
async/prefetch.cl
async/wait_group_events.cl
atomic/atomic_impl.ll
diff --git a/generic/lib/async/async_work_group_copy.cl b/generic/lib/async/async_work_group_copy.cl
new file mode 100644
index 0000000..31c71d6
--- /dev/null
+++ b/generic/lib/async/async_work_group_copy.cl
@@ -0,0 +1,21 @@
+#include <clc/clc.h>
+
+#ifdef cl_khr_fp64
+#pragma OPENCL EXTENSION cl_khr_fp64 : enable
+#endif
+
+#define __CLC_DST_ADDR_SPACE local
+#define __CLC_SRC_ADDR_SPACE global
+#define __CLC_BODY <async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
+
+#define __CLC_DST_ADDR_SPACE global
+#define __CLC_SRC_ADDR_SPACE local
+#define __CLC_BODY <async_work_group_copy.inc>
+#include <clc/async/gentype.inc>
+#undef __CLC_DST_ADDR_SPACE
+#undef __CLC_SRC_ADDR_SPACE
+#undef __CLC_BODY
diff --git a/generic/lib/async/async_work_group_copy.inc b/generic/lib/async/async_work_group_copy.inc
new file mode 100644
index 0000000..dd3db3f
--- /dev/null
+++ b/generic/lib/async/async_work_group_copy.inc
@@ -0,0 +1,16 @@
+_CLC_OVERLOAD _CLC_DEF event_t async_work_group_copy(
+ __CLC_DST_ADDR_SPACE __CLC_GENTYPE *dst,
+ const __CLC_SRC_ADDR_SPACE __CLC_GENTYPE *src,
+ size_t num_gentypes,
+ event_t event) {
+
+ // __builtin_memcpy doesn't work with address spaces, so we need to
+ // implement the copy using a loop.
+
+ unsigned i;
+ for (i = 0; i < num_gentypes; ++i) {
+ dst[i] = src[i];
+ }

If I understand this correctly, this lets every thread in the workgroup do the copy.
So this code has a data races if executed by more than one thread. OpenCL 1.2/1.1
does not say anything about the behaviour of racy code, so I’m not sure whether
the behaviour of this code is properly defined. If the intention is to eventually support
OpenCL 2.0, then the behaviour of is definitely undefined (due to the data races).

This isn't a bug in the implementation though, right? Isn't
it up the user to make sure the memory being written by each
thread doesn't overlap?

I think it is a bug in the implementation. All work-items in a work-group
need to reach the the function call with exactly the same arguments:

"The async copy is performed by all work-items in a work-group and this
built-in function must therefore be encountered by all work-items in a
workgroup executing the kernel with the same argument values; otherwise
the results are undefined.”

With the code you propose, this would mean that all work-items in a work-group
will execute exactly the same copy, which means the code is racy.

Jeroen