[PATCH 1/1] r600: Add fence implementation, rework barrier

Hw flushes caches after each mem clause (Ch 8.2 EG manual),
so we need to pass the information that indicates requested
clause boundaries.
Use existing llvm fence instruction for that.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Hi Jan,

Just a quick note. I've been working on an alternative implementation
of this (reworking barriers/fences, and getting global barriers to not
crash on radeonsi), and I plan on reviewing your change... Just need
to find time to do it.

--Aaron

Hw flushes caches after each mem clause (Ch 8.2 EG manual),
so we need to pass the information that indicates requested
clause boundaries.
Use existing llvm fence instruction for that.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---

I'm not sure whether using 'acquire' ordering for reads and 'release'
for writes is the correct thing with respect to OpenCL specs.
It seems to get read-write-relation correct, but I', not sure about
read ordering, or write ordering. Making everything use the strongest mem_fence
might be safer.

It looks like the LLVM memory orderings are designed for c++11
memory ordering:
http://en.cppreference.com/w/cpp/atomic/memory_order

The OpenCL spec is really unclear as to what mem_fence means,
so I think we may have to use seq_cst, but I think you should
pose this question to the cfe-dev mailing list to see what others
do.

Other than that, the patch looks good to me.

> Hw flushes caches after each mem clause (Ch 8.2 EG manual),
> so we need to pass the information that indicates requested
> clause boundaries.
> Use existing llvm fence instruction for that.
>
> Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> ---
>
> I'm not sure whether using 'acquire' ordering for reads and 'release'
> for writes is the correct thing with respect to OpenCL specs.
> It seems to get read-write-relation correct, but I', not sure about
> read ordering, or write ordering. Making everything use the strongest mem_fence
> might be safer.
>

It looks like the LLVM memory orderings are designed for c++11
memory ordering:
std::memory_order - cppreference.com

The OpenCL spec is really unclear as to what mem_fence means,
so I think we may have to use seq_cst, but I think you should
pose this question to the cfe-dev mailing list to see what others
do.

Other than that, the patch looks good to me.

> This patch changes global-memory.cl piglit crash from:
> i32 = GlobalAddress<void ()* @llvm.AMDGPU.barrier.global> 0 [ORD=11]
> Undefined function
> to:
> LLVM ERROR: Cannot select: 0x24b9f58: i32,ch = load 0x2090108, 0x24b9718, 0x24b9c40<LD1[@barrier_test_local_int.test(addrspace=3)], zext from i1> [ORD=4] [ID=13]
> 0x24b9718: i32 = Constant<0> [ID=7]
> 0x24b9c40: i32 = undef [ID=2]
> In function: barrier_test_local_int
>
> Without the fences the piglit just fails.

I'm seeing this failure with this patch:
LLVM ERROR: Cannot select: 0x2b8ae60: ch = AtomicFence 0x2b8caa0, 0x2b8ad58, 0x2b8a938 [ORD=11] [ID=37]

Do you have an LLVM patch to handle AtomicFence?

-Tom

Hw flushes caches after each mem clause (Ch 8.2 EG manual),
so we need to pass the information that indicates requested
clause boundaries.
Use existing llvm fence instruction for that.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---

I'm not sure whether using 'acquire' ordering for reads and 'release'
for writes is the correct thing with respect to OpenCL specs.
It seems to get read-write-relation correct, but I', not sure about
read ordering, or write ordering. Making everything use the strongest mem_fence
might be safer.

This patch changes global-memory.cl piglit crash from:
i32 = GlobalAddress<void ()* @llvm.AMDGPU.barrier.global> 0 [ORD=11]
Undefined function
to:
LLVM ERROR: Cannot select: 0x24b9f58: i32,ch = load 0x2090108, 0x24b9718, 0x24b9c40<LD1[@barrier_test_local_int.test(addrspace=3)], zext from i1> [ORD=4] [ID=13]
  0x24b9718: i32 = Constant<0> [ID=7]
  0x24b9c40: i32 = undef [ID=2]
In function: barrier_test_local_int

Without the fences the piglit just fails.

r600/lib/SOURCES | 1 +
r600/lib/synchronization/barrier_impl.ll | 25 ++----------------
r600/lib/synchronization/fence_impl.ll | 44 ++++++++++++++++++++++++++++++++
3 files changed, 47 insertions(+), 23 deletions(-)
create mode 100644 r600/lib/synchronization/fence_impl.ll

diff --git a/r600/lib/SOURCES b/r600/lib/SOURCES
index d9fc897..d93ec0f 100644
--- a/r600/lib/SOURCES
+++ b/r600/lib/SOURCES
@@ -7,4 +7,5 @@ workitem/get_local_id.ll
workitem/get_global_size.ll
synchronization/barrier.cl
synchronization/barrier_impl.ll
+synchronization/fence_impl.ll
shared/vload.cl
diff --git a/r600/lib/synchronization/barrier_impl.ll b/r600/lib/synchronization/barrier_impl.ll
index 3d8ee66..2f2d865 100644
--- a/r600/lib/synchronization/barrier_impl.ll
+++ b/r600/lib/synchronization/barrier_impl.ll
@@ -1,29 +1,8 @@
-declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
-declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
+declare void @mem_fence(i32 %flags) nounwind alwaysinline
declare void @llvm.AMDGPU.barrier.local() nounwind noduplicate
-declare void @llvm.AMDGPU.barrier.global() nounwind noduplicate

define void @barrier(i32 %flags) nounwind noduplicate alwaysinline {
-barrier_local_test:
- %CLK_LOCAL_MEM_FENCE = call i32 @__clc_clk_local_mem_fence()
- %0 = and i32 %flags, %CLK_LOCAL_MEM_FENCE
- %1 = icmp ne i32 %0, 0
- br i1 %1, label %barrier_local, label %barrier_global_test
-
-barrier_local:
+ call void @mem_fence(i32 %flags)
   call void @llvm.AMDGPU.barrier.local() noduplicate
- br label %barrier_global_test
-
-barrier_global_test:
- %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
- %2 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
- %3 = icmp ne i32 %2, 0
- br i1 %3, label %barrier_global, label %done
-
-barrier_global:
- call void @llvm.AMDGPU.barrier.global() noduplicate
- br label %done
-
-done:
   ret void
}
diff --git a/r600/lib/synchronization/fence_impl.ll b/r600/lib/synchronization/fence_impl.ll
new file mode 100644
index 0000000..a72e70e
--- /dev/null
+++ b/r600/lib/synchronization/fence_impl.ll
@@ -0,0 +1,44 @@
+declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
+declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
+
+define void @mem_fence(i32 %flags) nounwind noduplicate alwaysinline {
+ %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
+ %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
+ %2 = icmp ne i32 %1, 0
+ br i1 %2, label %global_fence, label %exit_fence
+
+global_fence:
+ fence seq_cst
+ br label %exit_fence
+
+exit_fence:
+ ret void
+}
+
+define void @read_mem_fence(i32 %flags) nounwind noduplicate alwaysinline {
+ %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
+ %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
+ %2 = icmp ne i32 %1, 0
+ br i1 %2, label %global_fence, label %exit_fence
+
+global_fence:

The label global_fence is defined three times in this file. That may
be ok (compiler isn't yelling about it, probably because they're in
different functions), but it does rub me the wrong way...

I'll echo Tom's comments. For now, we might end up needing to use
seq_cst, but if we can get clarification, that'd be good. I also get
the same AtomicFence selection error on my machines as Tom's
getting... but that could be that we're just running a different LLVM
version than you are (maybe?).

Once we get your changes in, I have a bit of refactoring of most of
barrier/[write|read]_mem_fence that I want to get finished up (given
that working mem fences are sort of required for barriers, I was
trying to implement those standalone functions first).

Other than the label issue, and Tom's comments, this patch looks ok to me.

--Aaron

> Hw flushes caches after each mem clause (Ch 8.2 EG manual),
> so we need to pass the information that indicates requested
> clause boundaries.
> Use existing llvm fence instruction for that.
>
> Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> ---
>
> I'm not sure whether using 'acquire' ordering for reads and 'release'
> for writes is the correct thing with respect to OpenCL specs.
> It seems to get read-write-relation correct, but I', not sure about
> read ordering, or write ordering. Making everything use the strongest mem_fence
> might be safer.
>
> This patch changes global-memory.cl piglit crash from:
> i32 = GlobalAddress<void ()* @llvm.AMDGPU.barrier.global> 0 [ORD=11]
> Undefined function
> to:
> LLVM ERROR: Cannot select: 0x24b9f58: i32,ch = load 0x2090108, 0x24b9718, 0x24b9c40<LD1[@barrier_test_local_int.test(addrspace=3)], zext from i1> [ORD=4] [ID=13]
> 0x24b9718: i32 = Constant<0> [ID=7]
> 0x24b9c40: i32 = undef [ID=2]
> In function: barrier_test_local_int
>
> Without the fences the piglit just fails.
>
>
> r600/lib/SOURCES | 1 +
> r600/lib/synchronization/barrier_impl.ll | 25 ++----------------
> r600/lib/synchronization/fence_impl.ll | 44 ++++++++++++++++++++++++++++++++
> 3 files changed, 47 insertions(+), 23 deletions(-)
> create mode 100644 r600/lib/synchronization/fence_impl.ll
>
> diff --git a/r600/lib/SOURCES b/r600/lib/SOURCES
> index d9fc897..d93ec0f 100644
> --- a/r600/lib/SOURCES
> +++ b/r600/lib/SOURCES
> @@ -7,4 +7,5 @@ workitem/get_local_id.ll
> workitem/get_global_size.ll
> synchronization/barrier.cl
> synchronization/barrier_impl.ll
> +synchronization/fence_impl.ll
> shared/vload.cl
> diff --git a/r600/lib/synchronization/barrier_impl.ll b/r600/lib/synchronization/barrier_impl.ll
> index 3d8ee66..2f2d865 100644
> --- a/r600/lib/synchronization/barrier_impl.ll
> +++ b/r600/lib/synchronization/barrier_impl.ll
> @@ -1,29 +1,8 @@
> -declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
> -declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
> +declare void @mem_fence(i32 %flags) nounwind alwaysinline
> declare void @llvm.AMDGPU.barrier.local() nounwind noduplicate
> -declare void @llvm.AMDGPU.barrier.global() nounwind noduplicate
>
> define void @barrier(i32 %flags) nounwind noduplicate alwaysinline {
> -barrier_local_test:
> - %CLK_LOCAL_MEM_FENCE = call i32 @__clc_clk_local_mem_fence()
> - %0 = and i32 %flags, %CLK_LOCAL_MEM_FENCE
> - %1 = icmp ne i32 %0, 0
> - br i1 %1, label %barrier_local, label %barrier_global_test
> -
> -barrier_local:
> + call void @mem_fence(i32 %flags)
> call void @llvm.AMDGPU.barrier.local() noduplicate
> - br label %barrier_global_test
> -
> -barrier_global_test:
> - %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
> - %2 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
> - %3 = icmp ne i32 %2, 0
> - br i1 %3, label %barrier_global, label %done
> -
> -barrier_global:
> - call void @llvm.AMDGPU.barrier.global() noduplicate
> - br label %done
> -
> -done:
> ret void
> }
> diff --git a/r600/lib/synchronization/fence_impl.ll b/r600/lib/synchronization/fence_impl.ll
> new file mode 100644
> index 0000000..a72e70e
> --- /dev/null
> +++ b/r600/lib/synchronization/fence_impl.ll
> @@ -0,0 +1,44 @@
> +declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
> +declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
> +
> +define void @mem_fence(i32 %flags) nounwind noduplicate alwaysinline {
> + %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
> + %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
> + %2 = icmp ne i32 %1, 0
> + br i1 %2, label %global_fence, label %exit_fence
> +
> +global_fence:
> + fence seq_cst
> + br label %exit_fence
> +
> +exit_fence:
> + ret void
> +}
> +
> +define void @read_mem_fence(i32 %flags) nounwind noduplicate alwaysinline {
> + %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
> + %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
> + %2 = icmp ne i32 %1, 0
> + br i1 %2, label %global_fence, label %exit_fence
> +
> +global_fence:

The label global_fence is defined three times in this file. That may
be ok (compiler isn't yelling about it, probably because they're in
different functions), but it does rub me the wrong way...

I'll echo Tom's comments. For now, we might end up needing to use
seq_cst, but if we can get clarification, that'd be good. I also get
the same AtomicFence selection error on my machines as Tom's
getting... but that could be that we're just running a different LLVM
version than you are (maybe?).

That's the intended behavior for now (see comments section of the
patch).
I don't have the llvm part yet, it will take me some time to understand
how memory instructions are put into clauses, and how to use fences in
the process.

The reason I posted the patch was to get an some feedback on whether it
makes sense to use existing llvm fence intrinsics to pass the
information to llvm, and whether it makes sense to use separate
intrinsics for read/write fences, since we'll need to treat them the
same anyway (at least that's my understanding of the memory part of isa
manual).
I should have marked the patch as RFC/WIP.

My original approach implemented read/write fences by just calling
mem_fence(). That should take care of both using only seq_cst and
duplicate labels. I can post is as a v2 if you are ok with a patch that
is not really useful on its own.

regards,
Jan

You can't use the LLVM atomic fence instruction for this. For reasons I do not understand, the atomic fence LLVM instruction only impacts the ordering of other atomic instructions, but OpenCL mem_fence is for all memory accesses

Plus the LLVM fence instruction doesn't specify the address space(s)

From examining the instructions generated by CodeXL/KernelAnalyzer for

BARTS (6850 it looks like any barrier(CLK_GLOBAL_MEM_FENCE) we do has
to request acknowledgement at the time we issue the read
(MEM_RAT_CACHELESS_STORE_RAW_ACK and then use WAIT_ACK to wait until
the number of pending acknowledgements falls to <= 0, after which we
then also issue a GROUP_BARRIER instruction. So basically the
mem_fence gets issued/processed, followed by the barrier instruction.

I'm not sure if this implies that we need to look back in the command
stream to change the store instruction to request acknowledgement...
or if there's a way around that.

If anyone wants the source kernel and binary dumps of the IL/ISA, I'm
perfectly willing to share. Pretty simple stuff, though. Just declare
a private int, set it, store it globally, and then
barrier(CLK_GLOBAL_MEM_FENCE).

--Aaron

Hi,

My original approach implemented read/write fences by just calling
mem_fence(). That should take care of both using only seq_cst and
duplicate labels. I can post is as a v2 if you are ok with a patch that
is not really useful on its own.

regards,
Jan

You can't use the LLVM atomic fence instruction for this. For reasons I do not understand, the atomic fence LLVM instruction only impacts the ordering of other atomic instructions, but OpenCL mem_fence is for all memory accesses

I don’t think this is totally accurate, as the LLVM atomic orderings — and the fence takes an ordering argument — are intended to implement C++11 atomic orderings and those put requirements on the writes to non-atomic memory locations. The IR language reference heavily leans on the C++11 spec for this reason [0]. However, the OpenCL mem_fence also has a requirement on the reads that were issued before the fence, which is something that is not required by any of the C++11 orderings. I’m not totally confident in stating this, but because of this read “issue” it seems that the llvm fence cannot be used to implement OpenCL’s mem_fence.

Jeroen

[0] http://llvm.org/docs/LangRef.html#ordering

Hi,

>> My original approach implemented read/write fences by just calling
>> mem_fence(). That should take care of both using only seq_cst and >>
duplicate labels. I can post is as a v2 if you are ok with a patch that
>> is not really useful on its own. >> >> regards, >> Jan >> > You
can't use the LLVM atomic fence instruction for this. For reasons I do
not understand, the atomic fence LLVM instruction only impacts the
ordering of other atomic instructions, but OpenCL mem_fence is for all
memory accesses

I don’t think this is totally accurate, as the LLVM atomic orderings —
and the fence takes an ordering argument — are intended to implement
C++11 atomic orderings and those put requirements on the writes to
non-atomic memory locations. The IR language reference heavily leans on
the C++11 spec for this reason [0]. However, the OpenCL mem_fence also
has a requirement on the reads that were issued before the fence, which
is something that is not required by any of the C++11 orderings. I’m
not totally confident in stating this, but because of this read “issue”
it seems that the llvm fence cannot be used to implement OpenCL’s
mem_fence.

The memory model in [1] "defines happens" before relation that uses both
program order and "synchronizes with" used in fence semantic
definition[2].
Using the fence therefore introduces ordering to instructions across
multiple execution threads, i.e every instruction A that was executed in
any thread before(PO) the fence 'happens before' every instruction B in
any thread executed after(PO) the fence.

According to the rules in [1] a READ after (PO) the fence sees the last
WRITE to the memory location from every thread (earlier writes are
blocked by happens-before within single thread PO). It may also see any
WRITE after(PO) the fence that it is not in 'happens-before' relation
with i.e any from other threads or from the same thread(before the read
in PO)

This matches the requirement that stores before the fence are committed
before the loads.

The situation with reads is a bit simpler. All reads before(PO) the
fence are in 'happens-before' relation with all writes after the fence
(PO), hence they are not allowed to see them. This matches the
requirement that all reads before the fence must be complete before any
write(PO) after the fence.

I think using acq_rel or seq_cst should be ok for OpenCL mem_fecne. If
all the loads had acquire semantics and all the stores had release
semantics (on respective memory locations) it'd be ok to use only
acquire/release fences for mem_write_fence()/mem_read_fence()
respectively.

Matt, is no AS argument a problem? OpenCL allows fences only on local
and global AS, and since we don't need any for LDS, I thought having
implicit global AS was ok.

regards,
Jan

[1] http://llvm.org/docs/LangRef.html#memmodel
[2] http://llvm.org/docs/LangRef.html#i-fence

Yes. Fencing all address spaces always would be unfortunate. I would be most concerned about not allowing private loads and stores to be moved across fences, which should always be OK, and would help reduce register usage

I think the only proper solution would be to extend all the atomics/fences with an address space argument. This would be needed in any case if there are plan to implement OpenCL 2.0, which roughly does C++11 atomics with address spaces bolted on.

Jeroen