[PATCH] R600: Use new barrier intrinsic

0001-R600-Use-new-barrier-intrinsic.patch (1.3 KB)

From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00 2001

Re-adding libclc to CC list… oops.

No, but the old intrinsics didn’t either. There is only one synchronization instruction, the memory fence is a different problem. I don’t think it actually matters with the constraint that global memory is only consistent within a workgroup, although I”m less sure how the old hardware caches worked

-Matt

>
>
> Re-adding libclc to CC list.... oops.
>
> From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00
> 2001
> From: Matt Arsenault <matthew.arsenault@Amd.com>
> Date: Wed, 13 Jul 2016 21:07:53 -0700
> Subject: [PATCH] R600: Use new barrier intrinsic
>
> ---
> r600/lib/synchronization/barrier_impl.ll | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/r600/lib/synchronization/barrier_impl.ll
> b/r600/lib/synchronization/barrier_impl.ll
> index 825b2eb..9b8fefb 100644
> --- a/r600/lib/synchronization/barrier_impl.ll
> +++ b/r600/lib/synchronization/barrier_impl.ll
> @@ -1,7 +1,6 @@
> declare i32 @__clc_clk_local_mem_fence() #1
> declare i32 @__clc_clk_global_mem_fence() #1
> -declare void @llvm.AMDGPU.barrier.local() #0
> -declare void @llvm.AMDGPU.barrier.global() #0
> +declare void @llvm.r600.group.barrier() #0
>
> define void @barrier(i32 %flags) #2 {
> barrier_local_test:
> @@ -11,7 +10,7 @@ barrier_local_test:
> br i1 %1, label %barrier_local, label %barrier_global_test
>
> barrier_local:
> - call void @llvm.AMDGPU.barrier.local()
> + call void @llvm.r600.group.barrier()
>
> Please pardon my ignorance, but does the new intrinsic handle
> synchronizing writes/reads to local/global memory along with thread
> synchronization?
>
> If that is the case, could we modify the barrier_local portion to
> jump straight to done, so that we don't end up inserting two
> barriers if someone calls
> barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
>
> I don't have my NI card installed at the moment, but could if
> necessary to test things out.
>
> --Aaron
>

No, but the old intrinsics didn’t either. There is only one
synchronization instruction, the memory fence is a different problem.
I don’t think it actually matters with the constraint that global
memory is only consistent within a workgroup, although I”m less sure
how the old hardware caches worked

EG/NI both reload/flush L1 (VC and TC) caches at the beginning and end
of every fetch clause. We can make the barrier instruction wait for
previous CF instructions (bit 31 CF_WORD0), but afaik it only waits for
instruction to complete (i.e I don't think it waits for the flush
complete).

There are _ACK versions of VC/TC fetch clauses, and WAIT_ACK
instruction as fence, although I haven't found decisive info on whether
these do wait for flushes to complete (or how they differ from using
barrier bit).

I agree with the point about issuing two barrier instructions. IMO the
correct implementation of this functions for r600 would be:

void barrier(cl_mem_fence_flags flags) {
mem_fence(flags);
__builtin_r600_barrier();
}

the current implementation also fails to issue any barrier if flags ==
0, so reducing it to:

define void @barrier(i32 %flags) #2 {
call void @llvm.r600.group.barrier()
ret void
}

would be an improvement

jan

> >
> >
> > Re-adding libclc to CC list.... oops.
> >
> > From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00
> > 2001
> > From: Matt Arsenault <matthew.arsenault@Amd.com>
> > Date: Wed, 13 Jul 2016 21:07:53 -0700
> > Subject: [PATCH] R600: Use new barrier intrinsic
> >
> > ---
> > r600/lib/synchronization/barrier_impl.ll | 7 +++----
> > 1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/r600/lib/synchronization/barrier_impl.ll
> > b/r600/lib/synchronization/barrier_impl.ll
> > index 825b2eb..9b8fefb 100644
> > --- a/r600/lib/synchronization/barrier_impl.ll
> > +++ b/r600/lib/synchronization/barrier_impl.ll
> > @@ -1,7 +1,6 @@
> > declare i32 @__clc_clk_local_mem_fence() #1
> > declare i32 @__clc_clk_global_mem_fence() #1
> > -declare void @llvm.AMDGPU.barrier.local() #0
> > -declare void @llvm.AMDGPU.barrier.global() #0
> > +declare void @llvm.r600.group.barrier() #0
> >
> > define void @barrier(i32 %flags) #2 {
> > barrier_local_test:
> > @@ -11,7 +10,7 @@ barrier_local_test:
> > br i1 %1, label %barrier_local, label %barrier_global_test
> >
> > barrier_local:
> > - call void @llvm.AMDGPU.barrier.local()
> > + call void @llvm.r600.group.barrier()
> >
> > Please pardon my ignorance, but does the new intrinsic handle
> > synchronizing writes/reads to local/global memory along with thread
> > synchronization?
> >
> > If that is the case, could we modify the barrier_local portion to
> > jump straight to done, so that we don't end up inserting two
> > barriers if someone calls
> > barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
> >
> > I don't have my NI card installed at the moment, but could if
> > necessary to test things out.
> >
> > --Aaron
> >
>
> No, but the old intrinsics didn’t either. There is only one
> synchronization instruction, the memory fence is a different problem.
> I don’t think it actually matters with the constraint that global
> memory is only consistent within a workgroup, although I”m less sure
> how the old hardware caches worked

EG/NI both reload/flush L1 (VC and TC) caches at the beginning and end
of every fetch clause. We can make the barrier instruction wait for
previous CF instructions (bit 31 CF_WORD0), but afaik it only waits for
instruction to complete (i.e I don't think it waits for the flush
complete).

There are _ACK versions of VC/TC fetch clauses, and WAIT_ACK
instruction as fence, although I haven't found decisive info on whether
these do wait for flushes to complete (or how they differ from using
barrier bit).

I agree with the point about issuing two barrier instructions. IMO the
correct implementation of this functions for r600 would be:

void barrier(cl_mem_fence_flags flags) {
  mem_fence(flags);
  __builtin_r600_barrier();
}

the current implementation also fails to issue any barrier if flags ==
0, so reducing it to:

define void @barrier(i32 %flags) #2 {
  call void @llvm.r600.group.barrier()
  ret void
}

would be an improvement

Yeah, agreed. I've written code in the past that attempted to do
barrier(0) to synchronize threads in a group without requiring
synchronization of local/global memory... I'm not saying it was good code,
but the barrier was required in that case.

I'd be fine with changing to the implementation you suggested. I know it's
not the first time we've discussed fixing our barrier implementation.
Hopefully we actually finish it this time :slight_smile:

--Aaron