From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00 2001
Re-adding libclc to CC list… oops.
No, but the old intrinsics didn’t either. There is only one synchronization instruction, the memory fence is a different problem. I don’t think it actually matters with the constraint that global memory is only consistent within a workgroup, although I”m less sure how the old hardware caches worked
-Matt
>
>
> Re-adding libclc to CC list.... oops.
>
> From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00
> 2001
> From: Matt Arsenault <matthew.arsenault@Amd.com>
> Date: Wed, 13 Jul 2016 21:07:53 -0700
> Subject: [PATCH] R600: Use new barrier intrinsic
>
> ---
> r600/lib/synchronization/barrier_impl.ll | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/r600/lib/synchronization/barrier_impl.ll
> b/r600/lib/synchronization/barrier_impl.ll
> index 825b2eb..9b8fefb 100644
> --- a/r600/lib/synchronization/barrier_impl.ll
> +++ b/r600/lib/synchronization/barrier_impl.ll
> @@ -1,7 +1,6 @@
> declare i32 @__clc_clk_local_mem_fence() #1
> declare i32 @__clc_clk_global_mem_fence() #1
> -declare void @llvm.AMDGPU.barrier.local() #0
> -declare void @llvm.AMDGPU.barrier.global() #0
> +declare void @llvm.r600.group.barrier() #0
>
> define void @barrier(i32 %flags) #2 {
> barrier_local_test:
> @@ -11,7 +10,7 @@ barrier_local_test:
> br i1 %1, label %barrier_local, label %barrier_global_test
>
> barrier_local:
> - call void @llvm.AMDGPU.barrier.local()
> + call void @llvm.r600.group.barrier()
>
> Please pardon my ignorance, but does the new intrinsic handle
> synchronizing writes/reads to local/global memory along with thread
> synchronization?
>
> If that is the case, could we modify the barrier_local portion to
> jump straight to done, so that we don't end up inserting two
> barriers if someone calls
> barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
>
> I don't have my NI card installed at the moment, but could if
> necessary to test things out.
>
> --Aaron
>No, but the old intrinsics didn’t either. There is only one
synchronization instruction, the memory fence is a different problem.
I don’t think it actually matters with the constraint that global
memory is only consistent within a workgroup, although I”m less sure
how the old hardware caches worked
EG/NI both reload/flush L1 (VC and TC) caches at the beginning and end
of every fetch clause. We can make the barrier instruction wait for
previous CF instructions (bit 31 CF_WORD0), but afaik it only waits for
instruction to complete (i.e I don't think it waits for the flush
complete).
There are _ACK versions of VC/TC fetch clauses, and WAIT_ACK
instruction as fence, although I haven't found decisive info on whether
these do wait for flushes to complete (or how they differ from using
barrier bit).
I agree with the point about issuing two barrier instructions. IMO the
correct implementation of this functions for r600 would be:
void barrier(cl_mem_fence_flags flags) {
mem_fence(flags);
__builtin_r600_barrier();
}
the current implementation also fails to issue any barrier if flags ==
0, so reducing it to:
define void @barrier(i32 %flags) #2 {
call void @llvm.r600.group.barrier()
ret void
}
would be an improvement
jan
> >
> >
> > Re-adding libclc to CC list.... oops.
> >
> > From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00
> > 2001
> > From: Matt Arsenault <matthew.arsenault@Amd.com>
> > Date: Wed, 13 Jul 2016 21:07:53 -0700
> > Subject: [PATCH] R600: Use new barrier intrinsic
> >
> > ---
> > r600/lib/synchronization/barrier_impl.ll | 7 +++----
> > 1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/r600/lib/synchronization/barrier_impl.ll
> > b/r600/lib/synchronization/barrier_impl.ll
> > index 825b2eb..9b8fefb 100644
> > --- a/r600/lib/synchronization/barrier_impl.ll
> > +++ b/r600/lib/synchronization/barrier_impl.ll
> > @@ -1,7 +1,6 @@
> > declare i32 @__clc_clk_local_mem_fence() #1
> > declare i32 @__clc_clk_global_mem_fence() #1
> > -declare void @llvm.AMDGPU.barrier.local() #0
> > -declare void @llvm.AMDGPU.barrier.global() #0
> > +declare void @llvm.r600.group.barrier() #0
> >
> > define void @barrier(i32 %flags) #2 {
> > barrier_local_test:
> > @@ -11,7 +10,7 @@ barrier_local_test:
> > br i1 %1, label %barrier_local, label %barrier_global_test
> >
> > barrier_local:
> > - call void @llvm.AMDGPU.barrier.local()
> > + call void @llvm.r600.group.barrier()
> >
> > Please pardon my ignorance, but does the new intrinsic handle
> > synchronizing writes/reads to local/global memory along with thread
> > synchronization?
> >
> > If that is the case, could we modify the barrier_local portion to
> > jump straight to done, so that we don't end up inserting two
> > barriers if someone calls
> > barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
> >
> > I don't have my NI card installed at the moment, but could if
> > necessary to test things out.
> >
> > --Aaron
> >
>
> No, but the old intrinsics didn’t either. There is only one
> synchronization instruction, the memory fence is a different problem.
> I don’t think it actually matters with the constraint that global
> memory is only consistent within a workgroup, although I”m less sure
> how the old hardware caches workedEG/NI both reload/flush L1 (VC and TC) caches at the beginning and end
of every fetch clause. We can make the barrier instruction wait for
previous CF instructions (bit 31 CF_WORD0), but afaik it only waits for
instruction to complete (i.e I don't think it waits for the flush
complete).There are _ACK versions of VC/TC fetch clauses, and WAIT_ACK
instruction as fence, although I haven't found decisive info on whether
these do wait for flushes to complete (or how they differ from using
barrier bit).I agree with the point about issuing two barrier instructions. IMO the
correct implementation of this functions for r600 would be:void barrier(cl_mem_fence_flags flags) {
mem_fence(flags);
__builtin_r600_barrier();
}the current implementation also fails to issue any barrier if flags ==
0, so reducing it to:define void @barrier(i32 %flags) #2 {
call void @llvm.r600.group.barrier()
ret void
}would be an improvement
Yeah, agreed. I've written code in the past that attempted to do
barrier(0) to synchronize threads in a group without requiring
synchronization of local/global memory... I'm not saying it was good code,
but the barrier was required in that case.
I'd be fine with changing to the implementation you suggested. I know it's
not the first time we've discussed fixing our barrier implementation.
Hopefully we actually finish it this time
--Aaron