[PATCH 1/4] Fix and improvements to barrier() for R600 targets

This patch introduces two new intrinsics and therefore
  must be used in conjunction with the patches to the LLVM backend. It fixes the
  behaviour of barrier(0) : previously no barrier was generated, this is fixed
  by making a call to the new intrinsic barrier.nofence(). The patch also
  changes the behaviour of barrier( CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE
  ) : previously two barriers were generated, now just one with the new
  intrinsic barrier.localglobal().

Signed-off-by: Damien Hilloulin <damien.hilloulin@supelec.fr>

This patch introduces two new intrinsics and therefore
must be used in conjunction with the patches to the LLVM backend.

Hi,

It looks like your mail client has automatically wrapped long lines,
which makes the commit comment hard to read. You should use
git send-email to submit your patches.

It fixes the
behaviour of barrier(0) : previously no barrier was generated, this
is fixed

I think the old behavior is correct. Does the specification say we need
to emit a barrier when the argument is 0?

by making a call to the new intrinsic barrier.nofence(). The patch also
changes the behaviour of barrier( CLK_LOCAL_MEM_FENCE |
CLK_GLOBAL_MEM_FENCE
) : previously two barriers were generated, now just one with the new
intrinsic barrier.localglobal().

I don't think this is the place to put this optimization. libclc
shouldn't know that llvm.AMDGPU.barrier.local and llvm.AMDGPU.barrier.global
are lowered to the same instruction. We should be doing this optimization
in the LLVM backend.

-Tom

> This patch introduces two new intrinsics and therefore
> must be used in conjunction with the patches to the LLVM backend.

Hi,

It looks like your mail client has automatically wrapped long lines,
which makes the commit comment hard to read. You should use
git send-email to submit your patches.

> It fixes the
> behaviour of barrier(0) : previously no barrier was generated, this
> is fixed

I think the old behavior is correct. Does the specification say we need
to emit a barrier when the argument is 0?

I believe it does. The specs mentions that: "All work-items in a
work-group executing the kernel
on a processor must execute this function before any
are allowed to continue execution beyond the barrier.
This function must be encountered by all work-items in
a work-group executing the kernel."

without even considering the arguments.

> by making a call to the new intrinsic barrier.nofence(). The patch also
> changes the behaviour of barrier( CLK_LOCAL_MEM_FENCE |
> CLK_GLOBAL_MEM_FENCE
> ) : previously two barriers were generated, now just one with the new
> intrinsic barrier.localglobal().
>

I don't think this is the place to put this optimization. libclc
shouldn't know that llvm.AMDGPU.barrier.local and llvm.AMDGPU.barrier.global
are lowered to the same instruction. We should be doing this optimization
in the LLVM backend.

I also understand the specs in a way that the parameter only refers to
the address space that should be flushed (local/global/both/none). The
specs explicitly allow queuing mem fences (rw) to achieve this.
It's my understanding that issuing multiple barrier instructions is not
against the specs, if we choose to add memory fence semantics to barrier
intrinsic.

IMO having one barrier intrinsic, and implementing barrier as:

barrier(flags)
{
  mem_fence(flags)
  @llvm.r600.barrier()
}

is the most straightforward way. This is what my old patch
implemented.[0] Unfortunately I never found time to address Matt's
comments.

jan

[0] http://www.pcc.me.uk/pipermail/libclc-dev/2014-April/000284.html