[PATCH] Make ptx barrier work irrespective of the cl_mem_fence_flags

This generates a "bar.sync 0” instruction, which not only causes the
threads to wait, but does acts as a memory fence, as required by
OpenCL. The fence does not differentiate between local and global
memory.

Index: ptx-nvidiacl/lib/synchronization/barrier.cl

This generates a "bar.sync 0” instruction, which not only causes the
threads to wait, but does acts as a memory fence, as required by
OpenCL. The fence does not differentiate between local and global
memory.

This sounds a bit heavy. There's no need for memory fence if the flags
== 0. Is there a PTX instruction that only synchronizes thread progress
without mem fence?
It's not wrong to insert an extra fence so:
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>

I was just wondering if there is a lighter implementation.

Jan

This generates a "bar.sync 0” instruction, which not only causes the
threads to wait, but does acts as a memory fence, as required by
OpenCL. The fence does not differentiate between local and global
memory.

This sounds a bit heavy. There’s no need for memory fence if the flags
== 0. Is there a PTX instruction that only synchronizes thread progress
without mem fence?
It’s not wrong to insert an extra fence so:
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>

I was just wondering if there is a lighter implementation.

PTX only has this, unfortunately.

It’s good that remark this though, because this does apply to the fence patch.
There the call can be wrapped.

Jeroen