This generates a "bar.sync 0” instruction, which not only causes the
threads to wait, but does acts as a memory fence, as required by
OpenCL. The fence does not differentiate between local and global
memory.
Index: ptx-nvidiacl/lib/synchronization/barrier.cl
This generates a "bar.sync 0” instruction, which not only causes the
threads to wait, but does acts as a memory fence, as required by
OpenCL. The fence does not differentiate between local and global
memory.
This sounds a bit heavy. There's no need for memory fence if the flags
== 0. Is there a PTX instruction that only synchronizes thread progress
without mem fence?
It's not wrong to insert an extra fence so:
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
I was just wondering if there is a lighter implementation.
Jan
This generates a "bar.sync 0” instruction, which not only causes the
threads to wait, but does acts as a memory fence, as required by
OpenCL. The fence does not differentiate between local and global
memory.
This sounds a bit heavy. There’s no need for memory fence if the flags
== 0. Is there a PTX instruction that only synchronizes thread progress
without mem fence?
It’s not wrong to insert an extra fence so:
Reviewed-by: Jan Vesely <jan.vesely@rutgers.edu>
I was just wondering if there is a lighter implementation.
PTX only has this, unfortunately.
It’s good that remark this though, because this does apply to the fence patch.
There the call can be wrapped.
Jeroen