My use case
The use case I would like would be SPIR-Vs OpControlBarrier with Workgroup
for the Execution
operand, Workgroup
for the Memory
Operand and WorkgroupMemory
or WorkgroupMemory | CrossWorkgroupMemory
for the Semantics
operand. Hopefully this also fits what others want.
OpControlBarrier
waits for all active invocations (threads) to within the Execution
scope to reach the operation, so it’s acting as a thread barrier, but it also acts like there is an additional OpMemoryBarrier
when the Semantics
operand is not None
. SPIR-Vs OpMemoryBarrier
Ensures that memory accesses issued before this instruction are observed before memory accesses issued after this instruction
OpMemoryBarrier
has two arguments, Memory
and Semantics
. Memory
will define the scope of invocation that will observer the memory changes, so far changing this has not been discussed and is the value is constrained to Workgroup
(and I would like to continue with that constraint). The Semantics
operand is a flag with many options, but importantly SubgroupMemory
and WorkgroupMemory
. Semantics
controls the address space that changes are observed in, which is what I would like to control.
The two options I’m interested in could be written as:
gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
or just gpu.barrier
gpu.barrier memfence [#gpu.address_space<workgroup>]
I think it’s important to note that the semantics of the default gpu.barrier
is the same as the existing semantics, but with the option to weaken the memory fence.
The design of the PR I put up also allowed for these operations:
gpu.barrier memfence []
gpu.barrier memfence [#gpu.address_space<private>]
They both describe a thread barrier only operation with no memory fence, since a memory fence within the private address space should have no effect. These inclusions are incidental.
Conflation of thread barriers and memfence
gpu.barrier
already conflates the idea of thread barriers and memory fencing, but I don’t think that is necessarily bad since barriers are often used for communication between threads.
I agree that gpu.barrier
matches the semantics of __syncthreads
from CUDA, but not amdgpu.lds_barrier
since that does not control how global memory is observed.
Add a memfence operation
I generally support the inclusion of a memfence operation, but that alone is not enough for my situation. I would both a workgroup wide thread barrier and to enforce that memory accesses to workgroup memory (shared memory for CUDA) by an invocation/thread/work-item in a workgroup, before the operation, can be observed by memory accesses by invocations/threads/work-items in the same workgroup after the operation, importantly, with no guarantee about global memory.
Lowering for backends
I would like guarantees about the observability of specified address spaces, but I’m not concerned about what happens in other address spaces, so the existing lowering of gpu.barrier
will suffice in all cases. Different lowering with may be preferable, like
amdgpu.lds_barrier
, when applicable. I was unable to find a CUDA operation equivalent to lds_barrier
.