According to the OpenCL standards, threads cannot depend on the result from another thread at any time because it makes no assumption about the behavior of the scheduler.
Right, in ISO C++ and others, these programs are well-defined, which differs from OpenCL.
I think the original blog post on the Volta thread scheduling only stated that mutex locks can only work if they are guaranteed to be ‘starvation free’. I.a. each thread must succeed taking the lock eventually.
ITS guarantees that starvation-free algorithms make progress, i.e., any thread that has started running will, if no other thread is making progress (e.g. cause they have exited, or are blocked on, e.g., a mutex), eventually be scheduled again. A thread that takes a mutex which prevents the algorithm to make progress (e.g. causing other threads to be blocked, no other threads running, etc.) is guaranteed to eventually be scheduled again, allowing it to release the mutex. This suffices to implement C++'s Parallel Forward Progress, take locks, etc.
So, basically you end up with this situation where a mutex lock “works” until it doesn’t. You could have a warp claim a lock, then some graphics job gets launched on the GPU and boots it out, the scheduler is then under no contract to ever schedule back in the thread that was evicted and owns the lock.
NVIDIA GPUs do provide this guarantee. I think that graphic job should eventually be booted out, the compute job rescheduled, and then the thread with the mutex given the chance to release it. (That doesn’t mean the implementation is bugfree; if you ran into a bug, and managed to hang the system, a reproducer would be helpful.)
The keywords are “time slice” (for GPU context time slicing). The programming guide, best practice guide, etc. all cover some of it, but I think the best place to read about it is probably the MPS Architecture docs. If you have more concrete questions, feel free to ping me on Discord (@gonzalob over there).
It’s nice that C++ has specified a mutex. Openmp does too.
AMDGPU can totally implement a spin lock on wavefront granularity. It won’t necessarily ever unlock, since it isn’t backed by a fair scheduler, but that’s sort of an application problem.
It can’t implement one on SIMT thread granularity, for the same reason pre-volta nvidia can’t, which is that the threads in a warp/wavefront all do the same thing as they have the same instruction pointer. All the threads in the wavefront do the CAS, at most one of them wins. Then you either do the masking to make it a per-wavefront control flow thing, or you try to send N-1 threads back to the CAS and 1 thread onward, and discover that you can’t have that since all N threads have the same instruction pointer.
Fork/join I think you actually can do with sufficient hassle in the compiler, it’s the unstructured concurrency which doesn’t map onto the SIMT model.