Divergent Control Flow

gonzalobg · September 13, 2024, 8:05pm

According to the OpenCL standards, threads cannot depend on the result from another thread at any time because it makes no assumption about the behavior of the scheduler.

Right, in ISO C++ and others, these programs are well-defined, which differs from OpenCL.

I think the original blog post on the Volta thread scheduling only stated that mutex locks can only work if they are guaranteed to be ‘starvation free’. I.a. each thread must succeed taking the lock eventually.

ITS guarantees that starvation-free algorithms make progress, i.e., any thread that has started running will, if no other thread is making progress (e.g. cause they have exited, or are blocked on, e.g., a mutex), eventually be scheduled again. A thread that takes a mutex which prevents the algorithm to make progress (e.g. causing other threads to be blocked, no other threads running, etc.) is guaranteed to eventually be scheduled again, allowing it to release the mutex. This suffices to implement C++'s Parallel Forward Progress, take locks, etc.

So, basically you end up with this situation where a mutex lock “works” until it doesn’t. You could have a warp claim a lock, then some graphics job gets launched on the GPU and boots it out, the scheduler is then under no contract to ever schedule back in the thread that was evicted and owns the lock.

NVIDIA GPUs do provide this guarantee. I think that graphic job should eventually be booted out, the compute job rescheduled, and then the thread with the mutex given the chance to release it. (That doesn’t mean the implementation is bugfree; if you ran into a bug, and managed to hang the system, a reproducer would be helpful.)

jhuber6 · September 13, 2024, 8:41pm

It’s less the language and more what the hardware supports, you can write a mutex in ISO C++, but it won’t work on an AMD GPU.

Yeah, I know AMDGPU doesn’t. Is this documented anywhere?

gonzalobg · September 13, 2024, 9:48pm

Is this documented anywhere?

The keywords are “time slice” (for GPU context time slicing). The programming guide, best practice guide, etc. all cover some of it, but I think the best place to read about it is probably the MPS Architecture docs. If you have more concrete questions, feel free to ping me on Discord (@gonzalob over there).

JonChesterfield · January 20, 2025, 2:58pm

It’s nice that C++ has specified a mutex. Openmp does too.

AMDGPU can totally implement a spin lock on wavefront granularity. It won’t necessarily ever unlock, since it isn’t backed by a fair scheduler, but that’s sort of an application problem.

It can’t implement one on SIMT thread granularity, for the same reason pre-volta nvidia can’t, which is that the threads in a warp/wavefront all do the same thing as they have the same instruction pointer. All the threads in the wavefront do the CAS, at most one of them wins. Then you either do the masking to make it a per-wavefront control flow thing, or you try to send N-1 threads back to the CAS and 1 thread onward, and discover that you can’t have that since all N threads have the same instruction pointer.

Fork/join I think you actually can do with sufficient hassle in the compiler, it’s the unstructured concurrency which doesn’t map onto the SIMT model.

Topic		Replies	Views
llvm, gpu execution environments LLVM Dev List Archives	5	119	May 23, 2007
Proposal: pragma for branch divergence LLVM Dev List Archives	6	137	January 26, 2015
"fork" and "sync" for LLVM thread support - any comments? LLVM Dev List Archives	2	110	October 30, 2006
9 Ideas To Better Support Source Language Developers LLVM Dev List Archives	20	222	January 8, 2004
GSoC 2012 Proposal: Automatic GPGPU code generation for llvm LLVM Dev List Archives	11	118	April 4, 2012

Divergent Control Flow

Related topics