LLVM Convergence Semantics

Hello @nhaehnle , @ssahasra

I’ve recently been studying the document on convergent semantics in LLVM and have some questions regarding the experimental intrinsics that were introduced.

  1. My primary question revolves around the purpose of these intrinsics:
  • Are these new intrinsics designed primarily as a tool for users to explicitly express the convergent semantics they expect in the final assembly instructions? In other words, are they meant to ensure that the programmer’s intentions regarding convergence are accurately reflected in the compiled code?
  • Or, are these intrinsics intended more as hints or guidance for the compiler during the optimization and code generation phases? Specifically, I’m curious whether these intrinsics might not directly translate to target instructions but rather influence the LLVM IR phase before potentially being discarded.
  1. I looked through the LLVM codebase and don’t see any IR level optimizations which seem to use and check these experimental intrinsics. It seems that only the verifier passes have been updated to ensure the experimental intrinsics adhere to the specified rules. Are there any intentions to revise the optimization passes to include these as well?

  2. Also, currently the convergent operations are marked with “convergent” property, which ensures that such operations are not moved across basic blocks, which could alter the set of threads executing them and affect the expected behavior.
    However, there are some instructions like _syncwarp() in CUDA, which do not modify the control flow but can still alter the group of threads that execute concurrently(i.e. the set of threads prior and after the operation are differen). This raises a question about the representation of such behaviors within LLVM:
    Do we currently have mechanisms or intrinsics in LLVM that accurately represent the behavior of instructions which, while not altering control flow, can change the executing thread set?
    Example in below code, is there a way we could prevent convergent_op() being moved above syncwarp()?
    bb_1:
    _syncwarp()
    convergent_op()

Thanks
Divya

1 Like

Welcome to LLVM Discourse, Divya!

It’s good to see more interest in convergence and particularly in the convergence control intrinsics.

Yes. Precisely.

No.

Yes.

Yes. But there is more work pending on getting the complete infrastructure in place. For example, I am currently working on the translation of these convergence control tokens to Machine IR for AMDGPU. We have patches to various optimizations that are waiting to be upstreamed. We are working through them, besides enhancements to the AMDGPU backend that will take advantage of these tokens.

That’s not an entirely accurate description of __syncwarp(). It merely causes threads to wait for other threads to reach that point. Beyond the __syncwarp(), these threads are free to make independent forward progress again, and the next convergent op cannot rely on these threads to be “converged”.

In fact, barriers like syncwarp are orthogonal to our definition of convergence. Whether or not two threads are related in the “maximal convergence” property is determined entirely by control flow, and not by barriers. It should not be interpreted to have an actual meaning such as “the threads are in lock step” or “the threads are executing together”. Any such interpretation can create the wrong expectations.

There are no control-flow barriers in the LLVM IR as far as I know. Some targets use their own intrinsics to represent them.

I don’t see why this is important. Someone with more expertise on CUDA’s independent thread scheduling may want to weigh in, but this transformation does not seem to be relevant to that. What matters is the mask argument that you provide to the CUDA convergent op.

Hope that helps!
Sameer.

1 Like

I don’t understand this comment actually, excerpt from the doc:

The outcome of a convergent operation is sensitive to the set of threads that executes it “together”, i.e., convergently.

To doc calls out “the threads are executing together” which you discard as part of the convergence property?

It also says:

When threads diverge at a divergent branch, they may later reconverge at a common program point. Subsequent operations are performed convergently, but the inputs may be non-uniform, thus producing divergent outputs.

The “Subsequent operations are performed convergently” is the interesting part here, what’s not clear to me is the part about “they may later reconverge at a common program point”: here this “common program point” is syncwarp I believe.

It’s been a while since I dug into the convergence definition, and I see that it totally changed last summer actually: I can’t find in Convergence And Uniformity — LLVM 19.0.0git documentation a definition of the reconvergence that does not include cycles: can you help me understand how this doc addresses the simple situation:

if (divergent_condition) {
} else {
}
// are the thread converged-with at this progam point?

I find the definition of “converged-with” very vague here:

Converged-with is a transitive symmetric relation over dynamic instances produced by different threads for the same static instance. Informally, two threads that produce converged dynamic instances are said to be converged, and they are said to execute that static instance convergently, at that point in the execution.

Seems like almost a circular definition, I can’t tell what does not it mean for “dynamic instances” to be “converged”.

This and the next quote are from the introduction, which does not claim to define anything, but it is an attempt to capture common intuition. That is why the word “together” is in quotes.

The section titled “Convergence” has this informational note:

Converged dynamic instances need not be executed at the same time or even on the same resource. Converged dynamic instances of a convergent operation may appear to do so but that is an implementation detail.

But I can see that there is a gap between what we want to say and what the introduction is conveying. I’ll start working on an improvement.

Note that this is still in the introduction, where it’s trying to be informal, but it ended up being ambiguous. I’ll work on it. The common program point is not any specific operation. It is just any program point, like the end of the if/then/else in your example below.

The text is meant to say that if a static instance is not in any cycle, then the dynamic instances are always converged. Or in other words, two threads executing a DAG in the control flow always produce converged dynamic instances at each static instance. In our eagerness to remove redundant statements, seems like we overdid the pruning here.

This is happening due to an over-abundance of the word “converged”. The sentence tries to bridge the gap between the “converged-with” relation over dynamic instances" where “converged dynamic instances” are the ones related by “converged-with” on the one hand, and the notion of “threads are converged at a given static instance” on the other. It might even help to just remove this paragraph, I don’t see that it accomplishes anything useful, actually.

Sameer.

1 Like

I think you’re trying to distinguish the actually execution (SIMD / lock-step) vs the conceptual model of convergence, where “together” means that they participate in the same dynamic instance of a convergent operation (regardless how implemented).

I assume this distinction isn’t really useful when we’re just in the context of discussing the convergence model: we only care about the convergence definition of “together”, whether they are in lock-step or not isn’t a visible aspect of the convergence model as far as I can tell, and so shouldn’t affect anything in the discussion here right?

The main problem to me isn’t as much the informal aspect, it is the disconnection from the rest of the document. I would think that the informal definition gets “formalized” or gets provided with an ambiguous definition later. Maybe there is just a missing connection (mapping the terminology when you move from informal to more formal?).

The confusion came from the post-Volta model where we don’t provide any guarantee about the non-divergence of threads, even in straight-line code / without any conditional I believe.

I suspect that for the purpose of:

_syncwarp()
convergent_op()

we don’t support a convergent_op() that depends on the _syncwarp() reconvergence: the syncwarp() only matters to synchronize shared memory accesses, not for convergent operations: instead the convergent operations has to implicitly synchronize the thread if needed. The abstract model of convergence in LLVM does not deal with this (and that’s fine I think).

Thanks for the help! And thanks for driving this work forward, happy to iterate on revisions of the doc.

Thanks Sameer and Mehdi.

  1. I agree with Mehdi that post-Volta model, most of the instructions seem to have explicit mask on them and thus necessitate explicit synchronization .
    However, how about following case
    _syncthreads()
    mask = _activemask()
    Given that the output of _activemask() is influenced by the convergence of threads up to that point, shouldn’t it be treated as convergent_op() ? If this is the case, would it then be necessary to establish its dependency with instructions like _syncwarp() or _syncthreads() , which have the potential to modify the active thread set?

  2. Also, does the convergence semantics also encompass textually unaligned convergent operations.
    For example, say we have

if (even_threads)
  shfl.sync.unaligned r0 = a1, b1, mask=all_threads  // shfl_1

shfl.sync.unaligned r0 = a2, b2,  mask=all_threads  // shfl_2
shfl.sync.unaligned r0 = a3, b3, mask=all_threads   // shfl_3

if (odd_threads)
 shfl.sync.unaligned r0 = a4, b4, mask=even_threads   // shfl_4

The anticipated execution pattern is as follows:

even_threads should execute shfl_1 with odd threads in shfl_2
even_threads should execute shfl_2 with odd threads in shfl_3
even_threads should execute shfl_3 with odd threads in shfl_4

However, altering the order of shfl_2 and shfl_3 introduces a significant concern, as it could change the dynamics of thread interaction—specifically, even threads may interact with odd threads in shfl_3 directly after shfl_1 , potentially leading to varied outcomes.

To be honest, I don’t even understand how to use _activemask in the Volta programming model, since as far as I can read about Volta, threads are free to diverge any time even in straight-line code: that means is no guarantee that right immediately after _syncwarp you would still have the warp fully converged.

Actually the doc calls it out:

Note that threads that are convergent at an __activemask() call are not guaranteed to be convergent at subsequent instructions unless those instructions are synchronizing warp-builtin functions.

The example in the doc is that the result of __activemask() is fed as a mask to the convergent operation itself.
As such I don’t quite see how we’d break __activemask: it should be valid to reorder it with __syncwarp() as far as I understand.

I don’t understand this actually: my understanding is that odd_threads should wait for even threads at shfl_2?

Also your shlf_1 does not seem valid Cuda to me, according to this doc which states:

all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

Yes, in the scenario I described, all threads which participate in the shfl operation(as dictated by their mask) will finally end up executing the same shfl instance. Specifically, the even_threads may encounter the first shuffle operation (shfl_1), and odd_threads may encounter the second (shfl_2). However, in practice, all threads will execute only one of these instances. Such instances of textually unaligned synchronization operations occur frequently.

Ah I missed the “unaligned” part of the shfl.sync.unaligned ; that seems interesting!
It means that different static instances of a convergent operation are supposed to have a converged dynamic instance if I follow correctly?
Not clear to me that the LLVM model supports this right now? @ssahasra @nhaehnle thoughts on this?

There is no builtin in CUDA or PTX that “modifies the thread set”. The semantics of all these intrinsics is defined only at each particular call and does not extend beyond that call. The only way that the user can rely on the convergence of any specific threads (inside a warp) is by mentioning them in the mask. This mask works only within a warp. There is no way to guarantee that any specific threads from different warps will “execute convergent_op together” in real hardware and in real time.

That’s right.

No, the current LLVM model does not support this. The LLVM model is only able to reason about dynamic instances of the same static instance. This is not a useful concept for uniformity analysis anyway. There is an impact on convergence control, though. Convergence control tokens are unable to model textually unaligned convergence — convergence is separately defined for each static instance that uses a token, but textually unaligned static instances need a definition that works over a set of uses instead. It was briefly discussed in the original RFC long ago.

1 Like

It sounds like those “unaligned” intrinsics no longer have any implicit dependency on control flow, and therefore I suspect we simply shouldn’t consider them to be convergent. Yes, they involve inter-thread communication, but with an explicitly listed set of participating threads presumably under an assumption of fully independent forward progress of individual threads. There’s simply no need for convergent to model this.

(I’d be curious to see concrete examples of where this is useful…)

1 Like

Well, in general, the compiler still can’t modify the program so that one of the listed threads fails to produce a dynamic instance or produces an unintended dynamic instance, right? Or in other words, they cannot be hoisted or sunk or eliminated. It’s best to treat them as convergent in the most conservative way possible, unless the compiler can reason about specific threads. Maybe treat them as having volatile side effects “that must match the abstract machine”?

Right, but as long those constraints are independent of control flow, I think the constraint is more accurately modeled as memory(inaccessible: readwrite) or something else along those lines.

1 Like

Posted a PR that addresses some of the comments from here: