[RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Hi all,

LLVM needs a solution to the long-standing problem that the IR is unable to express certain semantics expected by high-level programming languages that target GPUs.

Solving this issue is necessary both for upstream use of LLVM as a compiler backend for GPUs and for correctly supporting LLVM IR <-> SPIR-V roundtrip translation. It may also be useful for compilers targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really just looks like a CPU with extra-wide SIMD.

After thinking and talking about the problem on and off for more than two years now, I'm convinced that it can only be solved by adding dedicated semantics to LLVM IR, which take the shape of:

- a new function (and call) attribute similar to `convergent`,
- explicitly adding language to the LangRef that talks about groups of threads and their communication with each other via special functions,
- including how these groups of threads are impacted (split and merged) by branches etc., and
- new target-independent intrinsic(s) which manipulate subgroups of threads, mostly by marking locations in the program where threads reconverge

Details to be determined, of course.

In this mail, I would mostly like to make initial contact with the larger community. First, to explain the problem to CPU-centric folks and maybe help get over the initial shock. Second, to reach out to other people thinking about GPUs: Has anybody else given this issue much thought? How can we proceed to get this solved?

The Problem

Hi from one of the people who works/worked on the NVPTX backend.

The key issue is that the set of threads participating in the exchange of data is implicitly defined by control flow.

I agree that this is the root of the problem.

The way nvidia GPUs “solve” this is that in Volta and newer, it’s up to the user to track explicitly which set of threads participate in warp-synchronous functions. You cannot safely ballot(). Problem “solved”. :slight_smile:

We already have the notion of “convergent” functions like syncthreads(), to which we cannot add control-flow dependencies. That is, it’s legal to hoist syncthreads out of an “if”, but it’s not legal to sink it into an “if”. It’s not clear to me why we can’t have “anticonvergent” (terrible name) functions which cannot have control-flow dependencies removed from them? ballot() would be both convergent and anticonvergent.

Would that solve your problem?

However, the basic block containing the ballot call in the natural lowering to LLVM IR is not part of the loop at all. The information that it was intended to be run as part of the loop is currently lost forever.

Sounds like the natural lowering of this example is not respecting anticonvergence and just needs to be made more unnatural?

I also think it’s worthwhile to consider the reasons behind nvidia’s move away from functions like ballot() towards explicit tracking of which threads are active. +Olivier Giroux told me a while ago that he was working on a paper which showed that the old way of doing things is even more fraught/broken/difficult-to-specify than I thought; I’m not sure if anything ended up being published.

Hi Justin, Nicolai, LLVM-dev,

The new *_sync() collective operations, and the newer & higher-level Cooperative Groups API, both addressed difficulties that users faced and removed obstacles that prevented architectural innovation with SIMT going forward. It was tough work, and deprecation is a tough sell, but we have a much better programming system because of it now.

While I won’t write long-form here, you can refer to the IEEE Micro article on Volta (Volume: 38 , Issue: 2 , Mar./Apr. 2018) where I wrote something about the reasons why this change happened:

  • Avoiding both deadlock and silent data corruption with user-authored synchronization.
  • Simplifying the legality rules for collective operations. (What Justin was alluding to, I think.)
  • Moving beyond simple Dom/PDom reconvergence methods, to extract more convergence.

And I have a fourth reason that isn’t in IEEE Micro, because it’s only interesting to this audience: being free from having to audit every front-end and transformation, by requiring only single-threaded semantics be preserved. This might be less relevant to users, except for the fact that it improved reliability.

Now I can’t get into a deep discussion of each reason, and for this I apologize in advance.

Sincerely, and wishing you happy holidays,

Olivier

We already have the notion of "convergent" functions like syncthreads(), to which we cannot add
control-flow dependencies. That is, it's legal to hoist syncthreads out of an "if", but it's not
legal to sink it into an "if". It's not clear to me why we can't have "anticonvergent" (terrible
name) functions which cannot have control-flow dependencies removed from them? ballot() would be
both convergent and anticonvergent.

Would that solve your problem?

One could fold both these constraints into the single existing "convergent" attribute, and say that the control-flow dependencies of a convergent function call cannot be trifled with. But there is more: is it okay to sink a syncthreads() call into an "if", if the "if" (heh!) is known to be uniform, i.e., all threads are guaranteed to take the same side of the branch? This may not be important for syncthreads, but it may be a useful for calls like ballot() that produce or use values. That would mean that the convergent attribute should really constrain /divergent/ control-flow dependencies. It would be nice to have a single-threaded way to say all of this.

However, the basic block containing the ballot call in the natural lowering to LLVM IR is not

part of the loop at all. The information that it was intended to be run as part of the loop is
currently lost forever.

Sounds like the natural lowering of this example is not respecting anticonvergence and just needs
to be made more unnatural?

Right. We probably can't change the way control flow is represented in LLVM. But we could have a way to mark the extra blocks at the exit of the loop as being special. Optimizations will need to be aware of this when moving convergent functions. This is similar to using header/merge blocks in SPIR-V to demarcate the "extended" body of a loop. The tricky part is that the loop condition will now need to be a control-flow dependency for any convergent calls in those extra blocks!

I also think it's worthwhile to consider the reasons behind nvidia's move away from functions like
ballot() towards explicit tracking of which threads are active.

Ack for Olivier's email about *_sync and co-operative groups. Having a new programming model is quite useful. But what we are looking for is a generic way to support existing programming models in LLVM while limiting how invasive the changes will need to be.

Sameer.

Hi Justin,

Hi from one of the people who works/worked on the NVPTX backend.

> The key issue is that *the set of threads participating in the exchange of data is implicitly defined by control flow*.

I agree that this is the root of the problem.

The way nvidia GPUs "solve" this is that in Volta and newer, it's up to the user to track explicitly which set of threads participate in warp-synchronous functions. You cannot safely ballot(). Problem "solved". :slight_smile:

Yeah. Though there are costs associated with this approach, no matter how you implement it exactly, which is something I'd like to avoid.

The costs come from handling the case where the user tracks the set of threads "incorrectly", i.e. the set of threads conflicts with what would be naturally implied by control flow. "Incorrectly" in quotation marks because one could imagine this happening on purpose. But the more common case is for the programmer to intend to match their set of threads to align with control flow, and then we shouldn't have to pay the additional cost.

Now, if part of a clean solution was to somehow represent the set of threads explicitly in LLVM IR as values, I wouldn't be fundamentally opposed to that. The important point is that doing so isn't sufficient, because you also want to express the tie-in back to control flow somehow.

We already have the notion of "convergent" functions like syncthreads(), to which we cannot add control-flow dependencies. That is, it's legal to hoist syncthreads out of an "if", but it's not legal to sink it into an "if". It's not clear to me why we can't have "anticonvergent" (terrible name) functions which cannot have control-flow dependencies removed from them? ballot() would be both convergent and anticonvergent.

Would that solve your problem?

Right, thanks for reminding me of this line of thought. I think that already came up in a failed (and in hindsight misguided) review of mine on Phabricator a long time ago that you participated on.

There's a mixture of reasons why I have given up on this kind of approach.

The first is that "control-flow dependencies" aren't a particularly standard concept for unstructured control flow graphs in the first place, so forcing CPU folks to think about it is a bit problematic. Related to this is that all the definition attempts I'm aware of suffer from spooky action at a distance. An example of what I mean:

   flag = true;
   if (cond) {
     ...
     flag = ...;
   }
   if (flag) {
     ballot();
   }

There's a natural jump threading here that you would want to do for a CPU. Is that jump threading legal for a GPU?

The jump threading *should* be legal if the ballot wasn't there -- after all, it's legal and desired on CPUs. However, the jump threading pass does not naturally look at the basic block that contains the ballot, so it cannot make that judgment without an unnatural compile-time impact. Hence "spooky action at a distance".

And there are definitely implementations for which the jump threading here is illegal (classic PDOM reconvergence, for one). FWIW, the AMDGPU backend would currently just undo the jump threading entirely if the program is as simple as this, although I'm not comfortable with how reliably that works in more complex control flow (e.g., if there are loop nests involved as well). Since correctness requires reconverging anyway, it's preferable that jump threading is not legal here as long as the ballot is present.

But then, if `cond` is uniform, the jump threading can be both legal and desirable -- this goes in the direction of what Sameer wrote in his email.

An `anticonvergent` is simply insufficient to cover all this, but explicitly placing strategic "reconverging points" (in the form of some intrinsic) can solve all these problems.

At a higher level, all of the `convergent` / `anticonvergent` approaches are ultimately cop-outs. There are reasons why they developed the way they did, but I think it's time to face the reality of what's happening, which is that there is communication / synchronization between threads. If we just modeled that honestly, a lot of the confusion should disappear.

> However, the basic block containing the ballot call in the natural lowering to LLVM IR is not part of the loop at all. The information that it was intended to be run as part of the loop is currently lost forever.

Sounds like the natural lowering of this example is not respecting anticonvergence and just needs to be made more unnatural?

Right, though two points:

1. The optional structured control flow elements in SPIR-V point a way towards annotating an otherwise natural lowering with intrinsics in a way that models the required semantics.

2. If some other way was used that involved an unnatural lowering, we must be careful to guarantee that control flow transforms are unable to transform it back to the natural lowering.

I also think it's worthwhile to consider the reasons behind nvidia's move away from functions like ballot() towards explicit tracking of which threads are active. +Olivier Giroux <mailto:ogiroux@gmail.com> told me a while ago that he was working on a paper which showed that the old way of doing things is even more fraught/broken/difficult-to-specify than I thought; I'm not sure if anything ended up being published.

I'm curious about that, too.

I understand how the Volta architecture helps to avoid deadlocks with independent locking of threads, and it seems to me that co-operative groups mostly just fall out as a relatively easily implementable feature on top of that. IMHO they're mostly a move in the wrong direction though, at least from a programmer ergonomics point of view. They make it easier to shoot yourself in the foot for some of the typical use cases of cross-lane operations.

There are difficulties with specifying semantics of the implicit groups, sure, but I'm rather convinced they can be solved, so I'm curious to read some well-stated arguments stating the opposite :slight_smile:

Cheers,
Nicolai

Hi from one of the people who works/worked on the NVPTX backend.

The key issue is that the set of threads participating in the exchange of data is implicitly defined by control flow.

I agree that this is the root of the problem.

The way nvidia GPUs “solve” this is that in Volta and newer, it’s up to the user to track explicitly which set of threads participate in warp-synchronous functions. You cannot safely ballot(). Problem “solved”. :slight_smile:

Well, that certainly works for Nvidia, but from what I understand, you can’t efficiently do this without special hardware support to properly synchronize the threads when you get to the ballot or whatever, so the rest of us are unfortunately out of luck :slight_smile:

We already have the notion of “convergent” functions like syncthreads(), to which we cannot add control-flow dependencies. That is, it’s legal to hoist syncthreads out of an “if”, but it’s not legal to sink it into an “if”. It’s not clear to me why we can’t have “anticonvergent” (terrible name) functions which cannot have control-flow dependencies removed from them? ballot() would be both convergent and anticonvergent.

Would that solve your problem?

I think it’s important to note that we already have such an attribute, although with the opposite sense - it’s impossible to remove control flow dependencies from a call unless you mark it as “speculatable”. However, this doesn’t prevent

if (…) {
} else {
}
foo = ballot();

from being turned into

if (…) {
foo1 = ballot();
} else {
foo2 = ballot();
}
foo = phi(foo1, foo2)

and vice versa. We have a “noduplicate” attribute which prevents transforming the first into the second, but not the other way around. Of course we could keep going this way and add a “nocombine” attribute to complement noduplicate. But even then, there are even still problematic transforms. For example, take this program, which is simplified from a real game that doesn’t work with the AMDGPU backend:

while (cond1 /* uniform */) {
ballot();


if (cond2 /* non-uniform */) continue;

}

In SPIR-V, when using structured control flow, the semantics of this are pretty clearly defined. In particular, there’s a continue block after the body of the loop where control flow re-converges, and the only back edge is from the continue block, so the ballot is in uniform control flow. But LLVM will get rid of the continue block since it’s empty, and re-analyze the loop as two nested loops, splitting the loop header in two, producing a CFG which corresponds to this:

while (cond1 /* uniform */) {
do {
ballot();


} while (cond2 /* non-uniform */);

}

Now, in an implementation where control flow re-converges at the immediate post-dominator, this won’t do the right thing anymore. In order to handle it correctly, you’d effectively need to always flatten nested loops, which will probably be really bad for performance if the programmer actually wanted the second thing. It also makes it impossible when translating a high-level language to LLVM to get the “natural” behavior which game developers actually expect. This is exactly the sort of “spooky action at a distance” which makes me think that everything we’ve done so far is really insufficient, and we need to add an explicit notion of control-flow divergence and reconvergence to the IR. We need a way to say that control flow re-converges at the continue block, so that LLVM won’t eliminate it, and we can vectorize it correctly without penalizing cases where it’s better for control flow not to re-converge.

    We already have the notion of "convergent" functions like
    syncthreads(), to which we cannot add control-flow dependencies. That is, it's legal to hoist syncthreads out of an "if", but it's
    not legal to sink it into an "if". It's not clear to me why we
    can't have "anticonvergent" (terrible name) functions which cannot
    have control-flow dependencies removed from them? ballot() would be
    both convergent and anticonvergent.

    Would that solve your problem?

I think it's important to note that we already have such an attribute, although with the opposite sense - it's impossible to remove control flow dependencies from a call unless you mark it as "speculatable".

This isn't actually true. If both sides of an if/else have the same non-speculative function call, it can still be moved out of control flow.

That's because doing so doesn't change anything at all from a single-threaded perspective. Hence why I think we should model the communication between threads honestly.

However, this doesn't prevent

if (...) {
} else {
}
foo = ballot();

from being turned into

if (...) {
foo1 = ballot();
} else {
foo2 = ballot();
}
foo = phi(foo1, foo2)

and vice versa. We have a "noduplicate" attribute which prevents transforming the first into the second, but not the other way around. Of course we could keep going this way and add a "nocombine" attribute to complement noduplicate. But even then, there are even still problematic transforms. For example, take this program, which is simplified from a real game that doesn't work with the AMDGPU backend:

while (cond1 /* uniform */) {
ballot();
...
if (cond2 /* non-uniform */) continue;
...
}

In SPIR-V, when using structured control flow, the semantics of this are pretty clearly defined. In particular, there's a continue block after the body of the loop where control flow re-converges, and the only back edge is from the continue block, so the ballot is in uniform control flow. But LLVM will get rid of the continue block since it's empty, and re-analyze the loop as two nested loops, splitting the loop header in two, producing a CFG which corresponds to this:

while (cond1 /* uniform */) {
do {
ballot();
...
} while (cond2 /* non-uniform */);
...
}

Now, in an implementation where control flow re-converges at the immediate post-dominator, this won't do the right thing anymore. In order to handle it correctly, you'd effectively need to always flatten nested loops, which will probably be really bad for performance if the programmer actually wanted the second thing. It also makes it impossible when translating a high-level language to LLVM to get the "natural" behavior which game developers actually expect. This is exactly the sort of "spooky action at a distance" which makes me think that everything we've done so far is really insufficient, and we need to add an explicit notion of control-flow divergence and reconvergence to the IR. We need a way to say that control flow re-converges at the continue block, so that LLVM won't eliminate it, and we can vectorize it correctly without penalizing cases where it's better for control flow not to re-converge.

Well said!

Cheers,
Nicolai

I was looking into ballot() and how if it is possible to keep a single-threaded
view of the code, but add some extra conditions that must hold after the

transformation. I had the initial idea that each call to ballot() in a
single-threaded program can be seen as a partial write to a memory

location, and each location memory location is unique for every call site,
plus there some externally observable side effect. We can abstract this
away by tagging the calls, e.g. by using aliases.

For example:

if (…) {
foo1 = ballot();
} else {
foo2 = ballot();
}

simply becomes:

if (…) {
foo1 = ballot_1();
} else {
foo2 = ballot_2();
}

and

if (…) {
} else {
}
ballot();

becomes

if (…) {
} else {
}
ballot_1();

In the first case it would prevent combining the two calls into one
after the if. In the second example there is generally nothing that
says it could not be transformed into the first example with two
calls to ballot_1(), which should not be allowed.

Another form of duplication that we must allow are loop transforms,
like unrolling or peeling. These might seem similar to the example
above, since we are cloning code and with conditions etc. But
they are different since they calls are in different loop iterations.

The condition that needs to be met is that:

There must be a single path between all cloned ballot_n() functions.

The reason for this condition is that if we clone the same call, then
the copies must be mutually exclusive, but if they are cloned from
a loop, there must be a path, or we would skip iterations.

If we want to be more sophisticated we can add:

If there is no such path, the calls must be separated by uniform branches.

After the transform everything should be re-tagged, since we already
checked the calls and we don’t want to check them again. Also, not all
transforms need (or should) have the tagging checked. One example is
inlining, where multiple copies are created, but they are clearly different
calls. The tagging can be done temporarily for a single pass, and then
eliminated. This could be a good tool for debugging as well, since it can
detect if a transform is suspect.

The code would of course have to make sense as far as control flow. If
we have:

for(;:wink: {
if(…) {

ballot();
break;

}

}

This would have to be translated to:

for(;:wink: {
if(…) {
ballot();
if(UnknownConstant) { // Not a uniform condition, but later on translated to “true”

break;

}
}

However, I think that this is the way the code is generated today anyway.
There would have to be some attribute that indicate that these calls (or functions)
contain ballot or other cross-lane operations so they could be tagged and
checked. The attribute would be used by the passes to know that the special
conditions exist for those calls.

As far as what it means to have a path, it could be complicated.
For example:

x = …
ballot_1();

could be transformed to:

if (x < 4711) {
ballot_1();

if(x >= 4711) {
ballot_1();

}

So a simple path check would say there is a path, and the transform is legal,
but if we examine the conditions, there is no path, and the transform should not be legal.
It could be made even more obscure of course, but I don’t see any optimizations really
doing this kind of thing,

  • Jan

We already have the notion of “convergent” functions like
syncthreads(), to which we cannot add control-flow dependencies.
That is, it’s legal to hoist syncthreads out of an “if”, but it’s
not legal to sink it into an “if”. It’s not clear to me why we
can’t have “anticonvergent” (terrible name) functions which cannot
have control-flow dependencies removed from them? ballot() would be
both convergent and anticonvergent.

Would that solve your problem?

I think it’s important to note that we already have such an attribute,
although with the opposite sense - it’s impossible to remove control
flow dependencies from a call unless you mark it as “speculatable”.

This isn’t actually true. If both sides of an if/else have the same
non-speculative function call, it can still be moved out of control flow.

That’s because doing so doesn’t change anything at all from a
single-threaded perspective. Hence why I think we should model the
communication between threads honestly.

However, this doesn’t prevent

if (…) {
} else {
}
foo = ballot();

from being turned into

if (…) {
foo1 = ballot();
} else {
foo2 = ballot();
}
foo = phi(foo1, foo2)

and vice versa. We have a “noduplicate” attribute which prevents
transforming the first into the second, but not the other way around. Of
course we could keep going this way and add a “nocombine” attribute to
complement noduplicate. But even then, there are even still problematic
transforms. For example, take this program, which is simplified from a
real game that doesn’t work with the AMDGPU backend:

while (cond1 /* uniform /) {
ballot();

if (cond2 /
non-uniform */) continue;

}

In SPIR-V, when using structured control flow, the semantics of this are
pretty clearly defined. In particular, there’s a continue block after
the body of the loop where control flow re-converges, and the only back
edge is from the continue block, so the ballot is in uniform control
flow. But LLVM will get rid of the continue block since it’s empty, and
re-analyze the loop as two nested loops, splitting the loop header in
two, producing a CFG which corresponds to this:

while (cond1 /* uniform /) {
do {
ballot();

} while (cond2 /
non-uniform */);

}

Now, in an implementation where control flow re-converges at the
immediate post-dominator, this won’t do the right thing anymore. In
order to handle it correctly, you’d effectively need to always flatten
nested loops, which will probably be really bad for performance if the
programmer actually wanted the second thing. It also makes it impossible
when translating a high-level language to LLVM to get the “natural”
behavior which game developers actually expect. This is exactly the sort
of “spooky action at a distance” which makes me think that everything
we’ve done so far is really insufficient, and we need to add an explicit
notion of control-flow divergence and reconvergence to the IR. We need a
way to say that control flow re-converges at the continue block, so that
LLVM won’t eliminate it, and we can vectorize it correctly without
penalizing cases where it’s better for control flow not to re-converge.

Well said!

Cheers,

Nicolai

I don't see how this would fix the continue vs. nested loop problem I
explained earlier. That is, how would this prevent turning:

for (...) {
    ballot();
    if (... /* non-uniform */) continue;
}

into

for (...) {
    do {
        ballot();
    } while (... /* non-uniform */);
}

and vice versa? Note that there's no duplication going on here, and
the single-threaded flow of control is exactly the same.

Another reason this isn't so great is that it prevents e.g. CSE on
ballots that actually should be combined, since you're modelling it as
a write. It seems like the optimizer is going to need some special
knowledge of convergent things that fake memory constraints can't give
us.

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

I’m not sure if I follow this example, could you and explain a bit more?
It looks to me that the condition in the “if” must be false (if the
same condition is used in the while), or we would
call ballot the wrong number of times.

About the CSE, when would that be legal? I can imagine with uniform
branches that it could work, but would like to see an example to
fully understand this.

I agree that it would be more conservative than if we model the threading,
but I’m not sure about the cost/benefit. I am mostly curious if it is
possible to have a single-thread view or not. Then we would have to
see if it is adequate.

  • Jan

I don’t see how this would fix the continue vs. nested loop problem I
explained earlier. That is, how would this prevent turning:

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

and vice versa? Note that there’s no duplication going on here, and
the single-threaded flow of control is exactly the same.

Another reason this isn’t so great is that it prevents e.g. CSE on
ballots that actually should be combined, since you’re modelling it as
a write. It seems like the optimizer is going to need some special
knowledge of convergent things that fake memory constraints can’t give
us.

I just remembered that I had put some thought about moving and eliminating ballots before. I was looking into
using cyclic equivalence classes, and that a ballot could only be moved to a location with the same equivalence
class, and two ballot() could be merged. I didn’t think too much about it after that, because there was
no obvious way to handle uniform branches, which is important. I will see if I can figure something out that
makes sense.

  • Jan

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

I’m not sure if I follow this example, could you and explain a bit more?
It looks to me that the condition in the “if” must be false (if the
same condition is used in the while), or we would
call ballot the wrong number of times.

About the CSE, when would that be legal? I can imagine with uniform
branches that it could work, but would like to see an example to
fully understand this.

I agree that it would be more conservative than if we model the threading,
but I’m not sure about the cost/benefit. I am mostly curious if it is
possible to have a single-thread view or not. Then we would have to
see if it is adequate.

  • Jan

I don’t see how this would fix the continue vs. nested loop problem I
explained earlier. That is, how would this prevent turning:

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

and vice versa? Note that there’s no duplication going on here, and
the single-threaded flow of control is exactly the same.

Another reason this isn’t so great is that it prevents e.g. CSE on
ballots that actually should be combined, since you’re modelling it as
a write. It seems like the optimizer is going to need some special
knowledge of convergent things that fake memory constraints can’t give
us.

I don’t want to hold up the discussion too long. The basic idea to handle

merging/code motion would be to use equivalence classes as described in
“Finding Regions Fast: Single Entry Single Exit and Control Regions in Linear Time”
by Richard Johnson ,David Pearson and Keshav Pingali.

Uniform branches would allow the two sets from a branch to be the same as the
incoming edge. That would allow for LICM if a loop is uniform, and it would allow
eliminating redundant ballots in sequential regions. I’m not sure if it can be
integrated directly in the existing algorithm, or if it is easier to handle it as a pass
afterwards.

The labeling could use the information to make two calls to ballot() with the same
equivalence class have the same tag (call the same alias function).

The rules for ballot would be:

A ballot cannot be moved to a basic block with a different equivalence class.
A ballotN can be eliminated if it is dominated by another ballotN.

There will always be cases where this will be conservative. As long as useful
transforms/optimizations are not prevented, that should be satisfactory.

  • Jan

I just remembered that I had put some thought about moving and eliminating ballots before. I was looking into
using cyclic equivalence classes, and that a ballot could only be moved to a location with the same equivalence
class, and two ballot() could be merged. I didn’t think too much about it after that, because there was
no obvious way to handle uniform branches, which is important. I will see if I can figure something out that
makes sense.

  • Jan

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

I’m not sure if I follow this example, could you and explain a bit more?
It looks to me that the condition in the “if” must be false (if the
same condition is used in the while), or we would
call ballot the wrong number of times.

About the CSE, when would that be legal? I can imagine with uniform
branches that it could work, but would like to see an example to
fully understand this.

I agree that it would be more conservative than if we model the threading,
but I’m not sure about the cost/benefit. I am mostly curious if it is
possible to have a single-thread view or not. Then we would have to
see if it is adequate.

  • Jan

I don’t see how this would fix the continue vs. nested loop problem I
explained earlier. That is, how would this prevent turning:

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

and vice versa? Note that there’s no duplication going on here, and
the single-threaded flow of control is exactly the same.

Another reason this isn’t so great is that it prevents e.g. CSE on
ballots that actually should be combined, since you’re modelling it as
a write. It seems like the optimizer is going to need some special
knowledge of convergent things that fake memory constraints can’t give
us.

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

I’m not sure if I follow this example, could you and explain a bit more?
It looks to me that the condition in the “if” must be false (if the
same condition is used in the while), or we would
call ballot the wrong number of times.

Yes, the idea is that the same condition is used in the if and the do-while. I think I messed up the example a little… in the second snippet, we’re supposed to break out of the inner loop if the outer loop’s exit condition is true. Here’s a more concrete example:

for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1

if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;

bar = ballot(true); // ballot 2
}

versus:

int i = 0;

while (true) {
do {
if (i == 2) break;

foo = ballot(true); // ballot 1
i++;

} while (threadID % 2 == 0);

if (i == 2) break;
bar = ballot(true); // ballot 2
i++;

}

From a single-threaded perspective, these two snippets are identical, even if ballot() writes to arbitrary memory. The first can easily get transformed to something like the second when LLVM decides to duplicate the final i++ through jump forwarding, and then re-interprets the loop as two nested loops and splits the loop header in two. This is what currently happens with DOOM when we try to enable subgroup operations with it. Let’s say there are two threads in a wavefront. Then the execution trace mandated by SPIR-V for the first looks like:

thread 0 | thread 1
ballot 1 = 0b11 | ballot 1 = 0b11
skipped | ballot 2 = 0b10

ballot 1 = 0b11 | ballot 1 = 0b11
skipped | ballot 2 = 0b10

Now, contrast this with the execution trace that programmers would expect for the second example:

thread 0 | thread 1
ballot 1 = 0b11 | ballot 1 = 0b11
ballot 1 = 0b01 | skipped
skipped | ballot 2 = 0b10
skipped | ballot 1 = 0b10
skipped | ballot 2 = 0b10

Nicolai’s proposal solves this by having the frontend emit a merge intrinsic before the i++ is emitted. This prevents the jump forwarding from occurring.

for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1

if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;

bar = ballot(true); // ballot 2
}

versus:

int i = 0;

while (true) {
do {
if (i == 2) break;

foo = ballot(true); // ballot 1
i++;

} while (threadID % 2 == 0);

if (i == 2) break;
bar = ballot(true); // ballot 2
i++;

}

I think you can remove the second “i++”, otherwise we can increment “i” twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,
I’m not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.

Nicolai’s proposal solves this by having the frontend emit a merge intrinsic
before the i++ is emitted. This prevents the jump forwarding from occurring.

I was thinking about getting through the single-thread view and the issues with

that first, then I will think more about the multi-thread and explicit convergence.

If we are done with the single-thread stuff for now these are the question that I
have been thinking about with the multi-threaded view:

What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.

Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?

How are uniform branches handled? Do they affect the convergence model?

  • Jan

for (…) {
ballot();
if (… /* non-uniform */) continue;
}

into

for (…) {
do {
ballot();
} while (… /* non-uniform */);
}

I’m not sure if I follow this example, could you and explain a bit more?
It looks to me that the condition in the “if” must be false (if the
same condition is used in the while), or we would
call ballot the wrong number of times.

Yes, the idea is that the same condition is used in the if and the do-while. I think I messed up the example a little… in the second snippet, we’re supposed to break out of the inner loop if the outer loop’s exit condition is true. Here’s a more concrete example:

for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1

if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;

bar = ballot(true); // ballot 2
}

versus:

int i = 0;

while (true) {
do {
if (i == 2) break;

foo = ballot(true); // ballot 1
i++;

} while (threadID % 2 == 0);

if (i == 2) break;
bar = ballot(true); // ballot 2
i++;

}

From a single-threaded perspective, these two snippets are identical, even if ballot() writes to arbitrary memory. The first can easily get transformed to something like the second when LLVM decides to duplicate the final i++ through jump forwarding, and then re-interprets the loop as two nested loops and splits the loop header in two. This is what currently happens with DOOM when we try to enable subgroup operations with it. Let’s say there are two threads in a wavefront. Then the execution trace mandated by SPIR-V for the first looks like:

thread 0 | thread 1
ballot 1 = 0b11 | ballot 1 = 0b11
skipped | ballot 2 = 0b10

ballot 1 = 0b11 | ballot 1 = 0b11
skipped | ballot 2 = 0b10

Now, contrast this with the execution trace that programmers would expect for the second example:

thread 0 | thread 1
ballot 1 = 0b11 | ballot 1 = 0b11
ballot 1 = 0b01 | skipped
skipped | ballot 2 = 0b10
skipped | ballot 1 = 0b10
skipped | ballot 2 = 0b10

Nicolai’s proposal solves this by having the frontend emit a merge intrinsic before the i++ is emitted. This prevents the jump forwarding from occurring.

> for (int i = 0; i < 2; i++) {
> foo = ballot(true); // ballot 1
>
> if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;
>
> bar = ballot(true); // ballot 2
> }
>
> versus:
>
> int i = 0;
> while (true) {
> do {
> if (i == 2) break;
> foo = ballot(true); // ballot 1
> i++;
> } while (threadID % 2 == 0);
>
> if (i == 2) break;
> bar = ballot(true); // ballot 2
> i++;
> }

I think you can remove the second "i++", otherwise we can increment "i" twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,

What do you mean? Note that the transforms that lead from the first
example to the second are actually desirable if there aren't any
ballots or other convergent operations, so banning them entirely is a
no-go. Otherwise, note that the ballot here can be nested arbitrarily
deep, which means that jump forwarding/tail duplication has to be
aware of the entire loop unless we have some kind of merge intrinsic.

Also, I should say that while interesting, control equivalence classes
can be part of the *implementation*, but not the *specification*. We
need to define the semantics of the IR -- that is, how a program
compiled from any given IR is allowed to do when executed (and *not*
what it looks like when compiled) -- *first*, and then which
transforms are allowed/not allowed will fall out of that. We can't
start by listing disallowed transforms, because then when someone
comes along and writes a new optimization, it might be technically
"allowed" even though it breaks something in practice. The only way to
conclusively prove that transforms will never break code that's
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don't follow that rule, your
patch will probably be rejected.

Now, to define a semantics for ballot(), we need to define what it's
allowed to return for any given execution of any given program, and to
do that, we need to define which threads must be active together when
it's reached, which in turns means we need to define when control flow
can re-converge somehow. Any proposal for how to handle ballot() must
start here.

I'm not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.

> Nicolai's proposal solves this by having the frontend emit a merge intrinsic
> before the i++ is emitted. This prevents the jump forwarding from occurring.

I was thinking about getting through the single-thread view and the issues with
that first, then I will think more about the multi-thread and explicit convergence.

If we are done with the single-thread stuff for now these are the question that I
have been thinking about with the multi-threaded view:

What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.

Hopefully, it's clear from the above how this all falls out. The
frontend for e.g. GLSL or SPIR-V would have to insert the merge
intrinsics to preserve the semantics of the source language. Any
transforms must refine the semantics of the IR, although I can't think
of a scenario where that would involve emitting any new merge
intrinsics. Usually, structurized IR's have their own semantics about
when control flow converges, so a code structurizer should respect the
original semantics. AMDGPU has its own code structurizer that runs
very late in the pipeline (although there are plans to remove it), so
we might have to change that to make it respect the merge intrinsic
intrinsics, and then we'll correctly implement them "for free".

Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?

They'd only have to be present when ballot or other convergent
functions are called, since otherwise it doesn't matter when control
flow re-converges. However, we may want to keep them around for
performance reasons (usually destroying convergence points is bad for
performance).

How are uniform branches handled? Do they affect the convergence model?

We may be able to remove convergence points if branches are uniform.
In Nicolai's proposal, I believe we'd want to remove a merge intrinsic
when all the branches that it post-dominates that aren't also
post-dominated by some other merge intrinsic are uniform.

for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1

if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;

bar = ballot(true); // ballot 2
}

versus:

int i = 0;
while (true) {
do {
if (i == 2) break;
foo = ballot(true); // ballot 1
i++;
} while (threadID % 2 == 0);

if (i == 2) break;
bar = ballot(true); // ballot 2
i++;
}

I think you can remove the second “i++”, otherwise we can increment “i” twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,

What do you mean? Note that the transforms that lead from the first
example to the second are actually desirable if there aren’t any
ballots or other convergent operations, so banning them entirely is a
no-go. Otherwise, note that the ballot here can be nested arbitrarily
deep, which means that jump forwarding/tail duplication has to be
aware of the entire loop unless we have some kind of merge intrinsic.

Yes, if there weren’t any calls to a ballot, then the transform would
be legal. What I was saying in the beginning was that ballot() would
have some special rules attached to it. It is of course undesirable to
have flags to enforce correctness, but that is the point we are at
right now.

Also, I should say that while interesting, control equivalence classes
can be part of the implementation, but not the specification. We
need
to define the semantics of the IR – that is, how a program
compiled from any given IR is allowed to do when executed (and not
what it looks like when compiled) – first, and then which
transforms are allowed/not allowed will fall out of that. We can’t
start by listing disallowed transforms, because then when someone
comes along and writes a new optimization, it might be technically
“allowed” even though it breaks something in practice.

I think you misunderstand me if you think I was listing disallowed
transforms. My question was if it is possible to have multi-threaded
semantics in the source language, but in the compiler have a
single-threaded view, where some properties of the CFG would determine
what is legal and not for some functions with a special flag. I agree
it is more desirable to have the the semantics specified in the
IR. However, I am exploring this from a practical point of view, since
ballot() is very rare compared to all code that is being compiled for
all targets. These intrinsics would always have to be considered when
writing a pass. They seem to me harder to think about, and
test, for someone who is working on a single-threaded target, compared
to looking at a flag. If we allow ourselves to look at which
transforms that might violate these properties, we could would free up
the rest of the compiler (and developers) to not have to worry about
these things. Intrinsics would have to be maintained throughout the
entire compilation process in every pass.

The only way to
conclusively prove that transforms will never break code that’s
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don’t follow that rule, your
patch will probably be rejected.

I’m trying to figure out if we are in the “almost never” territory
here or not.

Now, to define a semantics for ballot(), we need to define what it’s
allowed to return for any given execution of any given program, and to
do that, we need to define which threads must be active together when
it’s reached, which in turns means we need to define when control flow
can re-converge somehow. Any proposal for how to handle ballot() must
start here.

I’m not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.

What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.

Hopefully, it’s clear from the above how this all falls out. The
frontend for e.g. GLSL or SPIR-V would have to insert the merge
intrinsics to preserve the semantics of the source language. Any
transforms must refine the semantics of the IR, although I can’t think
of a scenario where that would involve emitting any new merge
intrinsics. Usually, structurized IR’s have their own semantics about
when control flow converges, so a code structurizer should respect the
original semantics. AMDGPU has its own code structurizer that runs
very late in the pipeline (although there are plans to remove it), so
we might have to change that to make it respect the merge intrinsic
intrinsics, and then we’ll correctly implement them “for free”.

Any transform that re-arranges control flow would potentially have to
know about the properties of ballot(), and the rules with respect to
the CFG (and maybe consider the target) to know where to insert the
intrinsics. I had the impression that the control flow convergence was
in part specified by what the target architecture can handle. One of
the more complicated cases would be linearization where the control
flow is completely rewritten, and is encoded in a variable that says
which basic block is the next one to execute. Another case is DCE,
where a ballot() could be eliminated, and it would potentially have to
remove a number of intrinsics to enable later optimizations (unless it
would affect performance?), so the intrinsics will require some
non-local updates.

Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?

They’d only have to be present when ballot or other convergent
functions are called, since otherwise it doesn’t matter when control
flow re-converges. However, we may want to keep them around for
performance reasons (usually destroying convergence points is bad for
performance).

So we might have them without a ballot(), which would seem to make it
more difficult for structurizers or other transforms to maintain the
semantics and insert intrinsics.

How are uniform branches handled? Do they affect the convergence model?

We may be able to remove convergence points if branches are uniform.
In Nicolai’s proposal, I believe we’d want to remove a merge intrinsic
when all the branches that it post-dominates that aren’t also
post-dominated by some other merge intrinsic are uniform.

I couldn’t quite understand the last sentence, but I assume the
conditions would prevent removing convergence points that help
performance. Post-domination might not be adequate if there are loops
btw.

  • Jan

for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1

if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;

bar = ballot(true); // ballot 2
}

versus:

int i = 0;
while (true) {
do {
if (i == 2) break;
foo = ballot(true); // ballot 1
i++;
} while (threadID % 2 == 0);

if (i == 2) break;
bar = ballot(true); // ballot 2
i++;
}

I think you can remove the second “i++”, otherwise we can increment “i” twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,

What do you mean? Note that the transforms that lead from the first
example to the second are actually desirable if there aren’t any
ballots or other convergent operations, so banning them entirely is a
no-go. Otherwise, note that the ballot here can be nested arbitrarily
deep, which means that jump forwarding/tail duplication has to be
aware of the entire loop unless we have some kind of merge intrinsic.

Also, I should say that while interesting, control equivalence classes
can be part of the implementation, but not the specification. We
need to define the semantics of the IR – that is, how a program
compiled from any given IR is allowed to do when executed (and not
what it looks like when compiled) – first, and then which
transforms are allowed/not allowed will fall out of that. We can’t
start by listing disallowed transforms, because then when someone
comes along and writes a new optimization, it might be technically
“allowed” even though it breaks something in practice. The only way to
conclusively prove that transforms will never break code that’s
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don’t follow that rule, your
patch will probably be rejected.

Now, to define a semantics for ballot(), we need to define what it’s
allowed to return for any given execution of any given program, and to
do that, we need to define which threads must be active together when
it’s reached, which in turns means we need to define when control flow
can re-converge somehow. Any proposal for how to handle ballot() must
start here.

I’m not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.

Nicolai’s proposal solves this by having the frontend emit a merge intrinsic
before the i++ is emitted. This prevents the jump forwarding from occurring.

I was thinking about getting through the single-thread view and the issues with
that first, then I will think more about the multi-thread and explicit convergence.

If we are done with the single-thread stuff for now these are the question that I
have been thinking about with the multi-threaded view:

What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.

Hopefully, it’s clear from the above how this all falls out. The
frontend for e.g. GLSL or SPIR-V would have to insert the merge
intrinsics to preserve the semantics of the source language. Any
transforms must refine the semantics of the IR, although I can’t think
of a scenario where that would involve emitting any new merge
intrinsics. Usually, structurized IR’s have their own semantics about
when control flow converges, so a code structurizer should respect the
original semantics. AMDGPU has its own code structurizer that runs
very late in the pipeline (although there are plans to remove it), so
we might have to change that to make it respect the merge intrinsic
intrinsics, and then we’ll correctly implement them “for free”.

Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?

They’d only have to be present when ballot or other convergent
functions are called, since otherwise it doesn’t matter when control
flow re-converges. However, we may want to keep them around for
performance reasons (usually destroying convergence points is bad for
performance).

How are uniform branches handled? Do they affect the convergence model?

We may be able to remove convergence points if branches are uniform.
In Nicolai’s proposal, I believe we’d want to remove a merge intrinsic
when all the branches that it post-dominates that aren’t also
post-dominated by some other merge intrinsic are uniform.

for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1

if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;

bar = ballot(true); // ballot 2
}

versus:

int i = 0;
while (true) {
do {
if (i == 2) break;
foo = ballot(true); // ballot 1
i++;
} while (threadID % 2 == 0);

if (i == 2) break;
bar = ballot(true); // ballot 2
i++;
}

I think you can remove the second “i++”, otherwise we can increment “i” twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,

What do you mean? Note that the transforms that lead from the first
example to the second are actually desirable if there aren’t any
ballots or other convergent operations, so banning them entirely is a
no-go. Otherwise, note that the ballot here can be nested arbitrarily
deep, which means that jump forwarding/tail duplication has to be
aware of the entire loop unless we have some kind of merge intrinsic.

Yes, if there weren’t any calls to a ballot, then the transform would
be legal. What I was saying in the beginning was that ballot() would
have some special rules attached to it. It is of course undesirable to
have flags to enforce correctness, but that is the point we are at
right now.

Also, I should say that while interesting, control equivalence classes
can be part of the implementation, but not the specification. We
need
to define the semantics of the IR – that is, how a program
compiled from any given IR is allowed to do when executed (and not
what it looks like when compiled) – first, and then which
transforms are allowed/not allowed will fall out of that. We can’t
start by listing disallowed transforms, because then when someone
comes along and writes a new optimization, it might be technically
“allowed” even though it breaks something in practice.

I think you misunderstand me if you think I was listing disallowed
transforms. My question was if it is possible to have multi-threaded
semantics in the source language, but in the compiler have a
single-threaded view, where some properties of the CFG would determine
what is legal and not for some functions with a special flag.

But that’s practically the same as listing allowed and disallowed transforms – you’re defining what the final IR is allowed to look like, not how it’s allowed to execute. If at all possible, the former should be derived from the latter.

I agree
it is more desirable to have the the semantics specified in the
IR. However, I am exploring this from a practical point of view, since
ballot() is very rare compared to all code that is being compiled for
all targets. These intrinsics would always have to be considered when
writing a pass. They seem to me harder to think about, and
test, for someone who is working on a single-threaded target, compared
to looking at a flag. If we allow ourselves to look at which
transforms that might violate these properties, we could would free up
the rest of the compiler (and developers) to not have to worry about
these things. Intrinsics would have to be maintained throughout the
entire compilation process in every pass.

The only way to
conclusively prove that transforms will never break code that’s
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don’t follow that rule, your
patch will probably be rejected.

I’m trying to figure out if we are in the “almost never” territory
here or not.

I don’t think so at all. In addition to being much, much harder to reason about and prove that the approach is sound, I think it’ll be more intrusive.

Now, to define a semantics for ballot(), we need to define what it’s
allowed to return for any given execution of any given program, and to
do that, we need to define which threads must be active together when
it’s reached, which in turns means we need to define when control flow
can re-converge somehow. Any proposal for how to handle ballot() must
start here.

I’m not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.

What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.

Hopefully, it’s clear from the above how this all falls out. The
frontend for e.g. GLSL or SPIR-V would have to insert the merge
intrinsics to preserve the semantics of the source language. Any
transforms must refine the semantics of the IR, although I can’t think
of a scenario where that would involve emitting any new merge
intrinsics. Usually, structurized IR’s have their own semantics about
when control flow converges, so a code structurizer should respect the
original semantics. AMDGPU has its own code structurizer that runs
very late in the pipeline (although there are plans to remove it), so
we might have to change that to make it respect the merge intrinsic
intrinsics, and then we’ll correctly implement them “for free”.

Any transform that re-arranges control flow would potentially have to
know about the properties of ballot(), and the rules with respect to
the CFG (and maybe consider the target) to know where to insert the
intrinsics.

But the same is true for basically any approach to handling this. In fact, adding the merge intrinsics makes this much easier. Beyond the usual problems with hoisting ballots, which passes currently avoid via the current convergent + speculatable attributes, we’d only have to add awareness to passes that they can’t duplicate/combine merge intrinsics or move them past convergent intrinsics, which is a local property and hence easily checkable. In example I explained, without some kind of merge intrinsic, tail duplication has to look at the entire loop to know whether the transform is legal. Of course, you could have some kind of “no convergent calls” flag on a function, but that doesn’t eliminate the nastyness when there are convergent calls.

I had the impression that the control flow convergence was
in part specified by what the target architecture can handle.

This is true, although hopefully we can agree on something that everyone can implement.

One of
the more complicated cases would be linearization where the control
flow is completely rewritten, and is encoded in a variable that says
which basic block is the next one to execute.

It’s certainly interesting to think about how to maintain correctness in the face of ballots() with such a pass, but a) it’s certainly no harder with merge intrinsics than merges being implicit and b) I doubt that’s useful for anything you’d want to do with a GPU.

Another case is DCE,
where a ballot() could be eliminated, and it would potentially have to
remove a number of intrinsics to enable later optimizations (unless it
would affect performance?), so the intrinsics will require some
non-local updates.

Removing merge intrinsics if it’s profitable and allowed is a separate optimization, that can be done relatively quickly in a single pass without impacting the performance of other optimizations. It’s requiring expensive non-local checks in optimizations which modify control flow that’s a problem.

Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?

They’d only have to be present when ballot or other convergent
functions are called, since otherwise it doesn’t matter when control
flow re-converges. However, we may want to keep them around for
performance reasons (usually destroying convergence points is bad for
performance).

So we might have them without a ballot(), which would seem to make it
more difficult for structurizers or other transforms to maintain the
semantics and insert intrinsics.

It’s not any more difficult code-wise, since the case where there is a ballot needs to be handled anyways. And while it might take longer to process stuff when this hypothetical pass encounters a merge intrinsic (I can’t think of a real-world case where it would matter), it should result in better code generated.

How are uniform branches handled? Do they affect the convergence model?

We may be able to remove convergence points if branches are uniform.
In Nicolai’s proposal, I believe we’d want to remove a merge intrinsic
when all the branches that it post-dominates that aren’t also
post-dominated by some other merge intrinsic are uniform.

I couldn’t quite understand the last sentence, but I assume the
conditions would prevent removing convergence points that help
performance. Post-domination might not be adequate if there are loops
btw.

It should be – Nicolai’s proposal is that a merge intrinsic merges all the divergences caused by branches post dominated by the intrinsic, so if all the divergent branches are merged by some other intrinsic earlier, then there’s no divergence and the merge intrinsic is a no-op.

This is a very important point, and I believe this is why we went with the convergent attribute a few years ago.
I passively followed this thread from the beginning, but something that I think would really be key to start considering this “RFC” is an actual LangRef proposal.
It seems hard to me to get an idea about the proposed trade-off of adding the burden of this new “thread-aware” semantic on the IR semantic itself (and on every existing and future transformations) without seeing the concrete proposal.

Thanks,

I think you misunderstand me if you think I was listing disallowed
transforms. My question was if it is possible to have multi-threaded
semantics in the source language, but in the compiler have a
single-threaded view, where some properties of the CFG would determine
what is legal and not for some functions with a special flag.

But that’s practically the same as listing allowed and disallowed
transforms – you’re defining what the final IR is allowed to look
like, not how it’s allowed to execute. If at all possible, the former
should be derived from the latter.

It does at least say something about transforms that haven’t been written
yet, but from that viewpoint, sure you can say that.

The only way to
conclusively prove that transforms will never break code that’s
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don’t follow that rule, your
patch will probably be rejected.

I’m trying to figure out if we are in the “almost never” territory
here or not.

I don’t think so at all. In addition to being much, much harder to
reason about and prove that the approach is sound, I think it’ll be
more intrusive.

It is much harder to come up with rules that are correct, and will not
be too restrictive for sure. I am not convinced that it is more
intrusive since it will only affect some transforms.

Any transform that re-arranges control flow would potentially have to
know about the properties of ballot(), and the rules with respect to
the CFG (and maybe consider the target) to know where to insert the
intrinsics.

But the same is true for basically any approach to handling this. In
fact, adding the merge intrinsics makes this much easier. Beyond the
usual problems with hoisting ballots, which passes currently avoid via
the current convergent + speculatable attributes, we’d only have to
add awareness to passes that they can’t duplicate/combine merge
intrinsics or move them past convergent intrinsics, which is a local
property and hence easily checkable. In example I explained, without
some kind of merge intrinsic, tail duplication has to look at the
entire loop to know whether the transform is legal. Of course, you
could have some kind of “no convergent calls” flag on a function, but
that doesn’t eliminate the nastyness when there are convergent calls.

We will have to determine if the intrinsics are worth more compared to
scanning the code.

I had the impression that the control flow convergence was
in part specified by what the target architecture can handle.

This is true, although hopefully we can agree on something that
everyone can implement.

Yes, hopefully there is an execution model that is works for everyone.

One of
the more complicated cases would be linearization where the control
flow is completely rewritten, and is encoded in a variable that says
which basic block is the next one to execute.

It’s certainly interesting to think about how to maintain correctness
in the face of ballots() with such a pass, but a) it’s certainly no
harder with merge intrinsics than merges being implicit and b) I doubt
that’s useful for anything you’d want to do with a GPU.

Irreducible control flow has to be handled somehow, and linearization
is the only transform I know of, that will handle everything. I’m not
sure what the execution model says about irreducible control flow.

Another case is DCE,
where a ballot() could be eliminated, and it would potentially have to
remove a number of intrinsics to enable later optimizations (unless it
would affect performance?), so the intrinsics will require some
non-local updates.

Removing merge intrinsics if it’s profitable and allowed is a
separate optimization, that can be done relatively quickly in a single
pass without impacting the performance of other optimizations. It’s
requiring expensive non-local checks in optimizations which modify
control flow that’s a problem.

There are certainly a lot of tradeoffs. My point was simply that the
intrinsics are not strictly local.

So we might have them without a ballot(), which would seem to make it
more difficult for structurizers or other transforms to maintain the
semantics and insert intrinsics.

It’s not any more difficult code-wise, since the case where there is
a ballot needs to be handled anyways. And while it might take longer
to process stuff when this hypothetical pass encounters a merge
intrinsic (I can’t think of a real-world case where it would matter),
it should result in better code generated.

I was thinking of linearization or something similar, where an
instrinsic by itself might be hard to preserve, compared to looking at
a ballot() and derive where intrinsics should be inserted.

How are uniform branches handled? Do they affect the convergence model?

We may be able to remove convergence points if branches are uniform.
In Nicolai’s proposal, I believe we’d want to remove a merge intrinsic
when all the branches that it post-dominates that aren’t also
post-dominated by some other merge intrinsic are uniform.

I couldn’t quite understand the last sentence, but I assume the
conditions would prevent removing convergence points that help
performance. Post-domination might not be adequate if there are loops
btw.

It should be – Nicolai’s proposal is that a merge intrinsic merges
all the divergences caused by branches post dominated by the
intrinsic, so if all the divergent branches are merged by some other
intrinsic earlier, then there’s no divergence and the merge intrinsic
is a no-op.

Makes sense.

  • Jan

for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1

if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;

bar = ballot(true); // ballot 2
}

versus:

int i = 0;
while (true) {
do {
if (i == 2) break;
foo = ballot(true); // ballot 1
i++;
} while (threadID % 2 == 0);

if (i == 2) break;
bar = ballot(true); // ballot 2
i++;
}

I think you can remove the second “i++”, otherwise we can increment “i” twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,

What do you mean? Note that the transforms that lead from the first
example to the second are actually desirable if there aren’t any
ballots or other convergent operations, so banning them entirely is a
no-go. Otherwise, note that the ballot here can be nested arbitrarily
deep, which means that jump forwarding/tail duplication has to be
aware of the entire loop unless we have some kind of merge intrinsic.

Yes, if there weren’t any calls to a ballot, then the transform would
be legal. What I was saying in the beginning was that ballot() would
have some special rules attached to it. It is of course undesirable to
have flags to enforce correctness, but that is the point we are at
right now.

Also, I should say that while interesting, control equivalence classes
can be part of the implementation, but not the specification. We
need
to define the semantics of the IR – that is, how a program
compiled from any given IR is allowed to do when executed (and not
what it looks like when compiled) – first, and then which
transforms are allowed/not allowed will fall out of that. We can’t
start by listing disallowed transforms, because then when someone
comes along and writes a new optimization, it might be technically
“allowed” even though it breaks something in practice.

I think you misunderstand me if you think I was listing disallowed
transforms. My question was if it is possible to have multi-threaded
semantics in the source language, but in the compiler have a
single-threaded view, where some properties of the CFG would determine
what is legal and not for some functions with a special flag.

But that’s practically the same as listing allowed and disallowed transforms – you’re defining what the final IR is allowed to look like, not how it’s allowed to execute. If at all possible, the former should be derived from the latter.

I agree
it is more desirable to have the the semantics specified in the
IR. However, I am exploring this from a practical point of view, since
ballot() is very rare compared to all code that is being compiled for
all targets. These intrinsics would always have to be considered when
writing a pass. They seem to me harder to think about, and
test, for someone who is working on a single-threaded target, compared
to looking at a flag. If we allow ourselves to look at which
transforms that might violate these properties, we could would free up
the rest of the compiler (and developers) to not have to worry about
these things. Intrinsics would have to be maintained throughout the
entire compilation process in every pass.

The only way to
conclusively prove that transforms will never break code that’s
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don’t follow that rule, your
patch will probably be rejected.

I’m trying to figure out if we are in the “almost never” territory
here or not.

I don’t think so at all. In addition to being much, much harder to reason about and prove that the approach is sound, I think it’ll be more intrusive.

Now, to define a semantics for ballot(), we need to define what it’s
allowed to return for any given execution of any given program, and to
do that, we need to define which threads must be active together when
it’s reached, which in turns means we need to define when control flow
can re-converge somehow. Any proposal for how to handle ballot() must
start here.

I’m not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.

What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.

Hopefully, it’s clear from the above how this all falls out. The
frontend for e.g. GLSL or SPIR-V would have to insert the merge
intrinsics to preserve the semantics of the source language. Any
transforms must refine the semantics of the IR, although I can’t think
of a scenario where that would involve emitting any new merge
intrinsics. Usually, structurized IR’s have their own semantics about
when control flow converges, so a code structurizer should respect the
original semantics. AMDGPU has its own code structurizer that runs
very late in the pipeline (although there are plans to remove it), so
we might have to change that to make it respect the merge intrinsic
intrinsics, and then we’ll correctly implement them “for free”.

Any transform that re-arranges control flow would potentially have to
know about the properties of ballot(), and the rules with respect to
the CFG (and maybe consider the target) to know where to insert the
intrinsics.

But the same is true for basically any approach to handling this. In fact, adding the merge intrinsics makes this much easier. Beyond the usual problems with hoisting ballots, which passes currently avoid via the current convergent + speculatable attributes, we’d only have to add awareness to passes that they can’t duplicate/combine merge intrinsics or move them past convergent intrinsics, which is a local property and hence easily checkable. In example I explained, without some kind of merge intrinsic, tail duplication has to look at the entire loop to know whether the transform is legal. Of course, you could have some kind of “no convergent calls” flag on a function, but that doesn’t eliminate the nastyness when there are convergent calls.

I had the impression that the control flow convergence was
in part specified by what the target architecture can handle.

This is true, although hopefully we can agree on something that everyone can implement.

One of
the more complicated cases would be linearization where the control
flow is completely rewritten, and is encoded in a variable that says
which basic block is the next one to execute.

It’s certainly interesting to think about how to maintain correctness in the face of ballots() with such a pass, but a) it’s certainly no harder with merge intrinsics than merges being implicit and b) I doubt that’s useful for anything you’d want to do with a GPU.

Another case is DCE,
where a ballot() could be eliminated, and it would potentially have to
remove a number of intrinsics to enable later optimizations (unless it
would affect performance?), so the intrinsics will require some
non-local updates.

Removing merge intrinsics if it’s profitable and allowed is a separate optimization, that can be done relatively quickly in a single pass without impacting the performance of other optimizations. It’s requiring expensive non-local checks in optimizations which modify control flow that’s a problem.

Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?

They’d only have to be present when ballot or other convergent
functions are called, since otherwise it doesn’t matter when control
flow re-converges. However, we may want to keep them around for
performance reasons (usually destroying convergence points is bad for
performance).

So we might have them without a ballot(), which would seem to make it
more difficult for structurizers or other transforms to maintain the
semantics and insert intrinsics.

It’s not any more difficult code-wise, since the case where there is a ballot needs to be handled anyways. And while it might take longer to process stuff when this hypothetical pass encounters a merge intrinsic (I can’t think of a real-world case where it would matter), it should result in better code generated.

How are uniform branches handled? Do they affect the convergence model?

We may be able to remove convergence points if branches are uniform.
In Nicolai’s proposal, I believe we’d want to remove a merge intrinsic
when all the branches that it post-dominates that aren’t also
post-dominated by some other merge intrinsic are uniform.

I couldn’t quite understand the last sentence, but I assume the
conditions would prevent removing convergence points that help
performance. Post-domination might not be adequate if there are loops
btw.

It should be – Nicolai’s proposal is that a merge intrinsic merges all the divergences caused by branches post dominated by the intrinsic, so if all the divergent branches are merged by some other intrinsic earlier, then there’s no divergence and the merge intrinsic is a no-op.

This is a very important point, and I believe this is why we went with
the convergent attribute a few years ago. I passively followed this
thread from the beginning, but something that I think would really be
key to start considering this “RFC” is an actual LangRef proposal. It
seems hard to me to get an idea about the proposed trade-off of adding
the burden of this new “thread-aware” semantic on the IR semantic
itself (and on every existing and future transformations) without
seeing the concrete proposal.

Yes it is difficult to know what the consequences are for the
intrinsics. There are a number of questions that I have. Do we need
better machine descriptions so that various resources can be
considered? Do we need the capability to reason about the machine
state for the cross-lane operations to enable more optimizations? Are
intrinsics the right representation, or should there be a set of new
instructions, since we are changing the execution model in a
fundamental way? I’m personally leaning towards new instructions, and
a more comprehensive machine model that can capture more, and

hopefully allow more optimizations in the future.

  • Jan