[RFC] Refinement of convergent semantics

Hi all,

In light of recent discussions regarding updating passes to respect convergent semantics, and whether or not it is sufficient for barriers, I would like to propose a change in convergent semantics that should resolve a lot of the identified problems regarding loop unrolling, loop unswitching, etc. Credit to John McCall for talking this over with me and seeding the core ideas.

Today, convergent operations may only be moved into control-equivalent locations, or, in layman’s terms, a convergent operation may neither be sunk into nor hoisted out of, a condition. This causes problems for full loop unrolling, as the control dependence on the loop counter is eliminated, but our intuition indicates that this dependence was somehow trivial. More concretely, all know uses of convergent are OK with full unrolling, making this semantic undesirable. Related problems arise in loop unswitching as well.

The proposed change is to split the semantics of convergent into two annotations:
  convergent - this operation may not be made control dependent on any additional values (aka may not be sunk into a condition)
  nospeculate - this operation may not be added to any program trace on which it was not previously executed (same as notrap?)

Most of today’s convergent operations (barriers, arithmetic gradients) would continue to be marked only as convergent. The new semantics would allow full loop unrolling, and provide clarity on which loop unswitching operations are allowed, examples below.

The one case where nospeculate would also be needed is in the case of texture fetches that compute implicit gradients. Because the computed gradient forms part of the addressing mode, gibberish gradients here can cause invalid memory dereferences.

—Owen

+1

The proposed change is to split the semantics of convergent into two
annotations:

convergent - this operation may not be made control dependent on any
additional values (aka may not be sunk into a condition)

Does every unknown function need to be conservatively considered
containing a convergent operation? IOW, I'd want LLVM to unswitch
this:

for (…) {
   if (c) { … }
   call (*func_ptr)();
}

but (*func_ptr)() may contain a convergent operation in it (and it may
later get devirtualized and inlined).

Also, how about transforms like these:

  if (a & b)
    convergent();

==>

  if (a)
    if (b)
      convergent();

?

nospeculate - this operation may not be added to any program trace
on which it was not previously executed (same as notrap?)

How is this different from, say, a store to the heap? Or a volatile
load? If not, maybe nospeculate operations can just be modeled as
writing to the heap?

-- Sanjoy

The proposed change is to split the semantics of convergent into two
annotations:

convergent - this operation may not be made control dependent on any
additional values (aka may not be sunk into a condition)

Does every unknown function need to be conservatively considered
containing a convergent operation? IOW, I'd want LLVM to unswitch
this:

for (…) {
  if (c) { … }
  call (*func_ptr)();
}

but (*func_ptr)() may contain a convergent operation in it (and it may
later get devirtualized and inlined).

I expect SPMD implementations built on top of LLVM will need to conservatively mark external or indirect calls as convergent.

Also, how about transforms like these:

if (a & b)
   convergent();

==>

if (a)
   if (b)
     convergent();

?

This should be allowed, as the call was already control dependent on both a and b.

nospeculate - this operation may not be added to any program trace
on which it was not previously executed (same as notrap?)

How is this different from, say, a store to the heap? Or a volatile
load? If not, maybe nospeculate operations can just be modeled as
writing to the heap?

They’re more akin to integer divisions. They may in fact be purely arithmetic operations (say, computing a cross-thread gradient), and treating them like heap stores will significantly over-pessimize the program.

—Owen

+llvm-commits as well.

—Owen

Ping?

—Owen

Hi all,

In light of recent discussions regarding updating passes to respect convergent semantics, and whether or not it is sufficient for barriers, I would like to propose a change in convergent semantics that should resolve a lot of the identified problems regarding loop unrolling, loop unswitching, etc. Credit to John McCall for talking this over with me and seeding the core ideas.

Today, convergent operations may only be moved into control-equivalent locations, or, in layman’s terms, a convergent operation may neither be sunk into nor hoisted out of, a condition. This causes problems for full loop unrolling, as the control dependence on the loop counter is eliminated, but our intuition indicates that this dependence was somehow trivial. More concretely, all know uses of convergent are OK with full unrolling, making this semantic undesirable. Related problems arise in loop unswitching as well.

I don't understand this point. Loop unrolling specifically won't change which indices actually run. It might result in code duplication with a subset of indices taken one of two paths. Does today's convergent also imply no-duplicate? Is that what you're trying to relax?

The proposed change is to split the semantics of convergent into two annotations:
  convergent - this operation may not be made control dependent on any additional values (aka may not be sunk into a condition)

To be clear, this is a restriction of current semantics only right?

  nospeculate - this operation may not be added to any program trace on which it was not previously executed (same as notrap?)

Isn't this already true of all instructions? Unless we can *prove* that speculating an instruction can't introduce faults, we can't speculate it ever. An unknown call or intrinsic should already have this property right?

This part of the proposal doesn't feel mature to me. In particular, I think we need to talk about what other cases we want to handle w.r.t. speculation attributes and what our model is.

One case I want to support is for small functions which only read argument memory (i.e. argmemonly readonly nounwind) which are guaranteed (by the frontend) to fault only if a) the pointer passed in is null, or b) the memory state on entry is different that the one the context should have ensured. (The second part is standard. The first allows speculation in more cases.)

I'd suggest promoting this to it's own thread. Once we settle on a workable model for safe speculation attributes, we can revisit how we want to change the convergent attribute.

Hi all,

In light of recent discussions regarding updating passes to respect convergent semantics, and whether or not it is sufficient for barriers, I would like to propose a change in convergent semantics that should resolve a lot of the identified problems regarding loop unrolling, loop unswitching, etc. Credit to John McCall for talking this over with me and seeding the core ideas.

Today, convergent operations may only be moved into control-equivalent locations, or, in layman’s terms, a convergent operation may neither be sunk into nor hoisted out of, a condition. This causes problems for full loop unrolling, as the control dependence on the loop counter is eliminated, but our intuition indicates that this dependence was somehow trivial. More concretely, all know uses of convergent are OK with full unrolling, making this semantic undesirable. Related problems arise in loop unswitching as well.

I don't understand this point. Loop unrolling specifically won't change which indices actually run. It might result in code duplication with a subset of indices taken one of two paths. Does today's convergent also imply no-duplicate? Is that what you're trying to relax?

The definition today says that we cannot remove a control dependence. Since the loop counter is eliminated entirely, one can argue that we have eliminated the control dependence on it. I agree that there’s an intuitive argument that the dependence on the loop counter was trivial, but I have no idea how to formalize that.

While resolving the question re: loop unrolling is nice, I actually thinking providing answers on which loop unswitching transforms are legal is actually the more novel part of this change.

The proposed change is to split the semantics of convergent into two annotations:
  convergent - this operation may not be made control dependent on any additional values (aka may not be sunk into a condition)

To be clear, this is a restriction of current semantics only right?

That depends on what you mean by restriction. It allow strictly more code motion than is allowed by the current semantics.

  nospeculate - this operation may not be added to any program trace on which it was not previously executed (same as notrap?)

Isn't this already true of all instructions? Unless we can *prove* that speculating an instruction can't introduce faults, we can't speculate it ever. An unknown call or intrinsic should already have this property right?

Possibly? We probably need a safetospeculate attribute, then.

This part of the proposal doesn't feel mature to me. In particular, I think we need to talk about what other cases we want to handle w.r.t. speculation attributes and what our model is.

One case I want to support is for small functions which only read argument memory (i.e. argmemonly readonly nounwind) which are guaranteed (by the frontend) to fault only if a) the pointer passed in is null, or b) the memory state on entry is different that the one the context should have ensured. (The second part is standard. The first allows speculation in more cases.)

I'd suggest promoting this to it's own thread. Once we settle on a workable model for safe speculation attributes, we can revisit how we want to change the convergent attribute.

We can certainly start a separate conversation about safe speculation attribute. If what you suggest is true that we already treat intrinsics conservatively WRT speculation, then I don’t think we need to block progress on convergent on that.

—Owen

Hi all,

In light of recent discussions regarding updating passes to respect convergent semantics, and whether or not it is sufficient for barriers, I would like to propose a change in convergent semantics that should resolve a lot of the identified problems regarding loop unrolling, loop unswitching, etc. Credit to John McCall for talking this over with me and seeding the core ideas.

Today, convergent operations may only be moved into control-equivalent locations, or, in layman’s terms, a convergent operation may neither be sunk into nor hoisted out of, a condition. This causes problems for full loop unrolling, as the control dependence on the loop counter is eliminated, but our intuition indicates that this dependence was somehow trivial. More concretely, all know uses of convergent are OK with full unrolling, making this semantic undesirable. Related problems arise in loop unswitching as well.

I don't understand this point. Loop unrolling specifically won't change which indices actually run. It might result in code duplication with a subset of indices taken one of two paths. Does today's convergent also imply no-duplicate? Is that what you're trying to relax?

The definition today says that we cannot remove a control dependence. Since the loop counter is eliminated entirely, one can argue that we have eliminated the control dependence on it. I agree that there’s an intuitive argument that the dependence on the loop counter was trivial, but I have no idea how to formalize that.

I'm still not getting the challenge. I can write a loop unroll as a two step transform. 1) Introduce another predicated copy of the loop, reduce the iteration count by 1/2. 2) Try to prove the predicated version always execute. (1) involves code duplication, but doesn't change the dynamic branches executed along any path or the state of any index when hitting the convergent operation. (2) is merely constant folding and/or CSE. While it reduces the dynamic branches, it does so in a very structured way.

Is the problem purely that our current definition is a purely static one? Does our current definition allow (1) above?

While resolving the question re: loop unrolling is nice, I actually thinking providing answers on which loop unswitching transforms are legal is actually the more novel part of this change.

I think you can probably break down the argument the same as I did for loop unswitch. Step 1 - Introduce a new control dependency which doesn't change the state of any index along any path. Step 2 - Prove away a trivial branch.

Having said that, I don't care enough about the "convergent" attribute to block you here. Do what you want. :slight_smile:

The proposed change is to split the semantics of convergent into two annotations:
  convergent - this operation may not be made control dependent on any additional values (aka may not be sunk into a condition)

To be clear, this is a restriction of current semantics only right?

That depends on what you mean by restriction. It allow strictly more code motion than is allowed by the current semantics.

Ok. That's what I was trying to say, but said badly.

  nospeculate - this operation may not be added to any program trace on which it was not previously executed (same as notrap?)

Isn't this already true of all instructions? Unless we can *prove* that speculating an instruction can't introduce faults, we can't speculate it ever. An unknown call or intrinsic should already have this property right?

Possibly? We probably need a safetospeculate attribute, then.

The tricky part comes in defining what set of preconditions we want to encode in "safe". An unconditionally safe to speculate function isn't very common. So far, most of the cases I've seen come down to "safe if X" where X is "single argument which is non-null". But that's pretty specific to my frontend. I suspect we want to gather other inputs here.

This part of the proposal doesn't feel mature to me. In particular, I think we need to talk about what other cases we want to handle w.r.t. speculation attributes and what our model is.

One case I want to support is for small functions which only read argument memory (i.e. argmemonly readonly nounwind) which are guaranteed (by the frontend) to fault only if a) the pointer passed in is null, or b) the memory state on entry is different that the one the context should have ensured. (The second part is standard. The first allows speculation in more cases.)

I'd suggest promoting this to it's own thread. Once we settle on a workable model for safe speculation attributes, we can revisit how we want to change the convergent attribute.

We can certainly start a separate conversation about safe speculation attribute. If what you suggest is true that we already treat intrinsics conservatively WRT speculation, then I don’t think we need to block progress on convergent on that.

Great! Looking at isSafeToSpeculate, we definitely return false for unknown targets. If you find a place we're not conservative here, please let me know. It'd be a potentially serious bug.

Hi all,

In light of recent discussions regarding updating passes to respect convergent semantics, and whether or not it is sufficient for barriers, I would like to propose a change in convergent semantics that should resolve a lot of the identified problems regarding loop unrolling, loop unswitching, etc. Credit to John McCall for talking this over with me and seeding the core ideas.

Today, convergent operations may only be moved into control-equivalent locations, or, in layman’s terms, a convergent operation may neither be sunk into nor hoisted out of, a condition. This causes problems for full loop unrolling, as the control dependence on the loop counter is eliminated, but our intuition indicates that this dependence was somehow trivial. More concretely, all know uses of convergent are OK with full unrolling, making this semantic undesirable. Related problems arise in loop unswitching as well.

I don't understand this point. Loop unrolling specifically won't change which indices actually run. It might result in code duplication with a subset of indices taken one of two paths. Does today's convergent also imply no-duplicate? Is that what you're trying to relax?

The definition today says that we cannot remove a control dependence. Since the loop counter is eliminated entirely, one can argue that we have eliminated the control dependence on it. I agree that there’s an intuitive argument that the dependence on the loop counter was trivial, but I have no idea how to formalize that.

I'm still not getting the challenge. I can write a loop unroll as a two step transform. 1) Introduce another predicated copy of the loop, reduce the iteration count by 1/2. 2) Try to prove the predicated version always execute. (1) involves code duplication, but doesn't change the dynamic branches executed along any path or the state of any index when hitting the convergent operation. (2) is merely constant folding and/or CSE. While it reduces the dynamic branches, it does so in a very structured way.

Is the problem purely that our current definition is a purely static one? Does our current definition allow (1) above?

While resolving the question re: loop unrolling is nice, I actually thinking providing answers on which loop unswitching transforms are legal is actually the more novel part of this change.

I think you can probably break down the argument the same as I did for loop unswitch. Step 1 - Introduce a new control dependency which doesn't change the state of any index along any path. Step 2 - Prove away a trivial branch.

Having said that, I don't care enough about the "convergent" attribute to block you here. Do what you want. :slight_smile:

The problem in both proposed transformations sequences is that you have to add a new control dependence, which is not allowed by either the old or new definitions of convergent. Coming up with a definition that allows it will be very difficult without introducing more details of SPMD programming models, since the “real” constraint is that only uniform control dependences can be added. That happens to be true in both cases outlined here (since the loop count is thread-invariant), but would not necessarily be true in other transformations.

However, we can achieve the same end result without needing to introduce more details of SPMD programming into LLVM IR by removing the rule preventing the removal of control dependences. That’s basically what I’m proposing here.

  nospeculate - this operation may not be added to any program trace on which it was not previously executed (same as notrap?)

Isn't this already true of all instructions? Unless we can *prove* that speculating an instruction can't introduce faults, we can't speculate it ever. An unknown call or intrinsic should already have this property right?

Possibly? We probably need a safetospeculate attribute, then.

The tricky part comes in defining what set of preconditions we want to encode in "safe". An unconditionally safe to speculate function isn't very common. So far, most of the cases I've seen come down to "safe if X" where X is "single argument which is non-null". But that's pretty specific to my frontend. I suspect we want to gather other inputs here.

I actually have plenty of intrinsics that are universally safe to speculate, because they are essentially convergent math operations.

—Owen

Hi all,

In light of recent discussions regarding updating passes to respect convergent semantics, and whether or not it is sufficient for barriers, I would like to propose a change in convergent semantics that should resolve a lot of the identified problems regarding loop unrolling, loop unswitching, etc. Credit to John McCall for talking this over with me and seeding the core ideas.

Today, convergent operations may only be moved into control-equivalent locations, or, in layman’s terms, a convergent operation may neither be sunk into nor hoisted out of, a condition. This causes problems for full loop unrolling, as the control dependence on the loop counter is eliminated, but our intuition indicates that this dependence was somehow trivial. More concretely, all know uses of convergent are OK with full unrolling, making this semantic undesirable. Related problems arise in loop unswitching as well.

I don't understand this point. Loop unrolling specifically won't change which indices actually run. It might result in code duplication with a subset of indices taken one of two paths. Does today's convergent also imply no-duplicate? Is that what you're trying to relax?

The definition today says that we cannot remove a control dependence. Since the loop counter is eliminated entirely, one can argue that we have eliminated the control dependence on it. I agree that there’s an intuitive argument that the dependence on the loop counter was trivial, but I have no idea how to formalize that.

I'm still not getting the challenge. I can write a loop unroll as a two step transform. 1) Introduce another predicated copy of the loop, reduce the iteration count by 1/2. 2) Try to prove the predicated version always execute. (1) involves code duplication, but doesn't change the dynamic branches executed along any path or the state of any index when hitting the convergent operation. (2) is merely constant folding and/or CSE. While it reduces the dynamic branches, it does so in a very structured way.

Is the problem purely that our current definition is a purely static one? Does our current definition allow (1) above?

While resolving the question re: loop unrolling is nice, I actually thinking providing answers on which loop unswitching transforms are legal is actually the more novel part of this change.

I think you can probably break down the argument the same as I did for loop unswitch. Step 1 - Introduce a new control dependency which doesn't change the state of any index along any path. Step 2 - Prove away a trivial branch.

Having said that, I don't care enough about the "convergent" attribute to block you here. Do what you want. :slight_smile:

The problem in both proposed transformations sequences is that you have to add a new control dependence, which is not allowed by either the old or new definitions of convergent. Coming up with a definition that allows it will be very difficult without introducing more details of SPMD programming models, since the “real” constraint is that only uniform control dependences can be added. That happens to be true in both cases outlined here (since the loop count is thread-invariant), but would not necessarily be true in other transformations.

However, we can achieve the same end result without needing to introduce more details of SPMD programming into LLVM IR by removing the rule preventing the removal of control dependences. That’s basically what I’m proposing here.

Ok. I have no objection to that part of the change.

  nospeculate - this operation may not be added to any program trace on which it was not previously executed (same as notrap?)

Isn't this already true of all instructions? Unless we can *prove* that speculating an instruction can't introduce faults, we can't speculate it ever. An unknown call or intrinsic should already have this property right?

Possibly? We probably need a safetospeculate attribute, then.

The tricky part comes in defining what set of preconditions we want to encode in "safe". An unconditionally safe to speculate function isn't very common. So far, most of the cases I've seen come down to "safe if X" where X is "single argument which is non-null". But that's pretty specific to my frontend. I suspect we want to gather other inputs here.

I actually have plenty of intrinsics that are universally safe to speculate, because they are essentially convergent math operations.

Ah, the joys of different starting assumptions. :slight_smile:

If these are only intrinsics, do we need an attribute? Or can we just bake this into an intrinsic property? I'm all for an attribute long term, but if you want a short term "fix" while we figure out the generic use case... In fact, we already hard code this list inside isSafeToSpeculativelyExecute for some intrinsics. Moving that into the TD file seems like a clear win.

Hi Owen,

This is very interesting.

How different is “convergent” from “uniform”? An instruction is uniform if threads in the same SIMT unit (e.g. warp) do not diverge when executing this instruction.

I ask this because Bjarke recently came up with a mathematical definition of uniformity. I wonder if that is a foundation “convergent” needs as well. AFAICT, Bjarke’s definition of “uniformity” is less restrictive than “convergent”. For example, it allows loop unswitching the following code if “c” is uniform, which seems a case you ideally want to allow.

DISALLOWED:
for (…) {
if (c) { … }
convergent();
}

Jingyue

Hi Jingyue,

I consider it a very important element of the design of convergent that it does not require baseline LLVM to contain a definition of uniformity, which would itself pull in a definition of SIMT/SPMD, warps, threads, etc. The intention is that it should be a conservative (but hopefully not too conservative) approximation, and that implementations of specific GPU programming models (CUDA, OpenCL, individual GPU vendors, etc) may layer more permissive semantics on top of it in code that is specific to that programming model.

—Owen

The intention sounds reasonable. Given they are common and motivate the convergent attribute, I don’t object to introducing some implementation-independent SIMT concepts. But I understand that could be more controversial.

My general concern is still that some terms in this definition are not well-defined for arbitrary transformations especially duplication. For example, regarding

convergent - this operation may not be made control dependent on any additional values (aka may not be sunk into a condition)

Is LLVM allowed to unroll

for (int i = 0; i < 4; ++i) {
if (i < c) // c is loop invariant
convergent();
}

to

if (0 < c)
convergent();

if (1 < c)
convergent();

if (2 < c)
convergent();

if (3 < c)
convergent();

?

Is “0 < c” considered “an additional value”? I’d vote no, but one can argue the other way.

One approach (Bjarke’s idea) to work around such ambiguities is to define: a program is convergent-correct if everything marked convergent are indeed convergent. Then a transformation is convergent-correct unless it transforms a convergent-correct program to a convergent-incorrect one. However, defining “convergent-correct” involves SIMT concepts which you want to avoid here.

Is LLVM allowed to unroll

for (int i = 0; i < 4; ++i) {
if (i < c) // c is loop invariant
convergent();
}

to

if (0 < c)
convergent();

if (1 < c)
convergent();

if (2 < c)
convergent();

if (3 < c)
convergent();

?

Is “0 < c” considered “an additional value”? I’d vote no, but one can argue the other way.

My intuition agrees with you here, but I don’t know how to formalize it.

One approach (Bjarke’s idea) to work around such ambiguities is to define: a program is convergent-correct if everything marked convergent are indeed convergent. Then a transformation is convergent-correct unless it transforms a convergent-correct program to a convergent-incorrect one. However, defining “convergent-correct” involves SIMT concepts which you want to avoid here.

There are situations where this is over-conservative as well. Some convergent operations do not require uniformity per se. For example, a gradient operation in graphics programming models requires that all four threads in a given quad are either all executing or all not executing, though a warp is generally larger than that. Restricting convergent code motion to only be between uniform control flow points in the program would penalize that use case.

—Owen

Introducing intermediate steps in the transformation? If the first transformation is valid, it seems trivial to get the end one:

for (int i = 0; i < 1; ++i) {
if (i < c) // c is loop invariant
convergent();
}

for (int i = 1; i < 2; ++i) {
if (i < c) // c is loop invariant
convergent();
}

for (int i = 2; i < 3; ++i) {
if (i < c) // c is loop invariant
convergent();
}

for (int i = 3; i < 4; ++i) {
if (i < c) // c is loop invariant
convergent();
}

Hi Owen,

One approach (Bjarke's idea) to work around such ambiguities is to define:
a program is convergent-correct if everything marked convergent are indeed
convergent. Then a transformation is convergent-correct unless it
transforms a convergent-correct program to a convergent-incorrect one.
However, defining "convergent-correct" involves SIMT concepts which you
want to avoid here.

There are situations where this is over-conservative as well. Some
convergent operations do not require uniformity per se. For example, a
gradient operation in graphics programming models requires that all four
threads in a given quad are either all executing or all not executing,
though a warp is generally larger than that. Restricting convergent code
motion to only be between uniform control flow points in the program would
penalize that use case.

Sorry to be unresponsive on this; I'm very busy with another matter at the

moment. I should write up what I was thinking of with uniformity, though
for now I'll say that the definition would be flexible in terms of what
threads need to execute with each other. I could see e.g. a uniform-N
property where I think N=4 would fit what you're describing.

I think that a property defined in terms of what transformations are
allowed can be difficult to reason about in the face of not previously
considered transformations. Could there be a property that would serve your
purposes that concerns a program by itself, rather than concerning a
transformation of one program into another?

Bjarke