[RFC] Introducing convergence control bundles and intrinsics

Hi all,

please see https://reviews.llvm.org/D85603 and its related changes for our most recent and hopefully final attempt at putting the `convergent` attribute on a solid theoretical foundation in a way that is useful for modern GPU compiler use cases. We have clear line of sight to enabling a new control flow implementation in the AMDGPU backend which is built on this foundation. I have started upstreaming some ground work recently. We expect this foundation to also be usable by non-GPU whole-program vectorization environments, if they choose to do so.

What follows is a (not quite) brief overview of what this is all about, but do see the linked review and its "stack" for a fuller story.

The Problem

Hi, Nicolai,

Thanks for sending this. What do you think that we should do with the existing convergent attribute?

-Hal

Hi Hal,

Thanks for sending this. What do you think that we should do with the
existing convergent attribute?

My preference, which is implicitly expressed in the review, is to use
`convergent` both for the new and the old thing. They are implicitly
distinguished via the "convergencectrl" operand bundle.

The main reason for going down that route is that `convergent` is
really an odd duck in that it is the only attribute in LLVM that
_adds_ new constraints on program transforms. All other attributes
_remove_ constraints on program transforms by expressing constraints
on what the function does (e.g. readnone, nosync, willreturn, ...).
Matt tried to push for changing that, making the "convergent"
restrictions the default and adding `noconvergent` on most things, but
that effort got stuck. In the meantime, let's avoid adding yet another
odd duck like this :slight_smile:

Cheers,
Nicolai

Hi Hal,

Thanks for sending this. What do you think that we should do with the
existing convergent attribute?

My preference, which is implicitly expressed in the review, is to use
`convergent` both for the new and the old thing. They are implicitly
distinguished via the "convergencectrl" operand bundle.

The main reason for going down that route is that `convergent` is
really an odd duck in that it is the only attribute in LLVM that
_adds_ new constraints on program transforms. All other attributes
_remove_ constraints on program transforms by expressing constraints
on what the function does (e.g. readnone, nosync, willreturn, ...).
Matt tried to push for changing that, making the "convergent"
restrictions the default and adding `noconvergent` on most things, but
that effort got stuck. In the meantime, let's avoid adding yet another
odd duck like this :slight_smile:

Cheers,
Nicolai

Understood. Does the existing attribute have a sound definition, just not one that's useful for (all) GPU use cases? Or is the existing attribute not really sound and all use cases eventually need to be replaced?

-Hal

> Hi Hal,
>
>> Thanks for sending this. What do you think that we should do with the
>> existing convergent attribute?
> My preference, which is implicitly expressed in the review, is to use
> `convergent` both for the new and the old thing. They are implicitly
> distinguished via the "convergencectrl" operand bundle.
>
> The main reason for going down that route is that `convergent` is
> really an odd duck in that it is the only attribute in LLVM that
> _adds_ new constraints on program transforms. All other attributes
> _remove_ constraints on program transforms by expressing constraints
> on what the function does (e.g. readnone, nosync, willreturn, ...).
> Matt tried to push for changing that, making the "convergent"
> restrictions the default and adding `noconvergent` on most things, but
> that effort got stuck. In the meantime, let's avoid adding yet another
> odd duck like this :slight_smile:
>
> Cheers,
> Nicolai

Understood. Does the existing attribute have a sound definition, just
not one that's useful for (all) GPU use cases? Or is the existing
attribute not really sound and all use cases eventually need to be replaced?

The existing definition is insufficient for some GPU use cases, but it
also has a problem that I like to call spooky action at a distance:
some program transforms may have to look at code which they'd normally
not look at in order to correctly determine whether they're correct
for the current definition of `convergent`. The really problematic one
that I'm aware of in practice is jump threading -- the implementation
of that is incorrect wrt `convergent` -- but perhaps others are
affected as well.

Cheers,
Nicolai

Hi Hal,

Thanks for sending this. What do you think that we should do with the
existing convergent attribute?

My preference, which is implicitly expressed in the review, is to use
`convergent` both for the new and the old thing. They are implicitly
distinguished via the "convergencectrl" operand bundle.

The main reason for going down that route is that `convergent` is
really an odd duck in that it is the only attribute in LLVM that
_adds_ new constraints on program transforms. All other attributes
_remove_ constraints on program transforms by expressing constraints
on what the function does (e.g. readnone, nosync, willreturn, ...).
Matt tried to push for changing that, making the "convergent"
restrictions the default and adding `noconvergent` on most things, but
that effort got stuck. In the meantime, let's avoid adding yet another
odd duck like this :slight_smile:

Cheers,
Nicolai

Understood. Does the existing attribute have a sound definition, just
not one that's useful for (all) GPU use cases? Or is the existing
attribute not really sound and all use cases eventually need to be replaced?

The existing definition is insufficient for some GPU use cases, but it
also has a problem that I like to call spooky action at a distance:
some program transforms may have to look at code which they'd normally
not look at in order to correctly determine whether they're correct
for the current definition of `convergent`. The really problematic one
that I'm aware of in practice is jump threading -- the implementation
of that is incorrect wrt `convergent` -- but perhaps others are
affected as well.

Cheers,
Nicolai

I agree that the new scheme seems better for several reasons. So it seems like your recommendation is that we consider the existing attribute usage deprecated? Or should we auto-upgrade it to something?

-Hal

Hi Hal,

I agree that the new scheme seems better for several reasons. So it
seems like your recommendation is that we consider the existing
attribute usage deprecated?

Yes.

Some form of auto-upgrade could be attempted via the
ConvergenceControlHeuristic of https://reviews.llvm.org/D85609, though
that's slightly big for on-the-fly upgrading in the bitcode reader. So
while the groundwork for potentially doing it is there, I don't see a
strong need or desire for it at the moment.

Cheers,
Nicolai

Hi all,

As there has been quiet on the topic for quite some time now, I'm
pinging this thread to see if ideally we can just go ahead with the
current proposal, or if/how more discussion is required.

Quick pointer to the proposed LangRef changes for the `convergent`
attribute is here: https://reviews.llvm.org/D85603

The presentation from the dev meeting three weeks ago is here:
https://www.youtube.com/watch?v=_Z5DuiVCFAw

Thanks,
Nicolai

Hi Nicolai,

Hi Nicolai,

I watched the talk on the design, and it all made sense to me. I can’t claim to have a deep knowledge of the requirements of GPU architectures, but I can say that this is basically the kind of stuff we had in mind when the token type was designed. What you are saying about modelling these special GPU operations as accessing inaccessible memory makes sense to me, but again, I am not an expert.

One of the challenges we’ve faced when trying to create regions for unpacked call sequences[1] is that unreachable code elimination can often “break” a region by deleting the token consumer. It’s not clear if your proposal suffers from this problem, but it’s something to keep in mind, since optimizers can often discover that a plain store is storing to a null pointer, that can become unreachable, and there goes the entire rest of the function, leaving you with a half-open region. I can’t imagine a plausible series of transforms on a reasonable GPU program would end up in this situation, though.
[1] http://lists.llvm.org/pipermail/llvm-dev/2020-January/138627.html

Hi Owen,

The presentation from the dev meeting three weeks ago is here:
https://www.youtube.com/watch?v=_Z5DuiVCFAw

I have not studied the new proposal yet, but I wanted to ask a question about the JumpThreading example in your dev meeting talk: does allowing this transformation break programs?

I’m asking because my reading of the semantic correctness of the jump-threading transformation is that it *is* allowed by the current convergent annotation: the execution of the barrier() intrinsic depends on the value of `condition` in both the before and after cases. In the former, it is a transitive dependency via `flag`, but still a dependency. This seems to differ from your interpretation of the example, so I’m wondering to what degree clarifying the definition to include transitive dependencies would eliminate some or all spooky action, and whether doing so would result in broken programs.

Perhaps part of the problem is the notion of "control dependence" that the current definition of `convergent` relies on. I don't actually know if there's a universally accepted definition in the literature. The one I kept finding and that made sense to me as being usable in practice in a compiler is only in terms of divergent terminators and post-dominance. The data dependencies of `flag` in the original program aren't relevant.

With that definition of control dependence, the barrier() is initially not dependent on the branch on `condition`, because that branch is post-dominated by the intermediate basic block and therefore hidden.

In more practical terms, the transform really could result in a broken program depending on how the underlying implementation does divergence and reconvergence. This particular example shouldn't change anything wrt the current implementation in AMDGPU, for example, but it would break an implementation based on reconverging at post-dominators (which is documented in the literature as well).

Cheers,
Nicolai

I watched the talk on the design, and it all made sense to me. I can't claim to have a deep knowledge of the requirements of GPU architectures, but I can say that this is basically the kind of stuff we had in mind when the token type was designed. What you are saying about modelling these special GPU operations as accessing inaccessible memory makes sense to me, but again, I am not an expert.

One of the challenges we've faced when trying to create regions for unpacked call sequences[1] is that unreachable code elimination can often "break" a region by deleting the token consumer. It's not clear if your proposal suffers from this problem, but it's something to keep in mind, since optimizers can often discover that a plain store is storing to a null pointer, that can become unreachable, and there goes the entire rest of the function, leaving you with a half-open region. I can't imagine a plausible series of transforms on a reasonable GPU program would end up in this situation, though.
[1] http://lists.llvm.org/pipermail/llvm-dev/2020-January/138627.html

Thanks for the feedback, Reid. I try not to make assumptions about what is a "reasonable" GPU program :wink:

One advantage that we have here is that there is no need to explicitly "end" a token region. Rather, we have a single token producer and 0 to any number of consumers. Deleting a consumer doesn't make a difference for the semantic guarantees we make for any of the non-deleted code. At least, I've thought about this quite a fair bit, including some very incomplete proof sketches, and I'm pretty confident about it at this point.

For what it's worth, `unreachable` and multiple return statements are things we see in GPU code in practice, and simple support of dead-code elimination was an explicit goal.

One interesting kind of thing can happen and needs to be kept in mind when the `anchor` intrinsic is used, because this intrinsic is largely implementation-defined. If you have:

   %tok = call token @llvm.experimental.convergence.anchor()

   ... lots of code with control flow ...

   call void @convergent_op() [ "convergencectrl"(token %tok) ]

   ...

   %tok2 = call token @llvm.experiment.convergence.anchor()
   call void @second_op() [ "convergencectrl"(token %tok2) ]

Deleting the @convergent_op could potentially cause a different set of threads to arrive together at the second anchor and for the @second_op during execution in practice. Whether the deletion makes a difference or not depends on the details of the underlying divergence / reconvergence mechanisms.

That's okay though, because it just ends up being a different refinement of the original IR semantics. Allowing this kind of freedom (even including non-determinism) is exactly the point of the `anchor` intrinsic, and if the programmer/frontend didn't want that, some alternative structure (typically based on `entry`) would have to be used instead.

Cheers,
Nicolai