RFC: make calls "convergent" by default

TL;DR

CC'ing some more people who got dropped when sending the previous mail.

Sameer.

Sameer Sahasrabuddhe via llvm-dev writes:

n 2 Jun 2021, at 2:02, Sameer Sahasrabuddhe wrote:

Sameer Sahasrabuddhe via llvm-dev writes:

TL;DR

We propose the following changes to LLVM IR in order to better support
operations that are sensitive to the set of threads that execute them
together:

  • Redefine “convergent” in terms of thread divergence in a
    multi-threaded execution.
  • Fix all optimizations that examine the “convergent” attribute to also
    depend on divergence analysis. This avoids any impact on CPU
    compilation since control flow is always uniform on CPUs.
  • Make all function calls “convergent” by default (D69498). Introduce a
    new “noconvergent” attribute, and make “convergent” a nop.
  • Update the “convergence tokens” proposal to take into account this new
    default property (D85603).

I would suggest a slightly different way of thinking of this.

It’s not really that functions are defaulting to convergence,
it’s that they’re defaulting to not participating in the convergence
analysis. A function that does participate in the analysis should have
a way to mark itself as being convergent. A function that participates
and isn’t marked convergent should probably default to being
non-convergent, because that’s the conservative assumption (I believe).
But if a function doesn’t participate in the analysis at all, well,
it just doesn’t apply.

At an IR level, there are a couple of different ways to model this.
One option is to have two different attributes, e.g.
hasconvergence convergent. But the second attribute would be
meaningless without the first, and clients would have to look up
both, which is needlessly inefficient. The other option is to
have one attribute with an argument, e.g. convergent(true).
Looking up the attribute would give you both pieces of information.

GPU targets would presumably require functions (maybe just
definitions?) to participate in the convergence analysis. Or maybe
they could have different default rules for functions that don’t
participate than CPU targets do. Either seems a reasonable choice
to me.

If the inliner wants to inline non-participating code into participating
code, or vice-versa, it either needs to refuse or to mark the resulting
function as non-participating.

I know this is a little bit more complex than what you’re describing,
but I think it’s useful complexity, and I think it’s important to set
a good example for how to handle this kind of thing. Non-convergence
is a strange property in many ways because of its dependence on the
exact code structure rather than simply the code’s ordinary semantics.
But if you consider it more abstractly in terms of the shape of the
problem, it’s actually a very standard example of an “effect”, and
convergence analysis is just another example of an “effect analysis”,
which is a large class of analyses with the same basic structure:

  • There’s some sort of abstract effect.
  • There are some primitive operations that have the effect.
  • The effect normally propagates through abstractions: if code calls other code that has the effect, the calling code also has the effect.
  • The propagation is disjunctive: a code sequence has the effect if any part of the sequence has the effect.
  • Often it is rare to see the primitive operations explicitly in code, and the analysis is largely about propagation. Sometimes the primitive operations aren’t even modeled in IR at all, and the only source of the effect in the model is that calls to unknown functions have to be treated conservatively.
  • Sometimes there are ways of preventing propagation; this is usually called “handling” the effect. But a lot of effects don’t have this, and the analysis purely about whether one of the primitive operations is ever performed (directly or indirectly).
  • Clients are usually trying to prove that code doesn’t have the effect, because that gives them more flexibility.
  • Code has to be assumed to have the effect by default, but if you can prove that a function doesn’t have the effect, you can often propagate that information.

The thing is, people are constantly inventing new effect analyses.
LLVM has some built-in analyses that are basically effect analyses,
like “does this touch global memory” or “does this have any
side-effects”. Maybe soon we’ll want to do a new general analysis
in LLVM to check whether a function synchronizes with other threads
(in the more standard atomics/locks sense, not GPU thread
communication). Maybe somebody will add a language-specific analysis
to track if a function ever runs “unsafe” code. Maybe somebody will
want to do an environment-specific analysis that checks whether a
function ever makes an I/O call. Who knows? But they come up a lot,
and LLVM doesn’t deal with them very well when it can’t make nice
assumptions like “all the code came from the same frontend and
is correctly participating in the analysis”.

Convergence is important enough for GPUs that maybe it’s worthwhile
for all GPU frontends — and so all functions in a module — to
participate in it. A lot of these other analyses, well, probably
not. And we shouldn’t be totally blocked from doing interprocedural
optimization in LLVM just because we’re combining things from
different frontends.

So my interest here is that I’d like the IR for convergence to set
a good example for how to model this kind of effect analysis.
I think that starts with acknowledging that maybe not all
functions are participating in the analysis and that that’s okay.
And I think that lets us more neatly talk about what we want for
convergence: either you want to require that all functions in the
module participate in the analysis, or you want to recognize
non-participating code and treat it more conservatively.

Other than that, I don’t much care about the rest of the details;
this isn’t my domain, and you all know what you’re trying to
do better than I do.

John.

A function that participates and isn’t marked convergent should probably default to being non-convergent, because that’s the conservative assumption (I believe). But if a function doesn’t participate in the analysis at all, well, it just doesn’t apply.

Based on the name, I agree it does feel like “non-convergent” would be the conservative assumption. But IIUC, that’s unfortunately not the case, and is why this proposal to change the default is being made.

“Convergent” actually means something like “This function might depend on the alignment of control-flow across multiple threads in order to exchange data with the implicit set of neighboring threads which are executing the same instruction at the same time”. Non-convergent means the opposite: that the code – including any transitive functions it calls – is known to NOT have any such cross-thread interaction.

When you know the code doesn’t have these cross-thread interactions, you have additional optimizations available that are unsafe when there is such cross-thread interaction.

James Y Knight writes:

2. D69498 will be updated so that the convergent property is made
   default, but the new requirements on CPU frontends will be retracted.

If I understand correctly, this means the Clang frontend will no longer
need to add any convergent NOR noconvergent attributes. Instead, LLVM
analysis passes can infer the new noconvergent attribute as appropriate.
And on non-GPU platforms, the analysis can be skipped, and the optimization
passes can simply ignore convergence attributes, because the hardware has
no operations which can observe convergence. (So, it doesn't matter whether
a function is marked noconvergent or not.)

Which is to say: this proposal should make things *easier* for "naive"
frontends, not harder. By default, a reasonable thing should happen when
the frontend does nothing -- both on GPU platforms and non-GPU platforms.
Especially if all of the code is in a single module (e.g., via LTO), there
will be few calls to unknown destinations, so noconvergent can be inferred
basically everywhere that is not *in fact* dependent on convergence.

This is more or less the intention, but I would reword it at follows:

- Frontends are no longer required to add a convergent or noconvergent
  attribute if all they seek is correct execution on GPUs.

- A frontend or a suitable optimization may add the noconvergent
  attribute to gain performance where it can prove that the execution of
  the call or callee is not sensitive to which threads are executing
  such a call.

- The convergence property is never "ignored" ... it is now built into
  LLVM itself. But optimizations may rely on additional knowledge such
  as divergence analysis to reason about whether a particular control
  flow transformation is safe. This is not an explicitly target-specific
  definition. It just so happens that on CPUs, there is no divergent
  control flow, and hence it is "as if" the default convergent property
  has no effect.

- This use of divergence analysis also benefits GPUs. For example, the
  current implementation of sinking always bails out on a convergent
  call. In reality, a convergent call can be sunk across a uniform
  (non-divergent) branch. This is disallowed by the current overly
  conservative definition of convergent, but will be allowed by the new
  definition.

Sameer.

A function that participates and isn’t marked convergent should probably
default to being non-convergent, because that’s the conservative assumption
(I believe). But if a function doesn’t participate in the analysis at all,
well, it just doesn’t apply.

Based on the name, I agree it does feel like "non-convergent" would be the
conservative assumption. But IIUC, that's unfortunately not the case, and
is why this proposal to change the default is being made.

"Convergent" actually means something like "This function *might *depend on
the alignment of control-flow across multiple threads in order to exchange
data with the implicit set of neighboring threads which are executing the
same instruction at the same time". Non-convergent means the opposite: that
the code -- including any transitive functions it calls -- is known to NOT
have any such cross-thread interaction.

Oh, sorry, I have the name backwards, then. If “convergent” means
“this can interact with neighboring threads”, then “convergent” is the
conservatively correct default.

2. D69498 will be updated so that the convergent property is made
   default, but the new requirements on CPU frontends will be retracted.

If I understand correctly, this means the Clang frontend will no longer
need to add any convergent NOR noconvergent attributes. Instead, LLVM
analysis passes can infer the new noconvergent attribute as appropriate.
And on non-GPU platforms, the analysis can be skipped, and the optimization
passes can simply ignore convergence attributes, because the hardware has
no operations which can observe convergence. (So, it doesn't matter whether
a function is marked noconvergent or not.)

Which is to say: this proposal should make things *easier* for "naive"
frontends, not harder. By default, a reasonable thing should happen when
the frontend does nothing -- both on GPU platforms and non-GPU platforms.
Especially if all of the code is in a single module (e.g., via LTO), there
will be few calls to unknown destinations, so noconvergent can be inferred
basically everywhere that is not *in fact* dependent on convergence.

Well, if you really don’t need this to be explicit in IR at all,
that’s different. But my goal is not to reduce the work required
for frontends when they’re intending to support a particular
feature. My goal is to better support both mixed-frontend modules
and high-level optimizations that require frontend cooperation.
I just think that intersects nicely here because non-GPU frontends
can be thought of not cooperating with convergence analysis, and so
their output just needs to be treated conservatively in contexts
that do honor convergence.

John.

Sameer Sahasrabuddhe writes:

CC'ing some more people who got dropped when sending the previous mail.

Sameer.

Sameer Sahasrabuddhe via llvm-dev writes:

TL;DR

We propose the following changes to LLVM IR in order to better support
operations that are sensitive to the set of threads that execute them
together:

- Redefine "convergent" in terms of thread divergence in a
  multi-threaded execution.
- Fix all optimizations that examine the "convergent" attribute to also
  depend on divergence analysis. This avoids any impact on CPU
  compilation since control flow is always uniform on CPUs.
- Make all function calls "convergent" by default (D69498). Introduce a
  new "noconvergent" attribute, and make "convergent" a nop.
- Update the "convergence tokens" proposal to take into account this new
  default property (D85603).

Here's an RFC designed to look like an incremental change over Nicolai's
original spec for convergence control intrinsics (Phabricator is pretty
awesome that way).

RFC: Update token semantics with default convergent attribute
https://reviews.llvm.org/D104504

This RFC has two parts:

LangRef:

    Define the "convergent" property in LLVM IR and introduce the
    "noconvergent" attribute. This is independent of convergence control
    intrinsics and tokens. This part is intended to be submitted first
    and replaces D69498 (IR: Invert convergent attribute handling)

    ⚙ D69498 IR: Invert convergent attribute handling

ConvergentOperations:

    Updates the semantics of convergence control intrinsics and tokens
    to account for the new default convergent property. This part is
    intended to be merged into D85603 (IR: Add convergence control
    operand bundle and intrinsics)

    ⚙ D85603 IR: Add convergence control operand bundle and intrinsics

Sameer.