[RFC] LLVM Coroutines

Hi all:

Below is a proposal to add experimental coroutine support to LLVM. Though this
proposal is motivated primarily by the desire to support C++ Coroutines [1],
the llvm representation is language neutral and can be used to support
coroutines in other languages as well.

Clang + llvm coroutines allows you to take this code:

  generator<int> range(int from, int to) {
     for(int i = from; i < to; ++n)
        co_yield i;
  }
  int main() {
     int sum = 0;
     for (auto v: range(1,100))
        sum += v;
     return sum;
  }

And translate it down to this:

  define i32 @main() #5 {
  entry:
    ret i32 4950
  }

I prototyped llvm changes to support this proposal and extended clang coroutine
implementation [2] to emit proposed intrinsics to smoke test the proposed
design and to get simple generator and async task style coroutines working.
See "More Attention Required" in the doc/Coroutines.rst for details on what
addition work is required beyond cleanup and bug fixes.

I would like to start the discussion on the overall design, representation and
a roadmap to start upstreaming the changes with the intention to continue the
development in-tree incrementally.

Roadmap:

I’m sorry if this is off-topic, but has anyone in WG21 yet come up with a sane proposal about how coroutines interact with thread-local storage? In LLVM-compiled code, it is possible that the address of a thread-local variable will be cached to avoid repeated lookups. In an environment that contains both coroutines and TLS, it is possible that the coroutine will be migrated to a different thread between suspension and resumption. Does this mean that, for correctness, we must regard these suspension points as barriers that prevent loads or stores of thread-local variables and prevent caching of loads of addresses of thread-local variables? Do we have to treat every function call as a suspension point, as no function can know whether it is executing in a coroutine or whether a function that it calls may trigger a thread migration? This sounds as if it would cause noticeable performance regressions in code that makes heavy use of thread-local variables.

David

I looked over the proposal... a few comments:

1. This adds a lot of intrinsics, and they have weird semantics. This is
bad in general; the more weirdness you add to the IR, the more you'll
struggle to make the optimizer actually understand what you're doing. And
magic pattern matching *will* break. You've already pointed out the issue
with branches. (It's actually worse than you think: there isn't any
guarantee there will even be a branch when you've thrown the optimizer at
it.) The save intrinsic says "Its return value should be consumed by
exactly one `coro.suspend` intrinsic.", but it might have zero uses by the
time your lowering pass runs. You could end up with code getting inserted
before llvm.experimental.coro.fork.
2. If you need some sort of unusual control flow construct, make it a
proper terminator instruction; don't try to glue your intrinsic to a normal
branch instruction. A new kind of terminator instruction might be
appropriate. Or maybe you can make "invoke" work for you.
3. Maybe you could reduce the amount of magic involved by splitting the
initialization into a separate function? So the initialization is always
just something like "f(int x) { frame = init(); f_inner(x, frame); return
frame; }", and you don't have to deal with the weirdness of fork().

-Eli

Hi David:

  has anyone in WG21 yet come up with a sane proposal about how coroutines interact with thread-local storage?

There is a discussion in SG1 subgroup (Concurrency and Parallelism) on
the meaning of TLS in the lightweight thread of execution. I don't
think there is a resolution yet. Whether to think of a coroutine of
this kind as a thread of execution or just an "unusual" function is
not decided either.

Does this mean that, for correctness, we must regard these suspension points as barriers that prevent loads or stores of thread-local variables and prevent caching of loads of addresses of thread-local variables?

You are right! P0057 implies (probably need to make a note about it)
that when a coroutine is executing, you get thread_local data for the
current thread. For your first point (caching TLS loads), I think I am
getting it for free. coro.suspend intrinsic, from llvm standpoint, may
read and write any memory it can get access to, therefore, if a value
was loaded from TLS before coro.suspend, it would have to be reloaded
after a call to coro.suspend.

With respect to caching of addresses of TLS variables.

1) if a user does the caching, we should let him:

  thread_local int a;
  task f() {
     auto x = &a; // captures the address of a in Thread 1
     co_await some_async_call();
     *x = *x + 1; // updates the value in Thread1, whatever the
current thread is
  }

2) If llvm is does it, than, absolutely, we should prevent "prevent
caching of loads of addresses across suspend points". I poked around a
little and was not able to come up with the case where it would, but,
I'll keep looking. If you know of such a case, let me know.
Coroutines are split into normal functions in the middle optimizer, so
if there is any caching done during lowering, coroutines won't be
affected.

Do we have to treat every function call as a suspension point, as no function can know whether it is executing in
a coroutine or whether a function that it calls may trigger a thread migration? This sounds as if
it would cause noticeable performance regressions in code that makes heavy use of thread-local variables.

That should not be necessary. The proposed coroutines are "stackless",
they can only get suspended when control is in the body of the
coroutine at explicitly marked suspend points. Thus thread migration
can only happen at suspend point boundaries.

With stackful coroutines (aka Fibers, Green Threads, User-Mode
threads, etc), it is a big concern, since none of the functions that
runs on a fiber are aware that they run on a fiber and potentially can
get a thread switch at every call. Yep. TLS is pretty much busted
there.

Thank you very much for your comments!!!

Hi Eli:

Thank you very much for your comments!

If you need some sort of unusual control flow construct, make it a
proper terminator instruction

I would love to. I was going by the advice in "docs/ExtendingLLVM.rst":

       "WARNING: Adding instructions changes the bitcode format, and it will
        take some effort to maintain compatibility with the previous version.
        Only add an instruction if it is absolutely necessary.

Having coro.fork and coro.suspend to be proper terminators (as they are) will
be much cleaner. Something like:

  corobegin to label %start
         suspend label %retblock

   corosuspend [final] [save %token]
           resume label %resume
            cleanup label %cleanup

I did consider "repurposing" 'invoke', but felt it is too much of an abuse.
If there is a support for making corobegin/corosuspend instructions, I would
love to do it. Also, if there is a support for invoke abuse :-), I will be
glad to try it out and come back with the report on how it went.

Maybe you could reduce the amount of magic involved by splitting the
initialization into a separate function? So the initialization is always
just something like "f(int x) { frame = init(); f_inner(x, frame); return
frame; }", and you don't have to deal with the weirdness of fork().

I think outlining the "start" part will be sufficient (starting at the coro.fork
and cutting all of the branches ending in coro.end). In this case llvm won't see
coro.fork and coro.end as the clang will do the outlining work. Something like

T f(int x, A y) {
   auto mem = coro.elide();
   if (!mem) mem = malloc(llvm.frame.size(&f_inner));
  auto hdl = coro.init(mem);
  T result = <whatever>;

  f_inner(hdl, x, &y);

  return result;
}

f will be a normal function and f_inner will be magic function that will require
to be split at suspend points. It will be processed in the IPO order, so by the
time f_inner is considered for inlining into f, it will be a normal function.

I think this can work, but we are trading off one complexity for another.

As presented:

coroutine is defined as a magic function (that has weird intrinsics) and
requires splitting during IPO

With coro.fork / coro.end removed

coroutine is defined as a combination of normal function + magic function that
requires splitting during IPO. If there are other languages that acquire
coroutines they would need to have the same outlining pass as clang.

I need to think about it more, but, I am currently leaning toward the first
option (with coro.fork and coro.end). The "f_inner" is still part of the f,
but nicely delineated with corobegin and coro.end.

  corobegin to label %coro.start
         suspend label %retblock

   corosuspend [final] [save %token] resume label %resume
            cleanup label %cleanup

    call void @llvm.coro.end();

Does it look better?

Gor

Hi Eli:

Thank you very much for your comments!

>> If you need some sort of unusual control flow construct, make it a
>> proper terminator instruction

I would love to. I was going by the advice in "docs/ExtendingLLVM.rst":

       "WARNING: Adding instructions changes the bitcode format, and it
will
        take some effort to maintain compatibility with the previous
version.
        Only add an instruction if it is absolutely necessary.

The point of this is that we want to strongly encourage people to look for
other solutions first... but on the other hand this isn't a blanket ban on
new instructions; pushing intrinsics too far consistently caused trouble
for exception handling in LLVM for years.

I need to think about it more, but, I am currently leaning toward the first

option (with coro.fork and coro.end). The "f_inner" is still part of the f,
but nicely delineated with corobegin and coro.end.

  corobegin to label %coro.start
         suspend label %retblock

   corosuspend [final] [save %token] resume label %resume
            cleanup label %cleanup

    call void @llvm.coro.end();

Does it look better?

Yes, that seems fine. There's still the potential for non-initialization
instructions sneaking into the initialization block, but you can probably
handle it somehow.

That said, thinking about it a bit more, I'm not really sure why you need
to tie the suspend call to the branch in the first place. What exactly is
your algorithm doing that requires it? I mean, a naive implementation of
CoroSplit based on your llvm.experimental.coro.suspend intrinsic clones the
whole function, replaces the result of suspend calls with "true" in one
version and "false" in the other, and runs SimplifyCFG to kill the dead
code. And even with a an algorithm that tries to be more clever, you
probably need to do some cloning anyway: presumably frontends will want to
share code between the destroy and unwinding codepaths.

-Eli

Hi Eli:

corobegin to label %coro.start
      suspend label %retblock

corosuspend [final] [save %token]
     resume label %resume
     cleanup label %cleanup

Yes, that seems fine. There's still the potential for non-initialization
instructions sneaking into the initialization block, but you can probably
handle it somehow.

I don't mind non-initialization instructions sneaking in at all. I want them to.
All of the code that runs until it hits the suspends can stay in `f`.
Only post suspend code need to go to `f.resume` and `f.destroy`.

For example, consider fire and forget coroutine:

  void f(int n) {
    while (n-- > 0) {
      do1();
      <suspend> -- will subscribe to some async continuation
      do2();
    }
  }

Post split, it will look something like this:

  struct f.frame { Fn* ResumeFn, Fn* DestroyFn, int n };

  void f(int n) {
    f.frame* state = <init-stuff>;
    state->n = n;
    if (state->n-- > 0)
      do1();
    else
      <destroy>
  }

  void f.resume(f.frame* state) {
     do2();
     if (state->n-- > 0)
       do1();
     else
       <destroy-state>
  }

  void f.destroy(f.frame* state) { <destroy-state> }

That said, thinking about it a bit more, I'm not really sure why you need to
tie the suspend call to the branch in the first place.

I am not sure I understand the question. Suspend intrinsic is replaced with
a branch to a return block in the 'f' and with 'ret void' in resume.

What exactly is your algorithm doing that requires it? I mean, a naive
implementation of CoroSplit based on your llvm.experimental.coro.suspend
intrinsic clones the whole function, replaces the result of suspend calls
with "true" in one version and "false" in the other, and runs SimplifyCFG
to kill the dead code.

Very close. We do clone function body twice, once for resume and once for
destroy. We make a new entry block with a switch instruction that will branch to
all resume branches in `f.resume` and to all cleanup branches in `f.destroy` and
then let SimplifyCFG to remove the rest.

Changing the subject a little bit. I was thinking we can get rid of the
coro.fork altogether it we add a third label to the corosuspend.

Namely:

  corosuspend [final] [save %token] to label %return.block
    resume label %resume
    cleanup label %cleanup

corosuspend is lowered as follows:
  in 'f': corosuspend is replaced with `br %return.block`

  in 'f.resume':
    add a new entry block with a switch jumping to all resume blocks
    corosuspend is replaced with `ret void`

  in 'f.destroy':
    add a new entry block with a switch jumping to all cleanup blocks
    corosuspend is replaced with `ret void`

I think this makes understanding of the model clearer. The only negative side
to a corosuspend with three branching targets is that it is very likely that
to label block will be the same in all corosuspend's thus wasting valuable
bits. :slight_smile:

Gor

P.S.

Return block in 'f' may contain a lot of stuff, like destructors for parameters
passed by value (if ABI requires), possibly some code related to return value.

`f.resume` does not have any of that, so the return block is just:

return.block:
  ret void

P.P.S

I did consider making resume part return something other than void. One scheme
would be:

i1 false - at final suspend point (or destroyed)
i1 true - can be resumed again

If that case:
  corosuspend => ret i1 true
  corosuspend final => ret i1 false
  control flow reaches the end of the `f.resume` ret i1 false.

I wasn't sure that it is a clear win, so I went with 'void' as a return value.

Hi Eli:
> That said, thinking about it a bit more, I'm not really sure why you
need to
> tie the suspend call to the branch in the first place.

I am not sure I understand the question. Suspend intrinsic is replaced with
a branch to a return block in the 'f' and with 'ret void' in resume.

Right... but that doesn't mean the call to the suspend intrinsic has to be
the last non-terminator instruction in the basic block before you run
CoroSplit. You can split the basic block in CoroSplit so any instructions
after the suspend call are part of a different basic block. Then resume
and destroy both branch to the continuation of the basic block (and run
different codepaths based on the boolean).

What exactly is your algorithm doing that requires it? I mean, a naive
> implementation of CoroSplit based on your llvm.experimental.coro.suspend
> intrinsic clones the whole function, replaces the result of suspend calls
> with "true" in one version and "false" in the other, and runs SimplifyCFG
> to kill the dead code.

Very close. We do clone function body twice, once for resume and once for
destroy. We make a new entry block with a switch instruction that will
branch to
all resume branches in `f.resume` and to all cleanup branches in
`f.destroy` and
then let SimplifyCFG to remove the rest.

Okay, that helps me visualize it.

Changing the subject a little bit. I was thinking we can get rid of the

coro.fork altogether it we add a third label to the corosuspend.

Namely:

  corosuspend [final] [save %token] to label %return.block
    resume label %resume
    cleanup label %cleanup

corosuspend is lowered as follows:
  in 'f': corosuspend is replaced with `br %return.block`

  in 'f.resume':
    add a new entry block with a switch jumping to all resume blocks
    corosuspend is replaced with `ret void`

  in 'f.destroy':
    add a new entry block with a switch jumping to all cleanup blocks
    corosuspend is replaced with `ret void`

I think this makes understanding of the model clearer. The only negative
side
to a corosuspend with three branching targets is that it is very likely
that
to label block will be the same in all corosuspend's thus wasting valuable
bits. :slight_smile:

Yes, this makes sense.

-Eli

Sorry for the novice question, but how are coroutines lowered in the backend? It requires making operating system calls to something like pthreads right?

Hi Josh:

Sorry for the novice question, but how are coroutines lowered in the
backend? It requires making operating system calls to something like
pthreads right?

No OS, no target specific code required. Just simple rewriting of a function
during the middle optimizer. One of the first examples in the RFC shows
that a coroutine is optimized down to a constant. Try doing that with
pthreads :slight_smile:

  generator<int> range(int from, int to) {
     for(int i = from; i < to; ++n)
        co_yield i;
  }
  int main() {
     int sum = 0;
     for (auto v: range(1,100))
        sum += v;
     return sum;
  }

And translate it down to this:

  define i32 @main() #5 {
  entry:
    ret i32 4950
  }

Very briefly:

LLVM coroutines are functions that have one or more `suspend points`_.
When a suspend point is reached, the execution of a coroutine is suspended and
control is returned back to its caller. A suspended coroutine can be resumed
to continue execution from the last suspend point or be destroyed.

Let's say you get coroutine that looks something like this

  void *f(int n) {
     for(;:wink: {
       bar(n++);
       <suspend> // magic: returns a coroutine handle on first suspend
     }
  }

We will split it up into start, resume and destroy functions and will create
a coroutine frame that will keep values that need to be live across suspension
points.

  struct f.frame { int n; }

  void* f(int n) {
    auto s = new f.frame{n};
    bar(s->n++);
    return s;
  }

  void f.resume(f.frame* s) {
    bar(s->n++);
  }

  void f.destroy(f.frame* s) {
    delete s;
  }

You can read more details in the doc/Coroutines.rst up for review at

http://reviews.llvm.org/D21170

or in the RFC that started this thread.

Cheers
Gor