RFC: LLVM Coroutine Representation, Round 2

Hi all:

Thank you very much for the feedback during the first round! Special thanks to
Eli, Sanjoy, Hal and Chandler for their detailed public and private comments.
The model is simpler now and the intrinsics have nice symmetry to them:
coro.begin + coro.end, coro.alloc + coro.free. Compared to the previous RFC,
coro.fork is gone, coro.init and coro.start are collapsed into one coro.begin,
coro.suspend is three-way as opposed to two-way. coro.elide and coro.delete
renamed coro.alloc and coro.free.

All the changes implemented and tested. Full document, (nicely formatted by
github is at:

  https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst

Below is a summary of a proposal to add coroutine support to LLVM.
The proposal is motivated primarily by the desire to support C++ Coroutines [1],
however the llvm representation is language neutral and can be used
to support coroutines in other languages as well.

Clang + llvm coroutines allows you to take this code:

  generator<int> range(int from, int to) {
     for(int i = from; i < to; ++n)
        co_yield i;
  }
  int main() {
     int sum = 0;
     for (auto v: range(1,100))
        sum += v;
     return sum;
  }

And translate it down to this:

  define i32 @main() #5 {
  entry:
    ret i32 4950
  }

You can also use coroutines in plain C, by calling the builtins mapping to the
intrinsics described by this proposal, so that your coroutine can look like:

    #include "Inputs/coro.h"
    void print(int v);

    void* f(int n) {
      CORO_BEGIN(malloc);

      while (n-- > 0) {
        print(n+1);
        CORO_SUSPEND();
      }

      CORO_END(free);
    }

    // CHECK-LABEL: @main
    int main() {
      void* coro = f(4);
      CORO_RESUME(coro);
      CORO_RESUME(coro);
      CORO_DESTROY(coro);
    }

I prototyped llvm changes to support this proposal and extended clang coroutine
implementation [2] to emit proposed intrinsics to smoke test the proposed
design. See "More Attention Required" in the doc/Coroutines.rst [4]for details
on what additional work is required beyond cleanup and bug fixes.

I would like to continue the discussion on the overall design, representation
and a roadmap to start upstreaming the changes with the intention to continue
the development in tree incrementally.

Roadmap:

Hi,

Hi all:

Thank you very much for the feedback during the first round! Special thanks to
Eli, Sanjoy, Hal and Chandler for their detailed public and private comments.
The model is simpler now and the intrinsics have nice symmetry to them:
coro.begin + coro.end, coro.alloc + coro.free. Compared to the previous RFC,
coro.fork is gone, coro.init and coro.start are collapsed into one coro.begin,
coro.suspend is three-way as opposed to two-way. coro.elide and coro.delete
renamed coro.alloc and coro.free.

All the changes implemented and tested. Full document, (nicely formatted by
github is at:

  https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst

Reading through that document, I have a couple of questions (maybe
overlooking or misunderstanding some important explanations):

- As the caller of a coroutine, how do I know whether a coroutine
  reached its cleanup point or not? This seems essential for
  implementing iterators and looping on them, yet @llvm.coro.done only
  works if a particular suspend point was designated as "final" (which
  doesn't sound possible to do in the general case, i.e. halting
  problem).

- Is the promise type restricted to atomic LLVM types, or can I use
  @llvm.coro.promise.p0i8 (together with a bitcast) if the promise is an
  arbitrary complex structure?

- Is the promise allocated/copied inside the coroutine frame? If not,
  how does the sharing work even if the coroutine is passed to different
  callers?

Regards

Antoine.

cc llvm-dev

Hi Vadim:

Does you design support resumption with parameter(s)? (such as Python's
generator.send(x)). I suppose the "promise" could be used for passing data
both ways,

Correct.

The docs/Coroutines.rst states that:

   "A coroutine author or a frontend may designate a distinguished `alloca`
    that can be used to communicate with the coroutine."

What kind of communication happens is determined by the source language
coroutine constructs. An example later in the section shows how, as a consumer
of a coroutine, you can read the data from the promise. But that communication
can go the other way if you, instead, will be storing into the promise in main,
and loading from it in the coroutine.

coro.promise intrinsic gives you the address of the promise associated with a
particular coroutine instance and you (as frontend writer, or coroutine library
designer is free to do whatever you want with it).

please mention this explicitly in the design doc.

Would you suggest changes to the exact wording to make it clearer? I put it up
for review at: ⚙ D22603 [coroutines] Part 1 of N: Documentation, so you can just mark up the
changes you would like to see.

Also, how is loading/storing to promise going to be lowered?

The coro.promise intrinsics just gives you an address of the coroutine promise.
You, as frontend / library writer, know what is the promise and you emit
appropriate load or stores instructions.

If you have a synchronous scenario and a promise is plain int, you would use
plain loads and stores. If your promise is some complicated data structure
which has atomics and you use it from different threads, you would use
appropriate loads, stores, fences and synchronization primitives.

But all this is out of scope for docs/Coroutines.rst . It is up to the frontend,
library writer to decide how it wants to use the promise. The only special
handling that LLVM does for the coroutine promise is:

1) places the promise at deterministic offset in the coroutine frame
2) allows you to get an address of the promise given a coroutine handle and
   vice versa.

Cheers,
Gor

P.S. If you'd like to see a library implementation of a generator, please see:

Generator itself:

coroutine_handle (a C++ level abstraction mapping to llvm intrinsics)

and the use:

Well, if start looking at those tests, you may enjoy coroutines in C via macros,
lowered to the same coroutine intrinsics :-).

and

What I am getting at, is that ideally the parameters/return value of generator.send() should become parameters/return value of @f.Resume:
define @f.Resume(%f.Frame* %FramePtr, %Param);

They don’t need to be allocated in the coroutine frame at all. Do you think that LLVM will be able to perform such an optimization?

Hi Vadim:

What I am getting at, is that ideally the parameters/return value of
generator.send() should become parameters/return value of @f.Resume:
    define <return type> @f.Resume(%f.Frame* %FramePtr, <input type>
%Param);

Absolutely! I sketched out this approach and it looks like it can be build on
top of the current model. I have not designed nor implemented it yet and it is
not part of this RFC.

When I get to designing it, I can keep you in the loop if you are interested.

Gor

P.S.

BTW, you would need to have as many f.resume's as you have suspend points.
So that if you have a coroutine that looks like:

  task<void> f(AsyncStream& s) {
    int a = await s.ReadInt();
    float b = await s.ReadFloat();
  }

you would have:

  void f.resume.1(%f.Frame*, i32 result);
  void f.resume.2(%f.Frame*, float result);