[RFC] LLVM Coroutines

I love it!!! it is so much simpler. Indeed, we absolutely don't need to care
about the resume and cleanup branches. Let me restate the model as
I understand it.

Lowering of coro.suspend number X:

1) split block at coro.suspend, new block becomes jump point for resumption X
2) RAUW coro.suspend with i1 true in resume clone i1 false in destroy clone
3) replace coro.suspend with br %return.block in 'f', 'ret void' in clones

Scratch the proposed corosuspend instruction. I think the intrinsic will work
just fine!

Gor

That sounds right.

If you're going down that route, that still leaves the question of the
semantics of the fork intrinsic... thinking about it a bit more, I think
you're going to run into problems with trying to keep around a return block
through optimizations:

[...]
%first.return = call i1 @llvm.experimental.coro.fork()
%cmp = icmp eq i32 %x, 0
br i1 %cmp, label %block1, label %block2

block1:
[...]
br i1 %first.return, label %coro.return, label %coro.start

block2:
[...]
br i1 %first.return, label %coro.return, label %coro.start

coro.return:
%xxx = phi i32 [ %a, %block1 ], [ %b, %block2 ]
call void @f(i32 %xxx)
ret i8* %frame

(Now you can't trivially insert branches to the return block because of the
PHI node.)

-Eli

Hi Eli:

semantics of the fork intrinsic... thinking about it a bit more, I think
you're going to run into problems with trying to keep around a return block
through optimizations:

How about this? Make all control flow explicit (duh).

    declare i8 coro.suspend([...])

    returns:
      0 - resume
      1 - cleanup
      anything else - suspend

Now we can get rid of the coro.fork altogether.

    [...]
    %0 = call i8 @llvm.coro.suspend([...])
    switch i8 %0, label %suspend [ i32 0, label %resume,
                                   i32 1, label %cleanup]
  suspend:
    call void @llvm.coro.end()
    br %return.block

We no longer have to do any special handling of return block as coro.end takes
care of it.

  Lowering of coro.suspend X

  1. Split block at coro.suspend, new block becomes the resume label X
  2. RAUW coro.suspend with 0 in f.resume and 1 in f.destroy
  3. Replace coro.suspend with 'br %suspend'

  coro.end expands as before:

  in f => no-op
  in f.resume/f.destroy => ret void (+ fancy things in landing pads)

Gor

If you're going down that route, that still leaves the question of the
semantics of the fork intrinsic... thinking about it a bit more, I think
you're going to run into problems with trying to keep around a return block
through optimizations:

Thinking about it a bit more, it is even simpler, I don't have to involve
coro.end in this case at all and should not worry about branches, RAUW-ing
coro.suspend with an appropriate constant takes care of it all.

Here is an updated version.
1) coro.fork is gone, gone gone!!!!
2) coro.suspend is changed as follows:

    declare i8 coro.suspend([...])

    returns:
      0 - in resume clone
      1 - in destroy clone
      -1 - in the original function

Usage:

    [...]
    %0 = call i8 @llvm.coro.suspend([...])
    switch i8 %0, label %return [ i32 0, label %resume,
                                  i32 1, label %cleanup]

Lowering of coro.suspend X

  In the original function:
    * RAUW coro.suspend to -1
    * Remove coro.suspend

  In both clones:
    * RAUW coro.suspend to 0 in `f.resume` and 1 in `f.destroy`
    * Split block after coro.suspend, new block becomes the resume label X
    * replace coro.suspend with
      - `ret void` in resume clone
      - @llvm.trap() in destroy clone
        (in destroy clone none of the suspends points should be reachable,
         if they are it is a front end bug)

Much prettier than before :slight_smile:

Gor

I'm not sure this quite works the way you want it to in terms of the
optimizer. For example:

block1:
suspend
switch (resume, destroy, return)

resume:
store zero to global @g
[...]

destroy:
store zero to global @g
[...]

return:
store zero to global @g
[...]

Naively, you would expect that it would be legal to hoist the store... but
that breaks your coroutine semantics because the global could be
mutated between
the first return and the resume.

Probably the most natural way to solve this is the two-function approach I
suggested earlier... that way, the first call to suspend isn't special, so
everything just works.

There might be some other way to solve it, but I think it just becomes more
awkward. For example, you could add a boolean that tracks whether the
function has been suspended yet, and explicitly reroute the control flow to
conditionally perform one-time operations before each call to suspend. But
then you need a giant switch statement or a separate one-time cleanup
function to actually route the control flow correctly.

-Eli

Hi Eli:

Naively, you would expect that it would be legal to hoist the store...
but that breaks your coroutine semantics because the global could be mutated
between the first return and the resume.

Hmmm... I don't see the problem. I think hoisting the store is perfectly legal
transformation with all semantics preserved.

Let's look at your example:

block1:
  suspend
  switch (resume, destroy, return)

resume:
  store zero to global @g
  doA()
  [...]

destroy:
  store zero to global @g
  doB()
  [...]

return:
  store zero to global @g
  doC
  [...]

As written the behavior is:

  1) when we encounter a suspend during the first pass through the function,
     store 0 to @g and doC()

  2) when we encounter a suspend after coroutine was resumed
     ret void

  3) When coroutine is resumed:
     store 0 to @g and doA()

  4) When coroutine is destroyed:
     store 0 to @g and doB()

If this is a coroutine that can be resumed asynchronously from a different
thread there is a data race. For example, if resume happens before 'f' returns,
doA() can write something useful to @g, and 'f' will clobber it with zero.
But, this race is already present in the code and is not introduced by LLVM.

Let's hoist the store and see what happens:

block1:
  suspend
  store zero to global @g
  switch (resume, destroy, return)

resume:
  doA()
  [...]

destroy:
  doB()
  [...]

return:
  doC()
  [...]

Now, let's split it up:
  1. RAUW coro.suspend with -1,0,1 in f, f.resume and f.destroy
  2. remove coro.suspend in f, replace with ret void in f.resume

void f() {
  [...]
  store zero to global @g
  doC();
  [...]
}

void @f.resume() {
entry:
  store zero to global @g
  doA();
  [....]
}

void @f.destroy() {
entry:
  store zero to global @g
  doB();
  [....]
}

Behavior looks exactly as before. What am I missing?

Thank you,
Gor

P.S.

Eli, thank you very much for your input. I struggled with the coro.fork for
months and in just a few days here it seems to be getting so much better!

Err... I guess nothing; sorry, I got myself confused. Essentially
executing a return statement, then jumping back, seems wrong, but I'm
having trouble coming up with a good example of how it would actually break
something. I guess there aren't really any issues on a pure control-flow
basis: you can't execute a global side-effect a different number of times
than it would otherwise happen.

You might run into problems with allocas: LLVM's optimizer will assume the
lifetime of any alloca in the current function ends when you hit a "ret"
instruction, so jumping back from the ret to the middle of the function
could lead to wrong results. Silly example:

x = alloca...

block1:
  suspend
  ; increment could get moved here.
  switch (resume, destroy, return)

resume:
  x += 1
   [...]

destroy:
  x += 1
   [...]

return:
  (don't touch x)
  [...]

The increment is only supposed to execute once, but instead executes twice.

This isn't a great example... and I'm not sure issues are limited to
allocas... but that's the idea, anyway.

-Eli

Hi Eli:

Err... I guess nothing; sorry, I got myself confused.

That is my fault :-). I confused myself in the beginning where I was trying to
treat coro.suspend + cond.branch pair as a terminator and was looking for
branches. You showed me the way... Just ignore the branches and represent
the control flow cleanly and SimplifyCFG will take care of the rest :-).

You might run into problems with allocas:

  x = alloca...

block1:
  suspend
  ; increment could get moved here.
  switch (resume, destroy, return)

resume:
  x += 1 ; load + inc + store
  [...]

destroy:
  x += 1 ; load + inc + store
  [...]

return:
  (don't touch x)
  [...]

Yep. You are right. Need some kind of barrier to prevent the motion.

Okay. New intrinsics:

declare void coro.barrier(i64); // will get a unique number for every insertion

  x = alloca...

block1:
  suspend
  ; nothing can sneak in here???, Right?
  switch (resume, destroy, return)

resume:
  coro.barrier(0)
  x += 1 ; load + inc + store
  [...]

destroy:
  coro.barrier(1)
  x += 1 ; load + inc + store
  [...]

return:
  coro.barrier(2)
  (don't touch x)
  [...]

Now we are good, right?
I am still not giving up hope for an intrinsic approach.

Thanks a lot, Eli, for your help with the design!
Gor

P.S

There is always a backup plan to go to your "two function" idea with a slight
twist.

Almost two functions (coro.fork(), coro.suspendO() same as in original proposal)

Okay. New intrinsics:

declare void coro.barrier(i64); // will get a unique number for every
insertion

>> x = alloca...
>>
>> block1:
>> suspend
>> ; nothing can sneak in here???, Right?
>> switch (resume, destroy, return)
>>
>> resume:
>> coro.barrier(0)
>> x += 1 ; load + inc + store
>> [...]
>>
>> destroy:
>> coro.barrier(1)
>> x += 1 ; load + inc + store
>> [...]
>>
>> return:
>> coro.barrier(2)
>> (don't touch x)
>> [...]

Now we are good, right?
I am still not giving up hope for an intrinsic approach.

coro.barrier() doesn't work: if the address of the alloca doesn't escape,
alias analysis will assume the barrier can't read or write the value of the
alloca, so the barrier doesn't actually block code movement.

There is always a backup plan to go to your "two function" idea with a

slight
twist.

Almost two functions (coro.fork(), coro.suspendO() same as in original
proposal)

================================================================================

CoroEarly pass which runs at EP_EarlyAsPossible will extract f.start using
coro.fork as a guide (since false branch bypasses the coroutine body).

Coroutine will now travel as two functions until CoroSplit gets to f.start
(Probably will use 'coroutine' attribute on a function to indicate that it
requires CoroSplit to munge on it.)

I am still hoping we can find a solution without requiring to separate
f.start
out. Seems like a huge hammer to avoid undesirable code motion / hoisting.

(I really hope that @coro.barrier(i64) approach can work)

If you really don't want two functions, you could probably do something
like this:

block1:
%first_time = load...
br i1 %first_time, label return, label suspend1

suspend1:
suspend
switch (resume1, destroy1)

block2:
%first_time = load...
br i1 %first_time, label return, label suspend2

suspend2:
suspend
switch (resume2, destroy2)

label return:
%cont = phi i32 [(block1, 1), (block2, 2)]
; do stuff
switch %cont... (1, label suspend1), (2, label suspend2)

It's a little awkward to generate the return block, but otherwise it's
reasonably clean. Also, if some non-C++ language wants to generate
coroutines, it might not have to generate the return block at all.

-Eli

HI Eli:

>> coro.barrier() doesn't work: if the address of the alloca doesn't
escape,
>> alias analysis will assume the barrier can't read or write the value of
>> the alloca, so the barrier doesn't actually block code movement.

Got it. I am new to this and learning a lot over the course
of this thread. Thank you for being patient with me.

Two questions and one clarification:

Q1: Do we have to have a load here?

>> block1:
>> %first_time = load... <--- What are we loading here?

Just an local alloca, initialized to false, and changed to true in the
return block.

>> br i1 %first_time, label return, label suspend1
>>
>> supend1:
>> %0 = coro.suspend()
>> switch %0 (resume1, destroy1)

Can we use three way coro.suspend instead?

  Block1:
    %0 = call i8 coro.suspend()
    switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1
  Suspend1:
    switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1

This doesn't look right: intuitively the suspend happens after the return
block runs.

One problem I can see is that someone can write a pass that might merge
two branches / switches into one switch and we are back where we were.
I guess what you meant by load, is to call some coro.is.first.time()
intrinsic.
So it looks like:

>> block1:
>> %first_time = call i1 coro.is.first.time()
>> br i1 %first_time, label return, label suspend1
>>
>> supend1:
>> %0 = coro.suspend()
>> switch %0 (resume1, destroy1)

This looks fine, there may be more uses for this intrinsic in the frontend.
Killing two birds with one stone. Good.

It doesn't really matter whether the bit gets tracked in an alloca or
through intrinsics.

Question 2: Why the switch in the return block?

I would think that **pre-split** return block would be simply:

return:
   <run dtors for parameters, if required>
   <conversion ops for ret value, if required>
   <ret void> or <ret whatever>

Where and why I should put a switch that you mentioned in this return
block?

BTW, I am speaking of the return block as if it is one block,
but, it could be a dominating block over all the blocks that together
run the destructors, do return value conversion, etc.

The best way to be sure the compiler will understand the control flow is if
the coroutine acts like a normal function. Another way to put it is that
it should be possible to lower a coroutine to a thread rather than
performing the state machine transformation.

The switch answers the question of where the control flow actually goes
after the return block runs. Under normal function semantics, the "return"
block doesn't actually return: it just performs the one-time operations,
then jumps back to the suspend call. Therefore, you can't use "ret" there;
you have to connect the control flow back to the correct suspend call. The
switch makes that connection. So the return block looks like this:

   <run dtors for parameters, if required>
   <conversion ops for ret value, if required>
   call coro.first_time_ret_value(value) ; CoroSplit replaces this with a
ret
   switch ... ; jump to suspend; this is always dead in the lowered version

The dead switch is there so the optimizer will understand the control flow.

And yes, this would be much more straightforward with a two-function
approach.

Clarification:

==============

>> Also, if some non-C++ language wants to generate coroutines,
>> it might not have to generate the return block at all.

C++ coroutines are flexible. The semantic of a coroutine is defined via
traits, so you may define a coroutine that returns void. It does not have
to return coroutine handle or some struct that wraps the coroutine handle.

Oh, okay. I haven't actually looked at the spec; I'm mostly just going off
your description of what it does.

-Eli

Hi Eli:

  Block1:
    %0 = call i8 coro.suspend()
    switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1
  Suspend1:
    switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1

This doesn't look right: intuitively the suspend happens after the return
block runs.

Perhaps, but, that is not the intended semantics.

The following two things are distinct concepts:

  * Coroutine is considered suspended

  * Coroutine returns control back to its caller
    (initial caller, resumer, or destroyer)

Coroutine is considered suspended means that it is legal to use coro.done,
coro.resume, coro.destroy intrinsics (either in this thread or in a different
one). It may seem strange, but, consider this example:

  task<void> f() {
    char buf[N];
    ... fill buff with stuff
    await async_send(buf); // <suspend-point>
    ... do more
  }

IR for await expression will be (conceptually):

  %0 = coro.save()
  async_send(%buf)
  coro.suspend(%0)

coro.save is the point when coroutine is considered suspended.
This allows async_send to resume the coroutine even before async_send
returned control back to f.

To make it safe, absolutely nothing should sneak-in between coro.save
and coro.suspend beyond what frontend/user put there.

Aside:

Typo:

"and before" should not be in this sentence:

Since control can reenter the coroutine at any point after coro.save

and before <<<<, I don't think there is any point where putting the switch can

help

This sentence should read:

Since control can reenter the coroutine at any point after coro.save,
I don't think
there is any point where putting the switch can help

I think I got it. Original model (with coro.fork and two-way coro.suspend)
will work with a tiny tweak.

In the original model, I was replacing coro.suspend with br %return in original
function. The problem was coming from potential phi-nodes introduces into
return block during optimizations.

Let's make sure that there is only entry into the return block.
In the original model, ReturnBB had two entries, one from coro.fork,
another from fall-through from the delete block.

  T coro() {
    %mem = malloc(...);
    if (!coro.fork()) {
      Body Of The Coroutine
      With Suspends and All The Fun
  DeleteBB:
      free(%mem)
      coro.end()
    }
  ReturnBB:
    return some_code_creating_T();
  }

In the original model,
  coro.end is replaced with `no-op` in start, 'ret void' in clones (*)
  suspend is replaced with `br %return` in start, 'ret void' in clones

Let's tweak it. Add an i1 %fallthru parameter to coro.end. Now the behavior is:
  in clones coro.end behaves as before: replaced with 'ret void' (*)
  in original:
    coro.end(false) => no-op (as before). (*)
    coro.end(true) => 'br %return'

Now the body of the coroutine will look like this:

T coro() {
    %mem = malloc(...);
    if (coro.fork()) {
ReturnBB:
      return some_code_creating_T();
    }
    Body Of The Coroutine
    With Suspends and All The Fun
DeleteBB:
    free(%mem)
    coro.end(true)
    UNREACHABLE
  }

Now I don't think any code that has side effects can sneak up from the
"Body of The Coroutine" into ReturnBB. So we are good, right?

Gor

I may have missed something obvious here, but in this case doesn't the
transformed function have a path (DeleteBB -> ReturnBB) that is not
present in the original program? This is problematic since it will
allow the optimizer to do things to the program that will be invalid
once you transform the program. One example is that SSA construction
will be wrong:

T coro() {
    %t = alloca
    %mem = malloc(...);
    (*%t) = 10
    if (coro.fork()) {
ReturnBB:
      %val = *%t
      return some_code_creating_T(%val);
    }
    Body Of The Coroutine
    With Suspends and All The Fun
DeleteBB:
    free(%mem)
    (*%t) = 20
    coro.end(true)
    UNREACHABLE
  }

Now in the above CFG %val can be folded to 10, but in reality you need
a PHI of 10 and 20.

Generally (if I understand your idea), you have a program above whose
ultimate execution cannot be "explained" by these intrinsics executing
like normal function calls; i.e. there is a "magic" edge from DeleteBB
to ReturnBB that isn't present in the CFG.

The thought experiment relevant here is that **could** you implement
your semantics with reasonable bodies for the llvm.coro intrinsics?
It does not ultimately matter whether you do that or you do some fancy
program transformation, but assuming the program **could** have had
the semantics you wanted from the very beginning will (by principle)
prevent the optimizer from breaking you.

-- Sanjoy

Hi Sanjoy:

Now in the above CFG %val can be folded to 10, but in reality you need
a PHI of 10 and 20

Quick answer: folding %val to 10 is OK, but, I would prefer it to be 'undef'
or even catch it in the verifier as it is a programmer/frontend error.

Details are in the second half of my reply. I would like to start
with the thought experiment you suggested, as it might help to explain
the behavior a little better.

The thought experiment relevant here is that **could** you implement
your semantics with reasonable bodies for the llvm.coro intrinsics?

I can emulate **control flow** very closely with user mode thread switching
APIs like make_context, jump_context. Memory behavior not so much.

Library emulation

Small clarification. In coro.* emulation via fibers, the comment on
coro.end(true)

       coro.end(true) // switches back to the thread at ReturnBB <-- this one
       UNREACHABLE // <== never reaches here. So we are honest with optimizer
     }

should read something like:

    coro.end(true) // switches back to the thread that resumed this fiber.

Which will be ReturnBB if we did not hit any suspends on the way to coro.end.
If we did hit a suspend, it is the suspend that will switch back to ReturnBB.

With llvm implementation, it is modeled by replacing coro.suspend and
coro.end(true) with
br ReturnBB in the initial function which represents the first
execution of the coroutine before it reached any suspends.

In resume function, coro.suspend and coro.end are replaced with `ret
void`, thus modelling returning to whomever resumed us.

To me it looks that it does match fiber behavior exactly.

-- Gor

Hi Gor,

Quick answer: folding %val to 10 is OK, but, I would prefer it to be 'undef'
or even catch it in the verifier as it is a programmer/frontend error.

Ok.

The thought experiment relevant here is that **could** you implement
your semantics with reasonable bodies for the llvm.coro intrinsics?

I can emulate **control flow** very closely with user mode thread switching
APIs like make_context, jump_context. Memory behavior not so much.

Library emulation

Let's build a fiber abstraction out of make_context, jump_context.

i1 fiber_fork(i8* %mem) - clones this function into a fiber
                          returns 0 - in a fiber
                          returns 1 - in a thread
                          uses %mem to store fiber context + fiber stack

I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
setjmp, in that it can "return twice"? If so, I'd urge you to try to
not rely on that kind of behavior. LLVM has unresolved issues around
modeling setjmp like behavior, and if possible, it is preferable to
make all of the control flow explicit from the very beginning.

i1 fiber_suspend() - suspends current fiber, switches back to thread
                          stores the fiber context to be able to resume later

This looks like a normal function call, except that the "return" may
be arbitrarily spaced out from the call?

void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified by %mem,
                          fiber_suspend will return %val when resumed

void fiber_end() noreturn
                        - switches back to a thread, does not remember fiber
                          context and does not touch %mem.
                          (Thus you can use this after %mem is freed)

Detail: fiber_fork(i8* %mem)
----------------------------
  * Prepares fcontext in %mem utilizing the rest of the memory for stack.
  * Saves current thread context on a current thread stack.
    (We should not store if in %mem, otherwise we won't be
     able to switch back after fiber memory is freed)

As I said above, these following two points can be problematic:

  * Sets up **current thread** context, so that when resumed it will proceed
    as if fiber_fork returned true
  * Sets up **fiber** context so that when resumed, the fiber will proceed
    as if fiber_fork returned false.

Okay, let's map it to our intrinsics:

coro.fork == fiber_fork
coro.suspend == fiber_suspend
coro.end(true) == fiber_end
coro.resume = fiber_join(%mem, i1 false)
coro.destroy = fiber_join(%mem, i1 true)
coro.size == 1,000,000 (or some other big number)

Now let's walk through this example in the fiber model:

   T coro() {
       %t = alloca
       %mem = malloc(...);
       (*%t) = 10
       if (coro.fork()) { // switches to fiber at label Start
   ReturnBB:
         %val = *%t // will be 10, as fiber has its own copy
         return some_code_creating_T(%val);
       }
   Start:
       Body Of The Coroutine
       With Suspends and All The Fun
   DeleteBB:
       (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
       free(%mem)
       (*%t) = 20 ; // undef: as the fiber memory is freed already

Say there was no (*%t) = 20. There is nothing so far that prevents
sinking "(*%t) = 15" from before the free (where it is legal) to after
the free (where it is not). %t is an unescaped alloca, and LLVM can
move around loads from and stores to these (as a first approximation)
as it pleases.

       coro.end(true) // switches back to the thread at ReturnBB
       UNREACHABLE // <== never reaches here. So we are honest with optimizer
     }

Back to the LLVM based implementation

In your example, there will be a path from the block where %t is used to
where %t is defined that crosses a suspend point. Therefore %t will be part of
the coroutine frame. So before the split, the program will be transformed into:

   coro.frame = type { int %t }

   T coro() {
       %t = alloca
       %mem = malloc(...);
       %s = (coro.frame*)coro.init(%mem)
       s->%t = 10
       if (coro.fork()) {
   ReturnBB:
         %val = 10 // const.prop as 10, OK, but really undef
         return some_code_creating_T(%val);
       }
       Body Of The Coroutine
       With Suspends and All The Fun
   DeleteBB:
       free(%mem)
       s->%t = 20 //// BOOM writing to freed memory
       coro.end(true)
       UNREACHABLE
     }

Accessing anything in the coroutine frame after coro.fork "true" is undefined.
As by the time you get to returnBB, the coroutine frame could already
be destroyed.

So when you say "some_code_creating_T()" are there restrictions on
what kind of code there could be as part of the creation logic (I've
so far assumed that it can be arbitrary code). If it can't
(semantically) touch anything in the stack frame then why have it as
part of @coro() ?

-- Sanjoy

Hi Sanjoy,

I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
setjmp, in that it can "return twice"?

Yes, user-mode stack switching API are somewhat similar to setjmp. Here are
links to a doc page and implementation, just in case you are curious:

http://www.boost.org/doc/libs/1_59_0/libs/context/doc/html/context/context.html
http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_ms_pe_masm.asm
http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_sysv_elf_gas.S

If so, I'd urge you to try to not rely on that kind of behavior. LLVM has
unresolved issues around modeling setjmp like behavior

Absolutely! This was just a thought experiment how one could implement the
intrinsics in a library. One of the benefits of compiler based coroutines is
that they allow to avoid all of that mess. Suspend is just a 'ret void',
resume is simply a direct (or indirect) call and a coroutine state is tiny
(assuming you don't use 64K arrays as local variables :slight_smile: )

   DeleteBB:
       (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
       free(%mem)
       (*%t) = 20 ; // undef: as the fiber memory is freed already
       coro.end(true)
       UNREACHABLE

Say there was no (*%t) = 20. There is nothing so far that prevents
sinking "(*%t) = 15" from before the free (where it is legal) to after
the free (where it is not). %t is an unescaped alloca, and LLVM can
move around loads from and stores to these (as a first approximation)
as it pleases.

Good point. During CoroSplit I should sink CoroutineFrame deallocation
all the way down to coro.end().

So when you say "some_code_creating_T()" are there restrictions on
what kind of code there could be as part of the creation logic (I've
so far assumed that it can be arbitrary code).

You are right. It *is* arbitrary code. In my reply to you earlier I had a
particular use case in mind, where you should not touch anything in the
coroutine frame in the return block, but that is higher level semantics that
frontend or library imbues coroutine with, not something that LLVM enforces.

The case I had in mind was that a frontend generated an asynchronous coroutine
without RAII semantics. Once launched it will run driven by asynchronous
completions and when it runs to the end it will destroy itself.

Such a coroutine can get suspended, get resumed by a different thread, run to
completion and free the coroutine frame **before** the thread that created
the coroutine manages to get to its return block. In such a coroutine reading
or writing to coroutine frame in return block is undefined behavior.

On the other hand, if it is a synchronous generator that always starts
suspended, then any writes or reads from a coroutine frame in the
return block are perfectly fine.

Since LLVM does not know what kind of coroutine user is trying to build, it
should take a conservative approach. LLVM should not introduce writes or reads
to a coroutine frame in the return block if they weren't there, but, if
frontend put them there, LLVM should not replace them with
`@llvm.traps` and `undef`s either.

Gor

Hi Sanjoy, Eli:

I was thinking a little bit more about your comments related to optimizer
moving code up and down and I think I can generalize the problem.

Background

Hi Eli, Sanjoy:

Quick update. I took your advice to make control flow explicit and
extract parts of the coroutine into subfunctions for protection against
undesirable code motion.

It is working out quite nicely and looks prettier too.

  * all control flow is explicit (or very close to it :-))
  * coro.start now marks the point where all initialization is done and
    coroutine can be suspended;
  * coro.suspend is now three way showing all interesting control flow;
  * things like coro.fork or unreachable after coro.end are no more!

Out of the frontend, IR is structured like this:

Entry:
         Allocation Code
----------- coro.init -----------
         Initialization Code
----------- coro.start ----------
            whatever...
                > coro.suspend
                V (R) = resume control flow
           coro.suspend X (S) = suspend control flow
          / |(R) \(D) (D) = destroy control flow
         / V \
     (S)/ whatever2 |
       / | |
      / V V
----- | --- coro.delete ---
       \ Deallocation Code
        \____ |
             > >
             V V
----------- coro.end ----
           whatever 4
           ret Whatever

At EP_EarlyAsPossible, init, delete and return parts are extracted
to protect against any code movement:

      Entry:
              call f.init_part(inputs1, outputs1)
                ...
                ... everything between coro.start and coro.delete
                ... stays put
      DeleteBB:
              f.delete_part(inputs2, outputs2)
      RetBB: %result = f.ret_part(inputs3)
              ret %result

Thank you for all your help!

Gor