RFC: Coroutine Optimization Passes

Hi all:

I've included below a brief description of coroutine related optimization
passes and some questions/thoughts related to them. Looking forward to your
feedback, comments and questions.

Thank you!

Roadmap:

Hi all:

I've included below a brief description of coroutine related optimization
passes and some questions/thoughts related to them. Looking forward to your
feedback, comments and questions.

Thank you!

Roadmap:

1) Get agreement on coroutine representation and overall direction.
  .. repeat 1) until happy

  http://lists.llvm.org/pipermail/llvm-dev/2016-June/100838.html (Initial)
  http://lists.llvm.org/pipermail/llvm-dev/2016-July/102133.html (Round 2)

2) Get agreement on how coroutine transformation passes integrate into the
   optimizer pipeline. (this mail) <=== WE ARE HERE
  .. repeat 2) until happy

3) update IR/Intrinsics.td + doc/Coroutines.rst + doc/LangRef.rst

  https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst

4) trickle in coroutine transformation passes + tests in digestible chunks.
5) get clang changes in (sometime after step 3).
6) fix bugs / remove limitations / add functionality to coroutine passes
.. repeat 6) until happy

Overview:

LLVM coroutines are functions that have one or more `suspend points`.
When a suspend point is reached, the execution of a coroutine is suspended
and
control is returned back to its caller. A suspended coroutine can be
resumed
to continue execution from the last suspend point or it can be destroyed.

CoroSplit Pass: (CGSCC Pass @ new extension point EP_CGSCCOptimizerLate)
---------------
Overall idea is that a coroutine is presented to an llvm as an ordinary
function
with suspension points marked up with intrinsics. We let the optimizer
party
on the coroutine as a single function for as long as possible. Shortly
before
the coroutine is eligible to be inlined into its callers, we outline parts
of
the coroutine that correspond to the code that needs to get executed after
the
coroutine is resumed or destroyed.

Seems reasonable. How do you deal with basic blocks which appear to be
used by multiple parts of the coroutine? We handled this in WinEHPrepare
by cloning any BBs which were shared.

We also create a struct that will keep the
objects that need to persist across suspend points. This transformation is
done by CoroSplit CGSCC pass run as a part of the IPO optimizations
pipeline.

CoroElide Pass: (Function Pass @ EP_ScalarOptimizerLate extension point)
---------------
The coroutine resumption intrinsics get replaced with direct calls to those
outlined functions where possible. Then inliner can inline much leaner and
nicer
coroutine or any of the outlined parts as desired.

If we discover that the lifetime of a coroutine is fully enclosed in the
lifetime of the caller, we remove dynamic allocation of coroutine state and
replace it with an `alloca` on the caller's frame.

These optimizations are done by a function pass CoroElide run by the main
IPO
optimization at EP_ScalarOptimizerLate extension point.

CoroEarly Pass: (Function Pass @ EP_EarlyAsPossible)
---------------
Pretty boring pass that lowers coroutine intrinsics that are not needed for
later coroutine passes.

CoroCleanup Pass: (Function Pass @ EP_OptimizerLast)
--------------
Another boring pass that lowers all remaining coroutine intrinsics that
were not
processed by earlier coroutine passes.

Questions / Concerns / Thoughts:

Coroutine attribute or not.
---------------------------
I added a AttrKind::coroutine to flag the pre-split coroutine. The
intention is
to lessen the impact of CoroSpit pass since it can simply check for an
attribute
to learn if there is any work to be done on a function or not. Without the
attribute, it would need to examine every functions body to see if it has
an
llvm.coro.begin. On the other hand, three other coroutine passes have to
look at
every function body anyway to lower other coroutine related intrinsics, so
we
only saving a little. Another negative aspect of having this attribute is
that
there is a possibility of other passes checking if a function is a
coroutine or
not a doing and doing something differently if it is. I'd like to avoid
this
development and keep all coroutine logic in the coroutine passes and not
affect
other optimizations passes. Currently I am leaning towards removing this
attribute.

I would remove the attribute. There are all sorts of tricks you can do to
avoid scanning the function for calls to the intrinsic. For example, you
can see if a declaration of your intrinsic exists and, if so, if it has an
users in the function in question (under the assumption that there are few).

Hi David:

How do you deal with basic blocks which appear to be used by multiple parts
of the coroutine? We handled this in WinEHPrepare by cloning any BBs which
were shared.

I experimented with several approaches, but, cloning ended up being the simplest
and most reliable. Suspend points express three different control flows that
can happen at the suspend point: a suspension (default), resumption (0) and
destruction (1).

  %0 = call i8 @llvm.coro.suspend([..])
  switch i8 %0, label %suspend [i8 0, label %resume,
                                i8 1, label %destroy]

I slap a switch that jumps to all suspend points in the function. Then I clone
the function twice (one will become resume clone, and another destroy clone).
This switch becomes the entry block in the clones. Then, I RAUW coro.suspends
with -1, 0, or 1 (in original, resume and destroy clones respectively) and let
SimplifyCFG do the rest. (This is slightly simplified explanation, but it should
give the idea).

I would remove the attribute. There are all sorts of tricks you can do to
avoid scanning the function for calls to the intrinsic. For example, you
can see if a declaration of your intrinsic exists and, if so, if it has an
users in the function in question (under the assumption that there are few).

Aye-aye. Will remove the attribute.

With respect to lessening the impact of coroutine passes, one approach I tried
was to look during doInitialize whether there are any uses of coroutine
intrinsics and set a flag if there are any, or maybe build a set of functions
with coroutines intrinsics in doInitialize, so that in runOnFunction, I can just
check whether the function is in the set and skip if it is not.

Then, I scared myself silly that some optimization passes can materialize
new functions or new function bodies and I will miss them. So I stopped doing
that.

I think your approach takes care of my "materialization" concern. (BTW, I don't
even know if that is a real concern). But may not be profitable if there are a
lot of small functions that are faster to scan for coroutine intrinsics then to
scan potentially longer list of coroutine intrinsics users.

BTW, Do you have a preference on how to restart CGSCC pipeline? One of the four
options I listed (repeated in P.S), or even some better way I did not think
about?

Thank you,
Gor

P.S.

Option 1: https://reviews.llvm.org/D21569 (no longer relevant, since we are
                                           removing AttrKind::Coroutine)
Option 2: https://reviews.llvm.org/D21570 (bool& Devirt in runSCC)
Option 3: https://reviews.llvm.org/D21572 (virtual bool restatedRequested())
Option 4: Fake devirtualized call in a function pass, so that RefreshSCC will
          detect devirtualization and restart the pipeline by itself.

Hi!
Sorry for jumping in late, but I have a general question (that I perhaps should have asked during round 1):

This proposal jumps straight into the thick of implementation, but I don’t think I’ve seen a motivation of why coroutines need to be represented at the LLVM IR level. Can’t this transform be performed entirely in the front-end?

Vadim

Hi Vadim

Can't this transform be performed entirely in the front-end?

Absolutely it can. But, such a coroutine becomes unoptimizable. Once
split, we lose SSA-form, don't see the control flow clearly. The
wonderful property of the proposed approach is that a coroutine stays
intact looking like a normal function for as long as possible. We let
the optimizer to clean it up, doing constant propagation, constant
folding, CSE, dead code elimination, heap allocation elision, etc.

The frontend does not have the tools to make coroutines efficient.

Cheers,
Gor

Another possible advantage to seeing this in LLVM IR is so that the coroutines could be used in implementing other features in other languages that end up using the LLVM infrastructure without exposing coroutines in the higher level language. This also has implications for systems that JIT using LLVM IR and would like to get the coroutine functionality even in the JIT'ed code.

Cheers

Hi David:

>> How do you deal with basic blocks which appear to be used by multiple
parts
>> of the coroutine? We handled this in WinEHPrepare by cloning any BBs
which
>> were shared.

I experimented with several approaches, but, cloning ended up being the
simplest
and most reliable. Suspend points express three different control flows
that
can happen at the suspend point: a suspension (default), resumption (0) and
destruction (1).

  %0 = call i8 @llvm.coro.suspend([..])
  switch i8 %0, label %suspend [i8 0, label %resume,
                                i8 1, label %destroy]

I slap a switch that jumps to all suspend points in the function. Then I
clone
the function twice (one will become resume clone, and another destroy
clone).
This switch becomes the entry block in the clones. Then, I RAUW
coro.suspends
with -1, 0, or 1 (in original, resume and destroy clones respectively) and
let
SimplifyCFG do the rest. (This is slightly simplified explanation, but it
should
give the idea).

I like it, sounds nice and simple :slight_smile:

>> I would remove the attribute. There are all sorts of tricks you can do
to
>> avoid scanning the function for calls to the intrinsic. For example,
you
>> can see if a declaration of your intrinsic exists and, if so, if it has
an
>> users in the function in question (under the assumption that there are
few).

Aye-aye. Will remove the attribute.

With respect to lessening the impact of coroutine passes, one approach I
tried
was to look during doInitialize whether there are any uses of coroutine
intrinsics and set a flag if there are any, or maybe build a set of
functions
with coroutines intrinsics in doInitialize, so that in runOnFunction, I
can just
check whether the function is in the set and skip if it is not.

Then, I scared myself silly that some optimization passes can materialize
new functions or new function bodies and I will miss them. So I stopped
doing
that.

I think your approach takes care of my "materialization" concern. (BTW, I
don't
even know if that is a real concern).

Functions can be created from nothing in LLVM.

But may not be profitable if there are a
lot of small functions that are faster to scan for coroutine intrinsics
then to
scan potentially longer list of coroutine intrinsics users.

I find that unlikely but we can always benchmark it if we get concerned.
I'd use a naive approach to start out with.
If it shows up on profiles, we can optimize it.

BTW, Do you have a preference on how to restart CGSCC pipeline? One of the
four
options I listed (repeated in P.S), or even some better way I did not think
about?

I'm not an expert in that area. I think you will want someone like
Chandler or Hal to give advice here.

I also haven’t seen this discussion. Can you provide a pointer to the thread where this was discussed?

p.s. If this discussion hasn’t happened, that would definitely be a blocker for any of the specific work discussed below.

Philip

Hi Philip:

Please see:

http://lists.llvm.org/pipermail/llvm-dev/2016-July/102374.html
http://lists.llvm.org/pipermail/llvm-dev/2016-July/102372.html

Thank you,
Gor