Asking for advice: how to best place the CoroSplit pass

Hi,

The current way of scheduling CoroSplit pass is not ideal, and in some
cases can lead to significant slowdown of the compiler.
I would like to ask for advice/feedback on what might be the best way
to structure the passes so that CoroSplit pass is put in the right
place. First of all, let me explain the current situation:

CoroSplit pass is a pass that splits a coroutine function into
multiple functions, in order to implement the semantics of coroutines.
Right now, CoroSplit pass is added in the InlinerPipeline, right after
the Inliner pass (PassBuilder::buildInlinerPipeline). So it looks like
this:
- ModuleInlinerWrapperPass
    - A few global analysis
    - CGSCCPipeline
        - Inliner
        - CoroSplitPass
        - FunctionSimplificationPipeline

It's important to note that when the first time a coroutine function
goes through CoroSplitPass, it will not be split. Instead, it will
just add its belonging SCC back to the pipeline and hence repeat the
entire CGSCC pipeline. The goal for doing so is to allow coroutines to
be optimized twice, first time before splitting, so that it can have
smaller frame size, and second time after splitting, so that the
splitted functions can get cleaned up.
Furthermore, CoroSplitPass needs to be in CGSCC pipeline in order to
enable coroutine elision (an important coroutine optimization to
reduce dynamic memory allocation), that is, we want to walk through
SCCs postorder so that functions can inline fully processed coroutines
and does CoroElision.
A side-effect of this structure, however, is that all the
non-coroutines that are in the same SCC of any coroutine will run
through Inliner twice. This can lead to unexpected slowdown if an
aggressive inlining threshold is used and generates unexpectedly large
functions. There is also not much benefit for coroutines to run
through Inliner pass twice either.

So the ideal goal looks like this:
For coroutine functions, run through:
  Inliner -> function simplification passes -> CoroSplit -> function
simplification passes
For non-coroutine functions, just run through as usual:
  Inliner -> function simplification passes.

The challenge is of course when they are in the same SCC it's hard to
separate them like this. Any thoughts on how we might be able to
achieve that?

Hi,

Do you have to use one CGSCCPipeline? If not, you could run function simplification passes on coroutines in CoroSplit pass like this:

- ModuleInlinerWrapperPass
    - A few global analysis
    - CGSCCPipeline
        - Inliner
        - FunctionSimplificationPipeline
    - CGSCCPipeline
        - CoroSplitPass
          - FunctionSimplificationPipeline (on coroutines)

- Yuanfang

SCCs containing a coroutine function and a non-coroutine function should be fairly rare right? That means you have mutually recursive functions, which shouldn’t happen a lot in typical code. Are you seeing this come up a lot?

I’m not familiar with how the legacy CGSCC pass manager schedules passes on newly outlined SCCs, but I did notice that for the new PM, CoroSplit doesn’t enqueue any potentially split SCCs, meaning that split functions don’t have the pipeline run on them (I think… I didn’t actually run and check). I think ideally, the pipeline would be

inliner → function simplification → corosplit (without skipping once)

and in corosplit you’d enqueue newly created SCCs, and also the current SCC. The CGSCC order is so all child SCCs of the current SCC are fully optimized, and that may not be the case if we don’t rerun the pipeline on the original SCC after running it on the newly created one. So I think running the pipeline twice is necessary if you really want to follow the CGSCC vision. Running the inliner and other optimizations on the original coroutine function could provide new results now that the function itself has been split. This would also allow the removal of the postSplitCleanup() hack.

If you think that running the CGSCC pipeline after the split wouldn’t really further improve the coroutine function, an alternative is to use the same pipeline above but not enqueue the newly created SCCs and the current one, and do some ad-hoc simplification on them like postSplitCleanup(), as opposed to running the whole CGSCC pipeline. This is fairly similar to the current state, just changing the CoroSplit position in the pipeline.

SCCs containing a coroutine function and a non-coroutine function should be fairly rare right? That means you have mutually recursive functions, which shouldn't happen a lot in typical code. Are you seeing this come up a lot?

I don't know how often this is, the case I was seeing has it and the
code size exploded in the end.
One thing that's important to mention is that in the current setup, a
coroutine will run through Inliner twice before even going through the
real CoroSplit pass. I believe that's unintended and certainly not
beneficial.

I'm not familiar with how the legacy CGSCC pass manager schedules passes on newly outlined SCCs, but I did notice that for the new PM, CoroSplit doesn't enqueue any potentially split SCCs, meaning that split functions don't have the pipeline run on them (I think... I didn't actually run and check). I think ideally, the pipeline would be

For NewPM, the SCC is enqueued back to the pipeline here:
https://github.com/llvm/llvm-project/blob/51457cd50624a5f3f684b80c2ce6feff1a2b3901/llvm/lib/Transforms/Coroutines/CoroSplit.cpp#L2056

  inliner -> function simplification -> corosplit (without skipping once)

and in corosplit you'd enqueue newly created SCCs, and also the current SCC. The CGSCC order is so all child SCCs of the current SCC are fully optimized, and that may not be the case if we don't rerun the pipeline on the original SCC after running it on the newly created one. So I think running the pipeline twice is necessary if you really want to follow the CGSCC vision. Running the inliner and other optimizations on the original coroutine function could provide new results now that the function itself has been split. This would also allow the removal of the postSplitCleanup() hack.

This sounds clean and is likely better than the current approach.
Though it's still hard to imagine there are any benefits running
Inliner on coroutines again after CoroSplit.
Now I am getting a bit confused on what's happening after the second
CoroSplit in the current implementation. When CoroSplit splits the
coroutine and generates new SCCs, do these newly generated SCCs run
through the rest of the CGSCC pipeline today?