CoroSplitPass can get very slow for source files compiled with full debug info (we see 10x to 100x slowdown in some cases when switching from -g1 to -g2). The slowdown is caused by:
- repeated work in metadata processing during coroutine cloning (3 clones per a switch coroutine and arbitrary number of clones for other kinds), and the fact that
- debug info metadata cloning is effectively O(Module) rather than O(Function) currently.
Problem #2 was described in this commit together with the idea to revamp metadata ownership making it easy to identify metadata owned by a Function and cloning it efficiently. This is the right fix conceptually, but it also seems like a larger endeavour.
In the meantime, we can significantly reduce the overhead by pre-calculating and deduping some work. Which is not a fundamental fix to the metadata ownership model but seems worthwhile doing anyway.
I prepared a patch set that makes CoroSplitPass more efficient (see below). The changes are not too invasive but not trivial either, so I’m looking for feedback on the approach.
Each commit in the patch set is individually buildable and reviewable, and I’m happy to submit them as individual PRs in a stack. With that said, I thought that providing high-level context for the whole changeset together with some commentary for groups of commits could be useful.
Anecdata
These numbers are taken from a traceview of a sample C++ source file (it’s a larger one but this is exactly what ends up on build’s critical path).
Each column corresponds to a certain commit in the patch set (see the commentary below), while rows are time trace scopes.
The final speed up is 18x, however I think it could be made another 2x faster in another incremental change if the overall direction makes sense.
| Baseline / 0 | IdentityMD set / 1 | Prebuilt GlobalDI / 2 | Cached CU DIFinder / 3 | |
|---|---|---|---|---|
| CoroSplitPass | 306ms | 221ms | 68ms | 17ms |
| CoroCloner | 101ms | 72ms | 63ms | 0.5ms |
| CollectGlobalDI | - | - | 63ms | 13ms |
| Speed up | 1x | 1.4x | 4.5x | 18x |
The file has hundreds of coroutines, so the effect on the total compile time is dramatic: 2m30s in coroutine processing before vs 9.5s after:
Patch set commentary
Step 0:
The first group of commits is a step-by-step refactoring of CloneFunctionInto, trying to extract reusable pieces out of it. The resulting APIs are not ideal but hopefully good enough / better and a step in the right direction.
(0) [NFC][Coro] Add helpers for coro cloning with a TimeTraceScope
[NFC][Utils] Extract CloneFunctionAttributesInto from CloneFunctionInto
[Utils] Extract ProcessSubprogramAttachment from CloneFunctionInto
[NFC][Utils] Remove DebugInfoFinder parameter from CloneBasicBlock
[NFC][Utils] Clone basic blocks after we’re done with metadata in CloneFunctionInto
[NFC][Utils] Extract BuildDebugInfoMDMap from CloneFunctionInto
[NFC][Utils] Extract CloneFunctionMetadataInto from CloneFunctionInto
[NFC][Utils] Extract CloneFunctionBodyInto from CloneFunctionInto
[Utils] Eliminate DISubprogram set from BuildDebugInfoMDMap
[NFC] Remove adhoc definition of MDMapT in IRMover (<- this one is not strictly necessary)
Step 1:
This commit changes how we communicate global debug info that shouldn’t be cloned to the ValueMapper. Previously, CloneFunctionInto would eagerly identity-map global debug info in a ValueMap to avoid cloning it, but this is expensive and complicates sharing.
With this commit, such global metadata is passed to ValueMapper separately and is identity-mapped on first use. This is needed for the rest of the patchset to work (unless doing it this way is subtly wrong of course!)
I tried other ways to prime MD map but they ended up being a lot slower / harder to manage.
(1) [Utils] Identity map global debug info on first use in CloneFunction*
Step 2:
This is a straightforward continuation of Step 1. All coroutine clones share the same set of global debug infos, so we build it once and then pass directly using individual CloneFunction* helpers extracted in Step 0.
(2) [Coro] Prebuild a global debug info set and share it between all coroutine clones
Step 3:
All global debug info sets from Step 2 share a common core coming from DICompileUnit. We can build it once, cache, and then re-use it for each run of CoroSplitPass.
I implemented it as a simple module-level analysis. But I’m not sure if the way I wired it is the best (or even the right one!), so would certainly appreciate input on this.
[Analysis] Add DebugInfoCache analysis
(3) [Coro] Use DebugInfoCache to speed up cloning in CoroSplitPass
