Reducing build times for single compilation unit

Hi,

Long-time C++ user, first-time compiler-investigator. I’m particularly interested in improving single compilation unit latency, so that the development loop of modifying a single .cc file, then testing it is much faster.

I’ve experimented a bit with -ftime-trace, which has been already quite useful for identifying a few issues (primarily header-only libraries, unsurprisingly, since I do not use PCH currently). My hopes is C++ modules will improve some of these issues (waiting for Bazel support).

But beyond those improvements, I’m trying to understand what else can be done.

A few notes/thoughts - with widely varying difficulty:

  • My code makes quite extensive use of co_routines and lambdas, and for the backend, results in CoroConditionalWrapper taking ~15% of the compilation time (~6 seconds on my machine). I’m wondering if there are tips to reduce this.
  • Some of my compilation units are relatively large, and I realize I could separate them to improve single-compilation unit build latency - but philosophically I’m curious to explore not needing to. This could come in a couple flavours:
    • Parallelizing (more?) of the single-compilation builds (I saw the large discussion on it here) - so that parallelism does not need to rely on compilation-unit-level parallelism by splitting them apart
    • Within-compilation-unit incremental compilation. To take an extreme (albeit simple) example, if a compilation unit has 1000 functions that do not call each other, if I modify one of them, I’d like to avoid most of the cost of compiling the remaining 999 functions (handwave AST diffing).
      • The extreme would be an entire project in a single compilation unit with reasonable incremental build times.

And in-case general stats are useful (to identify an outlier), I’m seeing -ftime-trace of:

  • Source - ~15% - modules/PCH/avoiding header-only libraries should help
  • PerformPendingInstantiations - ~25%
  • CodeGen Function - ~10%
  • CoroConditionalWrapper - ~15%
  • CodeGenPasses/OptModule (without explicitly enabling optimizations) - ~25%

For my larger files, my compilation times are ~35 seconds on my hardware.

Thoughts? Both curious about opinions on shorter-term improvements, and long-term reducing the need to split files. Also, if there are starter simple improvements for compilation latency, would be interested to see what’s involved - I did see Improve build times with Clang as a project area, but did not find specific proposals for it.

Thanks!

Kyle

One of the possible directions would be to sprinkle more instrumentation around Clang, so that -ftime-trace output becomes more detailed.

I also recently did some timings of Clang for build perf reasons (splitting TUs is fine, but some expensive TUs have gnarly templating that is hard to factor the template instantiations to their own TU…). One thing that stood out to me was that OptFunction doesn’t seem to be done in a threaded way. I don’t know how much cross-function optimization happens that can’t use an internal directed graph to perform a build-like pass over, but it looked like some easy wins for inserting some parallelism into the compiler itself.

Of course, threading tools have interesting interactions with higher-level parallelism like make or ninja that also need to be considered as it is easy to end up with tasks being run instead of N when each tool thinks it can use nproc as a hint to its own parallelism level.

When you explore PCH make sure to try -fpch-instantiate-templates and -fpch-codegen/-fpch-debuginfo .