RFC: enabling `UNITY_BUILD`

TL;DR:

Inspired by [RFC] Use pre-compiled headers to speed up LLVM build by ~1.5-2x and Meta-RFC: Long-term vision for improving build times and empowered by Claude I took a shot at implementing CMAKE_UNITY_BUILD=ON. The results are shockingly good:

The plot shows the result of a synthetic experiment where I build mlir-opt, clang, and opt with alternatingly CMAKE_UNITY_BUILD=OFF and CMAKE_UNITY_BUILD=ON while passing -j$CORES, i.e., restricting the number of concurrent invocations of my CC (which happens to be clang). Note: lower is better. For normal (common) configurations (~16 cores) I observe a 30% improvement in wall-clock time. At higher core counts the effect decays because of the reduction in inherent parallelism from the build strategy (more below). I’ve tested this with a few other machines I have at my disposal and the results are robust.

The draft PR (in passing but still sketch form) is here. Before you click and become aghast at the number of files touched, maybe read/skim below :slight_smile: (or maybe just note that 98% of the files touched are CMakeLists.txt files). The current iteration of the PR which factors all the exceptions out into a single file is here: [MLIR][CMake] LLVM_UNITY_BUILD support by makslevental · Pull Request #188403 · llvm/llvm-project · GitHub. Note: the PR does test the change by adding CMAKE_UNITY_BUILD=ON to monolithic-linux.sh (which is how we run tests in pre-commit).

Background

UNITY_BUILD (also sometimes known as “Jumbo build”) is a build strategy which batches up source files for faster compilation. CMake implements this strategy by creating a (set of) unity sources which #include the original sources, and then compiling these unity sources instead of the originals:

// ./utils/TableGen/CMakeFiles/llvm-tblgen.dir/Unity/unity_0_cxx.cxx

#include "llvm-project/llvm/utils/TableGen/AsmMatcherEmitter.cpp"

#include "llvm/utils/TableGen/CodeGenMapTable.cpp"

#include "llvm-project/llvm/utils/TableGen/DAGISelMatcher.cpp"

#include "llvm-project/llvm/utils/TableGen/DAGISelMatcherEmitter.cpp"

...

i.e., effectively concating all the source files in a target.

How can this be faster? One benefit is headers included in multiple source files do not have to be reparsed or semantically re-analyzed (template instantiation etc.). Importantly (no free lunch), this can also be slower: merging/concating independent source files/TUs into a single file/TU necessarily reduces the inherent parallelism of the build. Aside from this, concating/merging the TUs (potentially) runs into issues related to One Definition Rule (ODR) violations.

Foreground (the sketch PR)

The main blocker to turning UNITY_BUILD on is ODR violations. In BC times (before Claude), I had attempted enabling it by rewriting code to actually eliminate the ODR violations. Suffice it to say, that is/was a sisyphean task. I had intended on asking Claude to do that rewriting but it actually suggested something much better: SKIP_UNITY_BUILD_INCLUSION! This is the primary approach taken in the sketch PR: simply exclude the problematic/violating files; this is why overwhelmingly the PR is just changes to CMakeLists.txt. The remainder is a few #undefs and one or two MLIR source files I was comfortable actually rewriting (moving a copy-pasta’ed function to a header). Here’s a summary of the changes as they are right now:

  1. lots and lots of SKIP_UNITY_BUILD_INCLUSION;
  2. A few CMAKE_UNITY_BUILD=OFF where there were more problematic source files than non-problematic source files;
  3. Basically #undefing macros which either should’ve been undeffed to begin with (e.g., at the bottom of the source file) or #undef DEBUG_TYPE for uses of DEBUG_TYPE from some header (almost certainly a mistake in and of itself…);
  4. An automatic insertion of #undef DEBUG_TYPE and #undef DBGS via UNITY_BUILD_CODE_AFTER_INCLUDE.

And that’s it! The PR is completely NFC! One caveat: I’ve patched all of the issues on MacOS and Linux but on Windows we have lots of macro issues via Minwindef.h/Windows.h. I tried fixing these issues using UNITY_BUILD_CODE_AFTER_INCLUDE but it didn’t fix it completely. I believe overall it’s doable but I just don’t have easy access to a Windows machine :person_shrugging:.

Landing/merging/testing strategy

My plan for landing this is to split out the source changes (fixing the uses of DEBUG_TYPE which are accidentally using defines in headers) from the build/CMakeLists.txt changes. I’m also considering (but not completely certain - hence this RFC!) refactoring into something that works via add_llvm_library - i.e., adding LLVM_UNITY_BUILD and then setting the target property UNITY_BUILD=${LLVM_UNITY_BUILD} inside of add_llvm_library/add_llvm_executable (and adding SKIP_UNITY_BUILD_INCLUSION as a list arg to add_llvm_library). The advantage of this approach is downstream users could turn on UNITY_BUILD for just LLVM without turning it on for their whole project. The disadvantage is we certainly have a lot of uses of just add_library and add_executable. Happy to bikeshed here!

More importantly, what should be discussed here is the maintenance/testing strategy. I think ideally we should test this path in pre-commit but we could of course live with testing this in post-commit. We might consider actually migrating the three main pre-commit bots to use this path but I worry there that that’ll confuse people when things fail for reasons other than the change they’ve made. Thus, realistically this path gets tested post-commit and it’s there that I would need to partner/collaborate with someone in the community. I suspect someone out there might appreciate the reduced infra costs :slight_smile:. My employer does have post-commit bots but they apparently do not actually alert anyone outside of our org/company when there’s a breakage. In the near-term I plan to migrate the nightlies at Workflow runs · llvm/eudsl · GitHub to use this path and so I’ll be able to catch breakages (and fix) but although I’m excited about this change (naturally) I’m not ready to volunteer to maintain it entirely by myself :slight_smile:. The current consensus is this should be supported “peripherally” (like bazel). See discussion below.

Note: I’m not proposing that going forward we should contort code to satisfy this build path (i.e., injudiciously eliminating ODRs across TUs just to be able to do UNITY_BUILD). That is exactly the value of SKIP_UNITY_BUILD_INCLUSION. But I would propose that if this lands, a best effort approach is taken: if it’s straightforward to eliminate the ODR then do so and otherwise just chuck the source file into the SKIP_UNITY_BUILD_INCLUSION bucket. Possibly someone (me?) could come back around on a regular basis and perform small refactors when doing so would “unlock” lots of build time improvement.

P(re-emptively)AQ

  1. Does this break source/debug/line info?
    • No! Just like source can be tracked through any other header #include, this associates the original line info with everything;
  2. Does this break ccache?
    • No! ccache hashes both the #includes and the source file in default (direct) mode and as well has a preprocessor mode which will run the preprocessor and then hash;
  3. Does this make the binary slower/less performant?
    • On the contrary, though I haven’t tested, this should in theory make things faster because the TU encompasses more of the program and thus more comprehensive analysis can be performed.
  4. Does this make the binary fatter?
    • Although I haven’t measured (it would be easy to do I just haven’t done it yet…), it shouldn’t - currently duplicated code across object files is deduped by the linker when the final shared library or executable is linked. In fact, it should make static archives smaller (I believe the linker does not dedup in static archives).
  5. Does this make linking slower?
    • Again, not sure, haven’t measured, but intuitively I believe it should make linking faster because there are fewer object files to link and fewer symbols in those object files.

Having claimed the last two points based on “intuition”, which I’m not expecting y’all to take on faith (feel free to tell me I’m wrong!), I’ll run some experiments and update this RFC.

Okay I invite y’all to try the PR, report back if you saw improved build times (or worse?), or just general comments/questions/concerns.

7 Likes

I don’t see any discussion of incremental build speed. Unity builds are great with a cold compilation cache, but I think what most developers actually care about build-speed wise is the modify single file/compiler+link/execute loop, which unity builds hurt.

I don’t think I’d be opposed to maintaining this configuration in principle for people who want to enable it on bots/locally if they’re really interested (i.e., maintain it, but don’t default to it). But given the data above, it seems like there are a large number of ODR violations caused by merging TUs, so it seems like this configuration might be kind of brittle. I’m also not sure how much this would actually help buildbots given all of them should be using some form of compiler caching where incremental builds are also sort of the common case.

1 Like

Yea it’s true and I think there’s not much to say here - the partial compilation story gets worse in direct proportion to the “cold” compilation story improving because they’re directly in opposition right (more TUs → better partial compile perf).

Yea that’s a fair assessment, especially if/when you look at it in terms of source rewrites (I think I spent probably 8 hours straight just renaming stuff when I tried this prior…). But with the advent of SKIP_UNITY_BUILD_INCLUSION and just how well it worked already for existing sources (the total objects for mlir-opt going down from ~4400 → ~2200, although I can’t back that out to targets) I think it recovers some (enough?) credulity (about reasonable maintenance costs). With SKIP_UNITY_BUILD_INCLUSION I see this as something that can be incrementally improved over time (bucket some things now and possibly come back to them). At worst it’s a lot of extra no-op CMake, which is not nothing of course in terms of aesthetics, but it’s neither an “abstraction cost” nor a maintenance cost. Of course that’s just worst-case analysis - if no one else were interested in maintaining this then I would close the PR and end the RFC.

For pre-commit bots, which keep their caches extremely hot, this is certainly true - (s)ccache is all we’ll ever need. But for other bots I wouldn’t be so sure? Those nightlies I maintain very rarely reap any benefits from ccache even though I’ve labored and labored to optimize the ccache config. Now nightly is a very infrequent build periodicty in the universe of LLVM (where we have 100 commits a day sometimes) but even if we did have fewer commits (and therefore more amenable caching patterns), it doesn’t take much to invalidate the cache for tons (most?) of the objects (just change a header in ADT…). So I think there would be some benefits for builders.

But yes truthfully the people I have in mind who will be most affected by this are those building for the first time, pulling and building infrequently, unaware of solutions like ccache, and just generally having fewer cores (namely students, hobbyists, and just people not “blessed” with a surfeit of compute resources).

FYI there’s a factoring of this PR (I think, I haven’t tried actually) which stuffs all of these exclusions into a single file - since all of these are properties on the source files themselves, they don’t actually have to live adjacent to the targets themselves. So conceivably this could be just a CMake module unto itself entirely. I was reluctant to start with that version because

  1. moving the exclusions away from the target makes them harder to reason about;
  2. moving them into their own file I think would all but guarantee that no one ever actually tries to resolve one - i.e., it would become some kind of ritual “just add this file to UnityBuild.cmake”. Though I guess someone could be code-owner of that file and then play hall-monitor on it :person_shrugging:.

EDIT:

For comparison, here is the version which pushes all the exclusions to a single UnityBuild.cmake module: [MLIR][CMake] LLVM_UNITY_BUILD support by makslevental · Pull Request #188403 · llvm/llvm-project · GitHub. Definitely cleaner to land but yea as I said maybe “psychologically” harder to maintain?

I think if there are meaningful benefits to a subset of developers for this option (and the data seems to suggest there are), I’d support this going in, but only if it didn’t impact those who don’t want the option. In other words, I wouldn’t want a build bot to start producing failure notifications because a developer didn’t remember to check this mode. It feels therefore like this belongs in the peripheral tier, in a manner similar to support for Bazel builds. You could even mirror how the Bazel stakeholders have set up an AI agent to make the required fixes to keep the unity build clean, either via adding to the “skip” file or by fixing the ODR issues.

2 Likes

While unity builds could be of value if someone regularly does clean builds, they also expose some unfortunate trade-offs to a much greater extent than PCHs regarding naming collisions. If we add support for unity builds upstream, I agree that it should go into the peripheral support tier, mainly because I feel that we have already too many CMake configurations that we need to care about. From the analysis below, I believe that some of the observed improvements could be achieved by using PCH in a few more places, which we, IMHO, should explore first.


Detailed Perf Analysis

Build Clang Rel+Assert X86;AArch64 dylib

cmake -DLLVM_ENABLE_PROJECTS="llvm;clang" -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -G Ninja -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=ON -DLLVM_ENABLE_DUMP=ON -DLLVM_ENABLE_FFI=ON -DLLVM_USE_LINKER=lld (...) && ninja -j47 clang
no unity: 7218.83user 338.09system 2:51.72elapsed 4400%CPU (0avgtext+0avgdata 1020392maxresident)k
unity:    5123.21user 157.68system 2:26.55elapsed 3603%CPU (0avgtext+0avgdata 1277256maxresident)k

Breakdown of CPU time:

                                    no unity   |     unity      | diff
                                 time    count |  time    count |
ExecuteCompiler                  7278,    3086 |  5071,     763 | -2207 -30.3%
  Frontend                       3930,    6172 |  2046,    1526 | -1884 -47.9%
    Source                       1876,   16993 |   412,    3021 | -1464 -78.0%
    PerformPendingInstantiations 1332,    3086 |   712,     763 |  -620 -46.5%
    CodeGenFunction               337, 5076609 |   254, 3926082 |   -83 -24.6%
  Backend                        3296,    3086 |  3004,     763 |  -292  -8.9%
    Optimizer                    2351,    3086 |  2100,     763 |  -251 -10.7%
    CodeGenPasses                 939,    3086 |   901,     763 |   -38  -4.0%

Compile times per CU:
no unity: avg:2.35s; med:1.27s; max:51.08s
unity:    avg:6.64s; med:3.80s; max:70.73s

Primary benefits are:

  • Less duplicate header parsing and template instantiations; this accounts for >80% of the gains.
  • Fewer generated IR functions due to less duplicated inline functions and therefore also fewer redundant functions to simplify; this is ~15% of the gains.

In Clang, some further reduction potential exists with PCH for clangCodeGen, clangSema, and clangLex – I didn’t implement this so far, because I don’t build Clang too often. The currently most expensive headers over 120s are:

CPUtime count
122s      463 clang/Basic/SourceManager.h
125s       87 clang/lib/CodeGen/CodeGenFunction.h
150s      157 clang/StaticAnalyzer/Core/BugReporter/BugReporter.h
173s      160 clang/Sema/Sema.h
181s      312 clang/Lex/Preprocessor.h

MLIR could also benefit from using PCH, e.g., parsing mlir/Dialect/Arith/IR/Arith.h (and transitive includes) 598 times takes 663s of 8181s in total for building mlir-opt. Also worth noting, instantiating MapVector<TypeID, std::unique_ptr<DialectExtensionBase>>::try_emplace across 1095 CUs takes 47s; the single AST instantiation of mlir::Dialect::addOperations<mlir::spirv::* takes 21.5s. (Aggregated time trace data for mlir-opt (phase breakdown, expensive headers, expensive template instantiations): parsed-time23-mlir.txt (551.2 KB))

2 Likes

In this case we should focus on the alternative version of the PR I “clauded” up last night (the one that moves all the exceptions into a UnityBuild.cmake).

I think this is probably doable given my org’s recent push to incorporate “GenAI” into our dev workflows but I just currently have no formal relationships with any team that might own this kind of thing. That’s fine - this is a good excuse to start building those relationships! In the meantime, so that we’re not blocked on that for landing, I can maintain this (by catching breakages) via the nightlies at eudsl.

I agree that there is already a panoply of CMake options and so I’m fine with this being “peripherally” supported (as mentioned above) but is there a reason you’d prefer PCH more strongly to UNITY_BUILD (aside for the aforementioned maintenance burden, which I’ve committed to managing)? In my intuition, the two approaches have overlap but they’re not completely equivalent because of the added benefit of better optimization opportunities given bigger TUs (@nikic I think will soon be able to report on this aspect).

By the way, thank you for the much more thorough performance analysis - do you have a script somewhere I could use to repro?

Yea that one is funny - I too have prior noticed that SPIR-V dialect has way too many ops but I’m not an expert - I’m not sure why that is the case.

In general, re MLIR specifically, if you share the details on how you produce the trace I can send PRs for PCHs for “offending” headers.

Do we expect that particular difference to be meaningful if you’re building with (Thin)LTO?

On the other hand, it’s basically the same tradeoff that causes incremental builds to get worse, but also for clean cached builds. It doesn’t break caching correctness and if you got 100% hits before you still get 100% hits, but it means that any cache misses that exist have a huge impact since they affect every TU that gets bundled together.

1 Like

I am not super familiar with how LTO actually works (something something keeps around the IR until the link stage?) so I can’t really comment but if you can suggest some representative workloads on the final linked clang I can take a shot at running them and comparing ThinLTO vs UNITY.

Yes, as I said above, if you have ccache at your disposal then you don’t need this or anything else. This solution is for when you don’t always have ccache available: either you don’t know about it or your cache will be cold regardless or you’re simply not building in an environment that allows you to cache (which surprisingly might be the case in some corporate prod build environments…).

LTO keeps all the IR together through the middle end and the backend after doing IR level linking of all the TUs involved. So this should be strictly worse than monolithic LTO from a performance perspective.

ThinLTO runs part of the middle-end on each TU, has each job generate summaries which then get processed in a single thin-link job, and summaries are then distributed back to the TUs were a second phase of compilation runs the rest of the middle-end and back-end. These summaries allow cross-module optimizations due to the data propagation, in particular function importing from other modules for inlining. I would be kind of surprised if unity builds even reach parity with ThinLTO and extremely surprised if profiles were involved.

2 Likes

I ran the comparison on mlir-opt specifically:

Link time “no unity” (10 runs)

  ┌────────┬──────────┐                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  │  Stat  │  Value   │
  ├────────┼──────────┤                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  │ Mean   │ 0.5822 s │
  ├────────┼──────────┤
  │ Median │ 0.5795 s │
  ├────────┼──────────┤
  │ Stddev │ 0.0080 s │
  └────────┴──────────┘

Link time “with unity” (10 runs)

  ┌────────┬──────────┐                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  │  Stat  │  Value   │
  ├────────┼──────────┤                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  │ Mean   │ 0.5683 s │
  ├────────┼──────────┤
  │ Median │ 0.5640 s │
  ├────────┼──────────┤
  │ Stddev │ 0.0128 s │
  └────────┴──────────┘

Lib sizes “with unity” vs. “no unity”

  ┌────────────────────────┬───────────────────┐                                                                                                                                                                
  │         Metric         │       Value       │
  ├────────────────────────┼───────────────────┤                                                                                                                                                                
  │ Average change per lib │ -19.2 KB (-3.30%) │                                                                                                                                                              
  ├────────────────────────┼───────────────────┤
  │ Total no-unity size    │ 301.8 MB          │
  ├────────────────────────┼───────────────────┤
  │ Total unity size       │ 295.2 MB          │
  ├────────────────────────┼───────────────────┤
  │ Total change           │ -6.7 MB (-2.20%)  │
  └────────────────────────┴───────────────────┘

  Unity build produces smaller libs on average. Notable reductions:

  - libMLIROpenACCTransforms: -34.2%
  - libMLIRGPUTransforms: -28.1%
  - libMLIRTransforms: -21.6%
  - libMLIRAffineTransforms: -23.0%
  - libMLIRXeGPUTransforms: -20.0%

Only one lib got larger: libMLIRSPIRVSerialization (+20 KB, +3%). Most others were identical or smaller.

Data

Improved PCH reduces incremental build times, which is what most developers care about and helps with several CI configurations.

That said, for clean builds PCH will always be worse than unity builds and I’m not against landing this, I just think we should do both.

Hardly a script.. setup new build dir, compile with -DCMAKE_C_FLAGS="-ftime-trace -ftime-trace-granularity=100" -DCMAKE_CXX_FLAGS="-ftime-trace -ftime-trace-granularity=100", then run find llvm-build-tmp/ -name \*.c\*.json | xargs -n40000 -s2093416 python3 parse-clang-time-trace.py > parsed. Data for presentation here is extracted semi-manually.
parse-clang-time-trace.py.txt (1.2 KB)

1 Like

100% agreed. And thanks for the script - I’ll start using/profiling and sending PRs to improve PCH coverage.

Some data from the LLVM compile-time tracker:

  • Stage2 clang builds ~15% faster, relative to PCH baseline. (Note that this is a ThinLTO build, so about 30% of the total time goes into linking.)
  • The clang binary size stays ~same for GCC host compiler (stage1), and becomes 4% larger for Clang host compiler (stage2).
  • The compiler is 2% slower for GCC host compiler (stage1), and stays ~same for Clang host compiler (stage2).

So I guess the answer to the performance and size question is “it depends”.

For reference the PCH build improved time to build clang by 40% and had no impact on binary size or performance.


I think the main thing that’s not super clear to me is what the motivation for unity builds over using PCH is. It looks like the unity build does provide some build time improvement over PCH right now, but I would expect that this gap can largely be closed by defining more PCHs rather than being a fundamental limitation.

The big advantage of PCH over unity builds I see is that they do not impose any undesirable requirements on the codebase. For PCH, all we need is that headers do not define conflicting symbols, which is something that we already want anyway (as otherwise some headers cannot be included togheter). For Unity builds, we need that source files do not define conflicting symbols, including local ones (static / anonymous). Avoiding such conflicts is generally a non-goal, and it would be unreasonable to impose such a requirement. So what we have to do instead is to opt-out any files with conflicts.

For that reason, I agree with others that if we support this, it should be supported in the peripheral tier. We shouldn’t make people deal with “I added a static variable to a .cpp file and now the unity build fails”.


cc @AaronBallman who I expect has some opinions on more peripheral tier build options.

1 Like

Maybe I’m interpreting @aengelke’s results incorrectly but I think this part is a fundamental difference:

Fewer generated IR functions due to less duplicated inline functions and therefore also fewer redundant functions to simplify; this is ~15% of the gains

Correct me if I’m wrong but this comes from merging the TUs right? So PCH wouldn’t enable this kind of savings.

More broadly, I do agree PCH is a great solution - as I said I plan to send a PR for at least MLIR using @aengelke’s handy script as a guide. I simply want to pick up all of the build performance lying on the floor :smiley: ie even with very good PCH coverage there will still be a long tail of headers we don’t promote (to a specific PCH header) and unity build (for its users) will still yield improved build performance on those headers.

I’ve come around to the requirement that this should be peripherally supported - I plan to have some discussions with folks in my org (for whom it might be valuable) about maintaining the UnityBuild.cmake in the current iteration of the PR.

Regarding the “fixing a broken UNITY_BUILD” strategy, how do you (and others) feel about these rules:

  1. Duplicated functions which collide can be unified (ie deduped and exposed in a common header);
  2. Macros can be #undefed;
  3. Anything else we want to allow???
  4. Everything else (eg no renaming) gets “skipped” (ie annotated with SKIP_UNITY_INCLUSION).

+1, but I personally hope we don’t support it at all. This isn’t the same thing as Bazel, etc because those don’t require source code changes to support. As a reviewer, I would not want additional work to land on my plate in support of keeping a unity build working; I’m already drowning in reviews for work which isn’t peripheral. Bikeshedding renames because of internal linkage conflicts that cannot happen in other build modes, risk to downstream merge conflicts, etc don’t seem justified unless we think the build mode is sufficiently valuable to make supporting it required, but I don’t think the benefits justify that level of support compared to PCH builds.

You beat me by a couple of minutes :smiley: - see my comment just above yours on the “rules” I’m proposing for “fixing” unity build.

This seems reasonable in general (I like removing code duplication!), but still comes with potential costs. e.g., if the same functionality exists in two siblings in our library layering; now we have to ensure the common header is in the parent library layer, which may expose the interface more broadly than we’d like. But also, “the same” is load bearing; the functionality might exist twice but with very subtle differences that are not easy to tease apart when refactoring. This is all stuff reviewers will have to think about, and maybe that’s reasonable in this case because code duplication is a code smell, but I’m still not certain that supporting unity builds is the reason to do that work.

There’s no contradiction here: PCH can (in principle and very likely in practice) achieve a substantial part of the observed gains (theoretically up to 80%, realistically maybe 50-60%) — I think this is what nikic is refering to when saying “that this gap can largely be closed by defining more PCHs”. PCH cannot, fundamentally, reduce the number of IR-generated inline functions that need to be optimized redundantly (~15% of the gains). So the minimum improvement of unity builds over a theoretically optimal PCH configuration is 15% of 15–25% = 2–4%.