TL;DR:
Inspired by [RFC] Use pre-compiled headers to speed up LLVM build by ~1.5-2x and Meta-RFC: Long-term vision for improving build times and empowered by Claude I took a shot at implementing CMAKE_UNITY_BUILD=ON. The results are shockingly good:
The plot shows the result of a synthetic experiment where I build mlir-opt, clang, and opt with alternatingly CMAKE_UNITY_BUILD=OFF and CMAKE_UNITY_BUILD=ON while passing -j$CORES, i.e., restricting the number of concurrent invocations of my CC (which happens to be clang). Note: lower is better. For normal (common) configurations (~16 cores) I observe a 30% improvement in wall-clock time. At higher core counts the effect decays because of the reduction in inherent parallelism from the build strategy (more below). I’ve tested this with a few other machines I have at my disposal and the results are robust.
The draft PR (in passing but still sketch form) is here. Before you click and become aghast at the number of files touched, maybe read/skim below The current iteration of the PR which factors all the exceptions out into a single file is here: [MLIR][CMake] LLVM_UNITY_BUILD support by makslevental · Pull Request #188403 · llvm/llvm-project · GitHub. Note: the PR does test the change by adding
(or maybe just note that 98% of the files touched are CMakeLists.txt files).CMAKE_UNITY_BUILD=ON to monolithic-linux.sh (which is how we run tests in pre-commit).
Background
UNITY_BUILD (also sometimes known as “Jumbo build”) is a build strategy which batches up source files for faster compilation. CMake implements this strategy by creating a (set of) unity sources which #include the original sources, and then compiling these unity sources instead of the originals:
// ./utils/TableGen/CMakeFiles/llvm-tblgen.dir/Unity/unity_0_cxx.cxx
#include "llvm-project/llvm/utils/TableGen/AsmMatcherEmitter.cpp"
#include "llvm/utils/TableGen/CodeGenMapTable.cpp"
#include "llvm-project/llvm/utils/TableGen/DAGISelMatcher.cpp"
#include "llvm-project/llvm/utils/TableGen/DAGISelMatcherEmitter.cpp"
...
i.e., effectively concating all the source files in a target.
How can this be faster? One benefit is headers included in multiple source files do not have to be reparsed or semantically re-analyzed (template instantiation etc.). Importantly (no free lunch), this can also be slower: merging/concating independent source files/TUs into a single file/TU necessarily reduces the inherent parallelism of the build. Aside from this, concating/merging the TUs (potentially) runs into issues related to One Definition Rule (ODR) violations.
Foreground (the sketch PR)
The main blocker to turning UNITY_BUILD on is ODR violations. In BC times (before Claude), I had attempted enabling it by rewriting code to actually eliminate the ODR violations. Suffice it to say, that is/was a sisyphean task. I had intended on asking Claude to do that rewriting but it actually suggested something much better: SKIP_UNITY_BUILD_INCLUSION! This is the primary approach taken in the sketch PR: simply exclude the problematic/violating files; this is why overwhelmingly the PR is just changes to CMakeLists.txt. The remainder is a few #undefs and one or two MLIR source files I was comfortable actually rewriting (moving a copy-pasta’ed function to a header). Here’s a summary of the changes as they are right now:
- lots and lots of
SKIP_UNITY_BUILD_INCLUSION; - A few
CMAKE_UNITY_BUILD=OFFwhere there were more problematic source files than non-problematic source files; - Basically
#undefing macros which either should’ve been undeffed to begin with (e.g., at the bottom of the source file) or#undef DEBUG_TYPEfor uses ofDEBUG_TYPEfrom some header (almost certainly a mistake in and of itself…); - An automatic insertion of
#undef DEBUG_TYPEand#undef DBGSviaUNITY_BUILD_CODE_AFTER_INCLUDE.
And that’s it! The PR is completely NFC! One caveat: I’ve patched all of the issues on MacOS and Linux but on Windows we have lots of macro issues via Minwindef.h/Windows.h. I tried fixing these issues using UNITY_BUILD_CODE_AFTER_INCLUDE but it didn’t fix it completely. I believe overall it’s doable but I just don’t have easy access to a Windows machine
.
Landing/merging/testing strategy
My plan for landing this is to split out the source changes (fixing the uses of DEBUG_TYPE which are accidentally using defines in headers) from the build/CMakeLists.txt changes. I’m also considering (but not completely certain - hence this RFC!) refactoring into something that works via add_llvm_library - i.e., adding LLVM_UNITY_BUILD and then setting the target property UNITY_BUILD=${LLVM_UNITY_BUILD} inside of add_llvm_library/add_llvm_executable (and adding SKIP_UNITY_BUILD_INCLUSION as a list arg to add_llvm_library). The advantage of this approach is downstream users could turn on UNITY_BUILD for just LLVM without turning it on for their whole project. The disadvantage is we certainly have a lot of uses of just add_library and add_executable. Happy to bikeshed here!
More importantly, what should be discussed here is the maintenance/testing strategy. I think ideally we should test this path in pre-commit but we could of course live with testing this in post-commit. We might consider actually migrating the three main pre-commit bots to use this path but I worry there that that’ll confuse people when things fail for reasons other than the change they’ve made. Thus, realistically this path gets tested post-commit and it’s there that I would need to partner/collaborate with someone in the community. I suspect someone out there might appreciate the reduced infra costs The current consensus is this should be supported “peripherally” (like bazel). See discussion below.
. My employer does have post-commit bots but they apparently do not actually alert anyone outside of our org/company when there’s a breakage. In the near-term I plan to migrate the nightlies at Workflow runs · llvm/eudsl · GitHub to use this path and so I’ll be able to catch breakages (and fix) but although I’m excited about this change (naturally) I’m not ready to volunteer to maintain it entirely by myself
.
Note: I’m not proposing that going forward we should contort code to satisfy this build path (i.e., injudiciously eliminating ODRs across TUs just to be able to do UNITY_BUILD). That is exactly the value of SKIP_UNITY_BUILD_INCLUSION. But I would propose that if this lands, a best effort approach is taken: if it’s straightforward to eliminate the ODR then do so and otherwise just chuck the source file into the SKIP_UNITY_BUILD_INCLUSION bucket. Possibly someone (me?) could come back around on a regular basis and perform small refactors when doing so would “unlock” lots of build time improvement.
P(re-emptively)AQ
- Does this break source/debug/line info?
- No! Just like source can be tracked through any other header
#include, this associates the original line info with everything;
- No! Just like source can be tracked through any other header
- Does this break
ccache?- No!
ccachehashes both the#includes and the source file in default (direct) mode and as well has apreprocessormode which will run the preprocessor and then hash;
- No!
- Does this make the binary slower/less performant?
- On the contrary, though I haven’t tested, this should in theory make things faster because the TU encompasses more of the program and thus more comprehensive analysis can be performed.
- Does this make the binary fatter?
- Although I haven’t measured (it would be easy to do I just haven’t done it yet…), it shouldn’t - currently duplicated code across object files is deduped by the linker when the final shared library or executable is linked. In fact, it should make static archives smaller (I believe the linker does not dedup in static archives).
- Does this make linking slower?
- Again, not sure, haven’t measured, but intuitively I believe it should make linking faster because there are fewer object files to link and fewer symbols in those object files.
Having claimed the last two points based on “intuition”, which I’m not expecting y’all to take on faith (feel free to tell me I’m wrong!), I’ll run some experiments and update this RFC.
Okay I invite y’all to try the PR, report back if you saw improved build times (or worse?), or just general comments/questions/concerns.
