Meta-RFC: Long-term vision for improving build times

Introduction

This RFC outlines a long-term vision and aggregated ideas for improving the build times. Generally this is more about mechanical and structural changes that can considerably improve build times in the long term.

The short-term (<1 year) objective is to run a full build graph within the same, long-lived LLVM daemon process, moving beyond the traditional model of spawning a new OS process for each build action. This is motivated by the significant process creation and I/O overhead on Windows, but the benefits extend to achieving a truly efficient, modern, and incremental compilation model across all platforms.

The first practical step toward this (sequential in-process execution) is discussed in [14]. Following steps after that RFC will be, multithreaded (concurrent) in-process execution and a build daemon. Those ideas were presented at past LLVM conferences in [2] and [3].

In the long term (if there’s an agreement on the above), this vision tends towards a system where overall local build times are directly proportional to the incremental changes done by a local user (on the target / compiled codebase). Changes done elsewhere (by other users on the same codebase) should already be incorporated asynchronously into the local build cache.

Short-term steps

This consists in four practical steps:

  • In-Process Execution: Run build commands sequentially within a single, long-lived process to reduce process overhead.

  • Multithreaded Execution: Extend the previous to run build commands concurrently within the single process, requiring the removal of global state in LLVM for thread safety.

  • Cached I/O: Introduce a shared, thread-safe Virtual File System (VFS) to bypass disk I/O; pass intermediate object state in shared memory between tools (e.g., Clang and LLD).

  • Build Daemon: Create a long-lived LLVM compilation service to manage entire build graphs, perform on-demand compilation, and shorten the build loop for client applications.

Here we only succintly describe the high-level intention; later RFCs will go fully into details.

In-Process execution

This first step of this work focuses on enabling sequential in-process execution of LLVM tools. This work is described in detail in RFC: In-process execution of LLVM tools [14].

A visible outcome of this work is the ability to run a sequence of build commands from the Clang driver:

> clang-cl file1.cpp file2.cpp file3.cpp -fuse-ld=lld

Or with a compilation database:

> clang-cl /compilation-database compile_commands.json

This sequential in-process execution could be used for example for Bazel persistent workers [1] or for faster Lit execution [17].

Multithreaded in-process execution

This second step extends the previous step to allow concurrent execution of tools within a single process. Traditional build systems like Ninja or Unreal Build Tool (UBT) could now see faster execution by delegating complete execution of build commands to a LLVM tool. This work will also make possible the daemon in step four.

In addition to the LLVM changes to enable this, the Clang driver will be able to execute jobs concurrently. Given no explicit dependencies defined, jobs could run concurrently in a thread pool using a new flag (-j) controlling the threads count:

> clang-cl file1.cpp file2.cpp file3.cpp -fuse-ld=lld -j4

Similarly,

> clang-cl /compilation-database compile_commands.json -jall

Internalizing execution like proposed here would greatly reduce OS friction on Windows (essentially the time spent in the kernel or OS libraries) and most likely on other systems as well. In our past llvm-buildozer prototype [7] we have been observing steady 99% CPU usage in user space while building a large game project; whereas regular compilation without llvm-buildozer oscillates today around 70-85% CPU usage in user space – or less, when building non-Jumbo targets such as LLVM or Chromium.

Most notably, this step involves removing some global state throughout LLVM and isolating it on the stack or on the heap instead. Only globals on the “golden path” of build execution are to be removed. Global state in smaller utility programs or libraries unaffected by the new concurrency model will not be modified. Coding guidelines will be changed, and mechanisms for avoiding global state in the future will be added in the test suite. As an example of this work, global state was already removed in LLD, see [8] and [9].

We can identify at least three classes of global state in LLVM: ManagedStatics, cl::opts and function-local statics. These are the vast majority of globals which would need to be adapted, to make most LLVM tools thread-safe for in-process invocations. Other global states exist in the CRT (C runtime libraries) and in other OS libraries. A notable example is the CWD (current directory) pointer which is stored per-process on Windows and indirectly affects many Win32 API calls.

We will also need to identify ManagedStatics (or global variables in general) that must remain process-global, which do not affect concurrent build execution. As an example of this, in llvm/Support/Parallel.h we might want to keep the parallel execution global across all tool invocations. Sharing a global ThreadPool would be useful to better use hardware resources – for example when multiple LLD tools are running in parallel. Cmake flags like LLVM_PARALLEL_LINK_JOBS will not be required anymore in this concurrent in-process mode, as long as all jobs are using LLVM tools.

Cached I/O

I/O is a major performance issue on Windows in general, and avoiding any kind of I/O would most likely favor all platforms. This third step would inctroduce a shared, in-memory cache, thread-safe VFS across all concurrent tool invocations. In essence, treating the file system view as immutable after a build starts allows for bypassing system I/O, by caching files during the LLVM tools execution. External file system changes would be supported, by reading the NTFS journal, or using inotify / fanotify on Linux. Recent sandboxing in [10] simplifies this work. While prior art exists in clang-scan-deps and in llvm-cas, the goal would be to extend and reuse these systems accross all LLVM tooling during a build.

On a similar topic, we can also take shortcuts between tool executions. Currently, when the Clang tool writes .OBJ files to disk, the LLD tool re-opens and reads them. For instance, in the example below, the .OBJ files contributing to the binary are needed immediately by the linker after compilation:

> clang-cl file1.cpp file2.cpp file3.cpp -fuse-ld=lld

The flow here is suboptimal because it necessitates .OBJ serialization and creates blocking I/O requests for both writing in Clang and subsequent reading in LLD. A better approach would be to pass the unserialized state directly in shared memory between tools, while a background thread concurrently writes the .OBJ files for later usage. In a way, this mirrors how Clang avoids outputting an intermediate .ASM file when producing .OBJ files. In our proposed concurrent build model, .OBJ files are not strictly required for the immediate link step above, only for subsequent builds.

Build daemon

The previous steps are focusing on efficiently executing a single build graph by invoking different tools (compiler, librarian, linker, etc.) in-process. This fourth step introduces a long-lived LLVM compilation service (similar to clangd) to tie these steps together. While this new service application could live separated from the main LLVM codebase, it may be beneficial for the LLVM project to drive its development.

Invoking the daemon could be as simple as a Clang driver flag:

> clang -s

This will start a background detached Clang process, which will wait for commands provided by the build system through IPC. The service would actually perform compilation or linking on-demand. Clients could be typical build systems such as Ninja or Unreal Build Tool (UBT), IDEs, or hotpatching applications like Live++ [13]. Along the way, we could look into progressively keeping compiler or linker state in memory to improve build iterations.

The daemon’s execution model could be either a pool of LLVM processes (such as llvm.exe from a llvm-driver build) or a pool of threads internally in the Clang daemon process itself. If favoring process isolation, note that processes are heavier on Windows, and context switching might be slower than on Linux. A pool of threads, conversely, would be more beneficial on Windows and could match Linux performance. Both modes could be implemented, allowing selection between absolute process isolation and build performance.

The daemon process could either remain in the background indefinitely, or it could be started with a default timeout, like sccache does. Furthermore, facilities would be provided for dealing with different toolchain versions across branches. For example, if a project branch A uses LLVM 22 and branch B uses LLVM 23, switching compilation between branches will re-launch the daemon – if it was previously started from another branch. We will assume a daemon instance is tied to specific branch(es), codebase, or C++ project.

Long-term ideas

The above foundational short-term steps – focused on internal execution efficiency – pave the way for a more ambitious, long-term strategy aimed at a complete paradigm shift in how LLVM interacts with modern, large-scale C++ codebases.

Our ultimate objective is to deliver a new development model where local build times are directly proportional to the incremental changes made, regardless of the overall codebase size (e.g., Chromium or Unreal Engine). This is a transition to an advanced toolchain featuring out-of-the-box features such as incremental compilation, runtime hotpatching, progressive optimization via live PGO, and on-demand debug information, among other things.

This vision focuses on four major, systemic areas of improvement that require deeper integration and coordination across the toolchain.

Caching and incremental compilation

Probably the most basic form of caching is .OBJ files. They avoid recompilation of previous build commands for which inputs were not modified.

Another form of front-end caching is .PCH files (or header units, or modules). The work in [15] is largely improving the situation for the LLVM project at least. However, for large projects, maintaining a good set of precompiled header files is not trivial. At Ubisoft, a custom ClangTooling pipeline was used to intelligently generate precompiled header sets, aggregating metrics such as AST node weight and build times across different platforms and compilers. This kind of smart PCH management is a crucial feature that should be considered in LLVM, possibly under different forms.

Memoization and more granular semantic caching offer significant opportunities for improvement in both the compiler front-end and back-end. Past prototypes like the program repo [4] or the zapcc LLVM fork [5] have shown interesting improvements in terms of build times. Other proposals around these topics might come later this year.

As far as network caching is concerned, traditionally we delegate this to external tooling like ccache, sccache, FASTBuild or others. Integrating caching as a first-class citizen into LLVM opens possibilities for both more granularity and more efficient asset sharing between users. While llvm-cas [11][12] largely paves the way ahead, several key aspects here are:

  • Generalizing or simplifying the cache key calculation for internal compiler computations.

  • Ensuring computations are deterministic everywhere.

  • Maintaining a live local catalog of already-performed remote computations (without necessarily transferring the assets themselves).

  • Ensuring the caching cycle overhead is less than the original computation time.

  • Developing facilities for testing, such as replacing a computation with its cached version at runtime and vice versa.

Caching and incrementality can take many different forms. Even with a remote network cache, the sheer size of today’s codebases can sometimes nullify the cache’s effectiveness. Reducing the amount of compiled data (GBs worth of compressed .OBJ files) to be transferred between hosts could be a great improvement. This reduction can be achieved in two primary ways:

  • Early on, by reducing the input source code to only reflect incremental changes.

  • After the build, by performing artifact diffing with domain knowledge after the .OBJ file is produced.

There is prior art such as ELFShaker [16] or MCCAS [18] as far as artifact diffing is concerned. However, these solutions need more natural integration into LLVM tooling to be simple to integrate into external network caching/distribution tooling.

The codebase as a whole

There are also opportunities for LLVM (tools, daemon?) to view and acknowledge that a modern C++ codebase is a complete unit with the following characteristics:

  • It evolves in (small) steps, over time.

  • It has branches.

  • It has many users, each syncing the same codebase across multiple machines.

  • It is stored in a VCS (version control system).

Acknowledging these realities allows us to move toward a more incremental, sharable, and distributed build approach. Traditionally, compiler toolchains like LLVM delegate these knowledge to external tooling (the build system). This creates missed opportunities, as the compiler treats every input file on every invocation as if it has never seen it before; and equally ignores that the same commands run across a fleet of machines on the same files.

While the original GCC build model has a certain purity in its referentially transparent compiler invocations, the practical complexity of C++ codebases has grown significantly over the past 20 years. A full non-distributed, non-cached Chromium rebuild now exceeds 1 hour 40 minutes on 32-core/64-threads machine. Many of its developers leverage distributed or network caching systems, but this infrastructure might not be universally available, particularly for those with unoptimal ISP connections. We must ask what build times are ultimately tolerable before taking action? Similarly, modern game engines such as Unreal Engine, are often difficult to build locally without high-end hardware or robust build distribution infrastructure.

We should aim for the ability to take small patches (.DIFFs) as an input, while maintaining prior compiler and linker state on disk or in memory. While current technologies like C++ modules, header units, PCH, and Clang precompiled preambles offer ways to achieve this, they are often complex to implement and maintain, placing an undue burden on toolchain users (who may not have the domain expertise or time to properly implement those solutions).

With that domain knowledge, LLVM could more efficiently distribute build actions across a fleet of machines when building the same codebase. This does not require bringing entire build systems, backend caching solutions, or distribution tools into LLVM, but rather providing helpers, APIs, and services to enable these external systems to work more efficiently.

Debug information on-demand

Generation of debug information is one of the reason for slower build times and one of the primary reasons for the size of the generated artifacts. Large-scale applications generate gigabytes (GBs) worth of debug information. As a motivating example, Chromium’s browser_tests.pdb is over 6.5 GB in size; most of it will never be consumed by users or any automated processes. Having a LLVM build daemon presents an opportunity to generate debug information on-the-fly to serve a debugger (such as LLDB or Visual Studio). A more involved daemon, with an entire view of the codebase and its evolution (across time and branches), could also act like a source server [6] or a debuginfod service, generating debug info only when needed or requested.

Live optimization and hotpatching

There are also opportunities in live optimization of code in running binaries. Because of the long edit-build-run iterations, the game industry typically uses hotpatching tools like Live++ [13] which can build any incremantal changes on-the-fly and dynamically patch a running process. Most codebase modifications – including changes to function bodies and class methods – can be made in the C++ source code and then hotpatched. This is tremendously improving the iteration times for developers and inherently allows for increased quality of games.

Despite the speed of hotpatching, realtime applications like games still need runtime performance at all times, even during development. Therefore, even during production, we can only afford optimized builds. For cases where more involved debugging is required, Live++ offers the ability to un-Jumbo-ify or deoptimize a .CPP file at runtime, by calling the compiler with appropriate flags; then hotpatch the corresponding TU functions in the target running process. This is typically something that could be improved with the help of a LLVM compilation daemon which would keep internal state in-memory between runs, drastically reducing overhead for such repetitive compilation tasks. This is also a great oportunity for incrementally optimizing runtime code, to avoid long build times upfront. Advanced features like live PGO could also be possible, where profile data is collected from a running process and injected back into the optimizer, in the daemon.

Conclusion

The vision I am articulating here is two-phased – from immediate, mechanical gains of In-Process Execution – to systemic transformation of the Long-Term Strategic Vision. I think it provides a actionable roadmap for tackling the challenges of build times in modern C++ development.

By implementing this short-term foundation, my aim is to establish the necessary platform for a toolchain that is deeply integrated, highly incremental, and optimized for the developer’s iteration loop. My ultimate goal is for us to move towards an experience where the frustration of minutes- or hour-long rebuilds are relics of the past.

I’m looking forward for discussions around these topics! With this RFC my goal is also to aggregate existing efforts in these ares in a common document / thread, and possibly come up with a roadmap.

Thank you for reading!

References

[1] Bazel Persistent Workers

[2] 2019 LLVM Developers’ Meeting: A. Ganea “Optimizing builds on Windows”

[3] 2024 LLVM Dev Mtg - Manifesto for faster build times

[4] 2019 EuroLLVM Developers’ Meeting: R. Gallop “Targeting a statically compiled program repository”

[5] Zapcc

[6] Microsoft Source Server

[7] llvm-buildozer

[8] RFC: Revisiting LLD-as-a-library design

[9] Removing global state from LLD

[10] [RFC] File system sandboxing in Clang/LLVM

[11] RFC: Add an LLVM CAS library and experiment with fine-grained caching for builds

[12] 2022 LLVM Dev Mtg: Using Content-Addressable Storage in Clang for Caching Computations and Eliminating Redundancy

[13] Live++

[14] RFC: In-process Sequential Execution of LLVM Tools

[15] [RFC] Use pre-compiled headers to speed up LLVM build by ~1.5-2x

[16] elfshaker stores binary objects efficiently

[17] RFC: Reducing process creation overhead in LLVM regression tests

[18] Fine-grained Compilation Caching using llvm-cas

7 Likes

I find this rather ambitious, and wonder how many year it would take to implement all this to production-ready state. Couple of notes:

  • It seems like users of distcc like me won’t be able to get any improvements from this work (fine by itself, but worth mentioning in the RFC).
  • My understanding is that you envision that compiler owns and manages the cache, which leaves an open question of integration with external tools like sccache, which are used on our CI to do cluster-wide caching. Can you expand on this?
1 Like

Hello @Endill,

I find this rather ambitious, and wonder how many year it would take to implement all this to production-ready state.

With a small team I think the short term topics – up to the daemon – could be achieved and landed in about a year. The long term topics will certainly take longer, but it all depends on who joins in this boat. Some of these topics are already being worked on, such as llvm-cas at Apple.

It seems like users of distcc like me won’t be able to get any improvements from this work (fine by itself, but worth mentioning in the RFC).

It all depends on the level of changes that you’d want to implement in your build system, in addition to still using distcc (or any distribution tool). If you’re ready to use the proposed LLVM daemon locally with a cache and an index, as well as possibly keeping some state on the remote agent (the index of already built sections), that could reduce the amount of data generated and sent through the network. So I am thinking along the lines of what was proposed with the prepo. There are other private proposals for the front-end which should improve even more the build times, but I think they all need some level of process persistence and state on disk. That wouldn’t change your build scripts, with the exception of supplemental commands for launching the daemon and managing state on disk on the remote machines.

My understanding is that you envision that compiler owns and manages the cache, which leaves an open question of integration with external tools like sccache, which are used on our CI to do cluster-wide caching. Can you expand on this?

llvm-cas already proposed this (owning the cache) but at a more granular level. LLVM wouldn’t own caching at OBJ-level like sccache / ccache / FASTBuild does. These solutions would remain orthogonal to llvm-cas and what is proposed here. The only improvement that this work can bring to existing OBJ-level caching is possibly a reduction in the OBJ size, associated with a reduction in compiler back-end times like mentionned above.

On the contrary if you’d eventually wish to embrace this proposal in your build system, that would require running the LLVM daemon on all machines that participate to the build, a bit like the existing sccache daemon. The LLVM daemon will have lot more domain knowledge and will be able to synchronize transparently the indexes between agents, so that they don’t need to build data that is already on the initiating host (or any other host that spawns builds on the same codebase). The index would be a map of anything from generated sections, debug info, header unit ASTs, etc. a bit like the linker does de-duplication.

1 Like

This is certainly a bold proposal. I understand this is a vision document, but there’s quite a lot of things to unpack, and, IMO, many details left undiscussed. I’d think any one of your short term goals is worthy of its own RFC, and indeed some of them have prior discussions. I guess I’d encourage you to use this space for broader discussion and present more detailed RFCs to the community before investing into any of the subprojects you’ve outlined. To be clear I think these are good ideas to discuss as a community, but I also want each part to get the attention it deserves.

You’re proposing pretty significant changes to the basic execution model of all of LLVM. How does this plan fare if any one part is not implemented the way you currently propose? How do these proposed changes scale? I’d guess that last bit varies significantly across the different parts of the proposal, if can even be estimated. How does your broader vision work in LTO builds? It’s hard to imagine incremental builds working all that well when performing whole program transforms and analysis.

Lastly, while I’m pretty bullish on leveraging llvm-cas within clang and llvm, none of the integration has started yet. There’s a lot of inherent complexity and designing and integrating it into the production compiler, and I don’t expect that to be completed any time soon. I think some if the timeline is perhaps overly optimistic (eg ~1 year) for many of the goals proposed.

CC: @petrhosek

1 Like

Hello @ilovepi,

Thanks for the feedback, I appreciate.

I understand this is a vision document, but there’s quite a lot of things to unpack, and, IMO, many details left undiscussed. I’d think any one of your short term goals is worthy of its own RFC, and indeed some of them have prior discussions. I guess I’d encourage you to use this space for broader discussion and present more detailed RFCs to the community before investing into any of the subprojects you’ve outlined.

The “short term steps” (up to, and including, the LLVM daemon) are essentially mechanical plumbing that do not affect the current build model, or any of the existing usages. Some of them will be tedious to implement like the global state removal, but I am planning to automatize that, and ensuring that we won’t have to deal with this afterwards. I was planning on proposing detailed RFCs only after each step lands, and use this current thread to discuss the overall intention(s).

You’re proposing pretty significant changes to the basic execution model of all of LLVM. How does this plan fare if any one part is not implemented the way you currently propose?

I think we should discuss more specifically about what you have in mind, but the plan is resilent. For example if we don’t want multithreaded compilation in-process, but we want the LLVM daemon, that’s acceptable. The goal for this meta-RFC is again to discuss and agree on the whole vision before going into the details.

How do these proposed changes scale? I’d guess that last bit varies significantly across the different parts of the proposal, if can even be estimated.

The changes themselves to get there require a lot of effort I confess. But if we agree as a community that we should head into that direction, these topics could be later picked up by anyone. Most of what is proposed here could be worked on in parallel, by different teams.

How does your broader vision work in LTO builds? It’s hard to imagine incremental builds working all that well when performing whole program transforms and analysis.

If we’re referring specifically to ThinLTO and not FullLTO, yes that is an objective. With caching and ThinLTO I’ve seen 40 sec iteration times on some of our game projects, from the source modifications to a generated binary. With a LLVM daemon, I am expecting that it could be possible (involved but possible) to keep in memory the LTO combined module index that is generated during the ThinLink phase in LLD (I know that there are memory usage implications there as well). That discovery phase is probably the most costly one in ThinLTO LLD, since it blocks all parallelization that comes afterwards; and most likely the combined index does not change that much during a developers’ day of work.

I think some if the timeline is perhaps overly optimistic (eg ~1 year) for many of the goals proposed.

To clarify, I mentionned ~1 year only for the short term steps, if a small team (3-4 people) was dedicated to that purpose.

1 Like

In summary, the short-term steps propose to introduce an additional, substantially different invocation model for Clang for a perf. improvement of at most 15-30% on Windows only. This would reduce the robustness and add new failure/bug modes that would not typically be observed on non-Windows platforms (where this probably provides much less benefit).

In my opinion, if we care about C++ compile times, we should work on (a) getting C++ modules into a widely usable state in which they provide substantial improvements and (b) a faster, performance-focused, and well-engineered C++ front-end (preferably without an expensive AST). IME, the front-end dominates compile times for larger code bases and a faster front-end could get improvements >2x for everyone.

Shared memory between processes/threads could then be used to e.g. avoid serialization of pcm files should this turn out to provide significant benefits on several platforms. I also think that the true benefits of having a compiler daemon (i.e., sharing of parsed/instantiated/IR-generated code between CUs even in non-module builds) would not really be achievable with the current code architecture and given the limited resources of the Clang project such substantial changes are probably unrealistic in the near future.

There are some bits in this long proposal that I agree with and some that I don’t agree with and it’s good to discuss these ideas, but I (similar to Paul) would rather do that on individual and detailed RFCs.

1 Like

As Vlas, I’m concerned about the scale of the effort of what you are proposing. Very bluntly, who is going to support this work, and the people that would be working on this?

More generally, I am wondering where you envision the separation of concerns to be (it seems reasonable that we would facilitate a way to run clang in process, but do we want to vendor such a tool yourself rather than for example, enable ninja to run clang in process? Did you for example consider a fork of ninja supporting in-process compilation? I think I’m am asking if you envision LLVM would ship libraries for build system vendors, or become a build system itself.

Did you talk to build system vendors?

Can you go into more details into the cached i/o section? For a large program there may be thousands of object files, we cannot expect so much information to stay in memory. In fact, linker memory usage is usually a bottleneck in build graphs today.

The overhead of compilations is linear with how much code needs to be recompiled. C++ modules would be a way to have to recompile less code (less parsing, some instantiations are cached). But modules, and any similar technologies still require anything downstream of that change to be recompiled, so I don’t know that you can ever get to “build times are directly proportional to the incremental changes made” - not without reducing code dependencies.

More memoization in the lines of zapcc seems interesting, but any such solution would represent a large maintenance burden on its own, and you are proposing several. Note that I have mostly a front end perspective - it might be easier to find more reasonable caching and diffing opportunities in the backend.

You mentioned Chromium, and Unity a few times. And I think it makes sense that keeping information in memory would facilitate on the fly generation of debug info, etc. But could such an amount of in-memory data exist on commodity hardware or would such a service be only provided by compilation farms that can already cope with lots of very large object files? Compile times are already both computationally and memory intensive.

As other I said, this is mostly my first gut reaction ie “wow, this is a lot”. Some of your ideas certainly are worth exploring :slight_smile:

3 Likes

I’ve thought about ways to improve build performance on Windows a decent amount for premerge testing, and while I never implemented it, I eventually settled on this. This is basically llvm-buildozer (although a process pool would work better than a thread pool due to how we handle options currently). I think it would actually be an achievable effort and could give very large performance improvements on Windows (like ~50%). And if someone wanted to do it entirely out of tree, that should be possible. It would be a substantial amount of work, but less than what is being proposed here.

I was also interested in much more aggressive caching in such a build system through LLVM CAS (something like siso with RBE/caching but without the remote part). That might not be as relevant for the intended use case here, but probably wouldn’t be hard to bolt on.

Hello @aengelke,

In summary, the short-term steps propose to introduce an additional, substantially different invocation model for Clang for a perf. improvement of at most 15-30% on Windows only. This would reduce the robustness and add new failure/bug modes that would not typically be observed on non-Windows platforms (where this probably provides much less benefit).

I argue the invocation model will be no different than LibTooling usage or clangd, sooner or later we might discover the same issues. As for the robustness, yes, until this new mode is adopted everywhere, it should be marked as “experimental”. Should we not attempt for such improvements?

Concerning for the 15-30% build times improvement on Windows only – as of 2026 there are approx 16M C++ users around the world, and roughly 50% or more are using Windows as a developement platform. While the Clang share on Windows might be more marginal, there might somewhere between 500k and 1M users of Clang on Windows. Even if it’s just 200k users, 15-30% reduction in build times is substantial.

What is skewing this discussion is that the vast majority of the LLVM contributors are using Unix-based systems, so these problems are not affecting them. But that is not representative of the reality of actual end-users.

I asked Claude to classify Windows LLVM contributors and commits over the past year:

================================================================================
LLVM CONTRIBUTOR ANALYSIS: WINDOWS vs NON-WINDOWS (Past Year)
Period: 2025-02-15 to 2026-02-15
================================================================================

OVERALL STATISTICS:
  Total commits:                   41,732
  Total unique contributors:       2,727
  Windows-related commits:         1,594 (3.8%)
  Contributors w/ Windows commits: 429 (15.7%)

================================================================================
CONTRIBUTOR CLASSIFICATION
================================================================================

Classification criteria:
  Primary Windows:     >=50% of commits are Windows-related AND >=5 win commits, OR >=20 win commits
  Significant Windows: >=20% win commits OR >=10 win commits (not meeting Primary)
  Occasional Windows:  1+ win commit (below thresholds)
  Non-Windows:         0 win commits

--- PRIMARY WINDOWS CONTRIBUTORS: 27 ---

And some takeways for that analysis:

  1. Windows is a very small slice of LLVM development – only 3.8% of commits touch Windows-specific code, and only ~27 people (1%) are “primarily” Windows contributors.

  2. The more active a contributor, the more likely they touch Windows code – 80% of very active contributors (100+ commits) have at least some Windows-related commits. This makes sense: prolific contributors tend to work across the whole codebase and fix cross-platform issues.

  3. Only ~177 people (6.5%) are meaningfully focused on Windows when combining Primary and Significant categories.

  4. The top “true” Windows contributors (people where >50% of their work is Windows-focused) include names like Jacek Caban (88% win, COFF/PE linker), yourself (Alexandre Ganea, 91%/57% across two emails), Tomohiro Kashiwada (77%), Akshat Oke (70%), Martin Storsjo (36%, MinGW), Mateusz Mikula (100%, MinGW), and jeremyd2019 (80%).

  5. Some “Primary” entries are misleading – contributors like Simon Pilgrim, Matt Arsenault, Kazu Hirata, and Fangrui Song appear because they have 20+ Windows commits in absolute terms, but Windows is only 1-4% of their work. They’re cross-platform contributors who occasionally touch Windows code.

In my opinion, if we care about C++ compile times, we should work on (a) getting C++ modules into a widely usable state in which they provide substantial improvements and (b) a faster, performance-focused, and well-engineered C++ front-end (preferably without an expensive AST). IME, the front-end dominates compile times for larger code bases and a faster front-end could get improvements >2x for everyone.

I agree, the front-end dominates our Unreal Engine codebase too. We don’t use PCH because they can’t be efficiently sent over the network to the remote agents, some .PCH files are over 1 GB. Here some of our timings taken with -ftime-trace and ClangBuildAnalyzer:

(build times for all files, aggregated)

Editor Development build:
  Parsing (frontend): 80595.4 sec = 70.9%
  Codegen & opts (backend): 33041.6 sec = 29.1%

Game Development build:
  Parsing (frontend): 35156.0 sec = 71.8%
  Codegen & opts (backend): 13829.7 sec = 28.2%

What I had in mind is a form of front-end caching (which PCH is part of and C++ modules to some extent) that is done transparently by the compiler. We shouldn’t ask users to make significant changes (including long-term maintance) to their codebases to support a feature that will improve their build times. Adding a flag to the build system is fine, but anything involved, seems unrealistic for existing (large) codebases, in my view. I don’t think the problem with C++ modules is that they are in a unusable state like you say, it’s that they ask too much from users.

There are some bits in this long proposal that I agree with and some that I don’t agree with and it’s good to discuss these ideas, but I (similar to Paul) would rather do that on individual and detailed RFCs.

I can provide RFCs for the short-term propositions which I am commited to, but not the long-term ones, which at the moment I am not commiting to. I am simply proposing the long-term ideas simply to sparkle discussion around build time improvement in general. This is a topic that is underdiscussed in the community I feel. As an user I am extremely happy with LLVM as a toolchain, in terms of features and the runtime performance it provides. However not when it comes to build performance, despite all the efforts we’ve put into our build systems, including distribution and caching.

2 Likes

For what it is worth, we (at The Browser Company) are looking at enabling LLVM CAS on Windows.

We are looking at modifying clang-cache to support clang-cl invocations (rather than requiring clang and clang++). We have the non-daemonised variant working sufficiently well that we are trying to migrate to that for regular builds. I am working on modifying some of the downstream (Swift) changes for LLVM to support the daemonised execution on Windows and feel like it is feasible to support in the short term future.

Subsequently, the piece that will remain after that work is enabling support for the MCCAS backend for clang-cache, which requires a new object writer. This also does not seem too far fetched to support, though might be a bit more challenging as there is some assumptions around MachO baked into the backend, but I’ve determined (via a PoC) the shape to be quite tractable: it is entirely isolated to a single TU which is responsible for writing the object into the CAS.

Honestly, the most complicated piece for supporting the CAS on Windows seems to be related to debug information. In order to maximize the caching benefits, we would need to ensure that the CAS is able to fragment and re-synthesize the CodeView content. I’ve not looked sufficiently at this to determine how much work this would be, but I feel like there shouldn’t be a strong reason to assume that this would be any worse than it was for implementing support for DWARF.

1 Like

Hello @cor3ntin,

As Vlas, I’m concerned about the scale of the effort of what you are proposing. Very bluntly, who is going to support this work, and the people that would be working on this?

I am commiting to the short-term goals, up to the LLVM daemon, even if that takes 10 years to land and bring to a production state. I am not commiting however on the long-term goals, I am only bringing them up for discussion at the moment.

More generally, I am wondering where you envision the separation of concerns to be (it seems reasonable that we would facilitate a way to run clang in process, but do we want to vendor such a tool yourself rather than for example, enable ninja to run clang in process? Did you for example consider a fork of ninja supporting in-process compilation?

I am only considering minimal changes to external build systems or to build scripts. Nothing should be much different from a build system perspective. A LLVM toolchain would ship as it is now.

If we’re talking about multithreaded in-process compilation or the LLVM daemon, eventually there could be some small changes being upstreamed to ninja (or any other build system). On the contrary, most build systems are able to export CDB .json files, so in practical terms not much needs to be done to use some of these proposals. In an ideal world, ninja would talk a simple protocol with the LLVM daemon through a pipe, but that won’t be a lot different from the commands ninja exports in a CDB file.

I think I’m am asking if you envision LLVM would ship libraries for build system vendors, or become a build system itself.

No. I am reluctant from bringing the build system into LLVM. The contract between “build systems” and the “toolchain” is fine at it is now: “We give you a command line; you give us back compiled assets (.OBJ, .LIB, .EXE, .PDB)”.

Did you talk to build system vendors?

I actually work for Sony Interactive, so yes I am talking to the SIE toolchain team. However this is my own proposal which Sony is not commiting to as of now.

Can you go into more details into the cached i/o section? For a large program there may be thousands of object files, we cannot expect so much information to stay in memory. In fact, linker memory usage is usually a bottleneck in build graphs today.

LLD already loads and keeps in-memory all the .OBJ files today, for the entire duration of the link. So this will be no different from the actual build flow, except that sections would remain in memory a little longer after the compilation. But this is indeed a problem like you say, for example see this Chromium ticket.

The memory pressure comes from the fact that we push the section de-duplication to very late in the build process (in the linker). One way to solve this is linker-driven compilation, like Apple has experimented in the past. The linker would start very early, and would maintain state in memory just to perform the de-duplication while the compilation is running.

Another way of solving this could be to pre-link files per-project or per-library. Probably 4/5 of that memory pressure comes from debug information, mainly from repetitive inclusion of the STL library or other header files. Instead of linking tens of thousands of .OBJ files in Chromium browser_test.exe, we could link perhaps a lesser amount that were already pre-linked.

The overhead of compilations is linear with how much code needs to be recompiled. C++ modules would be a way to have to recompile less code (less parsing, some instantiations are cached). But modules, and any similar technologies still require anything downstream of that change to be recompiled, so I don’t know that you can ever get to “build times are directly proportional to the incremental changes made” - not without reducing code dependencies.

An RFC along those lines might be posted sometimes soon (not by me).

You mentioned Chromium, and Unity a few times. And I think it makes sense that keeping information in memory would facilitate on the fly generation of debug info, etc. But could such an amount of in-memory data exist on commodity hardware or would such a service be only provided by compilation farms that can already cope with lots of very large object files? Compile times are already both computationally and memory intensive.

Maybe not all state would remain in memory, some will be in SSD-backed memory pages. But at least it would be better in terms of latency, than starting a compiler or linker binary and reloading and reprocessing everything from scratch. At least keeping an index in memory of what’s “available” – what was already processed – would be better than what we have right now. Like said before, this could be anything from sections, partial ASTs, debug information, etc.

My thinking is also along the lines of “is this piece of information available elsewhere?”, on a different machine. At least if we “know” locally that some data that contributes to build exists somewhere, it is already a big win. Right now most .OBJ caching systems do not have that knowledge locally, ie. a hashtable of existing keys that map to exiting remote .OBJs.

As other I said, this is mostly my first gut reaction ie “wow, this is a lot”. Some of your ideas certainly are worth exploring :slight_smile:

That’s the whole point, let’s discuss! :grinning_face_with_smiling_eyes:

1 Like

Just tried to comment this from a slightly different perspective: as other people also agreed, this will be a relatively big project to push – and it is important to not run out of steam. No one knows what’s gonna happen in the future but we can at least (1) motivate more people to care about your project (2) try not to upset too many people who don’t care about your project (i.e. not causing significant breakage unless necessary).

To this end, I think picking a part from your proposal with a wider impact (so more people care) and narrowing down its scope (so it’s going to break less code) could be a good starting point.

Personally I think the VFS part (i.e. Cached I/O) is a good candidate. Because it doesn’t matter whether you are on Linux or Windows, I/O is still gonna be a big, if not primary overhead in the build process in general. So saving it is always a good thing (and again, probably more people will care), though I guess it’s still an open question of how much I/O overhead we can reduce (on Linux) with this method. And it is also true that having Cached I/O more or less depends on making LLVM thread-safe as you pointed out, which is a big project by itself. But I have a feeling that it is long overdue and we have to deal with it one way or another in the future.

4 Likes

Hello @mshockwave,

Just tried to comment this from a slightly different perspective: as other people also agreed, this will be a relatively big project to push – and it is important to not run out of steam. No one knows what’s gonna happen in the future but we can at least (1) motivate more people to care about your project (2) try not to upset too many people who don’t care about your project (i.e. not causing significant breakage unless necessary).

Appologies to everyone if I could give the impression that this isn’t a large endeavour. It is. However some of the topics such as in-process compilation are indeed only favoring Windows users. That would bring performance on par with Linux fork() and I think, provide a fertile ground for later work.

Personally I think the VFS part (i.e. Cached I/O) is a good candidate.

What is your perspective on this? The caching I was envisioning requires memory page collaboration between tools. In-process compilation is a good candidate, but different approaches could be taken here, such as a self-standing resident process that only caches file content (and communicates through RPC with compiler processes)

And it is also true that having Cached I/O more or less depends on making LLVM thread-safe as you pointed out, which is a big project by itself. But I have a feeling that it is long overdue and we have to deal with it one way or another in the future.

I think the llvm-buildozer Phabricator gives a good first order approximation of what needs to be done for removing global state; while keeping in mind that we only remove what is on the “golden path”, for compilation and linking. I tried to take a pragmatic stance in llvm-buildozer to avoid too many changes: for example cl::opts were kept as they are, as “immutable blueprints” and the cl::location mechanism was used to redirect to a Bump allocator that contains all live values, per tool. In the context of llvm-buildozer, I took some shortcuts, so per tool means per thread. In another unpublished prototype, something similar was done for ManagedStatics – although some would need taging, to discriminate what needs to be “per process” and what is “per tool”.

The other approch (which could also be complementary to the above) is to move all globals into a per-tool context, like what was done in LLD, see llvm-project/lld/COFF/COFFLinkerContext.h at main · llvm/llvm-project · GitHub - but that requires a lot of shuffling around and context-passing in functions. For non-performance-critical globals, per-thread access like suggested above is enough.

1 Like