Meta-RFC: Long-term vision for improving build times

Hello @cachemeifyoucan,

Very good points, thanks for the feedback!

However, the proposal is very ambitious and lacks details (I know it is still meta, and I just suggest something I would like to see in future detailed RFCs). I am, in spirit, like to see RFCs for build performance.

Yeah this comes a lot in the comments. I can prepare RFCs for each of the short term steps at least. The in-process compilation has already its own RFC.

On the other hand, multi-threaded in process execution might involve completely replace cl::opt with something better or make it thread_local, which is a very big task.

cl::opts have specifically come up a lot in discussions over the past years. I will prepare an RFC just for the global state removal. Replacing cl::opts by something else seems way too much work for my taste. The approach I’ve taken here was to force usage of cl::location at all times and redirect the storage to a single thread-local buffer (for all cl::opts in the process). It could be something along those lines, but using a tool-local buffer instead, and use a TLS context to point to that buffer. When a tool starts, it would set its cl::opt buffer in the TLS, when it ends, it would clear the TLS. This mechanism doesn’t require any change to the existing cl::opts, except the ones that are already using cl::location.

Secondly, we need to write down the exact methodology to achieve each task so we know how feasible it is and how interruptive it is.

Agreed.

We also might need some initial experiment to collect some data to understand the potential benefits.

Absolutely.

Here is a counter example for your in-process compilation model: Some experiments were run on macOS (where launch time is not as big a problem comparing to windows) for in-process compilation years ago and we found the saving from no process launch is completely overshadowed by the cost of freeing memory (cannot use -disable-free for in process compilation). This is not a prove that in-process model doesn’t work, just mean that we need to squeeze more saving from elsewhere, which also need data to back that up.

Yes I remember that you folks put back CLANG_SPAWN_CC1=ON for MacOS when I introduced -fintegrated-cc1. In that regard, the prototype here uses multithreaded in-process compilation and is actually disabling -disable-free. Meanning that heap cleanup occurs between each tool invocation. However the heap memory pages remain committed (at least when using rpmalloc). I found that a lot of the churn that happens on shutdown is because of the VAD tree cleanup by the OS, reclaiming back the physical memory pages. The more you call short-lived processes with a lot of page allocations (like a compiler), the worse it gets. The OS has to clear the pages before handing them back to another process, and at some point, the zero queue is filling up and becomes a blocker. When building LLVM or Chromium this shows a lot. In some extreme cases, like linking Chromium’s browser_tests.exe, shutting down the LLD process (after the CRT finished its cleanup) can take between 5-7 seconds. However when the pages remain mapped in the process, there’s no hit between tool invocations (but this also greatly depends on what your CRT allocator does upon free() on MacOS).

It is great to see you have solutions in mind for many of the problems. Looking forward to read the detailed follow up RFC. For the meta RFC, it might be good to come up with a timeline (at least in what order and if you are depending on some other changes) so we know what to expect.

Yes I remember that you folks put back CLANG_SPAWN_CC1=ON for MacOS when I introduced -fintegrated-cc1.

That is probably unrelated to daemonize things. I think it is about crash reporting if I remembered correctly.

However the heap memory pages remain committed (at least when using rpmalloc).

Like I said, I don’t think it is blocker for what you propose, and it is highly depending on the OS and malloc library, and there are other things to we can do to mitigate the impact (like using more bumpPtrAllocator). That is why I would like to see some experiments with data.

1 Like

Hi folks,

(I share a corporate overlord with Alexandre; although in a different tentacle of the business),

I think this is a future vision of a world I want to live in; it fits a theme of large-scale integration to achieve efficiency. The topic of Windows is coming up a lot – our (the Sony bunch) customers use Windows, we care about it a lot and use it a lot, hence the in-process-testing work [0]. It’s certainly something we’d put effort into maintaining + monitoring. For build system integration, it isn’t something I have a lot of familiarity with, but experimenting with closer integration and seeing what performance results come of that seems feasible to plan out and try.

The only thing that truly makes me nervous is ensuring the steady-state of a long running LLVM daemon doesn’t subtly change behaviour over time: the cleanliness of single processes that terminate is very attractive from a sanity/safety point of view. I think there’s a direct trade-off between compile-time performance and complexity here; but it’s a design space we can explore and measure.

We’re becoming much more interested in compile times – there are various paths to be taken to reduce work, further process integration is one of them. With past learnings from the program-repo project Alexandre mentions, we feel work-reduction from the frontend is another important topic (most game-projects are frontend-dominated). I’d like to chime in with @aengelke that

In my opinion, if we care about C++ compile times, we should work on (a) getting C++ modules into a widely usable state in which they provide substantial improvements and (b) a faster, performance-focused, and well-engineered C++ front-end (preferably without an expensive AST). IME, the front-end dominates compile times for larger code bases and a faster front-end could get improvements >2x for everyone.

This is our experience too. I feel the AST is incredibly powerful, and that’s a double-edged sword because of the corresponding compile times. We’re considering shortcuts we could take, but another full frontend seems infeasible. Incremental compilation would be ideal (but hard); I understand C++ Modules can lead to serious work reduction, but there are few case studies demonstrating clear benefits.

We’d certainly chip-in to help prototype and evaluate the ideas that come out of this meta RFC.

[0] https://discourse.llvm.org/t/rfc-reducing-process-creation-overhead-in-llvm-regression-tests/88612

2 Likes

I think the issue of memory cleanup after tool call termination can be solved easily enough. As mentioned previously this is os and malloc library dependent. There are existing solutions both at the OS and malloc library level. At the OS level Windows offers Heapapi.h for easily freeable heaps. At the library level, which is what I would suggest, heaps are a first class primitive in mimalloc [0]. If each of the tool calls only allocated on one of the heaps the cleanup would be very cheap as it happens in one go. This would of course mean replacing rpmalloc, and as such would need benchmarking performance of the different mimalloc versions for llvm.

[0] mi-malloc: Heaps

I think the issue of memory cleanup after tool call termination can be solved easily enough. As mentioned previously this is os and malloc library dependent. There are existing solutions both at the OS and malloc library level. At the OS level Windows offers Heapapi.h for easily freeable heaps. At the library level, which is what I would suggest, heaps are a first class primitive in mimalloc [0]. If each of the tool calls only allocated on one of the heaps the cleanup would be very cheap as it happens in one go. This would of course mean replacing rpmalloc, and as such would need benchmarking performance of the different mimalloc versions for llvm.

Yes having separate heaps is something that I’ve suggested in the other RFC: [RFC] In-process execution of LLVM tools - #10 by R-Goc

I don’t think going back to Windows Heaps is a good idea, until Microsoft comes up with a more performant allocator. We can also create custom heaps in rpmalloc. mimalloc is good solution, on par with rpmalloc. See below some metrics from a few years ago:

Thanks Alexandre! I hope I can help make some of what’s proposed here happen. :slight_smile:

I want to emphasize that, if you dig just a little bit into Linux loader implementation details, you’ll realize that Linux process startup is slow too. Linux process startup overhead appears to be a significant bottleneck on Windows. @boomanaiden154-1 migrated our lit tests to the internal shell to avoid one bash invocation per test and pickup a ~10% runtime savings.

These are good goals, and we should 100% do this. I filed an issue in the tracker for this migration. Many LLVM downstreams, especially graphics drivers, provide LLVM-as-a-library and attempt to offer multi-threaded compilation, but they run into races on global state.

I actually think object file serialization is probably still valuable because the unserialized object file representation (MC) is bloated, and not dense. Serializing to a flat object file representation is valuable, since you can free and reuse all that assembler heap memory. I think an interesting direction here would be to add some kind of flat-per-function content hashing layer, since IMO, that’s what the linker needs to be redesigned around, and that’s what’s going on in the CAS workstream.

I feel like having the process pool model is valuable, since LLVM contains many fatal error paths. The process pool model can’t share as much cached filesystem state in memory, but it makes error recovery and cache flushing more reliable, since you can just restart the process after any action failure or when it exceeds some memory usage quota.

We never implemented this, but back when I was working with Chrome, I was advocating strongly for distributed ThinLTO, which is finally being implemented today, mostly because I wanted process isolation. Mostly I wanted to reduce the support cost of LTO for us. I didn’t want to get bug reports of the form “the linker crashed non-deterministically after 20 minutes of compilation”, I wanted to get bug reports like “this backend compilation action fails deterministically on this bitcode”. Having the ability to re-execute failed build actions in a clean process is very helpful.

I agree strongly with this. I’m a little bit leery of developing an entire build system inside of LLVM, but I think the entire C/C++ developer community has been held back by our inability to make changes, even incremental ones, that cut across traditional build system boundaries. If LLVM had a build action subgraph executor, that would open up a lot of possibilities.

1 Like

I think CodeView is actually pretty well-suited for this purpose. If you dig into the global type hashing implementation, it’s all just content hashing all the way down.

1 Like

As someone who needs to debug compiler toolchains when they inevitably break, determinism and serialization are absolutely critical to developing LLVM.

If we can avoid overhead by passing files between tools without waiting for them to be written to disk, that’s okay, but we need to be very careful to ensure that this is just a performance shortcut which doesn’t affect the semantics of the tools. I’m very afraid of a system where we have some sort of in-memory cached representation of a program, with no corresponding on-disk representation. When that database is corrupted, or has a race condition, it’ll be impossible to debug.

Along the same lines, incremental/live compilation systems are hard to debug. If you’re careful, you can log the user’s actions and store intermediate outputs, but if the system silently behaves differently based on which intermediate files are present, it’ll be impossible to debug.

You don’t need a daemon for this. You just need to encode instructions to invoke the compiler to generate a dwo or equivalent. Actually, I’m not sure what the daemon is even doing here, except act as a middleman to invoke the compiler.

1 Like

Yeah, CodeView itself should fit well into it, but I’ve not looked into MCCAS’ handling of the debug info so it is difficult to tell what dragons lie there.

Hello @efriedma-quic,

As someone who needs to debug compiler toolchains when they inevitably break, determinism and serialization are absolutely critical to developing LLVM.

If we can avoid overhead by passing files between tools without waiting for them to be written to disk, that’s okay, but we need to be very careful to ensure that this is just a performance shortcut which doesn’t affect the semantics of the tools. I’m very afraid of a system where we have some sort of in-memory cached representation of a program, with no corresponding on-disk representation. When that database is corrupted, or has a race condition, it’ll be impossible to debug.

Along the same lines, incremental/live compilation systems are hard to debug. If you’re careful, you can log the user’s actions and store intermediate outputs, but if the system silently behaves differently based on which intermediate files are present, it’ll be impossible to debug.

I agree. @jmorse has the same concerns above. My feeling is that we shouldn’t retain raw compiler state in-memory, but a light serialized format. This goes in hand with a CAS and calculation of hash keys for a “computation” which generates an outcome (the serialized state). We’re talking granular state here, intermediate artifacts such as token streams, AST fragments, types, IR, machine code, sections, debug info, etc. The goal is to also persist this state in a index+CAS when the LLVM daemon is not running. Additionnally, we will store along a history of the build actions that were triggered / sent to the daemon during its lifetime.

On one hand, all that would give us a key→value mapping for computations, which can be triggered again to assert their validity / determinism. For example, building a project from scratch twice shouldn’t generate new assets in the CAS. On the other hand, we could achieve reproducibility by knowing the damon’s initial state on startup (the root index hash of a Merkle tree, as stored in the CAS) + the history of actions that were executed.

I think the same ideas apply for multi-threading, if we share raw state accross threads, we’re at risk of non-deterministic behavior.

I wasn’t thinking of monolithically generating debug info for a TU. More along the lines of only generating a subset of debug information on the fly. This ties back a bit to the previous paragraph, where we need need to get back quickly in a state where we can generate the debug info from existing generated code.

An option along those lines is for the daemon to act as a DAP debugger server. Another aspect of this is that ultimately I am envisionning the daemon to act as a VM for C++, where it can dynamically build a process in-memory and modify it on the fly – ie. democratize even more the usage of LLVM ORC, ex. Julia or Cling. Right now we’re resorting to external tooling like Live++ because there’s no other way for achieving this, but ideally I’d prefer having a more integrated solution.

Hello @rnk,

I feel like having the process pool model is valuable, since LLVM contains many fatal error paths. The process pool model can’t share as much cached filesystem state in memory, but it makes error recovery and cache flushing more reliable, since you can just restart the process after any action failure or when it exceeds some memory usage quota.

Yeah that sounds like a good compromise. I’d like to have both ideally at least for testing purposes. From the OS scheduler’s perspective, having to manage a pool of processes is more expensive than a single process with a pool of threads. With the actual build model, just the context switching alone was significant enough to show up in profiles on Windows. However if we keep a pool of llvm.exe processes alive, and schedule actions on them without shutting them down, that should pin them to a specific core, so context switching shouldn’t be as bad.

I’m a little bit leery of developing an entire build system inside of LLVM

I think only a minimal amount of work needs to be done here. I am also against bringing of all the target management and build scripts inside LLVM. However, managing the dependency graph of actions makes sense since LLVM has deeper domain knowledge and of the implications.

(aside: If you want to rerun the compiler to generate debug info on the fly, do realize that LLVM isn’t robust against debuginfo-affects-codegen, so you have to produce debug info LLVM IR the first time and /maybe/ the backend can be made (probably isn’t already) robust against debuginfo/emission/-affects-codegen issues - if you want to be able to skip all the debug info handling/merging/etc during the middle end, then substantial work would need to be done to ensure debug info doesn’t affect LLVM’s optimizations, etc, in any way (it’d be good, valuable work - it’d remove substantial sources of heisenbugs, etc, - but it so far hasn’t been important enough for anyone to prioritize the kind of quality level in this area that would be needed to build incremental debug info on top of))

1 Like

Hello @dblaikie,

(aside: If you want to rerun the compiler to generate debug info on the fly, do realize that LLVM isn’t robust against debuginfo-affects-codegen, so you have to produce debug info LLVM IR the first time and /maybe/ the backend can be made (probably isn’t already) robust against debuginfo/emission/-affects-codegen issues - if you want to be able to skip all the debug info handling/merging/etc during the middle end, then substantial work would need to be done to ensure debug info doesn’t affect LLVM’s optimizations, etc, in any way (it’d be good, valuable work - it’d remove substantial sources of heisenbugs, etc, - but it so far hasn’t been important enough for anyone to prioritize the kind of quality level in this area that would be needed to build incremental debug info on top of))

I guess what I am asking in-between the lines with this RFC, is this effort worth investing into, from this community’s standpoint? As in, transitionning to a incremental compiler / toolchain? I am ready to invest my time into the first part, but the second part requires a more involved community support, which I suppose should also come with a strong corporate buy-in. Or maybe I am too optimistic and we should just scratch the long-term vision, and let’s just talk about the first part until we have a functionnal daemon?

To that broader question, here’s my rather pessimistic answer:

It’s pretty hard (read: assume impossible) to get investment from others on a project this experimental. It’s unlikely to align sufficiently with corporate goals that they’d be willing to invest significant resources into it if it’s not already the thing they’re working on (which it isn’t, or we’d see that in the communitty) or at least if there’s already some other major investment (see, for instance, Apple’s build cache thing - even that, with a strong corporate backing, wasn’t quite interesting enough for Google to engage with (some amount of how well/poorly things align with the current goals of the company/teams, etc) but merely to cheer on, with some reservations, from the sidelines).

So, essentially - assume you’ll be doing this alone/with whoever you’re already working with.

And that, I think, is then where some of the concerns in this thread come from - the concern that there’s insufficient backing for a project of this magnitude, that it’ll place a substantial maintenance burden on the project and the users won’t pan out, or the investment won’t be enough to complete it, etc. :confused:

So, generally the best way to approach something like this is, as with most of the LLVM project, with small increments that are self-supporting in their value/return-on-complexity, and hopefully appeal to existing users/contributors/use cases without major rework. (that said, I also really get a bit frustrated with solutions that are overly isolated/drop-in - Apple’s implicit modules, then implicit ThinLTO are examples - they work great to drop into a system you can’t change, but don’t scale well (no way to distribute them, for instance - though, admittedly, Google and Apple still got a lot out of sharing the common parts, even if in both cases Apple went with implicit systems that were easy to roll out to users and Google went with explicit systems that were easier to integrate into a very rigid distributed build system))

3 Likes

This is motivated by the significant process creation and I/O overhead on Windows

Do you have numbers do back up the process creation overhead?

I remember that when we made clang no longer spawn a cc1 subprocess, we also thought that’d help a ton (⚙ D69825 [Clang][Driver] Re-use the calling process instead of creating a new process for the cc1 invocation), but the actual measured wins after that went in were pretty small. (Low one-digit percentage, if I remember correctly?) (It was IMHO still a great change for other reasons too, but it was also much much smaller in scope…)