Hello @cachemeifyoucan,
Very good points, thanks for the feedback!
However, the proposal is very ambitious and lacks details (I know it is still meta, and I just suggest something I would like to see in future detailed RFCs). I am, in spirit, like to see RFCs for build performance.
Yeah this comes a lot in the comments. I can prepare RFCs for each of the short term steps at least. The in-process compilation has already its own RFC.
On the other hand, multi-threaded in process execution might involve completely replace
cl::optwith something better or make it thread_local, which is a very big task.
cl::opts have specifically come up a lot in discussions over the past years. I will prepare an RFC just for the global state removal. Replacing cl::opts by something else seems way too much work for my taste. The approach Iâve taken here was to force usage of cl::location at all times and redirect the storage to a single thread-local buffer (for all cl::opts in the process). It could be something along those lines, but using a tool-local buffer instead, and use a TLS context to point to that buffer. When a tool starts, it would set its cl::opt buffer in the TLS, when it ends, it would clear the TLS. This mechanism doesnât require any change to the existing cl::opts, except the ones that are already using cl::location.
Secondly, we need to write down the exact methodology to achieve each task so we know how feasible it is and how interruptive it is.
Agreed.
We also might need some initial experiment to collect some data to understand the potential benefits.
Absolutely.
Here is a counter example for your in-process compilation model: Some experiments were run on macOS (where launch time is not as big a problem comparing to windows) for in-process compilation years ago and we found the saving from no process launch is completely overshadowed by the cost of freeing memory (cannot use
-disable-freefor in process compilation). This is not a prove that in-process model doesnât work, just mean that we need to squeeze more saving from elsewhere, which also need data to back that up.
Yes I remember that you folks put back CLANG_SPAWN_CC1=ON for MacOS when I introduced -fintegrated-cc1. In that regard, the prototype here uses multithreaded in-process compilation and is actually disabling -disable-free. Meanning that heap cleanup occurs between each tool invocation. However the heap memory pages remain committed (at least when using rpmalloc). I found that a lot of the churn that happens on shutdown is because of the VAD tree cleanup by the OS, reclaiming back the physical memory pages. The more you call short-lived processes with a lot of page allocations (like a compiler), the worse it gets. The OS has to clear the pages before handing them back to another process, and at some point, the zero queue is filling up and becomes a blocker. When building LLVM or Chromium this shows a lot. In some extreme cases, like linking Chromiumâs browser_tests.exe, shutting down the LLD process (after the CRT finished its cleanup) can take between 5-7 seconds. However when the pages remain mapped in the process, thereâs no hit between tool invocations (but this also greatly depends on what your CRT allocator does upon free() on MacOS).