Just upgraded from LLVM 18.1.8 to 19.1.7 and noticed that the shipped x86_64-pc-windows-msvc binaries no longer depend on the MSVC runtime DLLs. Although the msvc*.dll and vcruntime*.dll files are still shipped in the “bin” directory, a quick investigation via Dependency Walker confirms that none of the *.exe and *.dll files depend on them. This was still the case for the LLVM 18.1.8 release.
While I personally like this change, I have the feeling that it sneaked in unofficially. This is confirmed by the fact that you still ship the runtime DLLs although they are no longer needed.
Please have a look and let me know whether this change was intended and if it’s going to stay for the upcoming releases.
While I personally like this change, I have the feeling that it sneaked in unofficially. This is confirmed by the fact that you still ship the runtime DLLs although they are no longer needed.
It might be interesting to know the time difference, size difference, and performance difference that this causes. There are security concerns here (basically it is putting updates of the C/C++ runtimes on the LLVM project). But if there are actual performance wins to be had at the cost of size and maintenance, it might be worth it.
Just for reference, Microsoft suggests: “App-local deployment of the UCRT is supported, though not recommended for both performance and security reasons”
For example the savings on a Release+asserts+symbols Clang build on a Threadripper PRO 3975WX is approx 30 secs on a recent Windows 11 with no anti-virus, security stack or other minifilter drivers in the way:
(default Windows Heap)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
Time (mean ± σ): 392.716 s ± 3.830 s [User: 17734.025 s, System: 1078.674 s]
Range (min … max): 390.127 s … 399.449 s 5 runs
(rpmalloc)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
Time (mean ± σ): 360.824 s ± 1.162 s [User: 15148.637 s, System: 905.175 s]
Range (min … max): 359.208 s … 362.288 s 5 runs
As for ThinLTO linking with LLD, the reduction is even more significant; the more cores the machines has, the more important the reduction:
Machine Allocator Time to link
16c/32t AMD Ryzen 9 5950X Windows Heap 10 min 38 sec
Rpmalloc 4 min 11 sec
32c/64t AMD Ryzen Threadripper PRO 3975WX Windows Heap 23 min 29 sec
Rpmalloc 2 min 11 sec
Rpmalloc+/threads:64 1 min 50 sec
176 vCPU 2x Intel Xeon Platinum 8481C (fixed clock 2.7 GHz) Windows Heap 43 min 40 sec
Rpmalloc 1 min 45 sec
Thanks for the details there! I think that the performance improvements are interesting, but also might be conflating the benefits of rpmalloc itself. It would be interesting to compare LLVM with and without static linking only.
FWIW, we (at The Browser Company) found that mimalloc is pretty good and gave a comparable time reduction (~8%?) in build times for our codebase. That still allows for dynamic replacement of the memory allocator.
rpmalloc doesn’t support runtime (IAT I think?) patching like mimalloc does. The patching is done statically at link time, thus the requirement for StaticCRT.
Right, I understand the requirement for rpmalloc. I was thinking more about the C runtime aspect. The rpmalloc being included would of course throw off the performance characteristics and I was hoping that we could get a clearer understanding of that.
Another thing that might be interesting is to consider a hybrid CRT approach (which Microsoft does also use). This allows for a partial static and partial dynamic link to at least allow some amount of security/performance updates from system libraries.
I think /MT over /MD gives about 1-2% runtime improvements overall in my past testing, for example: ⚙ D55056 [CMake] Default options for faster executables on MSVC. I recall using /GS- used to give the same kind of improvements (in the ~1-2% range) but maybe this has changed in recent Windows versions and with more recent CPUs.
The Hybrid CRT is an interesting idea, thanks for suggesting that!
Interesting numbers. Is “windows heap” here the default old allocator or the newer “segment heap” one? If it’s the old one, would you be able to collect numbers for the segment one?
Last time I tried there wasn’t much difference when using the segment heap (vs the regular heap), but I can try again. Sadly LLVM allocates a lot during in-process ThinLTO and that creates contention if the allocator is lockful.