LLVM x86_64-pc-windows-msvc binaries no longer need MSVC runtime DLLs since 19.x

Hi all!

Just upgraded from LLVM 18.1.8 to 19.1.7 and noticed that the shipped x86_64-pc-windows-msvc binaries no longer depend on the MSVC runtime DLLs. Although the msvc*.dll and vcruntime*.dll files are still shipped in the “bin” directory, a quick investigation via Dependency Walker confirms that none of the *.exe and *.dll files depend on them. This was still the case for the LLVM 18.1.8 release.

I checked the Release Notes and found no official announcement for this change. Git Blame reveals that [Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit r… · llvm/llvm-project@67226ba · GitHub or Revert "[asan][windows] Eliminate the static asan runtime on windows … · llvm/llvm-project@0a93e9f · GitHub may have caused it. Both changed a default CMAKE_MSVC_RUNTIME_LIBRARY setting from “MultiThreadedDLL” to “MultiThreaded”.

While I personally like this change, I have the feeling that it sneaked in unofficially. This is confirmed by the fact that you still ship the runtime DLLs although they are no longer needed.

Please have a look and let me know whether this change was intended and if it’s going to stay for the upcoming releases.

Best regards,

Colin Finck

I think you found the right commit: [Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit r… · llvm/llvm-project@67226ba · GitHub

@aganea can confirm.

While I personally like this change, I have the feeling that it sneaked in unofficially. This is confirmed by the fact that you still ship the runtime DLLs although they are no longer needed.

Yes, I think it was just a side effect of the rpmalloc change. The reason we still ship the UCRT DLLs is that we still pass the cmake flag: llvm-project/llvm/utils/release/build_llvm_release.bat at ee3bccab34f57387bdf33853cdd5f214fef349a2 · llvm/llvm-project · GitHub I suppose we could drop that now.

Since this shipped in LLVM 19 and nobody complained about it, I assume it will stick.

2 Likes

It might be interesting to know the time difference, size difference, and performance difference that this causes. There are security concerns here (basically it is putting updates of the C/C++ runtimes on the LLVM project). But if there are actual performance wins to be had at the cost of size and maintenance, it might be worth it.

Just for reference, Microsoft suggests: “App-local deployment of the UCRT is supported, though not recommended for both performance and security reasons

In our case LLVM retail builds for Windows now use the StaticCRT, after [Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit r… · llvm/llvm-project@67226ba · GitHub.

The usage of the StaticCRT is required by rpmalloc, since it only supports link-time functions replacement (CRT malloc/free).

Removing the CRT DLLs does not have a big impact on the LLVM installation: -3.5MB saved locally once installed.

As for performance, it is quite important, depending on your configuration. Some figures are highlighted in the above PR description: [Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit release by aganea · Pull Request #91862 · llvm/llvm-project · GitHub

For example the savings on a Release+asserts+symbols Clang build on a Threadripper PRO 3975WX is approx 30 secs on a recent Windows 11 with no anti-virus, security stack or other minifilter drivers in the way:

(default Windows Heap)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
  Time (mean ± σ):     392.716 s ±  3.830 s    [User: 17734.025 s, System: 1078.674 s]
  Range (min … max):   390.127 s … 399.449 s    5 runs

(rpmalloc)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
  Time (mean ± σ):     360.824 s ±  1.162 s    [User: 15148.637 s, System: 905.175 s]
  Range (min … max):   359.208 s … 362.288 s    5 runs

As for ThinLTO linking with LLD, the reduction is even more significant; the more cores the machines has, the more important the reduction:

	           Machine	          								  Allocator		         Time to link

16c/32t AMD Ryzen 9 5950X										Windows Heap			10 min 38 sec
									                            Rpmalloc			     4 min 11 sec

32c/64t AMD Ryzen Threadripper PRO 3975WX						Windows Heap	        23 min 29 sec
					                                            Rpmalloc	             2 min 11 sec
					                                            Rpmalloc+/threads:64	 1 min 50 sec

176 vCPU 2x Intel Xeon Platinum 8481C (fixed clock 2.7 GHz)	    Windows Heap			43 min 40 sec
                                                                Rpmalloc			     1 min 45 sec

Thanks for the details there! I think that the performance improvements are interesting, but also might be conflating the benefits of rpmalloc itself. It would be interesting to compare LLVM with and without static linking only.

FWIW, we (at The Browser Company) found that mimalloc is pretty good and gave a comparable time reduction (~8%?) in build times for our codebase. That still allows for dynamic replacement of the memory allocator.

rpmalloc doesn’t support runtime (IAT I think?) patching like mimalloc does. The patching is done statically at link time, thus the requirement for StaticCRT.

Right, I understand the requirement for rpmalloc. I was thinking more about the C runtime aspect. The rpmalloc being included would of course throw off the performance characteristics and I was hoping that we could get a clearer understanding of that.

Another thing that might be interesting is to consider a hybrid CRT approach (which Microsoft does also use). This allows for a partial static and partial dynamic link to at least allow some amount of security/performance updates from system libraries.

I think /MT over /MD gives about 1-2% runtime improvements overall in my past testing, for example: ⚙ D55056 [CMake] Default options for faster executables on MSVC. I recall using /GS- used to give the same kind of improvements (in the ~1-2% range) but maybe this has changed in recent Windows versions and with more recent CPUs.

The Hybrid CRT is an interesting idea, thanks for suggesting that!

Interesting numbers. Is “windows heap” here the default old allocator or the newer “segment heap” one? If it’s the old one, would you be able to collect numbers for the segment one?

Last time I tried there wasn’t much difference when using the segment heap (vs the regular heap), but I can try again. Sadly LLVM allocates a lot during in-process ThinLTO and that creates contention if the allocator is lockful.