RFC: Replacing the default CRT allocator on Windows

Hello,

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

Time to link clang.exe with LLD and -flto on 36-core:

Windows CRT heap allocator: 38 min 47 sec

mimalloc: 2 min 22 sec

rpmalloc: 2 min 15 sec

snmalloc: 2 min 19 sec

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

Two questions arise:

I'd appreciate the speed-up due to the inclusion of an alternative
malloc and the ease of using it if its source was included in the LLVM
repository. We already have multiple sources from external projects
included in the repository (among them gtest, gmock, ConvertUTF,
google benchmark, ISL, ...) and don't see a reason why it should be
different for a malloc implementation.

AFAIK replacing malloc is quite common for Windows projects. The
Windows default implementation (called "low fragmentation heap") has
different optimization goals.

I'd start with including it in the repository and providing the option
to enable it, with the possibility to change it to the default after
some experience has been collected, maybe even with multiple malloc
implementations.

Michael

Not against this for the executables, but I just wanted to 100% check that it is possible to override malloc/free for clang and not for the libclang.dll?

I think it’s fine to make the built executables use a different allocator, but it’d be a bigger pain if we force an allocator on users that link against the LLVM libraries or shared libraries by default.

Cheers,
-Neil.

These numbers all seem very close (apart from the baseline). How many runs did you do and what was the jitter?

FWIW, I'm using snmalloc on FreeBSD instead of jemalloc and clang is around 2% faster, so it might be worth considering this as an option for all platforms. It's likely to be a big win on anything where dlmalloc is the default allocator.

Snmalloc currently supports macOS, Windows, Linux, FreeBSD, NetBSD, OpenBSD, Haiku, and OpenEnclave (adding other POSIXy systems is fairly trivial, can be completely trivial if you don't want any non-standard-POSIX behaviour) on x86, ARM, and PowerPC (RISC-V and MIPS under review).

I am obviously biased towards snmalloc, since I'm one of the authors, and happy to help out anyone wanting to integrate it with LLVM. Note that snmalloc requires C++17, so would need to be conditional on LLVM being built with a vaguely modern compiler.

David

Hello David,

Please see below.

-----Message d'origine-----

Have you tried Microsoft’s new “segment heap” implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven’t seen performance comparisons.

If the performance is good, seems like that might be the simplest choice

https://docs.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype

https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf

Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB → 14 GB).

Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed.

Envoyé : July 2, 2020 6:08 PM

Let’s make sure we’re all working from the set set of assumptions. There are several layers of allocators in a Windows application, and I think it’s worth being explicit, since I’ve already seen comments referring to multiple different “allocators.”

The C++ allocators and new and delete operators are generally built on malloc from the C run-time library. malloc implementations typically rely on process heaps (Win32 HeapAlloc) and/or go directly to virtual memory (Win32 VirtualAlloc). HeapAlloc itself uses VirtualAlloc.

C++ → malloc (CRT) → HeapAlloc (Win32) → VirtualMalloc (Win32)

Hi Adrian!

I completely agree with you, we should be clear on the wording. This proposal is about replacing both the MS CRT malloc layer AND the HeapAlloc layer.

C++ → {mimalloc|rpmalloc|snmalloc} → VirtualAlloc (Win32)

The bottom line is that libraries in LLVM allocate a lot, lots of small allocations. I think that should be solved on its own, perhaps by using BumpPtrAllocator a bit more. But this is fragile, any new system in LLVM could later add lots of allocations, and induce heap locking further down the line.

The MS CRT malloc layer is a very thin wrapper over HeapAlloc. But the issue we want to solve is contention at the HeapAlloc layer level. There’s only one heap by default for the entire process, every thread allocating needs to “aquire” the heap, thus the lock. On the short term, I don’t see an easy way to solve this, except by bypassing HeapAlloc completely.

Reid mentioned that a Google toolchain/platform team experimented with tcmalloc3 (the one published here: https://github.com/google/tcmalloc - not the one in gperftools) integrated into LLVM and got similar results to ones in the review below.

Envoyé : July 6, 2020 11:52 AM

https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf seems to be the paper that goes with the sides I linked before. It says that there’s some sort of adaptive mechanism that allocates per-CPU “affinity slot” if it detects lots of lock contention. Which seems like it ought to have good multithreaded behavior.

I see in your initial email that the sample backtrace is in “free”, not allocate. Is that just an example, or is “free” where effectively all the contention is? If the latter, I wonder if we’re hitting some pathological edge-case…e.g. allocating on one thread, and then freeing on different threads, or something along those lines.

Yes sorry, the callstack was only showing “free”. However the time is equally spent between calls to HeapFree and HeapAlloc, so I don’t think this is a pathological case. We can clearly see in the profile traces, threads stalling on the heap’s critical section and then woken up later when the critical section is released into another thread.

As for the adaptative behavior for “affinity slots”, I got word that it doesn’t scale above 4 threads. From what I hear, the behavior of the segment heap is similar to the older low-fragmentation heap, in terms of multi-threaded performance/contention. Although I’d like to hear other opinions if anyone has deeper/practical knowledge with Windows’ segment heap.

Envoyé : July 6, 2020 3:19 PM