RFC: Replacing the default CRT allocator on Windows

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

Note that ASAN support is present on Windows now. Does the Debug CRT provide any features that are not better served by ASAN?

Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

Both help with detection of errors like buffer overrun, double free, use after free, etc. Asan generally gives you more immediate feedback on those, but you pay a higher price in performance. Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds). By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection. But I don’t think that feature is often used.

Windows’s Appverifier is cool and powerful. I cannot remember for sure, but I think some of its features might depend on the Debug CRT. One thing it can do is simulate allocation failures so you can test your program’s recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

Bearing in mind that the ASan allocator isn’t particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don’t imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

+Kostya Kortchinsky

w.r.t the licensing problem of a new allocator - have you considered using Scudo? The version in compiler-rt is the upstream (and thus fully licensed with LLVM), and it’s what we use as the production allocator in Android. The docs are a little out of date (see the source code in //compiler-rt/lib/scudo/standalone for the bleeding edge), and it doesn’t support Windows out of the box currently - but there have been some successful experiments to get it working. I don’t imagine that getting full support would be more challenging than setting some sort of frankenbuild up. From Kostya (who maintains Scudo), “I don’t think the port is going to be a lot of effort”.

I hadn’t heard this before. If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not? Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not

Yes, it doesn’t make a difference to your final executable whether the compiler was built with ASan or not.

Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

Sanitizer runtimes aren’t instrumented with sanitizers :).

To be clear, we’re talking about replacing the runtime allocator for clang/LLD/etc., right

This is my understanding. I want to ensure that the CRT debug allocator remains optionally and on by default for debug builds so that I can use it to troubleshoot memory corruption issues in clang/LLVM/etc itself. The alternative would be instrumenting debug builds of LLVM with asan to provide similar benefits.

If I’m reading downthread correctly, it takes something like 40 minutes to link clang.exe with LLD using LTO if LLD is using the CRT allocator, and something like 3 minutes if LLD is using some other allocator. Assuming these numbers are correct, and something wasn’t wrong with the LLD built with the CRT allocator, then this certainly seems like a compelling reason to switch allocators. However, I doubt anybody is trying to use an LLD built in debug mode on windows to link clang.exe with LTO. I imagine it’d take an actual day to finish that build. The main use for a clang.exe built in debug mode on windows is to build small test programs and lit tests and such with a debugger attached. For this use case, I believe that the CRT debug allocator is the correct choice.

As a side note, these number seem very fishy to me. While it’s tempting to say that “malloc is a black box. I ask for a pointer, I get a pointer. I shouldn’t have to know what it does internally”, and just replace the allocator, I feel like maybe this merits investigation. Why are we allocating so much? Perhaps we should try to find ways to reduce the number of allocations? Are we doing something silly like creating a new std::vector in ever iteration of an inner loop somewhere? If we have tons of unnecessary allocations, we potentially could speed up LLD on all platforms. 3 minutes is still a really long time. If we could get that down to 30 seconds, that would be amazing. I keep hearing that each new version of LLVM takes longer to compile than the last. Perhaps it is time for us to figure out why? Maybe it’s lots of unnecessary allocations?


Christopher Tetreault

That sounds like an interesting idea. What does it take to complete/land the Windows port? Do you think the performance would be equivalent to that of the allocators mentioned in the review?

Envoyé : July 7, 2020 5:15 PM

  1. Completing the Windows port requires porting the platform specific functions to Windows, probably updating some long vs longlong situations, size_t maybe, etc. An old CL (https://reviews.llvm.org/D42519) for the non-standalone version of Scudo shows the gist of it. A good part of having a Windows dev environment for LLVM/compilert-rt that I do not have anymore.
  2. Scudo is meant to be more secure by default and performance might not be fully on par with allocators that don’t check much. The allocator is highly configurable though, and I am pretty sure we can get close. The rest is all a matter of balancing RSS vs performance vs security which is a tricky business. It offers a couple of model of caching per thread (either unique or shared pool), configurable buckets/bins, and so on.