Using a custom allocator for LLVM

Hi all,

What do folks think about introducing an allocator layer or templates in LLVM for custom allocators in LLVM data structures and memory allocation API used in LLVM?

Motivation
We at Azul noticed that using jemalloc (http://jemalloc.net) as a custom allocator in hot allocation sites in LLVM gave very noticeable compile time improvements to our workloads (in the range of 10% geomean improvement in total compile time). We achieved this through a series of downstream changes in LLVM codebase. One question would be why not just LD_PRELOAD the jemalloc library at runtime. This is very tricky to achieve if the LLVM component is just a part of the main application (in our case, LLVM is linked into our JVM). Also, we had to isolate the usage of jemalloc to be just within LLVM and not touch the remaining components.

Patch Categories
The patches which gave these improvements are divided into three main categories downstream:

  1. Using jemalloc malloc and free API within SmallVector/SmallPtrSet and other data structures
  2. Overloading class level operator new and operator delete for high impact classes such as Value class, Module class
  3. Introducing a custom API for allocation functions and using them within high impact allocation use-sites (~ 2-3 changes)

For now, we plan on getting some traction in upstream for #1 and #2. The first point is similar to std::vector having the option to specify a custom allocator (std::vector<T,Allocator>::vector - cppreference.com). We can templatize SmallVector to specify a default allocator type. There are some issues within it which we need to address [A], but that is secondary concern once we get a consensus on:

  1. allocators being a useful option/layer to have in upstream LLVM
  2. other ways to handle this upstream

Thanks,
Anna

[A] Introducing any member variable in SmallVector class breaks certain static asserts about the size of smallVector. Even if these static asserts are fixed, we just crash when generating files using tablegen. It looks like there is some hardcoded logic in table-gen which utilizes the size of SmallVector?

For Windows there is already a cmake option to support different allocators with the LLVM_INTEGRATED_CRT_ALLOC option.

See llvm-project/CMakeLists.txt at main · llvm/llvm-project (github.com)

We use rpmalloc internally and get a lot better performance especially with the linker and LTO.

I think it’s well known that LLVM does lots of heap allocations, and that swapping out the memory allocator can greatly improve performance.

I think it makes sense for LLVM to make it easy to use fast, custom allocators (see the aforementioned rpmalloc integration on Windows), but I would also like to minimize the code complexity involved. I don’t think we should go down the road of customizing our data structure allocation paths like you suggest. I think, in the end, the long term code complexity costs are very high.

The LLVM IR allocation codepaths are already highly customized, so I think it would be reasonable to customize that to direct all IR allocations to a custom allocator. I really don’t want to see SmallVector gain an allocator template parameter. You see the STL allocator template parameter negatively effects so many aspects of developer quality of life:

  • code complexity
  • compile time
  • symbol table size
  • debug info size

Any other solution seems preferable, like hooking the safe_malloc codepath.

Why wouldn’t it be better for your app to use jemalloc for all the malloc/free calls, instead of just hand-picking a few hot sites in LLVM? You don’t need to use LD_PRELOAD, you can directly link your main binary against jemalloc (either as a shared library, or link it in statically, if you export the symbols).

+1 that we shouldn’t support a fine-grained pluggable allocator layer upstream.

Any other solution seems preferable, like hooking the safe_malloc codepath.

What stopped us from doing this was that we call free for safe_malloc, which means we can quickly go into nasty problems like allocating with one allocator and deallocating with another (there’s no safe_free version).

This can be solved by having a common API used for allocation and deallocation. They tend to mostly be under MemAlloc.cpp and we already utilize some of the APIs across various parts of LLVM:

  1. allocate_buffer and deallocate_buffer, used in DenseMap and all LLVM allocators (MallocAllocator and friends).
  2. safe_malloc, safe_realloc and safe_calloc, with free, used by LLVM data structures such as SmallVector, SmallPtrSet and friends.

We also call regular ::operator new and ::operator delete in various classes such as User, Use, Metadata, etc.

We can route all the calls to allocations to go through a common set of API, such as allocate_with_malloc, deallocate_with_free, allocate_with_new (and overloaded variants), deallocate_with_delete. These would be NFCs, but it gives two benefits:

  1. Clear APIs about how allocation and deallocation happens
  2. whenever any variant of operator new and operator delete is introduced in LLVM codebase, we have a single point of change in API, if desired.

I’m not against having a custom allocator in LLVM (because as stated previously in the thread LLVM allocates like crazy), but assuming we could provide this:

  1. How would you guarantee that all new code went through the allocator too? When we (Unity) use LLVM at present with rpmalloc, we are safe to assume that all allocations are going via the globally overwritten allocator.
  2. What cost would there be of intercepting all allocations and potentially routing them via something else? I’d guess this will be more expensive than globally replacing new/delete/malloc/free that a lot of us are using already, but maybe still less expensive than using the system allocator.
  3. We’d need the allocator to not be stored in some sort of global state (I’m mostly thinking of the existing Linux issues where multiple people are using the same .so of LLVM concurrently).

One further shower thought I had - you could route a lot of the allocations (all Value, Instruction, etc) via an LLVMContext instance and have an allocator attached to that. But that wouldn’t let you catch all the SmallVectors / etc we use all over the place.

MemAlloc.h with safe_malloc/allocate_buffer function already serves like an allocation layer. But LLVM code is not very consistent with it. Some allocations go thorough MemAlloc.h, others are done via the standard allocator (e.g. allocations through new/delete).

If we were more consistent with using MemAlloc.h layer this would provide a convenient place for allocation-related hooks. It can be used to plug a different allocator or to add some bookkeeping like memory usage tracking.

We can make the use of an allocator layer more consistent. We can provide new and delete overloads to go through this layer, we can make sure that LLVM’s ADTs use it. But I’m not sure we can guarantee that all allocation will be handled by it. For example, if you don’t go out of your way std collections will use the standard allocator. I think we will need to accept that some of the allocations will bypass the allocator layer.

This is fine for our (Azul) purposes. Anna’s experiments demonstrate that we only need to intercept several categories of allocation in order to get the most of the compile time gains.

  1. How would you guarantee that all new code went through the allocator too? When we (Unity) use LLVM at present with rpmalloc, we are safe to assume that all allocations are going via the globally overwritten allocator.

We do not guarantee that because any new code added can infact just choose to use malloc or operator new directly. However, what is guaranteed is that we will not be using two different allocators for the same allocation (i.e. no nasty allocating through one allocator and freeing through another).

  1. What cost would there be of intercepting all allocations and potentially routing them via something else? I’d guess this will be more expensive than globally replacing new/delete/malloc/free that a lot of us are using already, but maybe still less expensive than using the system allocator.

So, we did try using -Wl, wrap linker option to guarantee wrapping all the allocations to resolve into symbols we want, but it did have a high compile time cost. There are ways to reduce this, but in the end the maintenance burden was too high (when we started looking at wrapping operator new and all it’s mangled variants).

At this point, we’re thinking more in terms of “providing a common set of API for allocations” rather than “providing the option for a custom allocator for LLVM”. We did start off with the latter, but there’s been convincing arguments against that.

Why wouldn’t it be better for your app to use jemalloc for all the malloc/free calls, instead of just hand-picking a few hot sites in LLVM?

I’d really like to see a response to this question. If you use jemalloc for ALL malloc/free in your program, then no changes are needed in LLVM, and all your allocations get faster.

If you use jemalloc for ALL malloc/free in your program, then no changes are needed in LLVM, and all your allocations get faster.

We did try this through couple of ways. One was the -Wl,wrap option mentioned in one of the earlier threads.

You don’t need to use LD_PRELOAD, you can directly link your main binary against jemalloc (either as a shared library, or link it in statically, if you export the symbols)

The only thing we could do is to link in jemalloc just to the libLLVM.so shared library, which is then linked to the JVM binary.
The issue here is that the LLVM headers are included in the JVM component. Some of the LLVM headers contain calls to free, but we cannot link in jemalloc to the JVM binary (as mentioned earlier, there are user code components where we cannot just place in a custom allocator, so linking in jemalloc to the JVM binary does not work for us).

So, we can easily end up with the issue where the allocation was done via jemalloc, while the deallocation was through the regular allocator.
We could separate out the part of JVM which interacts with LLVM (into another library) and link that to the jemalloc library as well, which means jemalloc allocator is for LLVM code and for this part in JVM which includes LLVM headers. However, the idea of two different allocators being used in the same application is a recipe for nasty errors.
With choosing high impact regions, we localize the use of the custom allocator.

1 Like