[RFC] Attributes for Allocator Functions in LLVM IR

davidchisnall · April 13, 2022, 8:36am

In principle, I am strongly in favour of annotations on allocators, but I retain concerns about inlining. @nlopes was worried about attributes being lost, @nikic suggested that the possibility of losing call attributes is a common misconception, but inlining is the case where call attributes are lost. This is fine for things like byval because the relate to the calling convention and become irrelevant after inlining. It’s not clear that this is the case for allocator attributes.

The root of the problem is that allocators are always nested. My favourite bit of UB in the C spec is that it’s UB to use a pointer after it has been passed to free, which means that (by a strict reading of the standard) it is impossible to implement free in C. I don’t want to end up in that case.

In a trivial object allocator, you have some OS facility (mmap, VirtualAlloc, and so on) as the first-level allocator and then you subdivide the large chunks that this gives you. Consider this trivial malloc implementation:

static void *small_alloc(size_t);
extern "C" void *malloc(size_t size)
{
  if (size > PAGE_SIZE)
  {
    return mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
  }
  return small_alloc(size);
}

If you’re doing whole-program optimisation then you would find that you statically know the size at a lot of call sites and so you’d inline either the small_alloc call or the mmap call. You might even inline the fast path of small_alloc.

In snmalloc, in the default configuration, we have a load of different layers of allocator:

The platform layer, which returns chunks (some multiple of page size).
The global range layer, which manages a global pool of address ranges that have been allocated by the platform layer. These are power-of-two multiples of a page size.
The per-thread range layer, which manages a smaller pool of address ranges of power-of-two multiples of a page size for allocating chunks and very large allocations.
The per-thread freelists that contain a list of allocations of a specific sizeclass, which are the things that malloc returns for any allocation that isn’t larger than a few pages.

The layers nearer the programmer have fast paths that are amenable to inlining - the fast path for malloc is around 10-15 instructions and so a program that statically links snmalloc would expect to inline that in a lot of places (it should shrink by 2-3 instructions if you statically know the size).

So where should we put our allocator annotations, and what happens with inlining? Will inlining malloc’s fast path lose the allocator annotations and generate worse code? Will it generate incorrect code if we annotate each layer and, after inlining, an analysis sees a chunk allocated directly from the per-thread range layer (because the allocation call site statically knows that the size is large) but freed with the generic free layer (because the free site can’t statically prove the size)?

Should the allocator attributes be a flag that an early inlining pass should not inline the call, but that a later one can, as long as it also strips the allocator attributes on all functions in the module?

Topic		Replies	Views
[RFC] Adding support for marking allocator functions in LLVM IR LLVM Dev List Archives	18	281	January 7, 2022
Add an alloca op to the std dialect MLIR	0	389	March 23, 2020
Optimize away memory allocations? IR & Optimizations	15	1926	September 30, 2022
Does LLVM assume that optimizations cannot be partially freed? IR & Optimizations	4	218	July 31, 2023
alloca combining, not (yet) possible ? LLVM Dev List Archives	5	95	September 11, 2015

[RFC] Attributes for Allocator Functions in LLVM IR

Related Topics