[RFC] Attributes for Allocator Functions in LLVM IR

In principle, I am strongly in favour of annotations on allocators, but I retain concerns about inlining. @nlopes was worried about attributes being lost, @nikic suggested that the possibility of losing call attributes is a common misconception, but inlining is the case where call attributes are lost. This is fine for things like byval because the relate to the calling convention and become irrelevant after inlining. It’s not clear that this is the case for allocator attributes.

The root of the problem is that allocators are always nested. My favourite bit of UB in the C spec is that it’s UB to use a pointer after it has been passed to free, which means that (by a strict reading of the standard) it is impossible to implement free in C. I don’t want to end up in that case.

In a trivial object allocator, you have some OS facility (mmap, VirtualAlloc, and so on) as the first-level allocator and then you subdivide the large chunks that this gives you. Consider this trivial malloc implementation:

static void *small_alloc(size_t);
extern "C" void *malloc(size_t size)
{
  if (size > PAGE_SIZE)
  {
    return mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
  }
  return small_alloc(size);
}

If you’re doing whole-program optimisation then you would find that you statically know the size at a lot of call sites and so you’d inline either the small_alloc call or the mmap call. You might even inline the fast path of small_alloc.

In snmalloc, in the default configuration, we have a load of different layers of allocator:

  • The platform layer, which returns chunks (some multiple of page size).
  • The global range layer, which manages a global pool of address ranges that have been allocated by the platform layer. These are power-of-two multiples of a page size.
  • The per-thread range layer, which manages a smaller pool of address ranges of power-of-two multiples of a page size for allocating chunks and very large allocations.
  • The per-thread freelists that contain a list of allocations of a specific sizeclass, which are the things that malloc returns for any allocation that isn’t larger than a few pages.

The layers nearer the programmer have fast paths that are amenable to inlining - the fast path for malloc is around 10-15 instructions and so a program that statically links snmalloc would expect to inline that in a lot of places (it should shrink by 2-3 instructions if you statically know the size).

So where should we put our allocator annotations, and what happens with inlining? Will inlining malloc’s fast path lose the allocator annotations and generate worse code? Will it generate incorrect code if we annotate each layer and, after inlining, an analysis sees a chunk allocated directly from the per-thread range layer (because the allocation call site statically knows that the size is large) but freed with the generic free layer (because the free site can’t statically prove the size)?

Should the allocator attributes be a flag that an early inlining pass should not inline the call, but that a later one can, as long as it also strips the allocator attributes on all functions in the module?