[RFC] Attributes for Allocator Functions in LLVM IR

(I previously brought this up on the mailing list, with some good comments, but things have evolved enough since that thread I’m starting a fresh conversation)

Background

Rust uses LLVM for codegen, and has its own allocator functions. In order for
LLVM to correctly optimize out allocations we have to tell the optimizer about
the allocation/deallocation functions used by Rust, currently in a patch[0] that
Rust upstream is uncomfortable landing (more on that below). In particular, we
need to:

  1. Support replacement of allocation with an SSA value when the address of an
    allocation doesn’t escape (other than to the deallocator). This requires
    being able to recognize the allocation call (and its kind, alloc or
    realloc), know what argument to a deallocator call is the freed pointer, and
    know that the allocation and deallocation functions match. We also have to
    know that the allocation is one that is permissible to remove (explicit
    calls to C++'s new operator are not removable, as an example). If we can
    know the initial contents of the allocation (e.g. calloc) this optimization
    can apply even when the pointer may be read from prior to being freed. We
    need to know all the properites of the allocator function and its arguments
    except allocation alignment (as the address ceases to exist) to perform this
    optimization.

  2. Heap-to-stack conversion: when the allocation’s address does escape but
    we’re certain that it is freed in this function, rather than by a callee or
    caller we can convert the allocation into a stack allocation. In this case
    we need to know everything above, plus the alignment of the allocation. If
    the alignment is specified as a parameter, we can mark that parameter with
    an attribute and permit this optimization. Note that for a function with an
    implied alignment like malloc we can’t determine the alignment, nor can we
    trust the align attribute because that is semantically an _under_estimate
    of the alignment returned by the function. Rather, we should be using an
    _over_estimate of the alignment promised to users instead. This bug is
    present in the existing heap-to-stack logic in the Attributor, and is not
    addressed or affected by this proposal. Fortunately, the heap2stack logic in
    the Attributor is not included in the set of default passes (though it does
    run for OpenMP code.)

  3. Heap-to-global conversion: this has no additional requirements beyond what
    we’ve enumerated above for heap-to-stack. Note that the alignment defect described in (2) for H2S also appears to happen for H2G, but as best I can tell this hasn’t been a problem because of luck. Fixing the defect is not planned in my work, but the additional information in the IR may make such future work easier in the future.

Rust has tests to ensure it benefits from (1), and I don’t think the rest of the optimizations are significant for them. The other two constraints are so that we don’t paint ourselves into a corner with the current types of relevant optimizations. Languages supported by Clang, such as C and C++, have stable symbol names for their allocation functions, which are hardcoded in LLVM[1][2]. Unfortunately, this strategy does not work for Rust, where developers don’t want to commit to a particular symbol name and calling convention yet.

Proposal

We add several attributes to LLVM IR:

  • allockind(KIND): Marks a function as being a specific allocation function
    kind with various behaviors. KIND is a bit mask that can describe a
    function as returning a new allocation, a new zeroed allocation, resizing an
    allocation, or freeing an allocation. Specifically, the entries in the bit
    mask would be:

    • new: the function returns a new allocation
    • resize: the function resizes the given allocation
    • free: the function frees the given allocation
    • uninit: newly allocated memory is uninitialized
    • zeroed: newly allocated memory is zeroed
    • aligned: newly allocated memory is aligned based on the allocalign
      parameter

    One of the first three should be specified, and then the remaining values
    are flags that can apply to the function types in the first part of the
    mask.

  • "alloc-family"="FAMILY": Marks a function as part of an allocator family,
    named by the “primary” allocation function (e.g. "alloc-family"=“malloc”,
    "alloc-family"=“_Znwm”, or "alloc-family"=“__rust_alloc”).

  • allocalign: Marks a function argument as specifying the minimum alignment
    of the returned allocation.

  • allocatedptr: Marks a function argument as the pointer that the allocator
    will manipulate. The behavior is dependent on the next attribute.

These attributes, combined with the existing allocsize(n[, m]) attribute lets
us annotate alloc, realloc, and free type functions in LLVM IR, which relieves
Rust of the need to carry a patch to describe its allocator functions to LLVM’s
optimizer. Some example IR of what this might look like:

; Function Attrs: nounwind ssp
define i8* @test5(i32 %n) #4 {
entry:
  %0 = tail call noalias dereferenceable_or_null(20) i8* @malloc(i32 20) #8
  %1 = load i8*, i8** @s, align 8
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* noundef nonnull align 1 dereferenceable(10) %0, i8* noundef nonnull align 1 dereferenceable(10) %1, i32 10, i1 false) #0
  ret i8* %0
}

attributes #8 = { nounwind allocsize(0) "alloc-family"="malloc" allockind(new,uninit) }

Similarly, the call free(foo) would get the attributes free(allocatedptr foo)/alloc-family=”malloc” allockind(free) and realloc(foo, N) gets
releaseptr(allocatedptr foo, N)/"alloc-family"=”malloc” allocsize(1) allockind(resize).

Alternatives

Hard-code the Rust function names in LLVM

We don’t really want to do this because the function signatures have changed on
the Rust side a couple of times, and since these are emitted by the compiler
there’s no requirements that they be versioned today. Adding them to LLVM as a
permanent thing would necessitate versioning the symbol names. In fact, when I
started this researching this line of work, it was noticed that __rust_realloc’s
signature had changed and due to lack of test coverage nobody had yet noticed.

Don’t Add allockind()

This was the original approach, but without allockind, we were attempting to
detect whether a call was an alloc, aligned_alloc, realloc, or free kind based
on whether e.g. allocatedptr or allocalign appeared anywhere on the call. As
argument attributes – or entire arguments – may be removed by optimization
passes like dead-arg elimination, this was not correct."

Further, we need to express some extra information on return values (eg memory
is uninitialized or zeroed) that we can’t express without additional
information. Today that information is implicitly contained in the hard-coded
list in MemoryBuiltins.cpp but if we start using attributes to communicate
these behaviors we need to be more circumspect to avoid requiring all allocators
to behave exactly like the standard C allocator functions.

Benefits

In addition to the benefits for Rust, the LLVM optimizer could also be improved
to not optimize away defects like

{
  auto *foo = new Thing();
  free(foo);
}

which would then correctly crash instead of silently “working” until something
actually uses the allocation. Similarly, there’s a potential defect when only
one side of an overridden operator::new and operator::delete is visible to the
optimizer and inlineable, which can look indistinguishable from the above after
inlining.

We can also remove most knowledge of C++ allocator functions from LLVM entirely,
and instead rely on the attributes to communicate that information to LLVM.

Status

I have mostly implemented this already: some of the dead-ends mentioned in the alternatives section are a direct result of previous failed attempts. I’ll be mailing out the changes in reasonably-sized batches as they’re ready, unless I hear objections.

[0]

[1]

[2]

1 Like

Given that it is run we should take this as a bug and seriously. I suppose we have to give up on h2s if getAllocAlignment is returning a nullptr, right? Can we teach it about the default alignment of known runtime functions? So spec seems to say malloc returns a pointer aligned properly for any built-in type, which seems to be something we could know (in clang).

Wrt. the proposal, I think having more freedom in defining allocators makes sense. Making things explicit with attributes in the IR is also something I find more intuitive than hard coded tables that interpret the name. (I know we will still have those but we only consult them once.)

Forking heap2stack bug discussion to GH issue 54747.

This looks generally reasonable to me. A small question:

Why do we need this, and don’t detect it based on presence of allocalign instead? You mention later that attributes may be dropped, but I don’t really see how having allockind(aligned) without allocalign would give us any useful information.

Because then we know that it’s an aligned allocator function even if the argument gets deleted. If the allocalign argument gets deleted and we don’t have the extra tag on the function, we can’t tell aligned_alloc from regular old malloc, and we’d risk under-aligning things in H2S-type optimizations.

But why do you consider that it’s easier for the allocalign tag to get removed than allockind? I would suspect the probability is the same: an optimization that doesn’t know about one, won’t know about the other.

Alas, I strongly disagree with the addition of attributes that cannot be safely removed, as this breaks the design of LLVM that tags/attributes/metadata can be deleted at will. And that’s a good thing, that’s a safer design as optimizations cannot be all updated in one go.

An option would be to invert the semantics: if the attribute is not present, then maximum ABI alignment is requested. If present, it changes the default minimum alignment.
The issue is that this solution is still unsafe wrt to aligned_alloc as an arbitrary alignment can be requested.

So that solution doesn’t work either. You need a new @llvm.aligned_alloc builtin. That is, if you need that functionality. If not, just leave it as-is, with that knowledge on LLVM’s side.

It’s a common misconception, but this assumption just isn’t true when it comes to call attributes. While some call attributes can be safely dropped (optimization attributes like nonnull), others are ABI-affecting and cannot be dropped (for example zeroext or byval).

If we treat allocalign in the latter category, dropping it would not be legal, or at least not without additional reasoning.

I’m not familiar enough with LLVM to know if there’s a way to have an attribute on an argument make that argument ineligible for dead-argument elimination. That would mostly solve the problem, but it’s still been significantly cleaner in the code to explicitly mark what type of allocation function something is than to infer it from the attributes on the function and arguments. Note that if the allockind() attribute gets removed from the function, then it just becomes ineligible for most of the clever allocation-related optimizations in LLVM, as it’s no longer marked as an allocator function. So that’s still pretty safe as I understand it.

zeroext and byval have existed for a looong time. So that’s a bit different.

I agree that function attributes are more difficult to be dropped accidentally than e.g. metadata attached to generic instructions as functions are usually not modified. At most they are moved, but even that is not frequent. But they can be duplicated.
So, safety-wise, function attributes that can’t be dropped are not my biggest concern, but I would still avoid them whenever possible.

I think that’s a good argument for keeping both.
If allockind gets removed, then no alloc-related optimization takes place.
If allocalign gets removed but not allockind, the semantics is that alignment is unknown, thus equal to address space size. So no optimization can happen.

That semantics seems safe to me.

What’s the expected use for the alloc family attribute?
You hint at generating some errors like malloc/free family mismatch. Anything else?

If that’s the only use, I don’t think we want a new attribute just for that IMHO.

Yes, the alloc family attribute replaces logic we already have in MemoryBuiltins.cpp for identifying which allocator each function belongs to, so that we can avoid eliding allocations that are actually UB. If we omit that attribute, then we can’t do that optimization.

I’d offer to fold it into allockind but there’s no way for non-string-type attributes to carry string data, and it somewhat defeats the point of the line of work if allocator families have to be registered with LLVM before they can be used.

Oh, see ⚙ D117356 InstructionCombining: avoid eliding mismatched alloc/free pairs for where we introduced the family logic, it protects against some real bugs.

I don’t have time right now to get into detailed discussion on this, but I did want to note that the new proposal seems to address the concerns I had previously raised. My initial impression of the proposal is that it seems entirely reasonable, so you can consider this a weak +1 from me.

p.s. Good catch on the h2s alignment issue. I had never considered that one, ouch.

The LLVM side of this work (I have some clang follow-ups that I need to finish) is now available as ⚙ D123091 MemoryBuiltins: accept non-TLI funcs with attribs as allocator funcs and the associated stack in Phabricator, in case seeing how things look in an implementation helps.

I agree, I like the proposal.
I still don’t see the need for alloc-family, though. Frontends can issue warnings about mismatch between alloc families.

This is something I’ve hoped would land in LLVM eventually!
I’m working on TinyGo, a Go compiler. It currently has a custom heap-to-stack transformation for allocations that don’t escape the current function, which seems to be equivalent to (1). I would very much prefer this to be in LLVM because it allows better optimizations (for example, if it can be enabled in ThinLTO).

Would this proposal also be able to do a heap-to-stack conversion for garbage collected memory? In that case, the memory is allocated and used (read/written) but never explicitly freed. So as long as the memory doesn’t escape, it can be converted to a stack allocation or an SSA value. It appears that (1) requires there to be a deallocator call so I think a new bitfield (nofree for example) would be useful in this case.

I’m not really sure how this could interact with a garbage collector. You’re right that the current h2s functionality requires a deallocator call…

In principle, I am strongly in favour of annotations on allocators, but I retain concerns about inlining. @nlopes was worried about attributes being lost, @nikic suggested that the possibility of losing call attributes is a common misconception, but inlining is the case where call attributes are lost. This is fine for things like byval because the relate to the calling convention and become irrelevant after inlining. It’s not clear that this is the case for allocator attributes.

The root of the problem is that allocators are always nested. My favourite bit of UB in the C spec is that it’s UB to use a pointer after it has been passed to free, which means that (by a strict reading of the standard) it is impossible to implement free in C. I don’t want to end up in that case.

In a trivial object allocator, you have some OS facility (mmap, VirtualAlloc, and so on) as the first-level allocator and then you subdivide the large chunks that this gives you. Consider this trivial malloc implementation:

static void *small_alloc(size_t);
extern "C" void *malloc(size_t size)
{
  if (size > PAGE_SIZE)
  {
    return mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
  }
  return small_alloc(size);
}

If you’re doing whole-program optimisation then you would find that you statically know the size at a lot of call sites and so you’d inline either the small_alloc call or the mmap call. You might even inline the fast path of small_alloc.

In snmalloc, in the default configuration, we have a load of different layers of allocator:

  • The platform layer, which returns chunks (some multiple of page size).
  • The global range layer, which manages a global pool of address ranges that have been allocated by the platform layer. These are power-of-two multiples of a page size.
  • The per-thread range layer, which manages a smaller pool of address ranges of power-of-two multiples of a page size for allocating chunks and very large allocations.
  • The per-thread freelists that contain a list of allocations of a specific sizeclass, which are the things that malloc returns for any allocation that isn’t larger than a few pages.

The layers nearer the programmer have fast paths that are amenable to inlining - the fast path for malloc is around 10-15 instructions and so a program that statically links snmalloc would expect to inline that in a lot of places (it should shrink by 2-3 instructions if you statically know the size).

So where should we put our allocator annotations, and what happens with inlining? Will inlining malloc's fast path lose the allocator annotations and generate worse code? Will it generate incorrect code if we annotate each layer and, after inlining, an analysis sees a chunk allocated directly from the per-thread range layer (because the allocation call site statically knows that the size is large) but freed with the generic free layer (because the free site can’t statically prove the size)?

Should the allocator attributes be a flag that an early inlining pass should not inline the call, but that a later one can, as long as it also strips the allocator attributes on all functions in the module?

It’s always fine for allocation functions to not be annotated, and if a function gets inlined and the “inner” allocation function isn’t annotated then the optimizations around allocations will get skipped. The good news here is that prior to this change, the list of allocation functions was a hard-coded list, and you couldn’t get any magic for your custom allocator without patching LLVM. That means that, in the common case, you won’t see things being worse than before with these patches, but you have the opportunity for things to be slightly better. Nothing should produce incorrect code with the annotations lost (either because you omit them, or because of inlining): all you’re losing is the possibility of some clever allocation-avoiding optimizations.

Does that make you feel better? I don’t think it’s appropriate for the allocator attributes to imply no-inline (whether for some passes or all passes) as that’s a little spooky and is working around a behavior that isn’t a regression (if your malloc-shaped function that was matching the hardcoded list got inlined today and it devolved to mmap, you’d still lose the optimization) and otherwise isn’t harmful.

It seems to me that future work could be taught how to handle mmap et al and integrate it with the rest of the allocation-function system, and you could put your own annotations on small_alloc (in time, if enough things were wired up in clang) and make these optimizations still apply even with everything having been inlined. That work is out of scope for me, but easy to envision.

Summary: no incorrect codegen caused by missing attributes, performance of generated code won’t be worse than the status quo ante, might be better in some cases.

Does that make you feel better?