[RFC] A Framework for Allocator Partitioning Hints

Summary

To enable effective heap partitioning in a hardened memory allocator, such an allocator requires semantic information about allocations that is typically lost during compilation. We propose a framework for allocator-partition hints instrumentation, which provides partition ID hints derived from language-level or other static properties. The design enables different partitioning schemes, with a type-based one being our primary focus. The feature can be enabled for Clang with -fsanitize=alloc-partition.

The design’s focus is sanitizer-style instrumentation that transparently rewrites allocation calls (e.g., malloc, new) to include a partition_id. The language frontend infers source-level information and attaches !alloc_partition_hint IR metadata to allocation calls; the middle-end IR pass (AllocPartition) consumes this metadata to rewrite calls based on a configurable partitioning policy. This framework enables various heap organization strategies, with the primary motivation being type-aware hardening deployable across large codebases without requiring source modifications.

Background and Motivation

Heap memory allocators could implement stronger memory-safety hardening features if they had richer semantic information about the allocations they manage. One particularly powerful hardening technique is to partition the heap to isolate different kinds of allocations [PartitionAlloc, ChromeSecurity 2022, Erlingsson 2025].

For example, separating pointer-containing objects from pointerless data allocations can help mitigate certain classes of memory corruption exploits [XZone]: an attacker who gains a buffer overflow on a primitive char array cannot use it to directly corrupt a vtable pointer, function pointer, or other critical metadata in an object residing in a different, isolated heap region. Furthermore, heap isolation can also mitigate many data-only attacks that cannot be mitigated by control-flow mitigations.

It is important to note that heap isolation strategies offer a best-effort approach, and do not provide a 100% security guarantee—albeit achievable at relatively low performance cost. The effectiveness of heap isolation varies for different libraries and binaries, along with the properties of the given allocator implementation.

The fundamental blocker to implementing such strategies is that standard allocators are blind to source-level semantics. A call to malloc(), __builtin_operator_new(), or any of the numerous standard untyped memory-allocation functions provide no information about whether the memory is intended for an array of integers or for a critical object containing pointers.

To apply heap partitioning to large, existing C/C++ codebases, a transparent approach is required. The proposed solution is transparent to source code, meaning no modifications to allocation call sites are needed. This model is directly analogous to other sanitizers, which also pair compiler instrumentation with a runtime library—in this case, a compatible, partition-aware memory allocator (this RFC only discusses the compiler support).

Related Features

Several existing or proposed mechanisms provide allocators with type information, but none are suitable for the goal of transparent, binary-wide heap partitioning.

Language Extensions such as C++ P2719 and the proposed C typed_memory_operation (TMO) attribute allow libraries to define and use type-aware allocation functions. However, they require modification of allocation APIs, and explicit inclusion of these APIs across a codebase. This makes language extensions unsuitable for transparent deployment across large unmodified codebases. The sanitizer-style deployment model is a better fit to provide the option for heap partitioning, without the upfront risk (and cost) of a wholesale conversion.

Hardening vs. Performance Instrumentation. The MemProf framework uses profile data to guide allocation placement for performance. AllocPartition is fundamentally different: it aims to provide deterministic, policy-driven partitioning. Hardening techniques require consistent, predictable behavior that works for all code paths, which is orthogonal to the goals of non-deterministic, profile-based partitioning.

Design

One of our observations is that for heap-partition based hardening, there is no “one size fits all”. The design aims to provide a configurable and extensible framework. While heap hardening is the initial motivation, the design can support other static partitioning schemes. By adopting a sanitizer-style approach (-fsanitize=alloc-partition, no_sanitize attribute, and ignorelists support), large-scale deployment mirrors the experience of other sanitizers.

Hint Generation and Instrumentation

The feature requires frontend cooperation, while the majority of the heavy lifting is done in a middle-end IR pass. This provides greater flexibility and enables coherent multi-language support for heap partitioning—which is becoming especially relevant, as code generated from different LLVM-based languages is linked into the same binary sharing a heap allocator.

Metadata. The language frontend is responsible for attaching !alloc_partition_hint metadata to allocation call instructions. This metadata currently captures source-level type information that the middle-end IR pass cannot trivially recover otherwise. The !alloc_partition_hint metadata is an MDNode with the following format: !{<type-name>, <contains-pointer-bool>}

  • where <type-name> is the fully qualified name of the inferred type;
  • <contains-pointer-bool> is an i1/boolean constant indicating if the type (recursively) contains a pointer.

Frontend Hint Generation (Clang). The integration with Clang emits !alloc_partition_hint for allocation calls derived from the allocated type. The allocated type is inferred as follows:

  • For C++ new T and new T[N] expressions, the allocated type T is known syntactically.
  • For untyped allocation calls to functions with the malloc or alloc_size attributes, the type is inferred from a sizeof() expression used in an argument. In other words, for calls to functions such as malloc(), __builtin_operator_new(), or any of the other untyped allocation functions, the type is inferred from common idioms like malloc(sizeof(T)) or calloc(N, sizeof(T)).

The sizeof()-based type inference is similar to the earlier proposed C-extension typed_memory_operation (TMO). We expect that the core algorithm to infer types based on sizeof-expressions can be shared between the TMO language extension and !alloc_partition_hint generation used for -fsanitize=alloc-partition.

Type Inference Limitations and Diagnostics. The sizeof() inference is a best-effort heuristic. It is known to fail for complex patterns, such as with type-erasing containers that request an untyped bag of bytes.

  • Diagnostics: To aid developers in identifying where inference fails, the pass provides optimization remarks (Clang: -Rpass=alloc-partition) to point out allocation sites that could not be associated with accurate type-hint information (missing !alloc_partition_hint metadata). This information can be used to avoid code-patterns that prohibit accurate type inference, or improve frontend hint generation.
  • Fallback: In cases where the frontend cannot generate a hint, the AllocPartition pass falls back to a less-precise analysis of the pointer’s immediate IR uses, or simply assign a default partition ID.

Instrumentation Pass. The AllocPartition middle-end IR pass consumes the hints to rewrite allocation calls, which runs late in the middle-end optimization pipeline. By default only known libcalls are covered, but coverage can be extended to custom allocation functions with the -alloc-partition-extended option (Clang: -fsanitize-alloc-partition-extended). Indirect calls to allocation functions (incl. standard ones) are not covered.

Partitioning Modes. The AllocPartition pass is designed to be extensible with different policies for computing a static partition_id. Initial modes include:

  • TypeHashPointerSplit (default): Our initial hardening-focused policy. The partition ID space is split, with one half reserved for types that (recursively) contain pointers and the other half for non-pointer types, avoiding partition ID collisions between the pointer and non-pointer containing categories.

  • TypeHash: Partitions based on a hash of the canonical type name (typedefs resolve to underlying).

  • Random / Increment: Simpler modes for testing and other use cases.

The Clang default mode is TypeHashPointerSplit. We are not yet making other modes available via a frontend option, as these may be subject to removal or change. Experimentally, users can change the mode via -mllvm -alloc-partition-mode=<mode>.

Runtime Interface and ABI. The interface between the compiler and the partition-aware allocator is designed to be simple and efficient. The instrumentation rewrites allocation function (<func>) calls to __partition_<func>(<func args>..., uint64_t partition_id), where partition_id is a compile-time computed constant. All memory builtin libcalls (isAllocationFunction()) along with custom allocation functions (see “Instrumenting Non-Standard Allocation Functions” below) are supported.

The choice of an opaque uint64_t partition ID deliberately abstracts semantic information, enabling future enhancements to partitioning modes transparently. One goal was to avoid additional runtime cost and the complexity of parsing structured type information.

The instrumentation provides a default ABI which appends a partition ID function argument, and a “fast ABI”. The latter is more performant with a very small ID space, controlled with -alloc-partition-max=<max-partitions> (Clang: -fsanitize-alloc-partition-max=<max-partitions>).

ABI Clang Flag <func>(<size>) Partition ID Argument
Default (none) __alloc_partition_<func>(<size>, <id>) Passed as final function argument
Fast -fsanitize-alloc-partition-fast-abi __alloc_partition_<id>_<func>(<size>) Encoded in function name

Instrumenting Non-Standard Allocation Functions. To support environments using non-standard allocation functions (e.g. in OS kernels), AllocPartition can instrument such functions with the Clang -fsanitize-alloc-partition-extended option. This enables instrumentation of any call marked with the !alloc_partition_hint metadata. When used with Clang, all calls to functions marked with __attribute__((malloc)) or __attribute__((alloc_size(..))) will therefore be instrumented.

Future Enhancements

__builtin_alloc_partition_id(<type>): For Clang, introduce a builtin helper to query the partition ID (based on the current mode) of a given type. To implement, a new LLVM intrinsic would be introduced, which is substituted with a constant in the AllocPartition pass. This would make the partition ID no longer opaque, and code may start depending on particular properties of partition IDs that we do not (yet) want to guarantee longer-term to allow for improvements to partitioning modes.

Implementation

The current implementation can be found at: GitHub - melver/llvm-project at alloc-partition

Frequently Asked Questions

How can I deal with allocation wrapper functions? One strategy would be to mark wrappers with __attribute((alloc_size(..))) and compile with -fsanitize-alloc-partition-extended; this requires providing a partition-aware allocation wrapper function.

For the TypeHash* partitioning modes, is the partition ID stable? Yes, it relies on xxHash as the hash function.

Do the TypeHash* modes distinguish between e.g. uint8_t and unsigned char? No, the innermost underlying type of a typedef is looked up. A notable exception is uintptr_t (see below).

Does the TypeHashPointerSplit consider uintptr_t a pointer? Yes, it does. Syntactically uintptr_t is not a pointer; semantically, however, very likely used as such.

Are indirect function calls to allocation functions covered? No.


Cc: @vitalybuka @pcc @kees @tsan @Glider @waffles_the_dog @rcvalle

2 Likes

Cool! I’ve forwarded this to Chrome’s memory safety team who have expressed interest in exactly this kind of thing before, and have been experimenting with P2719.

I think this is a very interesting proposal. This style of hint seems very useful, and I can imagine it could be a nice building block for other security or performance work. It’s certainly something I would have appreciated when working on application sandboxing.

You haven’t spoken much about changes on the allocator/runtimes side outside of the ABI. Are there new requirements for the runtime allocators in compiler-rt? Is that out of scope? Or do you plan to upstream support into those allocators?

Another question I have is about frontend support: how specific are the requirements here tied to C and C++? They seem fairly agnostic to me (maybe aside from a few details), so I’m wondering if the frontend part (hint generation from what I gather) could be implemented in llvm/lib/Frontend, so other frontends could leverage this kind of analysis and allocation partitioning. That may require a bit of refactoring, but I imagine other languages (particularly Rust) would be interested in supporting this style of allocation. We’re increasingly seeing mixed C/C++ and Rust projects, so the ability to use a common allocation strategy has obvious benefits, particularly if you’re building with LTO.

Could you clarify what you mean here? Both of these - the standards tracked p2719 and the TMO extension - are intentionally source transparent and have already been deployed at scale. The allocation APIs are not changed beyond the addition of an annotation that specifies the typed entry point.

The original version of TMO I wrong was based around the idea of having specific allocation function types, but that runs into problems due to the myriad of allocation functions that are not just the core set of malloc, calloc, aligned_alloc, etc APIs. The reason TMO has the design it does is so that clang does not need to have any specific knowledge of what functions are or are not allocation functions, nor have any knowledge of what the expected APIs for every allocator are. That permits adoption by both wrappers and custom allocators without modifications to clang.

Similarly the p2719 based interface would work for these purposes - once a function declaration is required it is immediately possible to rely on the same mechanism for the introduction of that declaration to include declarations of the appropriate p2719 functions, e.g

template <class T> void *operator new(std::type_identity<T>, size_t sz, align_val_t align) { return __alloc_partition_xxx(sz, align, partition_id_v<T>); }
...

and so. For the default global (e.g. system allocators implicitly included with including ) operators this has to be directly supported by the compiler, but for your use case where including function declarations is an option that specific issue isn’t a problem.

A more fair issue with TMO is that there is not any obvious way to allow arbitrary overrides of the type encoding mechanism, and that’s something we’d really like to resolve. But semantically the type descriptor TMO produces has the same purpose as the partition id in this proposal, and the problem of allowing developer customization of the encoding for a type remains in this proposal. It seems like a problem that just cannot be resolved within the constraints of pure C.

Beyond that the limitations on indirect calls in this proposal also apply to TMO - it’s simply not possible to change the ABI of function pointers and that’s something we’re all stuck with :frowning:

All this aside this does remind me that I need to prepare the PRs to upstream TMO, which would allow us to work together to try and come up with a solution for customizing the type encoding.

How would this work with any abstracted allocation APIs? Let’s take an example from the LLVM codebase:

inline void *safe_malloc(size_t Sz) {
   void *Result = std::malloc(Sz);
  if (Result == nullptr) {
   if (Sz == 0)
      return safe_malloc(1);
    report_bad_alloc_error("Allocation failed");
  }
  return Result;
}
...
llvm::safe_malloc(sizeof(MyType));

would it annotate the call to safe_malloc with metadata (how would it know this is needed, how would an LLVM pass know how to transform safe_malloc later…) while within the safe_malloc function there is only a size but not type around for the malloc call…

just realize you hint at this in the FAQ. But that seems to immediately contradict your goals of “no source changes needed” when wrapper functions need to be adapted…

Initially compiler-rt is out of scope. Our initial supported allocator and prototype is using TCMalloc, for which the support would be upstreamed to the open source version of TCMalloc (should this RFC land). Having support in e.g. Scudo would also be nice to have, but we’ve not yet committed to that.

Exactly – the instrumentation part is certainly not tied to C or C++. Multi-language support would become easier with the shared middle-end IR pass, and should this find adoption, I would like to see e.g. Rust support, too.

As for the shared parts for hint generation: The most complex part for Clang at least is inferring the type from syntactically untyped allocations, so most of that is C and C++ specific. There is very little code that could be shared with other frontends (unless I missed something), but we could make some of this easier by e.g. adding a proper MDBuilder helper to construct the metadata.

The language extensions are, in many ways, the optimal way of doing things if it’s feasible to change all the allocator APIs, system libraries, etc. in use (ignoring the ABI performance problem for a second). Without having that luxury, we actually started trying to see how this could work regardless, but changing the APIs (even if the changes required might be small) and including them across a multi-million line codebase is brittle, and not a great way to get maximum coverage as quickly as possible.

Note, we’ve left the door open to custom allocation functions (incl. wrappers), but there are very few such cases where it’s feasible to make source changes at all. This is mentioned in the RFC to deal with some edge cases, but this is not the common case. The common case is code that cannot be touched.

In the C++ case, there is no guarantee everything sees the declarations. Valid C++ code never needs to actually include a system header to do an allocation, and can just do new T. How to deal with that?

In the C case, every standard C library version in use needs to be changed with the TMO annotations to have global effect (however, note that every recent standard C library already properly marks allocation functions with attributes malloc and/or alloc_size though). This is simply infeasible in cases where we do not have the luxury of being able to provide our own with replacement headers – but we can just link our own allocator.

The -include hack works for a subset of code, but is very brittle, as suddenly we’re implicitly including code (and decls for types etc. as well) where that might be wrong. And other fun cases e.g. where code does -Dmalloc=my_malloc or #define malloc, and suddenly our declaration from the -include overrides something it shouldn’t have.

Our conclusion to these problems were “we can’t do declarations in the common case”, and “we need the middle end do the transformations”. There are other problems, such as selective exclusions, which an ignorelist is the right answer to.

Another issue is performance, and partition selection must be doable statically and for some very high performance code we can’t have another function arg which is why we need -fsanitize-alloc-partition-fast-abi for those cases.

I agree this is a tricky problem, and -fsanitize-alloc-partition-extended does aim to solve the same thing here. However, we prefer to have broad coverage of the standard allocation functions first (~99% of the critical coverage needed), and then start to get more coverage for custom allocation functions where required.

Being able to enable this without API changes allows to quickly get that 99% of coverage, even for 3rd party code that can’t be touched at all.

I agree - but being able to include those function declarations is not the common case as explained above. Perhaps I didn’t make that clear enough, apologies for that.

Please do!

We’ve left the door open to be able to cover custom allocation functions, but this is not the common case we want to cover. The common case is “no source changes”, and coverage of the standard allocation functions is sufficient.

Supporting custom allocation functions consistently with the standard allocation functions is still desirable, and is therefore supported, albeit not for free.

I’m curious about the coverage of the proposed partitioning, TypeHashPointerSplit. What does the breakdown of objects by bytes in the workloads you care about (or perhaps in SPEC)? Recursively aggregating pointer types might include almost all objects depending on the workload.

Also is coverage by bytes a meaningful metric when thinking about security? Is frequency of access a better metric?

For one particular server workload we observed a split of 25% / 75% in bytes for pointer-containing and non-pointer-containing allocations respectively. This number might still be skewed towards non-pointer-containing types because of coverage gaps we’re trying to address.

Frequency is likely a better proxy. We also know that certain code is more prone to vulnerabilities, regardless of frequency (such as certain 3rd party dependencies).

An internal security analysis concluded that a significant number of exploitable memory-safety bugs would have become unexploitable by isolating at least by pointer and non-pointer containing types i.e. 2 partitions only. Going beyond 2 partitions raises the bar further (however, more partitions trades off performance, but this heavily depends on the allocator implementation). I can also share that we found that the majority of buffer overflows are from primitive-type allocations.

We know that there’s likely no one-size-fits-all, but the default policy is meant to raise the bar for the common case. We also made sure it’s very easy to implement different policies.

Sorry for the delayed response, I have had plague this week that meant even reading my screen has been nauseating so was unable to do anything useful.

I’m trying to understand the use case here.

Is it to support general - including - system allocators, or is it specifically for use in single binary cases similar to sanitizers?

I’m assuming with “general + system allocators” you mean the whole system and every Clang-built binary on that system. With “single binary” I assume you mean selected binaries where the heap-allocation partitioning feature is explicitly enabled.

I’d say it’s closer to the latter. However, for such binaries we want close to 100% coverage of heap allocations done by that binary, incl. e.g. C++ STL allocations and all dependencies, and ship our own allocator along with that. Binaries and their dependencies are built from source but the build flags can vary between projects.

Because performance is one of the biggest concerns, not all binaries may opt in (feature disabled), or only partially opt in e.g. high-risk codepaths (better than not at all). For the latter case, we need SCL support, which -fsanitize=alloc-partition provides consistently like other sanitizers. Furthermore, at a later stage we may need to integrate PGHO info into the instrumentation pass to selectively instrument based on profiling data.

We concluded with those baseline requirements:

  • Transparent: should not require modification of code
  • Scalable: works across entire codebase
  • Performant – to further mitigate:
    • Configurable guarantees (max partitions, fast ABI)
    • Selective by opt-in/out
    • Selective by profiling information (planned)

Get well soon!

Right, but how are those being built such that the compiler knows to substitute functions, and what they should be substituted for?

From the TMO PoV a problem we have is when developers simply declare the allocator functions themselves rather than using the correct headers, so clang is unaware of the typed interface and they end up with an untyped call (the system allocator interfaces then just use a return address hash as a pseudo type, but they don’t get the same semantic info). Resolving that problem overlaps this.

From the implementation it looks like it just performs a string concatenation on functions using alloc_size or malloc attributes, but I’m not sure how you prevent over application?

What was causing the perf problems? Is it code/data size? Can you compress partition information? We certainly did not have any issues with TMO applying to every allocation call (malloc, calloc, aligned_alloc, posix_memalign, operator new, new, delete, delete, malloc_zone_*, …) - it’s possible there’s some low hanging fruit to reduce code impact.

These are things that TMO does achieve, but I have thoughts (below)

To me these seem like properties of the runtime/allocator rather than codegen, but that may be a property of this being an ABI for the system allocators (so encoding that into generated code is not an option).

We have some silly (source level) hackery for the places where we need to prevent TMO substitution (e.g. making sure various allocator implementations don’t try to substitute themselves :smiley: )

It’s interesting that you perform the replacement at the IR level - I was very on the fence about the appropriate layer for the substitution - I think the IR approach is interesting from the PoV of supporting non-C[++] languages as they don’t each have to implement their own codegen. On balance I probably went for the clang side codegen approach because that’s what I’m more familiar with.

The different policy choices you make in your inference path also reinforce the need for these mechanisms (partition alloc and TMO) to have a configurable mechanism to construct the type token (I’m going to try to stick to type token from now on as the stand in for partition id, type descriptor, whatever other new and exciting things come forward).

  • “How built” → by adding the compiler flags (-fsanitize=alloc-partition ...), which tells Clang to attach !alloc_partition_hint to allocation functions (standard attributes + C++ operators), and inserts the AllocPartition middle-end IR pass.
  • “What substituted” → part of the ABI, e.g. malloc will become __alloc_partition_malloc(.., partition_id) (or __alloc_partition_<id>_malloc(..) for the fast ABI), _Znwm → __alloc_partition_Znwm, and so on. (The prefix is configurable via a LLVM cl::opt if necessary, but by default we encourage to use the standard ABI.).

To prevent over application – from the top post:

LLVM knows about libfuncs (llvm/include/llvm/Analysis/TargetLibraryInfo.def), and we can make use of that info in the middle-end (unlike Clang frontend). Most are also automatically covered by isAllocationFn(), with some exceptions dealt with explicitly.

So Clang will attach !alloc_partition_hint to all calls to what Clang thinks are allocation functions, which include the standard C++ operators, and also all calls with attributes malloc or alloc_size. The middle-end pass then decides which to actually cover (default being libfuncs only). With -fsanitize-alloc-partition-extended we can cover all, incl. custom allocation functions that have either malloc or alloc_size attributes, as that tells the middle-end pass to ignore the libfunc restriction.

If they are standard names, LLVM recognizes them as libfuncs, and the optimization remarks emitted by the middle-end IR pass (via -Rpass=alloc-partition) would complain.

The “extended” mode, as mentioned, overlaps a bit with TMO. It requires a bit of work to provide partition-aware variants of wrappers, but for certain critical libraries (think OpenSSL and the likes), we’re going to want to cover them, and just have to do the work to provide compatible partition-enabled variants (e.g. __alloc_partition_OPENSSL_malloc for OPENSSL_malloc). Some of these already attach one of the 2 standard attributes; some don’t, and we will have to go and deal with them. However, the common case usage won’t go there, but some security sensitive binaries still want to do so. OS kernels would fall into this bucket, as they likely do not use standard allocation function names.

  1. Register spills due to the additional argument to a partition-aware allocation function. Some workloads are so sensitive, that we introduced -fsanitize-alloc-partition-fast-abi that avoids this.
  2. The fast ABI necessitates a static “max partitions”.
  3. Without the fast ABI, letting the allocator limit partitions would require at least another instruction in the allocator fast path. That itself, can be too expensive (“[..] every instruction here is a huge pile of $$$ for us.”). Therefore, even without the fast ABI, we need static max partitions.

This is esp. critical as some binaries are going to become mixed language projects, e.g. C++ + Rust. While it’s conceivable to just always make e.g. Rust allocate in its own partition away from C++, this only works as long as the two sides don’t share data, which is unlikely. Therefore, we need to ensure that we have a relatively straightforward path to cover all allocations, regardless of origin language, to retain the same level of guarantees.

I’m just expecting that there’s no one-size-fits-all. While I agree that that “type” is likely the most useful information to construct that token, there are other options for partitioning (or grouping) allocations that are type-independent:

  • Per module
  • Per function
  • Per source location / allocation site (similar to AUTOSLAB)

Restricting it to “type” seemed too restrictive. The AllocPartition allows to implement different policies; the choice to implement the policies in the compiler and generate the “token/ID” statically is for the performance reasons mentioned above (esp. fast ABI requires this).

Overall, I believe that both TMO (and P2719) and AllocPartition can be used to solve similar problems, but the context, requirements, and trade-offs in which the problems are solved are not entirely overlapping. Therefore, I think both these solutions can co-exist.

Two sections to this reply, first is response and/or questions to your last comment, second is things that we might want to do to get these both in, as they do have slightly different constraints so having both might be reasonable, but if nothing else there is also a lot of overlap in the kinds of things we want to do, so where possible we should share those, and for concepts that both systems need but require different semantics we should try to make the selection and generation more unified.

Responses and questions

Ok, so it sounds like you are ok injecting additional compiler flags into all the build targets? Clarifying this as it adds additional options for this and TMO.

You talk about considering the alloc_size attribute to know where to perform inference, but I think that requires the header definition rather than the builtin allocation knowledge (though it does seem that it would be reasonable to add such information in TargetLibraryInfo.def).

What exactly are you passing (as in ABI type)? We didn’t encounter issues with this despite replacing more or less every call.

This is similarly surprising to me but maybe it’s related to how/what you’re passing for the info? It could also be due to how partition alloc behaves? Out of curiosity how do you determine the number of partitions and the partition selection over multiple TUs?

So are you using this for production builds or testing/sanitizer-esque builds? It’s been a long time since I wrote any code in TCMalloc, and the TCMalloc I was working with was one with a lot of my own security and safety hackery included so I don’t know exactly what modern TCMalloc does. But back then I added many additional branches to the interface without any measurable perf impact, and that is despite TCMalloc making trade-off choices that aren’t appropriate for general system allocation.

When you are benchmark are you benchmark real runtime or just allocation/deallocation loops?

While with our TMO deployment we’re talking about a system allocator which has to make different trade-offs vs time-sensitive allocators like TCMalloc, but we do much more than a trivial indexed indirection and did not find any measurable perf regressions. That is why I’m concerned about this bit specifically.

Things that we should consider unifying

Both this proposal and TMO try to infer the type of an allocation. It’s kind of silly to have multiple implementations of that inference. I find your inference code to be somewhat easier to read as it’s just a direct implementation of a recursive tree walk over the expression, whereas ours uses the ASTVisitor model. At the same time I think the simple recursive approach is much less robust - for example it looks like it’s possible that malloc((sizeof(Foo))) might fail inference, or essentially any other cases where there are unhandled expressions. I think we’ve avoided those hazards by use of the RecursiveVisitors.

From there we get to “what is the result of the inference”:

  • PartitionAlloc seems to be happy with a single QualType?
  • TMO currently does it’s best to identify cases like malloc(sizeof(Header) + sizeof(Element) * N) and similar, sizeof(A) + sizeof(B), etc.

A shared implementation here does seem entirely reasonable: In the cases where multiple types are found the codegen for the partition alloc backend would simply choose which to prioritize. For us TMO attempts to unify what we consider to be structurally equivalent types by a rather obnoxious process of linearization and hashing, but it’s possible that we could simply adopt a model of “what types did we find” for anything more complicated than just the most common cases of sizeof(A) and sizeof(Header) + N * sizeof(Element).

Conversion from inferred type to implementation type token can’t be shared - the various constraints are simply too different. For TMO my hope was that I would be able to come up with some way to allow developers to specify their own system inline, but for C the only real option for that kind of thing is macros, and you can’t randomly “evaluate” a macro during sema and/or codegen. For C++ we could use constexpr functions, but in C++ new/delete etc already know the type anyway.

Having looked at what you’re doing I am tempted to see what it would take to move our substitution from clang to IR, as it might allow us to improve how we respond to inference failures: we current create a hash of the allocation location at the source level, but for allocation wrapper functions that means an inference failure gives all allocations the same hash. If present in the IR it becomes plausible that we might construct that location hash post-inlining. That may permit a greater degree of isolation between unrelated allocations, that could be helpful.

But in addition to that, I really don’t want multiple implementations of what is fundamentally the same operation.

The last thing that I wanted to consider is the implicit rewriting behavior. You are currently (functionally) hardcoding the remapping selection. But this does not really need to be the case, an alternate design would be to have a table of mappings from allocation function to replacement allocation function - to share the implementation we’d need some additional metadata in this table to deal with ABI differences (for example your abi always puts the type at the end of the list, whereas ours puts it after the parameter we infer the type from). But then given this arbitrary shared list of mappings we could add a mechanism that allows a developer to provide a list of additional/custom mappings. For your approach that would provide a mechanism for custom allocation functions, for TMO it would help us resolve/mitigate the local declaration problem.

1 Like

Correct.

The attributes are only relevant for Clang. Yes, for all C-style allocations, the attributes are needed, but as I mentioned before, all recent standard libraries do attach the attributes, so that’s not a problem per-se.

Clang blindly attaches !alloc_partition_hint for calls to functions with the attribute. TLI only becomes relevant in the IR pass (I don’t think Clang has accurate TLI yet, because it also requires inferattrs IR pass to run), which restricts/filters down to libcalls (not with -fsanitize-alloc-partition-extended, but that’s a deliberate opt-in to avoid overapplication).

The “normal ABI” just passes an opaque uint64_t partition id.

See the ABI table in the top post: if there’s no “max partitions” for the “fast ABI”, there could be up to UINT64_MAX different partition allocation functions per <alloc_function> (e.g. malloc), where the partition-aware variant becomes __alloc_partition_<partition_id>_<alloc_function>. Therefore, only with a small max. partitions it becomes feasible to use the “fast ABI”. (Sure, we can auto-generate all those functions with a script or such, but that’s rather wasteful. :wink: ).

With the same compiler arguments, the partition-selection algorithm is stable. Like many other codegen features and optimizations, not passing the same flags for different TUs may results in undefined (unexpected) behaviour incl. ODR violations (for header inline functions compiled in multiple TUs that used different compiler flags).

Production builds. This feature was designed with production in mind, similar to other “production sanitizers / instrumentation” (e.g. -fsanitize=cfi).

Large end-to-end benchmarks of critical Google server workloads, and the measurements include CPU time, cycles, QPS, etc. For some workloads, an additional instruction in TCMalloc’s fast path can cost >1% QPS. And that’s not acceptable. This paper summarizes current TCMalloc architecture.

I agree. I’m rather indifferent which implementation we end up with for the type-inference, as long as the tests pass. And your more elaborate analysis will likely give us the QualType we need. If there are multiple types inferred (as you say your analysis does), I suspect we could make use of that as well.

In the simplest case, we’re currently relying on function aliases (at the linker level) to map __alloc_partition_<foo> to some remapped implementation function. LLD also gives us tools like --defsym=, so for the simple cases it makes little sense to reinvent the wheel.

From what you write, I suppose a more elaborate mapping mechanism could solve:

  • What to do with the local declaration problem (alternative: for standard allocation functions, TLI libfuncs would allow you to generate pass warnings, which we already do with -Rpass=alloc-partition).
  • Custom allocation functions where we can’t touch the source to add malloc attribute.
  • ABI differences (argument position, function mapping).

This mapping likely still needs to exist in the frontend, because that’s where we need to infer types. But the frontend then encodes all this as part of the IR metadata node, e.g. !alloc_partition_hint <target-fn -or- null> <arg-pos -or- null> <type-str -or- type-token> <contains-ptr> (rough sketch), and the IR pass interprets this MD node appropriately (where null, reverts to a sane default). This would also provide greater flexibility for non-Clang frontends in general.

How to proceed

It would be good if we can start upstreaming parts of this soon.

The exact plan depends on how much infrastructure we want to share (just the type-inference, or also the middle-end pass?). We should probably flesh this out in more detail among us. If we only share the type-inference at the beginning, I think this can happen incrementally as we upstream, and it should e.g. be trivial to replace my type-inference code with your code. It will become more complex if you decide to move parts of your feature to an IR pass, and we’d have to flesh this out in more detail, although perhaps this can also happen incrementally (e.g. by later extending the IR pass with the enhanced MD format).

What I mean is that I thought I read somewhere there was a mention of a fixed number of partitions or slots, but if different TUs are seeing different sets of types I don’t know how you determine the number of partitions.

Similarly for the directly indexed partitions, without knowing ahead of time how many and which slots are used I’m not sure how that works?

Thanks, it’s useful to know how you’re testing perf (I’ve seen many “look how much faster my allocator is” where the benchmark is something like for (....) free(malloc(...));, though obviously I expected something more sensible from you folk :D)

It was just surprising to me given our experience, I guess that it’s a result of a more heavily optimized/directly linked interface without the “I am an OS allocator” tradeoffs increasing the base allocator costs. The choice to directly link to explicitly indexed allocation functions seems like a detail of the implementation backend rather than something that would be intrinsic to the design of the entire feature.

Anyway my questions about perf here weren’t really an opposition to the design, more just surprise at the cost you were seeing vs what we see, but as above that may just be OS allocator vs usage specific allocator constant costs.

Sorry the ABI difference is how/where the type token is passed - the TMO ABI places it after the size parameter, your model places it at the end of the parameter list. If we move the function replacement to IR (which I’m liking more and more vs TMO’s approach of doing it in clang) we just need to ensure that the metadata generated during codegen includes information about where the token should be placed - which seems like a fairly easy thing to manage.

We’ll also want to come up with some way to communicate the replacement rules from clang to LLVM, as the TMO approach means that we only find that information out in clang.

I think having a general interface to register replacement functions is the right approach? Hmmm, if the registration included the index the type token should be inserted that would resolve the “communicate the ABI” issue above as well.

Agreed.

I need to think about it a bit more, but I like the idea of pushing the TMO’s replacement logic into IR as you have done, but for TMO that would require quite a bit of rewriting. We’ve also got our type encoding in clang, where we need it because we provide a builtin that can be used in clang to generate the encoding.

One thing we could do is upstream our inference logic - I was thinking we could just include a builtin that would allow a developer to do __builtin_infer_allocation_type(some expression), which would give us an intermediate function to include tests, and in principle allow manual adoption (say in allocation wrapper macros). The problem there is that we try to infer complex types and I have no idea how that could possibly be represented in source. At the same time our type linearization logic is not exactly trivial so I’m loathe to block upstreaming of the allocation type inference on it (@AaronBallman any thoughts on how this might be approached? Adding the inference).

Oh durrr, I just read through your code for EmitAllocPartitionHint and saw that you’re simply placing the type name in the hint (for some reason I thought you’d come up with a way of sending the QualType to llvm, and then operating on that in llvm - presumably via some callback I guess was my logic?).

We could hypothetically have a pair of builtins: __builtin_infer_type_names(expr) and __builtin_infer_type_token(expr). The former producing a const char* and that latter producing a mode dependent encoding of the type (the split type hash or the tmo semantic token - which is {semantic flags, linearized type hash} rather than a name hash).

Starting with __builtin_infer_type_names would give a testable base point for the inference, and in principle could be useful on it’s own for folk using macro allocation wrappers.

The !alloc_partition_hint metadata carries the necessary information to deterministically calculate a partition ID, and currently that’s the type name (as string) and if it contains a pointer. Then with the type name, we just hash that, and as long as the hash function is stable, the result will always be the partition ID modulo max partitions.

Of course if a user does:

// foo.c
struct Something { int a; };
// bar.c
struct Something { std::string a; };

… in the same namespace, we consider it UB.

There are 2 aspects, one is if the user wants to register some special replacement functions where some source code can’t be changed. That could be a separate file, the format could just be function:replacement:arg, but I’m not a big fan of custom file formats. I don’t think this is at all needed for the baseline initial version, and I’m weary of YAGNI, so I’d postpone that until after the initial version is landed.

Then there’s communicating replacements and the arg position from Clang to an LLVM IR pass, such as what you will need for TMO if you choose the IR pass based design. This needs to be part of the LLVM IR generated, and the canonical way to do this would be to design an MD node that the IR pass interprets to pick the replacement function and where the arg is placed. Currently AllocPartition knows about !alloc_partition_hint <type-as-str> <contains-bool>, but that can be extended with additional optional parameters (... <token> <token-pos>).

I prefer simplicity, and currently don’t see how I’d want to use such a builtin. I think existing facilities like typeof()/typeid()/decltype() likely cover uses where code can afford to use such builtins. The real value that the proposed features provide is transparency and some degree of type-introspection to construct a token to make allocation decisions. __builtin_infer_type_token could be useful, if anything, to get the inference logic sorted out.

But I think we really need to charter an incremental path, before designing new features, esp. to keep things managable and simple.

Planning - Given you’re currently investigating how to move TMO to an IR pass, I could imagine something like this to work out and provide a coherent overall feature (but not sure that’s what you were imagining):

  1. I can rename AllocPartition to AllocToken (also shorter name, which I like). AllocToken will give us the shared LLVM IR infrastructure we need.
  2. I introduce Clang -fsanitize=alloc-token which behaves more or less like the current prototype, and constructs a sensible “token ID” as a default token (currently: TypeHashPointerSplit).
  3. With that baseline infrastructure, TMO can be built on top - you could figure out the extra MD info passed on via !alloc_token_hint (probably “semantic token + target fn + argpos”).
  4. We can teach AllocToken about the more advanced !alloc_token_hint.
  5. You introduce attribute typed_memory_operation (shorter: alloc_typed(F, N)?), and replace the less advanced type-inference logic with your more advanced one. Such functions will be passed the “semantic token”.
  6. At this point AllocToken will take care of -fsanitize=alloc-token and the attribute – by design they will not conflict (the pass knows about both), as those with the attribute will use TMO semantics, and the rest may only be covered if -fsanitize=alloc-partition is enabled.
  7. [Future] Other LLVM-based languages can insert the AllocToken pass into their pipeline.

Naming: As always a hard thing - I think “token” should capture most use cases (incl. e.g. per-allocation-site tokens etc.). typed_memory_operation → “memory operation” can also be understood as a “load”, “store”, etc. and is generally ambigious.
I’d propose something that mirrors alloc_size, e.g. alloc_typed.

Questions for TMO via IR-pass (should you decide to do so):

  1. Does it have to work without additional compiler flags? If yes, we just need to unconditionally insert the IR pass. Performance-wise I’d prefer an explicit flag for more complex features, but we can probably make the pass return fast if we know the function has no TMO-attributed calls.

This plan could be executed incrementally and I also don’t think we’d need the __builtins, if you are comfortable with ripping out my type-inference code – if you upstream first, then I can rebase (but my reading was you need more time, and I don’t see a point in rushing if the incremental plan works).

Thoughts?