[RFC] Allocators in the libc for the embedded use case

Allocators in the libc for the embedded use case

This proposal is likely not complete enough. The goal is to gather feedback and
set up a concrete plan/strategy for embedded allocators in the libc.

The libc project uses the Scudo hardened allocator
as the allocator for server applications. This is a proposal to add a light
weight allocator implementation in the libc project which will serve embedded
use cases. The allocator requirements of embedded platforms have a lot of
variance - some use cases require a small lightweight allocator which can be
used in single threaded applications, while few others require an allocator
which can work in multi-threaded contexts also. Likewise, some baremetal
applications only ever allocate and never deallocate, while few others require
a deallocation operation which also does defragmentation. The goal of this
proposal is two main things:

  1. Identify the variants of embedded allocators that should be supported by the
    libc project.
  2. Present the highlevel interface of these allocators and demonstrate how
    that interface can be used to implement functions like malloc and free
    for the various embedded platforms.

A non-goal is to iron out implementation details - they will be discussed
during code review.

The axes of variations

As mentioned above, there are different variations of allocators that serve
embedded use cases. In this section, we identify the main axes of these
varations highlighting the variants the libc will support along each axes.

Thread safe vs single thread only allocator

An allocator which only needs to work in a single thread context can avoid the
overhead of locking/unlocking operations. This locking/unlocking overhead is not
only a runtime performance overhead, but also a code size overhead for single
threaded baremetal applications. To serve both the multi-threaded embedded use
case as well as the slim no-thread use case, the embedded allocators in the libc
project will provide a thread-safe variant as well as a single thread only

Platform memory allocation

At a highlevel, an allocator is an agent which manages memory it obtains from
the platform. For example, on Linux, an allocator can be implemented to manage
memory it obtains using the mmap operation. On few other platforms, allocators
use sbrk to expand the single block of memory they manage. The embedded
allocators in the libc will support both these methods of obtaining memory from
the underlying platform, namely mmap style and the sbrk style. Additionally,
the allocators will also support the case where it is not possible to obtain
additional memory from the platform - which is to say that they will manage a
fixed blob of memory.

No deallocations

The libc project will support use cases where deallocation and defragmentation
are considered code size overheads. It will be up to the application to craft
their allocations carefully and never invoke realloc. For such applications,
the free function will be implemented as a no-op (which should ideally never
be called).

Allocator Interface

The allocator will be an instance of the following class template:

template <ExpansionStyle EXPANSION_STYLE, bool DEALLOC>
struct AllocState;

template <typename MutexType , ExpansionStyle EXPANSION_STYLE,  bool DEALLOC>
class Allocator {
  MutexType mutex;

  constexpr Allocator(AllocState<EXPANSION_STYLE, DEALLOC> init_state)
      : state(init_state) {}

  void *alloc(size_t);
  void *realloc(void *, size_t);
  void *aligned_alloc(size_t, size_t);
  void free(void *);

The allocator instance can be a global object, or on platforms with
thread_local support, it can be a thread_local object. The allocator methods
will be specialized methods suitable for the corresponding variant. For example,
for no-deallocation allocator, the free method can be implemented as a specialization:

template <typename MutexType, ExpansionStyle EXPANSION_STYLE>
void Allocator<MutexType, ExpansionStyle, false>::free(void *) {}

Allocator for single-threaded applications

Allocators which are to serve only a single-threaded application should be
instantiated with a passthrough mutex class which the methods named lock and
unlock which do nothing.

1 Like

Thanks for writing this up. Some early thoughts.

In our embedded C-library we have a couple of choices of heap, with the default being small and simple, and a more complex one that offers more constant performance over a larger number of blocks (Documentation – Arm Developer) for some applications a bounded guaranteed worse case is more important than average performance.

Although not necessarily part of the design here, one of the important parts for a general embedded toolchain shipping pre-compiled libraries is how an end user can override the binary implementation. For example if they implement malloc, free, realloc, calloc in their own objects, can they replace the libraries implementation at static link time? We have some ability to do this in our C-library Documentation – Arm Developer

Some of the most extreme requirements on allocators comes from the Automotive industry. In particular Autosar requires determinism and no fragmentation. A really good talk on this from a C++ perspective was given at CPPCON Practical Memory Pool Based Allocators For Modern C++ - Misha Shalem - CppCon 2020 - YouTube

1 Like

I like the flexibility this offers. For the GPU use case (@jhuber), we already realized we will need more than one allocator. There, the right allocation strategy can make a huge difference and we know different apps require different schemes. Single vs multi-threaded is one thing, stack like allocation, bump allocator, … etc.

When I envision this, the user/runtime should be able to choose which allocator we want, potentially per kernel or at least per module.

I think there is a general appreciation of the fact that there is no one-size fits all allocator for embedded use cases. To begin with, we can start with some really simple allocators. The ones I plan to work on immediately are the bump-allocator, a linear time free-list allocator, and logrithm-time free-list allocator. Also, the single threaded and multi-threaded flavors of all of these.

If restrict ourselves to static linking, which I think is the most common mode of linking used for bare-metal embedded applications, the link order semantics facilitate overriding the libc provided allocator funtions with the user provided allocator functions. The libc project has made a concious choice to call allocator functions from within the libc code using their public names (such as malloc and free) instead of using internal names so that the user provided functions get used.

Thanks for pointing this out, I will study the Autosar requirements in more detail. As I have mentioned above, we can add more complex allocators as required but will start with some simple ones. Also, as the complication increases, using Scudo itself might actually be better/simpler.

As in, a single application will want to use more than one allocator strategy?

I can imagine that we might want to define it on a per-kernel granularity, however, right now per-app granularity is good enough. If we do the dispatch through a compile time constant switch, having multiple options will not even cost us much, no matter the granularity.

We have a DSP with the following properties:

  • It is single threaded, but it has several computational vector cores that may work in parallel.
  • Since each core may want to access big chunks of memory at the same time, memory throughput must be sufficient.
  • High throughput is achieved by dividing internal memory into several “memory banks”. Each memory bank allows only two simultaneous accesses, but accesses to different memory banks can be done in parallel. This allows us to read N * 2 vectors at the same time.

The usual memory allocation algorithms do not care about such memory banks, so we implemented our own algorithm and provided the user a way to select the bank the malloc should allocate memory in. You can think of it as if malloc had a second argument – which memory area to allocate in.

I doubt that there are many architectures that have these properties, so it is probably not worth adding support for this in upstream. If it can be done, cool, but otherwise I’d like to have a way to opt-out of in-tree memory allocation algorithms.

I’ve been trying to port llvm’s libc to our architecture, and one of the main obstacles I ran into is unguarded uses thread-local variables and atomics (we have neither). It would be nice if these are abstarcted away in the implementation of memory allocation routines.
(The other main obstacle is uses of fixed-width integer types like uint8_t, but that’s a different story.)

Not just the allocation functions, the libc project allows you to opt-out of any function you do not want in your libc.

Do you need those parts of the libc which make use of thread-local variables and atomics? If yes, you can now make use of the LIBC_THREAD_LOCAL macro to make the variables declared as thread local to not be thread local. If no, just don’t include those parts in your libc and thread-local and atomic variables should not trouble you. If you are facing problems either way, feel free to file bugs or start discussions here, at the least it will help us develop a better appreciation of the problems you are facing.

1 Like

To be clear, I want them in libc, but not the generic implementation.
Not all entrypoints in libc are ALIASes, so I have to patch them, add baremetal/arch subdirectories, boilerplate CMakeLists.txt etc. That’s fine, apart from potential explicit (hopefully) or implicit merge conflicts. Still, it would be nice to have some way to “override” the generic implementation without patching existing code in CMakeLists.txt. E.g. instead of ALIASing arch/os-specific entrypoint, only add generic entrypoint if there is no arch/os-specific one with the same name. (It is probably more difficult than I can imagine, but maybe not.)