Llvm-libc and embedded systems (follow up from LLVM Embedded Toolchains Working Group)

Introduction

At the LLVM Embedded Toolchains working group sync up (LLVM Embedded Toolchains Working Group sync up) we have been discussing llvm-libc and embedded systems. With the aim to share some of the experience that toolchain suppliers have with embedded libraries, and how that might be applied to llvm-libc. I volunteered, several meetings ago, to write up Arm’s experience with its proprietary C-library, turns it out it is harder to put in words than it is to talk about!

We are very interested to hear from other potential users of llvm-libc in an embedded context, we’re relying on our own Arm coloured experience which may not cover all use cases.

General properties of embedded C-libraries

The definition of embedded used here is in the spirit of a freestanding rather than hosted implementation. There is no assumption of an operating system abstracting away the hardware, with services to implement the C-library on top of. Common use cases for such a library involve developing firmware for larger devices and software for microcontrollers and real-time systems.

Common properties of an embedded library:

  • Statically linked (no OS to dynamically load), often for a specific sub-target.
  • No assumption of an OS to provide primitive operations, load the program or initialize the target hardware.
  • Code-size often more important than performance. Desire to only include what is used.
  • IO often redirected via a serial port or emulated by the Host/Debugger.

LLVM-libc alternative implementation of functions

Many embedded systems have a limited amount of read-only and read-write memory. Embedded C-libraries tend to favour small code-size over performance. Some of the functions in llvm-libc are tuned for maximum performance at the expense of code-size, for example the memmove implementation when compiled for Arm is about 100 times larger than equivalent implementation in Arm’s C-library, exceeding the total flash size of the smallest microcontrollers. We’ll need a way of building llvm-libc so that alternative implementations with a different code-size/performance trade off are permitted, particularly in the strings area. Many functions in the C-library are independent, so it shouldn’t be too difficult to offer a choice of implementations at build time. There are areas that are more complicated due to dependencies between components, for example many functions depend on the definition of opaque types such as FILE. In particular the locale implementation affects several parts of the library such as time, ctype and string.

Some embedded developers are willing to trade-off standards compliance as well as performance for minimum code-size. Particularly on the smallest micro-controllers like the Cortex-M0 that can have implementations with as little as 8KiB of Flash. The library might choose not to conform to IEE 754, support configurable locales or wide-characters. Our experience with the smallest possible code-size use case is that it is better to design a C-library with this use case in mind, rather than trying to strip down an existing C-library to meet these constraints. For llvm-libc it may be worth setting out how far an alternative implementation can go. For example do the build and tests need to support a library without configureable locales?

When implementing components with dependencies it will be good to isolate these so that alternative implementations are possible.

Printf optimization

The printf function in its entirety is large as it has to handle many different cases in the format string. In an embedded system an optimized printf can omit the code to handle format specifiers that aren’t used. I’ve seen a number of mechanisms, with conditional compilation the most common one, such as https://github.com/mpaland/printf, however this would be difficult to justify in a toolchain that supplies pre-compiled libraries as we can’t know what might get used. With compiler support it can be possible to arrange a printf installation so that only the code to handle format strings is required, this would likely be similar to the existing __builtin_printf transformation. For example the code to handle the floating point specifier could be located in a separate object file, the main printf implementation only refers to it via a weak reference so that it is not automatically loaded by the linker, when the compiler determines that it is needed it emits a non-weak reference to another symbol defined in the object file for just that purpose.

I must confess I’ve not dug too far into the details of the llvm-libc printf implementation so it is possible that it supports what would be needed already, my guess from https://github.com/llvm/llvm-project/blob/main/libc/src/stdio/printf_core/parser.cpp#L124 is that this would be a compile time choice?

It is likely that llvm-libc would need to provide a printf implementation that colludes with LLVM optimizations as an alternative implementation and not the default.

Low level hardware abstraction layer

An argument could be made that the implementation of the low-level hardware abstraction layer is a separate library to llvm-libc that the llvm-project could provide a reference implementation for common platforms. The low-level hardware abstraction library is essentially providing the functionality that an OS would provide for a hosted implementation. As an example the newlib hardware abstraction layer is called libgloss Embed with GNU

Startup Code

The definition of startup code I’m using here is code that runs before main, usually in an object called crt0.o . From the perspective of the C-library this usually requires:

  • A hook for user supplied code to Initialize any hardware. For example, enable the floating point unit, cache, MPU or MMU and potentially change to a lower privilege level. This supports the case where the startup code is the first thing in the system to run after reset.
  • Setup a stack pointer, often using a region of memory defined by a linker script.
  • Copy/Zero-initialize memory, driven by the linker-script (can be done before a stack pointer is available with assembly).
  • Run initializers such as .init_array
  • Call main, with some code to handle argc and argv if the low-level abstraction layer supports it, for example semihosting can obtain command line options from the host.
  • Call exit after main has finished, this is not usually expected to return in an embedded system.

The libgloss crt0 is written in assembler and is quite difficult to follow. The picolibc library, has written more of this in C. For example https://github.com/picolibc/picolibc/blob/main/picocrt/crt0.h has the generic part, there are machine specific parts such as https://github.com/picolibc/picolibc/blob/main/picocrt/machine/aarch64/crt0.c and https://github.com/picolibc/picolibc/blob/main/picocrt/machine/arm/crt0.c

The startup code often has assumptions about linker defined symbols to provide the location of the stack and in some cases the heap. It is not common for the library itself to provide the linker script as hardware varies considerably, the toolchain might include some examples to run on a simulator. For example the Arm LLVM embedded toolchain has https://github.com/ARM-software/LLVM-embedded-toolchain-for-Arm/blob/main/ldscript/base_aarch64.ld

Retargeting of IO

Many embedded systems do not have a filesystem, but it is still useful to be able to use these facilities from embedded programs, particularly in testing. With a retargeting layer IO can be redirected through a peripheral like a serial port or implemented by a Host such as a debugger or model via semihosting (https://github.com/ARM-software/abi-aa/blob/main/semihosting/semihosting.rst)

The libgloss (Newlib’s hardware abstraction library) has some documentation on its retargeting layer Embed with GNU

Typically the high-level routines are implemented in terms of a narrow porting layer. This can limit the ability of the implementation to optimize so it is possible that this would need to use an alternative implementation.

Retargeting threads

This is similar to libc++ where the default implementation of std::thread is built on top of pthreads, which a bare-metal environment can’t usually assume. For libc++ there is an option to use an external threading header file which permits an alternative low-level theading implemenation https://github.com/llvm/llvm-project/blob/main/libcxx/docs/DesignDocs/ThreadingSupportAPI.rst . I expect that we’ll need something similar for llvm-libc to work with an OS that has its own threading primitives.

Retargeting memory management

There may not be an OS to provide more heap memory. In principle a hard-coded alternative malloc implementation could be provided, although there is a way to solve this problem by requiring an implementation of sbrk by the hardware abstraction layer. An example of a very simple implementation that uses linker defined symbols to identify an area of memory for the heap can be found in picolibc https://github.com/picolibc/picolibc/blob/main/newlib/libc/picolib/picosbrk.c

Layering on top of an existing library

The current layering scheme relies on dynamic linking so it won’t be suitable for static linking. I can’t think of a clean way to implement layering on top of an existing static library. It is possible if llvm-libc is encountered first by the linker and defines all the symbols of a corresponding object(s) in the base libc then the object from the other library won’t be selected by the static linker. If there are symbols defined by the corresponding object(s) base libc but not llvm-libc then both objects could get selected leading to multiple symbol definition errors. I don’t think that this would be manageable without picking a specific base libc.

POSIX Compatibility

The IEEE 1003.13.2003 specification 1003.13-2003 - IEEE Standard for Information Technology - Standardized Application Environment Profile (AEP) - POSIX(TM) Realtime and Embedded Application Support | IEEE Standard | IEEE Xplore defines subsets of POSIX that are suitable for bare-metal embedded systems.

  • PSE 51 (Minimal, No MMU and no physical filesystem)
  • PSE 52 (Controller, MMU and physical filesystem)

These specifications are primarily requirements on an Operating System, however the specification does include parts of the C-library and there are some areas of overlap. For example the C-library will need to provide POSIX extensions, the OS will have to provide the low-level pthread implementation, file descriptors and signals are a bit more ambiguous.

I expect that the OS developer will have to do the majority of the work to integrate llvm-libc. I’m thinking that the majority of the work in llvm-libc is to make sure that the POSIX requirements for the C-library can be met, and there is documentation for OS vendors to follow.

References

Two OSS RTOS that have a substantial subset of POSIX implemented.

FreeRTOS appears to have some experimental support FreeRTOS-Plus-POSIX - FreeRTOS

Build configurations

Even within a single LLVM target like Arm, AArch64 or RISC-V there will be a large number of possible sub-architectures and ABI options. For example the GNU embedded toolchain for Arm has 32 multilib combinations across Arm, Thumb, v5 to v8.1, soft-float. hard-float, vector unit present. Other architectures will have their own combinations. We’ll want to have a way to:

  • Build all the variants needed for a toolchain via a single runtimes build.
  • Describe the variants to the bare-metal driver via multilib, this is hardcoded as of Today. Ideally the libraries build could populate configuration files that a multilib aware toolchain could use to configure itself.
  • Run some form of tests for each variant assuming some kind of test runner is provided.

Testing

Buildbots

To run the tests on an embedded system is likely to require something like libc++'s test executor concept which can be used to run on an emulator, or even download a test to a dev board. This works although it can be slow, especially when running many different variants. It is difficult to get around this without a lot of work in llvm-lit or the tests to do all builds before running tests in a batch.

The target specific implementations of the hardware abstraction layer are not going to be tested by existing host based BuildBots. For each new target we should have public buildbots that are likely to span the majority of variants supported by conditional compilation. For example on Arm v8-m.mainline and v8-m.baseline, v8-r and v8-a are likely to be close enough to be covered by the same builder.

There are to my knowledge no public build-bots for the compiler-rt built-ins which has made introducing changes without undetected breakages harder than it would otherwise have been.

Subset tests that require a feature unavailable on the target

I think these can work in a similar way to libc++ handles subsets of the library. A feature like C11 Threads that requires an external threading implementation may not be desirable to include in a library build, or may be difficult to test without an OS. A way to define build options to define a subset that also disables tests that require the subset will be useful. For example https://github.com/llvm/llvm-project/blob/main/libcxx/cmake/caches/Generic-no-localization.cmake that turns off localization support.

Reproducing failures

Setting up all the dependencies to run tests can be quite a bit of work. Is it worth asking bot maintainers to provide docker containers with the necessary models and test execution scripts?

Where can Arm Contribute?

Arm is interested in having llvm-libc support embedded systems. This would make it possible to construct an embedded toolchain entirely from the llvm-project.

Our thoughts are that getting to a stage where we can build the library with enough of a hardware abstraction layer to run the tests on a model would be the best starting point. That would form the basis of a build-bot that could be used to support further development.

7 Likes

This was a very informative read, and I appreciate you writing it up. I wanted to answer a few of the questions you raised as one of the people working on the project.

For alternative implementations of functions there has been some internal discussion around separate implementations for size vs. performance optimized builds. We very much intend to do this, but I believe we are still working on the specifics of the design. Regarding locales and wide characters, we have not implemented them so far and they aren’t a current priority. We will definitely focus on having an implementation that works well without them, since they are unnecessary in most cases. Currently all of our functions are independent in the sense that they don’t call each other’s public entrypoints, although some do share internal functionality (e.g. the printf functions all call into the same internal printf_main function). This means that including any individual function should not pull in anything that is not needed.

Speaking of printf, I’m currently the primary developer for our implementation, so I can talk a bit about the design. We are indeed planning on making floating point support a compile option, as well as a few other pieces of printf that an embedded system might not need, such as POSIX’s index mode and integer writing. The system is designed to be modular, with each conversion specifier having its own header (e.g. int_converter.h). This allows us to simply not include headers that aren’t needed, which you can see for integer writing in converter_atlas.h. Overall the goal is to give the users of our library the knobs to adjust it to fit their needs, so if there are any knobs that you feel we are missing then we would like to add them.

Feel free to reach out if you have questions about anything I’ve said here. I’m looking forward to working with you in the future.

Thank you Peter for such a detailed write-up. I’m really excited to see efforts to make LLVM libc support this use case. It feels to me like a high quality embedded-friendly libc with LLVM’s collaboration model, infrastructure, and licensing could see a lot of uptake.

Regarding build configurations: this is perhaps implicit based on your previous discussion about optimising for size, but I think there’d be a need to expose configurations across the dimension of target hardware/ABI, as well as libc configurations (trading off code size for features / conformance etc). I could be wrong, but the impression I had was that newlib had a standard configuration, the newlib-nano configuration, and then probably also allows more fine-grained control of individual build flags for people who need different tradeoffs. Do you envisage llvm-libc providing something similar?

Re testing and buildbots: if anyone has real-world experience of qemu just not being up for the job then do speak up, but I’d suggest that making a virtual target the primary test target (with buildbots with boards connected as a useful additional check) is probably the way to go. If we provide the right documentation (or containers etc), any contributor can trivially spin up a simulated environment, while the same obviously isn’t true for embedded devboards.

Michael: Peter suggested I might say something here about the modular printf system we developed for Arm Compiler, in case it’s of interest.

The aim is to arrange that a single static library can automatically include only the subset of the full printf subsystem needed by a given application, without the library itself having to be rebuilt with options like ‘no FP formatting’.

In order to do this, the compiler has to analyze printf format strings to figure out what features are used, and transform the printf calls in some way. So this requires code in the compiler to collude with the library.

The basic idea is that printf is divided up into a central core function which I’ll call __printf_core for illustration, and a bunch of independent formatting routines for particular format specifiers like %s and %d. The implementation of __printf_core will contain a switch statement on the formatting directive character, which calls the appropriate formatting subroutine, along the lines of (heavily simplified)

    switch (fmt_char) {
        case 'd': __printf_d(&state, va_arg(ap, int)); break;
        case 's': __printf_s(&state, va_arg(ap, const char *)); break;
        case 'f': __printf_f(&state, va_arg(ap, double)); break;
        // and more
    }

but in that library object file, all of the formatter subroutines like __printf_d are declared with __attribute((weak)). So when the linker pulls in __printf_core from the library, it doesn’t automatically pull in all of the formatting subroutines too.

Then, the compiler will transform each individual printf call (if it can) into a call to __printf_core, plus a set of non-weak references to the formatter functions that that particular format string will require. For example, a call like this

printf("there are %d %s\n", 4, "lights");

would be converted into something like this:

__printf_core("there are %d %s\n", 4, "lights");
__asm(".globl __printf_d");
__asm(".globl __printf_s");

which requests from the library only the core printf function, and the specific formatting routines that are needed for this particular format string. In this case, __printf_f is left out, for example.

So the effect is that the formatters for FP, decimal, hex, strings, wide strings, etc, can all be independently included or excluded in a given static link, and it all happens automatically, without anyone having to rebuild the library or write any explicit customization config specific to their application.

The symbol printf itself still exists in the library, as a trivial wrapper on __printf_core which also includes a non-weak reference to every formatter. So if the compiler is unable to optimize any printf call (e.g. because its format string is unavailable), then you get everything included in the link just in case, which is the only safe thing to do in that situation.

Side notes:

  • this need not be limited to format specifier characters. We similarly modularized the code that prints padding for an explicitly specified field width, for example. Anything you can put in a separate subroutine and call via a weak reference, and identify from the format string when a non-weak reference must be inserted, is a candidate for this optimization.
  • if the format string is not statically determinable in a particular printf call but the argument list is, you can still do a limited version of the same optimization. You can’t tell %u from %x, or %s from %p, but you can at least avoid including integer formatters if there are only strings in the argument list.
  • a couple of times we’ve moved further pieces of the core function out into modules, and maintained backwards binary compatibility with existing object files by renaming the core function. So if an object file refers to the old core function name, that automatically inserts non-weak refs to all the modules that were part of the core in the previous version.
  • in the current Arm Compiler libraries, this system involving weak references is replaced by a much more complicated piece of specialist linker magic which allows further space optimization, but that would take much longer to describe and probably isn’t suitable for use in LLD in any case.
1 Like

Thanks for all the responses.

Yes there will need to be software related configuration as well. Arm has the nightmare of the cross-product of (-fshort-enums,-fno-short-enums) * (-fshort-wchar,-fno-short-wchar) * (soft, softfp, hardfp) and more that I’ve forgotten about. Mostly these can be dealt with compiler options but yes, there may need to be conditional compilation in places.

I’m expecting things can work like libc++ where there are individual tuning parameters that can be enabled at build time for those that need fine grained control. I’m thinking that no one toolchain can provide all possible binary variants that someone might want to use so having a means to build their own will be useful. I think the CMake cache files that combine multiple options could be used to model standard configurations like newlib/newlib-nano.

I think QEMU support for most, if not all, of the LLVM supported targets is good enough to use for testing. For testing the compiler-rt built-ins I was able to build the tests using a linux target and use the user-mode emulator to test the library, that worked as the user-mode emulator was a superset of the embedded target. For a C-library I think we’d need to use the system-mode emulator, which is a bit more complex to set up. It would be good to have a skeleton QEMU runner template that can be modified.

Thx for the write up, this is exciting!

I’m the author of the low level memory functions (memcpy, memset, bzero, memcmp, bcmp and memmove). As you said, they are currently written for maximum performance at the expense of code size (a few hundred bytes for x86 memset is 233B, memcpy is 387B). Of course they can be trimmed down to a few bytes if needed.

Apart from the code size / performance tradeoff do you see any over dimensions in the design space?
BTW is this always “less code is better”? And if not, where do you set the cursor between the two?

There is currently no dependency between the memory primitives by design (we want to be able to ship bzero without memset). But I can imagine having memcpy be an alias of memmove if code size is an absolute necessity. Would that make sense?

Any peculiarities of the embedded world regarding memory primitives? Alignment constraints? Pointer sizes ? I don’t see anything special but I’m not an expert here.

For our libraries it is blend of performance and codesize. We have a code-size at all costs library where memcpy is effectively a single loop copying a byte at a time, and a standard implementation that is probably not too different in structure from the llvm-libc implementation but without as much unrolling (copy up to nearest 4-byte boundary, loop copying 4 integers at a time, then 2, then 1, followed by remaining bytes).

Alignment does come into it as there are some CPUs that don’t support unaligned access, or don’t support unaligned accesses to certain types of memory. As a general rule I’d expect that we’d build with unaligned accesses disabled so that the libraries would

On 32-bit Arm there are a range of memory functions defined in the ABI (abi-aa/rtabi32.rst at main · ARM-software/abi-aa · GitHub) that can be used when the compiler knows the alignment. This can permit use of special instructions (Arm’s load and store pair instructions LDRD and STRD, at least used to on older implementations, require 8-byte alignment). Many implementations just use alias these to the non-specialized implementation.

At least for Arm and AArch64 the pointer sizes are similar to x86_64, there are some stranger DSP targets that may have more constraints.

Thank you Peter for the write-up and everyone else for the discussion!

When thinking about how projects can best utilize LLVM’s libc, I think it would be interesting to consider a model where the libc code is built “fresh” from sources as-needed. That is, when you are building your project for deployment on a particular microarchitecture, you build only the parts of libc you need, and you build those sources in the same way as your project. Only exactly what is used by your code is built and linked in, and it is built for the specific microarchitecture on which you want to run, and the code is built with whatever compiler flags or options you want (e.g. sanitizers), and you can optimize for speed or size, etc.

This model is somewhat in use today. E.g. many math functions can be built differently for different microarchitectures that do (or do not) support FMA. Of course it has some drawbacks; “How do I actually do that using my project’s build system?” being the obvious one.

1 Like

I see that we have special code in LLVM to deal with these functions (1, 2).

Does llvm-libc need to provide these functions in order to be compliant?

They have to be available to the bare-metal target, but they don’t necessarily have to be implemented in the C-library. For example compiler-rt llvm-project/aeabi_memcpy.S at main · llvm/llvm-project · GitHub and libgcc (don’t have an online link, but you can search for aeabi in there). Have essentially aliased them to the standard versions.

Hi Simon,

The compiler based printf sounds interesting, and our design is already fairly similar to what you’re describing with the individual conversion functions (see llvm-project/converter.cpp at main · llvm/llvm-project · GitHub). We don’t want to get rid of the existing compile time options since they are useful for situations where you don’t want conversions to even be available (%n being the most obvious example), but if a compiler wanted to support static conversion with our printf it should be fairly simple. Each converter takes a FormatSection with the argument and other state (described here: llvm-project/core_structs.h at main · llvm/llvm-project · GitHub) and a writer that it uses to output its results (described here: llvm-project/writer.h at main · llvm/llvm-project · GitHub), both of which can theoretically be prefilled with data. The argument in the FormatSection has to be set at runtime, although in theory I could make a function that takes an argument and a prefilled FormatSection and combines them. Overall I think it’s very doable, although it would require more compiler knowledge than I have.

Hope this helps,
Michael Jones

Yes, that makes sense – %n is indeed a good example of a thing you might want to ensure you leave out, but FP is another one, not for the danger but for the huge code size.

I don’t see any reason the two systems couldn’t coexist, of course! Once you have the module system in the first place, it’s easy to say that the call site for an optional function calls it (a) via a weak reference, (b) not at all if an #ifdef tells it to leave that part out.

Hi Peter,

First of all, thanks for starting this discussion. Some of the current LLVM libc developers have already responded on the topics of their interest, but to take this forward, I think that each of the topics you touched should be discussed in detail in separate focused threads. We do not have to discuss about all of them simultaneously, but I will leave it up to you how you want to do it.

I have some questions about the last section of your post, “Where can Arm Contribute?”. I seems to me like you want to start with building out the hardware abstraction layer? If yes, I want to know more details about this hardware abstraction layer - we can take it up in a more focused thread. If not, can you may be elaborate/clarify where would Arm want to start and drive?

Thanks,
Siva Chandra

Hello Siva,

It would definitely be worth separating out into a separate thread.

At a high-level we’ll follow the usual process of submitting a RFC with a design and when there is sufficient consensus on the approach we can start posting patches.

I’ll aim to send out a message on discourse next week with a few more details, won’t be anything near an RFC by next week but will have an outline of some options.

We may be a bit slow to respond over Summer as we’re approaching European holiday season.

Peter