[RFC][clang/llvm] Allow efficient implementation of libc's memory functions in C/C++

TL;DR:
Defining memory functions in C / C++ results in a chicken and egg problem. Clang can mutate the code into semantically equivalent calls to libc. None of -fno-builtin-memcpy, -ffreestanding nor -nostdlib provide a satisfactory answer to the problem.

Goal
Create libc’s memory functions (aka memcpy, memset, memcmp, …) in C++ to benefit from compiler’s knowledge and profile guided optimizations.

Current state
LLVM is allowed to replace a piece of code that looks like a memcpy with an IR intrinsic that implements the same semantic, namely call void @llvm.memcpy.p0i8.p0i8.i64 (e.g. https://godbolt.org/z/0y1Yqh).

This is a problem when designing a libc’s memory function as the compiler may choose to replace the implementation with a call to itself (e.g. https://godbolt.org/z/eg0p_E)

Using -fno-builtin-memcpy prevents the compiler from understanding that an expression has memory copy semantic, effectively removing @llvm.memcpy at the IR level : https://godbolt.org/z/lnCIIh. In this specific example, the vectorizer kicks in and the generated code is quite good. Unfortunately this is not always the case: https://godbolt.org/z/mHpAYe.

In addition -fno-builtin-memcpy prevents the compiler from understanding that a piece of code has the memory copy semantic but does not prevent the compiler from generating calls to libc’s memcpy, for instance:
Using __builtin_memcpy: https://godbolt.org/z/O0sjIl
Passing big structs by value: https://godbolt.org/z/4BUDc0

In both cases, the generated @llvm.memcpy IR intrinsic is lowered into a libc memcpy call.

We would like to use __builtin_memcpy to communicate the semantic to the compiler but prevent it from generating calls to the libc.

One could argue that this is the purpose of -ffreestanding but the standard leaves a lot of freestanding requirements implementation defined ( see https://en.cppreference.com/w/cpp/freestanding ).

In practice, making sure that -ffreestanding never calls libc memory functions will probably do more harm than good. People using -ffreestanding are now expecting the compiler to call these functions, inlining bloat can be problematic for the embedded world ( see comments in https://reviews.llvm.org/D60719 )

Proposals
We envision two approaches: an attribute to prevent the compiler from synthesizing calls or a set of builtins to communicate the intent more precisely to the compiler.

  1. A function/module attribute to disable synthesis of calls

1.1 A specific attribute to disable the synthesis of a single call
attribute((disable_call_synthesis(“memcpy”)))
Question: Is it possible to specify the attribute several times on a function to disable many calls?

1.2 A specific attribute to disable synthesis of all libc calls
attribute((disable_libc_call_synthesis))
With this one we are losing precision and we may inline too much. There is also the question of what is considered a libc function, LLVM mainly defines target library calls.

1.3 Stretch - a specific attribute to redirect a single synthesizable function.
This one would help explore the impact of replacing a synthesized function call with another function but is not strictly required to solve the problem at hand.
attribute((redirect_synthesized_calls(“memcpy”, “my_memcpy”)))

  1. A set of builtins in clang to communicate the intent clearly

__builtin_memcpy_alwaysinline(…)
__builtin_memmove_alwaysinline(…)
__builtin_memset_alwaysinline(…)

To achieve this we may have to provide new IR builtins (e.g. @llvm.alwaysinline_memcpy) which can be a lot of work.

Target library is probably more relevant than libc. We have a number of issues with libm on tier 2 platforms for FreeBSD without assembly fast paths. This requires work-arounds for the fact that clang likes to say 'oh, this function seems to be calling X on the result of Y, and I know that this can be more efficient if you replace that sequence with Z', ignoring the fact that this case is an implementation of Z.

The same thing is true in Objective-C runtime implementations, where we need to be careful to avoid LLVM performing optimisations on the ARC functions that result in infinite recursion.

There are numerous cases of compiler-rt suffering from the same issue.

TL;DR: This is a really important problem for clang and your proposed solution 1 looks like it is far more broadly applicable.

David

Thx for the feedback David.

So we’re heading toward a broader

attribute((disable_call_synthesis))

David what do you think about the additional version that restrict the effect to a few named functions?

e.g. attribute((disable_call_synthesis(“memset”, “memcpy”, “sqrt”)))

A warning should be issued if the arguments are not part of RuntimeLibcalls.def.

Also I’d like to get your take on whether it makes sense to have this attribute apply to functions only or at module level as well.

Thx,
Guillaume

Thx for the feedback David.

So we’re heading toward a broader

attribute((disable_call_synthesis))

David what do you think about the additional version that restrict the effect to a few named functions?

e.g. attribute((disable_call_synthesis(“memset”, “memcpy”, “sqrt”)))

Nit: the attribute basically just states that there is no runtime support for these functions in this context, so why not directly name it so:

attribute((no_runtime_for(“memcpy”, “memset”, “sqt”)))

It still allows compiler to synthesize calls to builtins that are guaranteed to be inline expanded later (if that is available).

David

I would find that exceptionally useful. For the libm example, preventing LLVM from synthesising calls to other libm functions that may call this one would be the fine-grained control that we want. For an Objective-C runtime, being able to explicitly disable synthesising ARC calls would be similarly useful (though I can no longer construct an example where LLVM does the wrong thing, so maybe this is fixed already in the ARC passes).

David

A POC patch is available here for discussion
https://reviews.llvm.org/D61634