[RFC] libc -ffreestanding / -fno-builtin

I’m starting this thread in response to Monthly LLVM libc meeting - #6 by michaelrj-google (I can’t attend the meeting so let’s discuss this in an RFC)

  • -ffreestanding
    • Need to talk to Guillaume (gchatelet)
    • The original reason for the flag was to avoid the compiler calling builtin memcpy inside memcpy
    • -fno-builtin prevents the memcpy inlining, but it also prevents inlining in later, and also causes issues with LTO
      • LTO and inlining are very important in GPU builds, but also anyone else who is building from source
    • Original patch: :gear: D74162 [Inliner] Inlining should honor nobuiltin attributes
    • It would be nice to have a reproducer so it can be checked if the problem is fixed

I totally agree that these options ( -ffreestanding / -fno-builtin ) prevent almost all inlining possibilities between application code and the libc. This is not desirable in the long run.

Now, the concern about the compiler turning memcpy code into a call to memcpy is not restricted to this particular function. The compiler is allowed to turn all code that looks like a libc function into a call to that function

e.g. GCC turning custom mystrlen into a call to strlen

We can improve on the current situation though : -ffreestanding implies -fno-builtin which really prevents all optimizations.

For the production version of libc we can remove -ffreestanding and use finer grain -fno-builtin-function flag. That is, use -fno-builtin-memcpy when compiling libc.src.string.memcpy, use -fno-builtin-strlen when compiling libc.src.string.strlen and so on and so forth. This still prevents a lot of inlining possibilities but it’s a first step.

When the compiler is clang, we can be even more specific and apply an attribute on specific functions when we know its body is subject to libc delegation. This would only work for clang though but it’s a much more precise tool; effectively preventing inlining only for a handful of problematic functions.

Let’s look at a contrived example to see what I mean. The following function would need to be compiled with -fno-builtin-memset

void libc_memset(const char* ptr, char value, size_t size) {
  for(size_t i=0; i < size; ++i)
    ptr[i] = value;
}

But this could be rewritten into

void libc_memset_loop(const char* ptr, char value, size_t size) __attribute__((no_builtin("memset"))) {
  for(size_t i=0; i < size; ++i)
    ptr[i] = value;
}

void libc_memset(const char* ptr, char value, size_t size) {
  if(size == 0) return;
  libc_memset_loop(ptr, value, size);
}

With this second version, PGO and LTO would be able to inline the zero size shortcut at the call site if deemed necessary.

Additionally, for the specific case of memcpy when __builtin_memcpy_inline is available, the compiler will not be able to recognize the memcpy semantics so we could drop the -fno-builtin-memcpy completely.

Now, for integration tests I think it’s still important to use -ffreestanding to make sure we don’t accidentally depend on hosted features.

@michaelrj-google @jhuber6

Thank you for the very timely response. Another benefit of compiling with -ffreestanding is that we do not set __STDC_HOSTED__ when including libraries, which prevents us from going to system paths.

We definitely need some kind of -fno-builitin operation to prevent the compiler from optimizing the functions into calls to itself. I see in gcc this even occurs if you make the function name the name of the builtin, e.g. Compiler Explorer.

I believe the original problem this was solving was specifically related to how the LLVM optimizer was handling these functions. So we would get the case where the function is transformed into a call to itself, which will get inlined forever. Preventing inlining is one solution to this, but I remember in the original discussion of ⚙ D74162 [Inliner] Inlining should honor nobuiltin attributes there were some other suggestions. Will one of the memset examples exhibit the bug upstream if I compile it without -ffreestanding?

For gcc, I suppose -fno-tree-loop-distribute-patterns is another choice to disable self calls.

Reviving this slightly after the holidays. Do you have an example of some code that would fail if we reverted ⚙ D74162 [Inliner] Inlining should honor nobuiltin attributes? I’m wondering if we can find a better solution than to prevent all inlining so we can freely use -ffreestanding. I think this is related to a lot of other problems where backends can’t adequately report which runtime functions are supported.

Thx for the follow up, I never sent the answer I wrote and then it slipped my mind… My apologies for that.

I had trouble accessing D74162, the link was timing out but it’s fixed now.

So AFAIR the issue was not about -ffreestanding but about -fno-builtin and as I mentioned earlier (at least in clang) the former implies the latter.

The issue came up during LTO where functions were in the form of IR in objects files. If you perform interprocedural optimization between functions and the imported IR uses @llvm.memcpy (doc) and the original translation unit compiler flags do not specify -fno-builtin-memcpy then the compiler would be able to recognize the memcpy semantics and replace it with a call to the libc’s memcpy, leading to an infinite loop. This happened in production. This was because the no-builtin-memcpy semantic was only in the state of the compiler (through compiler flags) and not stored next to the function in the IR.

The aforementioned patch would make sure that if -fno-builtin-* is used when compiling libc memcpy, then this “no builtin” semantics will be stored in the IR and will prevent further optimizations.

AFAIU your concern is more about the coupled semantics between -ffreestanding and -fno-builtin rather than making sure that the -fno-builtin semantics is stored in the IR, am I correct?

Thanks for the extra context.

Overall, the desire to treat the GPU as a standard clang target. Because the GPU is unhosted we like to use -ffreestanding when compiling GPU programs. For performance reasons, we also always target an LTO build when targeting the GPU. The problem is that the current logic states that any program compiled with -ffreestanding cannot be inlined into user code. I would like to be able to use -ffreestanding semantically without the extreme performance hit it currently causes on the GPU.

I remember we discussed the original problem with @arsenm. I’m thinking there should be a way to get the same semantics where it matters without simply turning off all inlining.

@gchatelet I remember we discussed this at the Monthly meeting again. I’m wondering if we could simply go with libcall recognition or some alternate approach that at a minimum allows functions that don’t match intrinsic functions to be used.

From the LLVM perspective, there’s a fundamental split here: any given function has to be either no-builtin (i.e. not recognized as a libcall), or external to the translation unit. The duality here mostly isn’t specific to the way LLVM is implemented; what makes a libcall a “libcall” is that its interface and semantics are hardcoded into the compiler. So if we have a specific implementation visible to the optimizer, it can’t be a libcall: it’s a specific function with the semantics written in the IR.

It might be possible to define some sort of hybrid mode, where C library functions are in the IR module but still have some sort of “libcall” semantics. But I’m not sure how you’d define that; in a lot of cases, LLVM’s understanding of the semantics of a function breaks down once you start inlining bits of libc.

Getting inlining from libc to work should just be a matter of adding an IR transform: when you pull libc into the link, you mark the user code nobuiltin. Once all the code is nobuiltin the nobuitin-ness of libc doesn’t block inlining.

Adding @teresajohnson here to get her insights regarding FDO.

For context, LLVM libc is compiled with -ffreestanding which implies -fno-builtins which marks all functions in LLVM libc with the nobuiltin attribute.

This in turn prevents inlining LLVM libc function inside the application code which has dramatic performance implications when targeting GPU applications (for the GPU a large part of the functions are implemented as intrinsics intended to be inlined)

@efriedma-quic was offering to add an LLVM IR pass marking all the functions nobuiltin at link time, hence actually removing the barrier to inlining.

@teresajohnson do you envision any issues with this solution when ThinLTO is used? Or maybe a different solution like selectively discarding the nobuiltin attribute mismatch in TargetLibraryInfo.h when linking? Getting more inlining opportunities for libc would definitely help us as well.

Adding @teresajohnson here to get her insights regarding FDO.

Do you mean ThinLTO? I don’t see any FDO related discussion here.

For context, LLVM libc is compiled with -ffreestanding which implies -fno-builtins which marks all functions in LLVM libc with the nobuiltin attribute.

This in turn prevents inlining LLVM libc function inside the application code which has dramatic performance implications when targeting GPU applications (for the GPU a large part of the functions are implemented as intrinsics intended to be inlined)

@efriedma-quic was offering to add an LLVM IR pass marking all the functions nobuiltin at link time, hence actually removing the barrier to inlining.

From @efriedma-quic 's description “when you pull libc into the link, you mark the user code nobuiltin.” - does this mean run as a standalone pass or would this be implemented in the IR Linker? If the latter it is should apply to all types of LTO. For the former it would need to be added to both LTO backend pipelines.

Also, does this mean that ever function in the module would be marked nobuiltin if any nobuiltin code is linked in? Is that going to be overly-conservative?

@teresajohnson do you envision any issues with this solution when ThinLTO is used?

See above. I don’t believe ThinLTO function importing is prevented for these cases currently (we don’t have the nobuiltin attributes in the summary), and the IR Linker is used for all types of LTO. If this is done as a standalone pass it would need to be added to both pass pipelines, but that should be straightforward (if I’m understanding the proposal correctly).

Or maybe a different solution like selectively discarding the nobuiltin attribute mismatch in TargetLibraryInfo.h when linking?

Would that be legal? My recollection of the earlier problem is that we would lose the nobuiltin by inlining and result in the incorrect transformation of inlined callees causing infinite loops. My understanding of the solution proposed here is that the caller would conservatively be marked nobuiltin when inlining nobuiltin callees. I think you would need that to do this safely? But I might not be understanding all the details here.

If someone just uses -fno-builtin on a random file, we probably don’t want to treat it as if they’re trying to LTO libc, I think? Not sure exactly what the condition would be to turn this mode on.

We need to make sure we aren’t making incorrect assumptions about the behavior of libc functions, and any “leaks” of semantics can cause trouble. For example, if we inline malloc anywhere, we have to treat malloc/calloc/strdup/free as nobuiltin across the entire module, so alias analysis doesn’t make incorrect assumptions about aliasing.

To ensure consistency along these lines, the transition from builtin to nobuiltin should happen for the whole module at once; we can’t selectively ignore nobuiltin-ness in specific places.

Do you mean ThinLTO? I don’t see any FDO related discussion here.

I meant ThinLTO sorry, I was concerned about the possible interactions between the IR pass and the split link of ThinLTO. I don’t think this would have implications for the profiling part but feel free to correct me as I’m not at all an expert in this field.

It seems to me that this would be the latter.

No indeed you’re right.

Yes indeed.

Would that make sense to use a specific llvm::Triple::EnvironmentType in the target triple to convey this?

I think the fundamental issue is we have 2 different, but similar sounding concepts. There probably should be separate mechanisms for functions with compiler recognized behavior, and those which the compiler is allowed to introduce calls to

1 Like

Even if we split functions based on whether we’re allowed to introduce calls, that doesn’t really solve the issues we’re discussing here. You still need to distinguish “this call has abstract semantics based on the libc interface definition” vs. “this call calls a specific function defined in IR”. Once we start doing optimizations that see into libc function definitions, we can’t treat libc function definitions as having the abstract libc semantics. This is most obvious for functions like malloc, where alias analysis treats the return value in a special way.

What are the specific issues we need to avoid? My understanding was that the issue was the interfacing between libcall recognition and implementations of said libcalls living in the same module. So, if we inlined memcpy it would then get recognized at memcpy and get turned back into the call / intrinsic.

As far as I’m aware the initial solution was to simply prevent inlining of said libcalls so this never happened. The problem is that this is overly restrictive and prevents us from inlining every single function compiled with -ffreestanding if linked into a project not compiled with -ffreestanding.

My initial thought was that we could do this inline restriction only on some recognized libcalls, but i know the LLVM codebase is very liberal with libcall recognition. That is, every single target assumes it’s x64 as far as I’m aware.

Could anyone offer a TL;DR for why no_builtin is behaving as a function coloring problem that affects LLVM’s inliner?

For libc’s use cases, as long as the original symbol is not dropped, it seems safe to inline a no_builtin colored function into an uncolored one? For example, if LLVM decides that the memcpy loop pattern should be translated into a libcall, it looks fine to me as such symbol will still be provided by libc.

Could anyone offer a TL;DR for why no_builtin is behaving as a function coloring problem that affects LLVM’s inliner?

nobuiltin basically means “there is no C library/this is the C library”. So we don’t recognize the semantics of any existing call to a C library function, and don’t generate calls to any existing library function.

The issue is, once you start mixing no_builtin code with non-no_builtin code, it’s not clear what exactly you’re expecting to happen. If we inline from a no_builtin function to a non-no_builtin function, are we supposed to propagate the no_builtin attribute? Whether the C library exists isn’t really something that should be changing from function to function… so we just conservatively give up.


“memcpy” in particular is an edge case because of the interaction with backend code generation. (In addition to being a C library function, we also treat it as a compiler support routine, like compiler-rt builtins.) When you’re thinking about how things should work in general, better to use a different example, like “malloc”.

-ffreestanding more or less just implies that there is no C library, so if it’s inlined into something that does have a C library we could then assume that the inlined code has access to the C library as well. Realistically, the issue doesn’t seem to be the inlining, it’s the libcall recognition done once it’s inlined. Simply disabling all libcall transforms in a mixed -fbuiltin and -fno-builtin module during LTO would likely have a far less serious performance impact than preventing inlining I would wager.

Right.

This is what I suggested at [RFC] libc -ffreestanding / -fno-builtin - #8 by efriedma-quic .

Would it make sense to simply import the nobuiltins attribute when inlining? As far as I’m aware, the LLVM passes are required to respect this attribute when doing any sort of libcall related transformation. If we simply inherited this attribute when doing inlining it would then prevent the libcall recognition from transforming the call code back into the intrinsic call. I think this would have the desired semantics and allow for reasonable optimization in a whole-program LTO scenario. Is there anything wrong with this approach?