[RFC] Instrumented versions of libc++ for different sanitizers

The purpose of this RFC is to discuss shipping instrumented versions of libc++ with LLVM, for different sanitizers. The motivation for this change is to add ASan annotations to libc++, which require full libc++ instrumentation. It is already a good idea to use ASan instrumented library – and for example required by MSan – what currently may be handled at the build system level or by sanitizer*bootstrap build bots, which does that already.

Background

I want to add ASan annotations to libc++, which require full libc++ instrumentation to work (The change is std::basic_string annotations.The discussion about that started here - further in this post, I’m going to use observations made there by @ldionne and @vitalybuka, thank you!).

Therefore, it would be good to ship ASan libc++ with LLVM.

Additional effects

Results of that discussion may be also helpful in different situations like shipping a library with unstable ABI:

I think knowing how to do this would benefit us since we have similar needs for other efforts. For example, it would be conceivable to ship an unstable-ABI version of the library, and a hardened version as well, and doing that would follow the exact same steps.
ldionne

Also, while my main goal is libc++ with ASan, I want to include msan/hwasan/tsan in the patch as well.

What we want to discuss

  1. Multiple CMake invocations or building multiple libraries from a single invocation?

We need to build this new version of the library (do we do that with multiple CMake invocations or do we do like compiler-rt and allow building multiple libraries from a single invocation?)
ldionne

Single invocation would be nice, as it’s better scale on other sanitizers.
vitalybuka

  1. Inform vendors. Is there anything we can do to make it easier?

We need to ship this library as part of the LLVM release and vendors need to also start shipping it with their own releases if they want -fsanitize=address to work
ldionne

CC: @EricWF

  1. How should we handle other components in the LLVM stack that use libc++ and may need to be instrumented as well? Should we provide instrumented versions of them too?

Probably yes, otherwise it’s too easy to create a situation with a false positive while using container annotations (especially important for strings). It’s a known limitation of vector annotations – everything has to be instrumented.

  1. How should we name and locate the instrumented libraries? Should we use prefixes or suffixes (e.g., libc++asan.dylib or asan/libc++.dylib) or some other scheme?

  2. How should we modify the compiler driver to link the correct version of libc++ depending on the sanitizer flags?

  3. Everything else related to the topic.

All suggestions are welcome!

I have experience with building libc++ with different sanitizers, but I am not very familiar with the details of the libc++ build system implementation. I would appreciate any guidance or pointers on where to look and how to implement those changes. That may save me (or whoever is going to implement it, but probably me) a lot of time!

Thank you for your feedback and suggestions!

Summary

Shipping instrumented versions of libc++ for different sanitizers as part of the LLVM release and making the compiler driver automatically link them depending on the sanitizer flags. This would improve the usability and reliability of sanitizers with libc++ and allow ASan ABI breaking changes. That’s a preliminary implementation of this proposal, feedback and suggestions from the community are more than welcome.

CC: @philnik

1 Like

Thanks for looking into this!
I would really love if I could easily use sanitizers without having to recompile libc++.

I am not completely sure about ABI compatibility, though. Do all relevant sanitizers guarantee a stable ABI across clang releases? Could I link a libc++tsan.dylib compiled with clang 15 against my main executable compiled with clang 16?

If I am reading ⚙ D143675 Discussion: Darwin Sanitizers Stable ABI correctly, then ASan is only ABI-stable on Darwin but not on Linux and Windows. Is the plan to ship the annotated libc++ versions only on Darwin? Will we force matching versions between libc++ and clang on the other systems if I want to use sanitizers?

Furthermore, I could not find anything on the ABI stability of TSan, MSan and other sanitizers.

I would propose to use the same naming scheme as libclang_rt. E.g., on my machine I see the following variants of libclang_rt:

$ ls lib/clang/16/lib/darwin
libclang_rt.asan_osx_dynamic.dylib              libclang_rt.lsan_osx_dynamic.dylib              libclang_rt.tsan_osx_dynamic.dylib              libclang_rt.xray-fdr_osx.a
libclang_rt.cc_kext.a                           libclang_rt.osx.a                               libclang_rt.ubsan_minimal_osx.a                 libclang_rt.xray-profiling_osx.a
libclang_rt.fuzzer_interceptors_osx.a           libclang_rt.profile_osx.a                       libclang_rt.ubsan_minimal_osx_dynamic.dylib     libclang_rt.xray_osx.a
libclang_rt.fuzzer_no_main_osx.a                libclang_rt.stats_client_osx.a                  libclang_rt.ubsan_osx_dynamic.dylib             liborc_rt_osx.a
libclang_rt.fuzzer_osx.a                        libclang_rt.stats_osx_dynamic.dylib             libclang_rt.xray-basic_osx.a

The compiler driver apparently already picks the correct version of libclang_rt based on the provided compiler flags. Can this logic be reused?

Making it work only on Darwin is not my goal for sure!

If ASan ABI is not stable already, maybe forcing the same versions for sanitized binaries is not a bad choice? In that case, if I understand correctly, it’s already required just with an additional step of compiling everything manually. At the end, everything has to be compiled with ASan anyway. Ready libc++ makes it a little bit easier, and when Sanitizers ABI becomes stable, shipping sanitized libc++ is in place. I don’t see much of negatives here. @ldionne what’s your opinion about it?

@vitalybuka do you expect any unique issues for other sanitizers than ASan?

It makes sense, thank you!

I will look closer to it, if (re)using it is possible, it would be probably the right choice. @MaskRay do you have any suggestions? I’m not very familiar with the driver.

FWIW I wouldn’t put too much time in ABI stability. Most people don’t ship with any sanitizers, since they aren’t meant for that. I don’t see a problem with just providing static libraries and making it clear that the ABI of the sanitized libraries is not stable.

Furthermore, I could not find anything on the ABI stability of TSan, MSan and other sanitizers.

ABI is not changing very frequently, but I guess it’s good enough if we just require that instrumented libc++ ver. N, is compiled with clang ver. N, so sanitized program should also use the same compiler.
If mismatched versions works for you - good, if not, it’s unsupported anyway.

wouldn’t put too much time in ABI stability.

Agree. We can reconsider if we find common usecase.

Thank you @philnik @vitalybuka! If we don’t need to consider ABI stability, it makes it easier.

Just a small update. I’m working on the patch, hope to publish it next week, but I still have a problem with a single invocation build. I may create PR as is, after a little bit more of testing.

Modifying the driver is relatively easy but tedious. Every libc++ mention has to be changed into if/else tree and it has to be done in many places, for different systems separately. I’m unable to test on every environment, so after publishing the patch, I expect to get some feedback with bugs (hopefully only before merging).

Modifying build files is a little bit more tricky, also because some parts of sanitizer-related cmake code seems to do nothing (I will address it in a different PR). But mainly I struggle with a single invocation build. Let’s look on libc++ lib file.
In general, I went with libc++.asan.<extension> naming as suggested, and it’s easy to modify name based on sanitizer used (ASan, HWASan, MSan etc.) here. But I’m not sure where to create a loop creating libc++ for every sanitizer. It’s possible to just loop the whole file, but I managed to do it rewriting (and producing!) a little too much of code for my liking, I’m trying to reduce code size and its readability.

I’m happy to get some suggestion where this loop can be done easier. Otherwise, I will test it a little bit more and create PR, but maybe building libc++ for every sanitizer at the same time is not what we want to do, and require many invocations as it makes code easy.

PS
Until Friday, I may not be able to answer.

I am unsure whether clang driver needs a built-in support as every vendor may do things differently and implementing this libc++ selection isn’t that hard.

Our scheme is like this:

The compiler is at bin/clang. The MemorySanitizer-instrumented libc++ archive is at lib/$triple/llvm-msan.0/libc++.a.
We use clang instead of clang++ for linking, so the normal lib/$triple/libc++.a location is not used.
Bazel invokes clang ... lib/$triple/llvm-msan.0/libc++.a ... to link in the instrumented libc++ archive.

We use llvm-msan.0/. It is possible that a vendor wants to use different directories for other combination of ABI-significant options. A vendor may remove the llvm- prefix and the .0 suffix. In addition, a vendor may want to use msan-dbg/ for debugging.

FWIW I would really welcome a « standardized » way of doing this. Some vendors may decide to be different, but it would be great if we could at least have something that works out of the box by default. In the long term, I would also suggest that vendors who use different naming schemes today standardize on using the same (upstream) scheme since that just simplifies everything.

1 Like

FYI The PR has been posted: WIP: [libc++] Renaming instrumented versions of libraries by AdvenamTacet · Pull Request #72688 · llvm/llvm-project · GitHub

I second that.
In terms of ASAN the sanitized library will “just” provide enhanced but MSAN is essentially broken without the sanitization when using C++ so built-in support for selecting the correct library seems essential in that case.

See MemorySanitizer `use-of-uninitialized-value` with `std::istringstream::rdstate()` · Issue #57314 · llvm/llvm-project · GitHub about that.

I already raised the packaging issue with the apt.llvm.org maintainers: add instrumented `libc++` to `apt.llvm.org` packages · Issue #59193 · llvm/llvm-project · GitHub.

I’ve been looking at it for a long time and I didn’t come to one good solution, with which I would be happy, so I’m trying to push this discussion with a small commit to propose an important choice: the use of forced suffixes for instrumented libc++ and libc++abi files. Adopting this approach would simplify the implementation significantly (in the driver and in cmake files).

It is essential to maintain compatibility with non-instrumented libc++ and ensuring no changes at all when not instrumented library is created/referenced, is an aspect integral to the proposed approach.

I suggest mandating library file suffixes, resulting in linking “libc++.asan.so” with -fsanitize=address flag (suffixes would be added automatically).

I have at least one more suggestion, which requires a discussion, but to not dilute that discussion, I created the very minimal example necessary to demonstrate the change.

Check: WIP: [libc++] Renaming instrumented versions of libraries by AdvenamTacet · Pull Request #72688 · llvm/llvm-project · GitHub

I suggest going with suffixes, and implementation in which names are updated just before a target is created. It limits flexibility of naming libc++ (the suffix is forced), compared to just using it as a default value of LIBCXX_SHARED_OUTPUT_NAME and LIBCXX_STATIC_OUTPUT_NAME earlier. It should also help when compiling many versions of the library.
At the same time, I had a problem figuring out how to compile many versions of libc++ without a significant code bloat, however, in the PR I updated target names, so when someone figures out how to do it well with different values of LLVM_USE_SANITIZER, there will hopefully be no conflict of target names.
What do you think?

If you have any suggestions how to improve readability of the code, or how to loop over sanitizers, I will appreciate suggestions, as I am learning cmake while working on it.

@vitalybuka thank you for your code review, I will fix the naming!

This RFC is related to extension of ASan which breaks ABI and therefore it’s as essential for it as for MSan.
ABI breaking annotations: [ASan][libc++] std::basic_string annotations by AdvenamTacet · Pull Request #72677 · llvm/llvm-project · GitHub

Thank you for your comment @MaskRay, I was thinking how flexible can be a suggested solution.

I was thinking about adding the suffix at the same time as when the prefix is added, by modifying how -l works. That shouldn’t limit directories which may be used by vendors. But it can be problematic when linking other libraries - probably some kind of heuristic is necessary?

From what I can see, the current common solution of specifying default libc++ to be linked is:

CmdArgs.push_back("-lc++");

It’s possible to create a function doing exactly that, but when compiling with eg. ASan adding the suffix directly here. Then vendors can use the function, but have full flexibility.
That may result in problems of using homemade sanitized versions of the library.

Probably prioritizing libraries with the correct suffix by the driver, if those exist, is the best solution. But I will describe those choices on details, if we have consensus on unified suffixes and if we proceed with them.

Do you have any implementation answering your concerns in mind?

Using filename infixes like libc++.asan.a is inflexible and difficult to support cases mixing two or more sanitizers (e.g. UBSan and another sanitizer). In addition, forcing CmdArgs.push_back("-lc++"); to change would cause a lot of unnecessary changes.

If there is really a push to make different versions of sanitizers, I’d like to see directories, e.g. -Lpath/to/lib/x86_64-unknown-linux-gnu/asan (the driver will append -Lpath/to/lib/x86_64-unknown-linux-gnu -lc++ after user -L). This doesn’t even need any driver change. Everything can be done in the build system.

For example, clang++ --print-file-name=libc++.so derives the path after libc++.so is built. Then you can derive the asan/libc++.so path and build a file there.

The fact that driver change may overlap with Clang’s experimental multilib support makes me nervous. [RFC] Multilib

Every valid combination of sanitizers requires a unique name, but thankfully there is not a lot of them. And it’s not really different from the situation we have now. Just now, burden of compiling all versions and later linking correct ones is on a user.

Probably shipping all possible combinations is not viable and a subset has to be chosen.

That’s one possible solution and I don’t like it as well. Unfortunately, alternative I like only a little bit more.

That would require an automatic change of the directory based on sanitization flag (-fsanitize=…), right? Can it be achieved without driver modification?

@mplatings can multilib code be helpful here, somehow?

Let’s say the default libc++.so is in path/to/lib/x86_64-unknown-linux-gnu and you want to build asan-instrumented libc++.so and place it in path/to/lib/x86_64-unknown-linux-gnu/asan/.

You can specify -fsanitize=address -Lpath/to/lib/x86_64-unknown-linux-gnu/asan. It’s manual but it does affect driver -lc++.

Let’s say we also build asan/ubsan-instrumented libc++.so. The user can specify -fsanitize=address,undefined -Lpath/to/lib/x86_64-unknown-linux-gnu/ubasan.

For certain system libraries, you probably want to use -L anyway.

The fact that driver change may overlap with Clang’s experimental multilib support makes me nervous. [RFC] Multilib

This mechanism is currently BareMetal only. My idea is that it is generic and can be extended to Gnu.cpp. Users may build the sets of instrumented libc++.so (or libc++.a) they want and specify a configuration file, instead of relying on hard-coded driver logic (which I supposed you are trying to do; and I hope that we can hold off doing it).

I appreciate the clarification. Shipping instrumented libraries with LLVM and allowing users to select the appropriate ones is feasible, as manual choice of libraries for sanitizers is already established practice, however problematic.
In that scenario same names of files in different directories make sense.
At the same time, there are ongoing ABI breaking works related to hardening (CC: @varconst) and ASan, while use of MSan already proves to be challenging.

Shipping instrumented libc++ versions would simplify use of sanitizers and hardening, but it still necessitates user awareness of this requirement, otherwise could result in inconspicuous errors.

Permitting users to instrument binaries with sanitizers or ABI-breaking hardening without specifying the corresponding library version essentially makes them to introduce unintentional errors into their programs.

Saying that, I’m not sure that it isn’t the best course of action. I don’t see a good hard-coded logic, so forcing user to manually provide the path may be the best.

However, I still believe that we should hard-code path suffixes in build system to try to standardize them. Eg. normal/path/to/lib/asan as in your example.
And then ship problematic libraries with LLVM. What do you think about it?

I’m also curious about opinions of others.

Yes this is exactly what multilib is for.

The multilib system has existed for a long time with hard-coded paths and that’s what’s used by the Fuchsia toolchain: https://github.com/llvm/llvm-project/blob/1716c5b614c004cf4a890eeaa113d18c8f7cb1f7/clang/lib/Driver/ToolChains/Fuchsia.cpp#L281

If everyone can agree on a standard set of subdirectories then you could go the hard-coded route like Fuchsia. Alternatively you could add code to Gnu.cpp to optionally support multilib.yaml and let vendors choose which library variants to ship.

@petrhosek might have guidance here based on his experience with Fuchsia multilib.

Thank you for letting me know, then I think there is no point in me developing a competitive solution for Sanitizers.

So, we can just build those libraries, ship with LLVM and put the burden of specifying the correct directory on a user, until those are included in multilib (I don’t know specifics of how multilib works).

We can try to standardize paths. Eg. in libc++ directory, directory “instrumented” and in that directory many standardized names like asan.

@ldionne @philnik @vitalybuka @avogelsgesang what do you think about that?

LGTM, if we have tools for that already, we should use them.

In my day job as a developer at Salesforce: LGTM. This will make it simpler to use an annotated libcxx, and the setup effort is manageable given that we have tight control of our toolchain anyway.

In my role as hobbyist / for side projects: I would prefer if we could find a way that a clang which I install through sudo apt-get install clang would just work out-of-the-box with ASAN also for libcxx. Afaict, the current proposal wouldn’t reach that goal.

In summary, I think this would still be a step in the right direction, but it would be amazing if we can also figure out the multilib aspect such that this is simpler to use in smaller projects