Code sharing between compiler-rt and clang/llvm

What is the best way to share code between compiler-rt and clang/llvm (or the rest of llvm-project)? Is there a policy that there should be no dependencies introduced between the two? If so, what is the rationale for that?

There are several cases where ABIs need to remain in sync between the two and would benefit from code sharing. For example, compiler-rt/lib/builtins/cpu_model/aarch64.h and llvm/include/llvm/TargetParser/AArch64TargetParser.h both contain separate definitions of enum CPUFeatures which must be kept in sync. This is obviously tedious and error prone, and it would be good if they could e.g. share a header.

All that I know of is from the duplication of InstrProfData.inc which I believe is duplicated in compiler-rt and llvm.

There was an attempt to remove one of them in ⚙ D86890 [compiler-rt] Remove copy of InstrProfData.inc but it seemed like there was some push back from people building compiler-rt out of tree. There is supposedly a test that checks that the two files are identical.

This will have been a much more common occurrence pre-monorepo. There are some projects that extract LLVM components, for example the L4RE runtime environment has its own copy of compiler-rt (but not llvm) which is locally patched and used when building l4re with clang. GitHub - kernkonzept/l4re-core: The core components of the L4Re operating system.

Header file dependencies on llvm, could probably be worked around by people hacking apart the monorepo, but I don’t know how common a practice this is and how much disruption it would cause.

There is some use of LLVM header files in llvm-project/compiler-rt/lib/sanitizer_common/symbolizer/sanitizer_symbolize.cpp at main · llvm/llvm-project · GitHub and some in the xray tests.

1 Like

I hadn’t noticed that it already uses some header files. They were added way back in 2016 and added to more recently.

@vitalybuka @hjyamauchi @davidxl do you have any insights on this? Have things changed since 2020?

There are a few other shared header files that I know of:
compiler-rt/include/profile/{MemProfData,MIBEntryDef}.inc

I’m not really sure why we can’t do something like this in llvm/include/llvm/ProfileData/InstrProfData.inc:

#include "../../../../compiler-rt/include/profile/InstrProfData.inc

Libraries built from compiler-rt code should have no dependency on LLVM/Clang. This is because applications that link against these libraries (sanitizer runtime, xray runtime, PGO runtime, etc) generally do not want a fixed dependency on LLVM/Clang.

This is a special case. Sanitizer internal symbolizer (sanitizer_symbolize.cpp) uses LTO and internalization
(opt -passes=internalize) to build symbolizer.o that does not export LLVM symbols (even if it internally uses LLVM). symbolizer.o is injected into sanitizer runtime libclang_rt*san*.a. This is a very special build mode, and other compiler-rt libraries don’t do this.


In general, some code duplication cannot be prevented. The best is to add some comments so that people update one file will know the other file to update.

It also seems to be reasonable practice to add a test to ensure things stay the same too? Comments are great, but I know I often skip reading the boilerplate at the top of a file, especially when jumping around to different files through code search.

I’m not too familiar with what builds are supported, but it seems like the compiler-rt builds should have access to the LLVM source:

so adding a lit test to ensure that the files stay in sync seems pretty reasonable. I thought it was done in more cases, but it only seems to have been done in the context profile cases. I’ll look into seeing if they can be added.

There are also licensing considerations with compiler-rt depending on LLVM. The legacy LLVM license (which some code is still only covered by), and the licenses of some code included in LLVM require attribution even when distributed in binary form. Compiler-RT cannot incur code that has an attribution requirement.

I’m interpreting this to mean the library should not have a runtime/linktime dependence on llvm/clang. But I don’t see a reason why source level dependencies shouldn’t exist, e.g. stuff you would typically find in headers, like enums, constexprs, etc. For whole functions/templates, I can see there is a risk of accidentally re-exporting some llvm/clang dependencies. Can we add source level dependencies?

Relying on comments to keep things in sync is less than ideal. Testing that copy-pasted code remains identical is better but still feels hacky. Is it possible to actually test the desired behaviour, i.e. that the produced library does not have unexpected undefined symbols?

It would be good if we could have this, and the library dependency restrictions, documented somewhere. Maybe compiler-rt/Readme.txt?

Source level dependencies could exist, just like a C library could be written in C++. Then contributors must pay attention to only use struct/enum definitions that do not compiled to code/data that could lead to conflicts.

An arbitrary #include might lead to undefined symbols. A robust approach is to extract the shared part to a .inc file with no llvm includes, as [LLVM][compiler-rt][AArch64] Refactor AArch64 CPU features by boomanaiden154 · Pull Request #97777 · llvm/llvm-project · GitHub does.


However, stand-alone builds, used by some distributions, probably do not like true header sharing. For example, the following needs to work. Perhaps a new CMake variable is needed to customize the relevant llvm include path.

cmake -GNinja -Scompiler-rt -B/tmp/out/rt-aarch64 -DCMAKE_CROSSCOMPILING=on -DCMAKE_C_COMPILER=/tmp/Rel/bin/clang -DCMAKE_CXX_COMPILER=/tmp/Rel/bin/clang++ -DCMAKE_{ASM,C,CXX}_COMPILER_TARGET=aarch64-unknown-linux-gnu -DCMAKE_{C,CXX}_FLAGS='-fPIC -D_GNU_SOURCE' -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=lld -DLLVM_CMAKE_DIR=/tmp/Rel -DLLVM_APPEND_VC_REV=OFF -DLLVM_ENABLE_PER_TARGET_RUNTIME_DIR=on -DCOMPILER_RT_DEFAULT_TARGET_ONLY=on -DCOMPILER_RT_EMULATOR='qemu-aarch64-static -L /usr/aarch64-linux-gnu' -DCOMPILER_RT_HAS_LLD=on -DCOMPILER_RT_TEST_USE_LLD=on -DCOMPILER_RT_INCLUDE_TESTS=on -DCOMPILER_RT_TEST_COMPILER_CFLAGS=--target=aarch64-unknown-linux-gnu

ninja -C /tmp/out/rt-aarch64
ln -s /tmp/out/rt-aarch64/lib/aarch64-unknown-linux-gnu -t /tmp/Rel/lib/clang/16/lib

In GCC, libsanitizer is based on compiler-rt’s sanitizer libraries. It only uses one file in builtins, so extra LLVM includes in builtins should not be a problem.

% ls libsanitizer/builtins/
assembly.h

We do have the legacy license documented here: LLVM Developer Policy — LLVM 19.0.0git documentation

Which explains why the runtimes were licensed under a license without attribution.

While the project has dropped the requirement that all new code be under both the legacy and new license, we have not made any determination that all code in LLVM is covered under the new license (see: LLVM Developer Policy — LLVM 19.0.0git documentation). That means we still need to be careful about code sharing across parts of the project which had different licenses under the legacy license.

I’m not convinced that building compiler-rt from source without the rest of llvm source next to it is worth the dev cost. Removing that capability would break some builds but given llvm ships in one repo anyway, the fix for said builds is to stop deleting the llvm subdir after cloning. Seems reasonable to me.

compiler-rt builtins exporting symbols from llvm would be bad. That archive gets linked into everything and some things already link against llvm and other things do not want to link against llvm.

I’d suggest put shared headers somewhere in compiler-rt and #include them from llvm as the reasonable choice. Creating another subproject they both depend on is kind of better but mostly rearranging deckchairs. Failing to pay attention to symbols then adds pieces of compiler-rt to llvm which is less likely to cause problems than adding pieces of llvm into compiler-rt.

That also means the standalone build of compiler-rt continues to work. I’m doubtful that anyone deletes the compiler-rt subdir before building llvm.

Lots of users depend on being able to build compiler-rt separately from LLVM. That’s an extremely important workflow for many organizations that distribute toolchains, and it is an important feature for supporting the sanitizer runtimes with non-LLVM-based compilers (MSVC, GCC, etc).

These two things are not equivalent. I don’t think the need to be able to build compiler-rt without building the rest of LLVM is being challenged.

For these users, is it important that they be able to build without the rest of the LLVM source present? If so, why? Do they check out the whole repo then delete most of it? Do they download a tarball of compiler-rt only? Also just for my own understanding, who are these users?

Building compiler-rt without simultaneously building llvm has lots of uses. Cloning llvm and then deleting some subdirs and then building compiler-rt is less compelling. However I note my recommendation for llvm including headers located within compiler-rt would not break that use case.

Building llvm after first deleting the compiler-rt subdir is even less plausible as a genuine requirement, but even then, those few could adjust their delete-compiler-rt to leave a couple of files behind.