[RFC] Fix loose behaviors of Clang --target=

Without --target=, Clang should be able to pick up the system include and library paths. This shall never change.

Clang has some loose behaviors on how --target= selects multiarch include and library paths.

For the example builds below, run ninja clang cxx cxxabi asan to build needed targets.

–target= and libc++/compiler-rt multiarch paths

For a -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_ENABLE_PER_TARGET_RUNTIME_DIR=on -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-unknown-linux-gnu build , the libc++ multiarch directory and compiler-rt multiarch directory have x86_64-unknown-linux-gnu in the path. However, --target=x86_64-linux-gnu picks such multiarch directories:

% /tmp/out/custom0/bin/clang++ --target=x86_64-linux-gnu --stdlib=libc++ -fsanitize=address a.cc -o a -v |& sed -E 's/ "?-[iIL]/\n&/g'
...
clang -cc1 version 15.0.0 based upon LLVM 15.0.0git default target x86_64-unknown-linux-gnu
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include"
ignoring nonexistent directory "/include"
#include "..." search starts here:
#include <...> search starts here:
 /tmp/out/custom0/bin/../include/x86_64-unknown-linux-gnu/c++/v1
 /tmp/out/custom0/bin/../include/c++/v1
 /tmp/out/custom0/lib/clang/15.0.0/include
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
 "/usr/bin/x86_64-linux-gnu-ld" --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a /lib/x86_64-linux-gnu/crt1.o /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbegin.o
 -L/tmp/out/custom0/bin/../lib/x86_64-unknown-linux-gnu
 -L/usr/lib/gcc/x86_64-linux-gnu/11
 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib64
 -L/lib/x86_64-linux-gnu
 -L/lib/../lib64
 -L/usr/lib/x86_64-linux-gnu
 -L/usr/lib/../lib64
 -L/tmp/out/custom0/bin/../lib
 -L/lib
 -L/usr/lib ...

Proposal: drop the loose behavior

–target= and system (glibc,GCC) multiarch paths

On Debian, the system (glibc,GCC) include and library paths use something like /usr/include/x86_64-linux-gnu/c++/11/, /usr/lib/x86_64-linux-gnu,

% g++ -dumpmachine
x86_64-linux-gnu

Note the ‘vendor’ part is omitted. It is due to Debian multiarch scheme: https://wiki.debian.org/Multiarch/Tuples.

Unfortunately, our CMake build system defaults LLVM_DEFAULT_TARGET_TRIPLE to the normalized x86_64-unknown-linux-gnu.
(mordern config.guess uses x86_64-pc-linux-gnu: ~/Dev/gcc/config.guess => x86_64-pc-linux-gnu)

However, a specified mismatching --target=x86_64-unknown-linux-gnu may pick up system x86_64-linux-gnu.
I consider this unfortunate and it’d be better we can drop this behavior.
This mismatch works because in clang/lib/Driver/ToolChains/Gnu.cpp, X86_64Triples encodes

  static const char *const X86_64Triples[] = {
      "x86_64-linux-gnu",       "x86_64-unknown-linux-gnu",
      "x86_64-pc-linux-gnu",    "x86_64-redhat-linux6E",
      "x86_64-redhat-linux",    "x86_64-suse-linux",
      "x86_64-manbo-linux-gnu", "x86_64-linux-gnu",
      "x86_64-slackware-linux", "x86_64-unknown-linux",
      "x86_64-amazon-linux"};

These variables are discouraged. Please never add new values.
For a Linux x86_64 triple, Clang detects GCC installations in paths constructed from these triples.

% /tmp/out/custom0/bin/clang++ --target=x86_64-unknown-linux-gnu --stdlib=libstdc++ a.cc -o a -v |& sed -E 's/ "?-[iIL]/\n&/g'
...
clang -cc1 version 15.0.0 based upon LLVM 15.0.0git default target x86_64-unknown-linux-gnu
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include"
ignoring nonexistent directory "/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11
 /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11
 /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/backward
 /tmp/out/custom0/lib/clang/15.0.0/include
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
 "/usr/local/bin/ld" --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a /lib/x86_64-linux-gnu/crt1.o /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbegin.o
 -L/tmp/out/custom0/bin/../lib/x86_64-unknown-linux-gnu
 -L/usr/lib/gcc/x86_64-linux-gnu/11
 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib64
 -L/lib/x86_64-linux-gnu
 -L/lib/../lib64
 -L/usr/lib/x86_64-linux-gnu
 -L/usr/lib/../lib64
 -L/tmp/out/custom0/bin/../lib
 -L/lib
 -L/usr/lib ...

As said, a random Linux x86_64 triple can pick up system x86_64-linux-gnu. This means:

% /tmp/out/custom0/bin/clang++ --target=x86_64-linux --stdlib=libstdc++ a.cc -o a -v |& sed -E 's/ "?-[iIL]/\n&/g'
same output
% /tmp/out/custom0/bin/clang++ --target=x86_64-linux-musl --stdlib=libstdc++ a.cc -o a -v |& sed -E 's/ "?-[iIL]/\n&/g'
same output

Proposal: even if we decide to prolong the support for --target=x86_64-unknown-linux-gnu on Debian (see below), the loose behavior related to x86_64-linux/x86_64-linux-musl should be removed.

Long-term proposal: For Debian and its derivatives, --target=x86_64-unknown-linux-gnu should not pick system x86_64-linux-gnu.
Clang installations should use -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-linux-gnu or --target=x86_64-linux-gnu to pick up system x86_64-linux-gnu.

The long-term proposal needs to fix a few things. x86_64-redhat-linux may be problematic as well. We may need to make x86_64-redhat-linux-gnu behave similar to x86_64-redhat-linux.
See ⚙ D111367 [Driver] Change -dumpmachine to respect --target/LLVM_DEFAULT_TARGET_TRIPLE verbatim

Prebuilt Clang works on nearly every Linux distro without --target=

Not sure if anyone does this.
Such a prebuilt Clang definitely does not work on a Linux distro with a triple not listed on clang/lib/Driver/ToolChains/Gnu.cpp:X86_64Triples.
I think such the distributor should specify correct --target= for their prebuilt Clang, possibly via a wrapper.
It’s not Clang driver’s job to enumerate the endless different triples.

Such distributors may ship libc++/compiler-rt multiarch directories beside Clang.
This is strongly discouraged since a prebuilt library may not necessarily run on a random different distribution.
If the distributor wants to hack up things, they can rename the multiarch directories to match the system, then use the correct --target=.
Again, it’s not Clang driver’s job to enumerate the endless different triples.

I agree with the general idea of this proposal. The way we process the targets and search for gcc installs is really complicated and leads to confusing errors. I think the fundamental issue we have now is that the target is used for 2 separate purposes. The first is to instruct clang how to generate code, and the second is to select the runtime libraries (i.e. gcc install) for the compiler to use.

Do you think it would make sense to introduce a new flag, like --gcc-target or --runtime-target, that would be used only for searching the system for runtime libraries? So for the debian example, you could have
-DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-unknown-linux-gnu and -DLLVM_DEFAULT_RUNTIME_LIB_TARGET_TRIPLE=x86_64-linux-gnu This would keep things working as they are now while still allowing us to remove the hard-coded target lists that the driver uses to search for runtime libraries.

I think if we can decouple the target from the library searching that will make it much easier to make progress on this.

That sounds good to me. We might want to use the “configuration name” terminology from autoconf somehow.

I think the fundamental issue we have now is that the target is used for 2 separate purposes. The first is to instruct clang how to generate code, and the second is to select the runtime libraries (i.e. gcc install) for the compiler to use.

Yes. I’d add that normalization is part of the fundamental issue.

The two-option direction is a more fit, if: we decide that we will always normalize the triple in user-facing output (e.g. -dumpmachine, -v). In that case, there will always be a x86_64-unknown-linux-gnu vs (Debian multiarch) x86_64-linux-gnu mismatch.

I have thought about introducing a new option, but haven’t convinced myself that would be clearer.

My current thought is that for Debian we should transit to this behavior (⚙ D110663 [Driver] Support Debian multiarch style lib/clang/14.0.0/x86_64-linux-gnu runtime path and include/x86_64-linux-gnu/c++/v1 libc++ path ⚙ D111367 [Driver] Change -dumpmachine to respect --target/LLVM_DEFAULT_TARGET_TRIPLE verbatim):

% g++ -dumpmachine
x86_64-linux-gnu
% future-clang++ -dumpmachine
x86_64-linux-gnu

(Note: code generation will always use the normalized target triple, but that is internal usage and unrelated to library and include path selection. None of our driver changes needs to touch that.)

thanks for the awesome information.

I agree about cleaning up --target=x86_64-unknown-linux-gnu behavior on Debian, but I disagree about handling of –target= for Clang runtime libraries. These are two separate issues and while the first one is specific only to a particular Linux distribution where the goal is compatibility with existing installations, the second affects every target, most of which don’t have the same constraints.

From my perspective, the organization of Clang runtime libraries, that is the content of directories like include and lib, is an internal detail and should use normalized triples the same way that code generation does for consistency across different Clang distributions.

If we want to avoid introducing an entirely new flag, we can consider changing -triple to be also recognized by the driver and use it for constructing paths to Clang runtime libraries.

From my perspective, the organization of Clang runtime libraries, that is the content of directories like include and lib , is an internal detail and should use normalized triples the same way that code generation does for consistency across different Clang distributions.

The runtime file hierarchy does not need to match the normalized IR target triple.

Let’s consider the three involved target triples:

  • IR target triple: currently normalized due to normalized Clang cc1 -triple. If we change cc1 -triple to be unnormalized, IR target triple will be unnormalized
  • system library search paths: normalized with special hacks to behave like unnormalized for Debian multiarch
  • runtime library search paths: normalized

AIUI we all agree that system library search paths using unnormalized triples will simplify things. There is another advantage: Clang doesn’t need to invent its own triples when there is a convention (e.g. for x86_64, x86_64-pc-linux-gnu is conventional nowadays. x86_64-unknown-linux-gnu is more of less Clang’s own idea (it was legacy GCC config.guess spelling but GCC has abandoned that for at least 10 years I think))

Let’s see whether runtime library search paths should use the normalized or unnormalized triple.
I favor unnormalized one, because this allows us to have one triple instead of two in the driver in the long term. Having two triples for system libraries and runtime libraries adds complexity IMO.

The runtime files are more closely related to system libraries (Linux: libc.so, libpthread.so, libgcc_s.so.1, etc). When system libraries use x86_64-linux-gnu, using x86_64-linux-gnu for runtime paths simplifies the work Clang driver needs to do.

If useful, we could change -triple to be unnorrmalized as well if we want additional consistency among IR target triple/system library search paths/runtime library search paths. Code generators use a normalized target triple to guide some decisions, but that process can be internal, rather than serialized in the IR target triple string.

Another aspect that I didn’t see mentioned here so far, is that the issues don’t only lie in the vendor/environment parts of the triples - the issue also is present in the architecture field. E.g. for x86_32, you may be doing code generation with an i686 arch name, but the triple used for system libraries might be i386-linux-gnu. The same thing goes on ARM; you probably don’t want to do code generation with an arm-linux-gnueabi triple, you most probably want to specify armv[5-7]-*. The same thing goes especially for arm-linux-gnueabihf, where the baseline configuration in Debian implies armv7, but the multiarch triple is plain arm-*.

The same things do crop up in the mingw targets as well, but there, the driver has some amount of (arguably ugly, but it predates me) hardcoded logic to try to locate sysroots based both on the literal triple and <arch>-w64-mingw32, using only the arch name from the user-provided triple.

Thirdly, I recently fixed another slightly related issue about deducing the right spelling for the arch name for the mingw target. When running a toolchain that defaults to x86_64, but invoking Clang with -m32, it would switch the target triple to i386-w64-mingw32 and fail to find the right system libraries (as the correct target triple for the system libraries was i686-w64-mingw32). I fixed this for the mingw target in ⚙ D111952 [clang] [MinGW] Guess the right ix86 arch name spelling as sysroot.

Clang installations should use -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-linux-gnu or --target=x86_64-linux-gnu to pick up system x86_64-linux-gnu .

There’s actually another slightly surprising aspect to this, unrelated to Clang searching for libraries/headers. Currently, using a non-normalized triple as LLVM_DEFAULT_TARGET_TRIPLE breaks code generation in some tools, e.g. llc. See e.g. llvm-project/llc.cpp at 1db4dbba24dd36bd5a91ed58bd9d92dce2060c9f · llvm/llvm-project · GitHub

      if (!TargetTriple.empty())
        IRTargetTriple = Triple::normalize(TargetTriple);
      TheTriple = Triple(IRTargetTriple);
      if (TheTriple.getTriple().empty())
        TheTriple.setTriple(sys::getDefaultTargetTriple());

Here; if a user specifies a triple with a command line option to llc, it will normalize that triple, and any usual form of writing a triple works out fine. But if you don’t specify a triple manually but let it use the built-in default one, it will use that one as-is without normalization, which easily leads to e.g. interpreting the OS part as the vendor and the environment part as OS.

So if we agree that LLVM_DEFAULT_TARGET_TRIPLE can be set to a non-normalized triple, we’d need to fix such cases in tools to normalize it before it is used. (I guess we can’t normalize it within sys::getDefaultTargetTriple() because then we’d lose the ability to specify the literal form we expect to use for e.g. the distro’s system libraries.)