Flang-aarch64-dylib buildbot: need help understanding a regression in clang-tblgen

Hi all,

I need help understanding a regression that only affects the flang-aarch64-dylib build bot and that I wasn’t able to reproduce locally.

The commit in question (which I have reverted for now) is here and is part of the effort to replace ManagedStatic. The failed build is here.

As you can see, clang-tblgen fails with

: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamEi+0x34)[0x595088]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen(_ZN4llvm3sys17RunSignalHandlersEv+0x38)[0x593100]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen[0x5957d8]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffa407e5c0]
/lib/aarch64-linux-gnu/libc.so.6(gsignal+0xe0)[0xffff9fe37d78]
/lib/aarch64-linux-gnu/libc.so.6(abort+0x114)[0xffff9fe24aac]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen(_ZN4llvm18report_fatal_errorERKNS_5TwineEb+0x1e4)[0x538808]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen(_ZN4llvm18report_fatal_errorERKNS_5TwineEb+0x0)[0x538624]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen[0x52b958]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen(_ZN4llvm2cl6Option11addArgumentEv+0x50)[0x51cd48]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen(_ZN4llvm23initDebugCounterOptionsEv+0xb4)[0x533770]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen[0x524754]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen(_ZN4llvm2cl23ParseCommandLineOptionsEiPKPKcNS_9StringRefEPNS_11raw_ostreamES2_b+0x44)[0x521e74]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen[0x5077d0]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe8)[0xffff9fe24e10]
/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/bin/clang-tblgen[0x42d788]

… which shouldn’t be possible because the CommandLine option in question is owned by a static variable, so I’m pretty much at a loss as to what’s going on.

One thing that might help is if the buildbot owner (who?) could provide more details on the bot’s configuration and perhaps even the produced binaries to analyze. Or perhaps somebody else has a brainwave or can reproduce this.

Hi @nhaehnle , sorry that you are hitting this :frowning: Indeed, quite confusing!

I’m not the owner of this bot, but I do know that the following flags make it “unique” (at least as far as Flang buildbots are concerned):

-DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=ON

Have you tried using these?

From my experience, this sort of issues are hit when using e.g. LLVMSupport as a library (in CMake’s target_link_libraries) rather than a component in CMake rules. But I don’t see LLVMSupport used much in Clang (apart from here, which doesn’t seem relevant).

HTH,
Andrzej

We (Linaro) run this bot. Looking at the failure I don’t think this is an AArch64 specific issue unless it is some linker ordering change that happens due to OS differences. If you can’t reproduce with the same cmake config then we can help debug on the bot machine.

The first failure for your reland appears to be Buildbot. The full cmake command for the build is:

cmake ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=AArch64 -DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=ON -DCMAKE_CXX_STANDARD=17 '-DLLVM_ENABLE_PROJECTS=llvm;flang;clang;mlir' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

So Andrzej is right about the flags that are key.

Hi @DavidSpickett, thank you for your response and sorry for taking so long. I finally got a chance to dig deeper into this.

Let me spare you the details, but one suspicious thing I noticed is that on my system with the given cmake options, clang-tblgen is linked against both static libLLVMSupport.a and libLLVM-16git.so:

[1/1] : && /usr/bin/c++  -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG  -Wl,-rpath-link,/home/nha/amd/src/llvm-project/build-rel/./lib  -Wl,--gc-sections tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ASTTableGen.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangASTNodesEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangASTPropertiesEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangAttrEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangCommentCommandInfoEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangCommentHTMLNamedCharacterReferenceEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangCommentHTMLTagsEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangDataCollectorsEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangDiagnosticsEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangOpcodesEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangOpenCLBuiltinEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangOptionDocEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangSACheckersEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangSyntaxEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/ClangTypeNodesEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/MveEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/NeonEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/RISCVVEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/SveEmitter.cpp.o tools/clang/utils/TableGen/CMakeFiles/clang-tblgen.dir/TableGen.cpp.o  -o bin/clang-tblgen  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMSupport.a  lib/libLLVMTableGen.a  -lpthread  lib/libclangSupport.a  lib/libLLVMSupport.a  -lrt  -ldl  -lpthread  -lm  /usr/lib/x86_64-linux-gnu/libz.so  /usr/lib/x86_64-linux-gnu/libtinfo.so  lib/libLLVMDemangle.a  lib/libLLVM-16git.so

However, the resulting binary doesn’t have the .so reference anymore:

$ ldd bin/clang-tblgen
        linux-vdso.so.1 (0x00007ffc8ddd0000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5c84408000)
        libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007f5c843d8000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f5c841f6000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5c840a7000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f5c8408c000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5c83e9a000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f5c84654000)

My current best theory is that for some reason, the clang-tblgen binary on the builtbot contains both a static copy of libLLVMSupport.a and the dynamically linked libLLVM-16git.so and both get initialized.

Can you confirm this by looking at the buildbot’s ldd bin/clang-tblgen and nm bin/clang-tblgen (the latter is expected to have DebugCounter symbols)?

Furthermore, it seems the TableGen-related cmake scripts are designed to not ever use dynamic linking, presumably for cross compilation. Would you be able to try the branch GitHub - nhaehnle/llvm-project at clang-tblgen-link on the buildbot and see if that builds correctly? It has the original offending patch as well as a hack to avoid pulling in the libLLVM-16git.so.

If that builds successfully, it would confirm my theory as a first step. We’d then have to figure out what the real fix is – presumably, clang-tblgen isn’t supposed to link against the dylib, and that’s the real bug, but the details are tricky.

I’ll try this out tomorrow and get back to you.

With main.

$ ldd bin/clang-tblgen
        linux-vdso.so.1 (0x0000ffff984df000)
        libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff9847e000)
        librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffff98466000)
        libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffff98452000)
        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff983a7000)
        libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000ffff9837d000)
        libtinfo.so.6 => /lib/aarch64-linux-gnu/libtinfo.so.6 (0x0000ffff9833f000)
        libLLVM-16git.so => /home/david.spickett/build-llvm-aarch64/bin/../lib/libLLVM-16git.so (0x0000ffff9450b000)
        libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff94326000)
        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff94302000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff9418f000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffff984af000)
        libedit.so.2 => /lib/aarch64-linux-gnu/libedit.so.2 (0x0000ffff9414a000)
        libxml2.so.2 => /lib/aarch64-linux-gnu/libxml2.so.2 (0x0000ffff93f90000)
        libbsd.so.0 => /lib/aarch64-linux-gnu/libbsd.so.0 (0x0000ffff93f69000)
        libicuuc.so.66 => /lib/aarch64-linux-gnu/libicuuc.so.66 (0x0000ffff93d7c000)
        liblzma.so.5 => /lib/aarch64-linux-gnu/liblzma.so.5 (0x0000ffff93d48000)
        libicudata.so.66 => /lib/aarch64-linux-gnu/libicudata.so.66 (0x0000ffff92279000)
$ nm ./bin/clang-tblgen | grep -i DebugCounter
0000000000686d40 b _ZGVZN4llvm23initDebugCounterOptionsEvE25RegisterPrintDebugCounter
0000000000686d38 b _ZL17PrintDebugCounter
0000000000686d48 b _ZL18DebugCounterOption
<...>
$ nm ./bin/clang-tblgen | grep -i DebugCounter | wc -l
39

With your branch, which built fine.

$ ldd bin/clang-tblgen
        linux-vdso.so.1 (0x0000ffffbe4d4000)
        libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffbe473000)
        librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffffbe45b000)
        libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffffbe447000)
        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffbe39c000)
        libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000ffffbe372000)
        libtinfo.so.6 => /lib/aarch64-linux-gnu/libtinfo.so.6 (0x0000ffffbe334000)
        libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffbe14f000)
        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffbe12b000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffbdfb8000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffffbe4a4000)
$ nm ./bin/clang-tblgen | grep -i DebugCounter
00000000005d5c20 b _ZGVZN4llvm12DebugCounter8instanceEvE1O
00000000004fd5e0 t _ZN12_GLOBAL__N_116DebugCounterListD0Ev
<...>
$ nm ./bin/clang-tblgen | grep -i DebugCounter | wc -l
31

Thank you very much for doing this test! That does seem to confirm my theory. I put a hopefully cleaner way of solving the underlying issue here: ⚙ D134637 clang-tblgen build: avoid duplicate inclusion of libLLVMSupport

I just into a similar issue for MLIR on Windows. My VM seems to be rather slow, so debugging this takes ages. Given that you looked into this since a while, I wonder if you would have comments on LLVM_LINK_LLVM_DYLIB=ON on windows breaks MLIR build · Issue #58015 · llvm/llvm-project · GitHub. In particular, I do not understand why we need to use DISABLE_LLVM_LINK_LLVM_DYLIB as added in ⚙ D125440 [mlir][Tablegen-LSP] Add support for a basic TableGen language server? Should llvm_config not add TableGen, but none of the others:
llvm-project/LLVM-Config.cmake at main · llvm/llvm-project · GitHub