[RFC] Cleaning up how we link TableGen tools

Summary

I would like to clean up how TableGen tools are linked by:

  • Adding the LLVMTableGen library to the shared library (libLLVM-*.so) build if enabled
  • Linking tablegen tools other than llvm-tblgen dynamically against libLLVM-*.so in builds where this is enabled

I have a stack of changes towards this goal, ending in ⚙ D138278 TableGen: honor LLVM_LINK_LLVM_DYLIB by default. Please review / give feedback: this is a dark but important corner that it is typically difficult to get reviews for in my experence.

Rationale

Most tablegen tools link against both LLVMSupport and some project-specific support library, like clangSupport or the MLIR PDLL implementation. Those project-specific libraries also tend to link against LLVMSupport directly or indirectly for obvious reasons.

In LLVM_LINK_LLVM_DYLIB=ON build, the result is that tablegen tools link against LLVMSupport both statically and dynamically, via the two different paths:

  • tablegen tool → LLVMSupport (direct path, static)
  • tablegen tool → project support library → LLVMSupport (indirect path, dynamic)

On some toolchains, the tablegen tool process ends up with duplicated global variables from LLVMSupport, which unsurprisingly leads to bugs. (It’s actually more surprising that it hasn’t lead to bugs earlier than now.) So ultimately, the goal here is to make our linker story more robust.

Now, Chesterton’s fence: Why have tablegen tools been linked statically so far? I believe the answer is that llvm-tblgen must be linked statically (to avoid a circular build dependency) and then other tools just copied whatever llvm-tblgen did without revisiting that part of it. There simply was no need to revisit this until quite recently.

Then there is the argument, documented in one of the cmake files, that LLVMTableGen is an internal library, so there is no need to include it into libLLVM-*.so. The extensive and good use of tablegen in MLIR proves that this stance has become outdated. There are certainly heavy users of tablegen outside of core LLVM itself, which suggests that downstream use of tablegen is very much in the cards as well. (I’m somewhat biased because I happen to have such a project :slight_smile: )

The same cmake file also documents that LLVMTableGen wasn’t included to avoid polluting the command-line option namespace, but that’s easily addressed with an established pattern for registering options explicitly, which one of my patches does.

Finally, you may ask about native builds. Tablegen tools are also built as part of a recursive cmake call for cross compilation and for non-Release builds with LLVM_OPTIMIZED_TABLEGEN=ON. Those native builds should be kept small: they shouldn’t build the full libLLVM-*.so. However, that is not a problem because the recursive native builds always use the default Release build, which has LLVM_LINK_LLVM_DYLIB=OFF.

Alternatives

The alternative option that I started with was to create duplicate versions of support libraries: one meant for dynamic links and one meant for static links. This was actually landed for clangSupport in ⚙ D134637 clang-tblgen build: avoid duplicate inclusion of libLLVMSupport / commit dce78646f07f. However, the same type of issue also exists in MLIR land and affects many more libraries (mostly around PDLL and LSP). The duplication of libraries quickly got out of hand – I spent more time trying and failing to make this alternative work cleanly than I did so far on the proposed solution.

In general, I think the option that I am advocating for is clearly better because it fixes the underlying problem by making our build system simpler. Fixing an issue by making code simpler always makes me happy.

cc @mehdi_amini @TobiasGrosser @River707

1 Like

Could TableGen be refactored to get rid of the Support dependency and become a true leaf component?

A quick skim of the #include’s in llvm/lib/TableGen suggests that removing the Support dependency would mean a lot of code duplication and/or loss of UI functionality. It relies on Support for command-line handling, error handling, source management, and a few other things.

Agreed. But if you look at the problems there are with Tablegen and the non obvious solutions proposed.

I really would prefer to not do this. LLVM tools, particularly on Windows, are terribly large due to lack of symbolic links (yes, they exist now, and there has been support for NTFS junctions, but they rely on a specific file system and require elevated privileges to create in the first place). This results in a clang/LLVM distribution being nearly 2.5G. I’d like to enable support for using shared linking on Windows, and this will increase the size of the LLVM DLL (DSO/dylib) to add functionality that is not used in a regular distribution.

My perception is that “regular distribution” is not very well defined: the point of this thread is to realize that this library is part of the use cases served by the LLVM distribution right now.
Having a less monolithic distribution would be highly valuable, but when it comes to this dylib I don’t quite get how to get around the fact that it is monolithic by design right now, and as such should expose every public libraries from the llvm directory by default (otherwise how do we compose projects? How do we make things works the same across static and dynamic linking?)

This is inherently a problem. Windows has a 16K limit on symbols per DLL. We cannot expose every single interface under a single library. What I suspect that we will need (and even want - at least with my distributor hat on) is that we have multiple libraries, at least 2 (LLVMSupport, LLVM) but likely 3 (LLVMSupport, LLVMBinaryUtilities, LLVMCodeGen) and a separate library for clang.

Again, as a distributor, the general distribution consists of binary tools (which currently are on the order of 35-100 MiB each), clang, lld, (and in my case, IDE tools such as clangd, clang-format), as well as the resource directory. This covers the binary analysis tools, linker, assembler, debugger, and compiler.

I don’t understand how project composition is impacted by there being a handful of libraries rather than a single library. I could understand that it can complicate distribution, but ultimately, the projects can still compose even if there are many libraries.

There in lies the point - they are different. We need to be cognizant of that, and make it easy to reason about. As a concrete example, one issue that was uncovered by my previous attempt to enable building LLVMSupport as a shared library on Windows was that we are reliant on ODR violations to have cl::opt function properly. Modeling everything as dynamic linking will permit us to link statically, however, modeling everything as static linking will not allow dynamic linking. The dynamic model forces you to consider the boundaries of the libraries and how they interact with each other. I am not claiming that this is simple, but it is part of software engineering.

Right, but it seems to me that you’re hinting at a design problem with this dylib in itself, not a principled argument about the state of the library that we’re discussing right now.

Maybe, and I’m all for going towards more modularity, but in the meantime we have a single monolithic library right?
Even then what you mention seems like very oriented toward some use-cases: why wouldn’t we separate the middle-end from each individual backends for example?
How would the library for clang differs? Would it be redondant from the other libraries (include code from LLVMCodeGen) or do you intend this as “code inside the clang/ folder”?

We’re talking about different things: you are targeting the monolithic aspect of libLLVM.so as problematic. I don’t disagree with you but I claim that this isn’t the topic at hand here: the problem is that projects right now are using LLVM and trying to link to it whether it is built statically or dynamically with the monolithic libLLVM. Right now there is inconsistency in what LLVM exposes with respect to this tablegen library and that makes it harder for all of these project to manage this dependency on LLVM.

So breaking appart libLLVM.so is a valuable thing to do, but pending this to happen we should keep this inconsistency with respect to a single library IMO.

@nhaehnle, I am very supportive of this change. It clearly cleans up some dark corners in our cmake system that have been inconsistent, complex, and buggy. I had no idea that there is such a neat workaround for the command line issue that seemingly motivated this complex setup in the first place. Thank you.

I also have been hit the Windows 16K symbol limit and would like to see this addressed, as I faced it myself. However, I agree with @mehdi_amini that we should likely decouple these two issues.

I’ve wondered if we should be moving in the other direction: making LLVMSupport a dll when LLVM_LINK_LLVM_DYLIB=True. It seems that LLVMSupport is a pretty widely-used dependency, and this would remove the common footgun that you’ve encountered, while LLVMTable gen is used relatively rarely by non-tablegen tools and so adds relatively little to be bundled with libLLVM.so?

4 Likes

The limit is 64k symbols, not 16k. But yes, the limit is a practical problem.

For mingw builds, it is (and has been for quite some time) possible to build with the dylib enabled, and it is a massive improvement for distributions.

But we recently did reach the 64k symbol limit for libLLVM. These days, the mingw builds can use hidden visibility just like on ELF though, and that brought us down from ~64k symbols to ~36k symbols, which gave us plenty of margin again.

Thank you for these numbers. It sounds like we’re not in any immediate trouble here, so I can go ahead with the proposed change while the larger question of whether and how to break up libLLVM in a principled way can be discussed separately?

At least from the current dylib builds on Windows with mingw tools, I don’t see a problem.

Do you have rough numbers on how many symbols you have exported in libLLVM on ELF platforms before and after this change?

I don’t know how to get really meaningful numbers there. nm libLLVM-16git.so gives me:

  • Without LLVMTableGen: 987406 lines of defined symbols
  • With LLVMTableGen: 994070 lines of defined symbols

(where lines of defined symbols is just all the lines starting with a hex number)

Clearly way beyond the 64k limit either way, perhaps the most meaningful aspect of it is that the growth is ~0.7%, and less than 6664 symbols in absolute terms.

Hmm, if it would add ~6k new symbols, that would be 10% of the whole allowance in DLLs which is rather a lot. But I guess not all of those really count here.

What does nm -D --defined-only libLLVM-16.0.0git.so | wc -l say before and after the change? With that, I get 36320 on a current build from git on Linux, which is quite close to what I’m getting on the mingw dylib builds too.

I still get numbers that are a lot higher than yours. In any case, it’s moving closer to what you have:

  • Without LLVMTableGen: 268048
  • With LLVMTableGen: 270777

So that’s an absolute increase of 2729, and a relative increase of ~1%.

I think this is a step in the right direction. Right now Gentoo ebuilds for LLVM need a few hacks to deal with the fact that tablegens are the only out-of-LLVM components requiring static LLVMSupport (we skip all other static libraries), and this is effectively a blocker for relatively clean MLIR packaging.

Weird that your numbers are that much higher. I’ve only got the X86, ARM and AArch64 targets enabled, but those should have mostly hidden symbols anyway.

I tested out your patches. Before, I’ve got 36201 symbols on mingw and 36311 on Linux, after these patches, I’ve got 36602 and 36750 symbols respectively - an increase of 401 and 439 symbols. So that doesn’t significantly affect the margins for the PE-COFF case at all - so from that point of view, it seems totally ok to me.

1 Like

Yes, there is a design problem - but this change does seem to make it more difficult to support the option on PE/COFF targets rather than making it easier.

Excellent question. The reasoning for that is two fold:

  • additional libraries do incur a load time cost (which is why something like ClangBuiltLinux uses static linking)
  • It enables additional opportunities for LTO over the library

Correct - I intended that to be the code under the clang directory, and thus not be redundant with the LLVM libraries.

It sounds like the tool is after LLVMSupport and not the rest of LLVM, so why not actually build LLVMSupport as a shared library? Doing that for ELF platforms is relatively easy, and would not make one configuration significantly different from the others.

Considering your argument, wouldn’t it be less intrusive to make TableGen a separate shared library that links to the dylib?