[RFC] Open sourcing and contributing TAPI back to the LLVM community

Hi @ll,

Over the past years I have been looking into how to reduce the size of the SDK that ships with Xcode and how to improve build times for the overall OS inside Apple. The result is a tool called TAPI, which is used at Apple for all things related to text-based dynamic library files (.tbd).

What are text-based dynamic library files?
Text-based dynamic library files (TBDs) are a textual representation of the information in a dynamic library / shared library that is required by the static linker - basically a symbol list of the exported symbols.

Apple’s SDKs originally used Mach-O Dynamic Library Stubs. Mach-O Dynamic Library Stubs are dynamic library files, but with all the text and data stripped out. TBD files were introduced to replaced Mach-O Dynamic Library Stub files in the SDK to further reduce its overall size.

Over time the TAPI tool has grown and is used now in a variety of ways.

Dynamic Library Stubbing:
As mentioned above, TAPI is used to read the content of dynamic library / shared library and generates a textual representation that can be used by the static linker. The current implementation reads MachO files, but it could be extended to also provide the same functionality for other object file formats.

Framework / Dynamic Library Verification:
The symbols that are exported from a dynamic library should ideally match, or at least contain, all the API that is specified in the associated header files. TAPI performs this verification by parsing the header files with CLANG and compare the findings to the exported symbols from the library.

InstallAPI:
InstallAPI is a new build phase that generates the TBD file from header files only. This allows a dependency of the library to build concurrently even before the library has been built itself. This can be used to increase parallelism in the build or larger projects or operating systems.

Misc:

  • display and operate on TBD files
  • automatically generate API tests from header files
  • libtapi, which is used by the linker (ld64) to parse the TBD files

The functionality of the tool is currently limited to Mach-O object files, but that is not a technical limitation. In making the tool open source I hope others will be able to take advantage of it too and extend its functionality to other object file formats.

I initially developed the project as a CLANG project, but that was mostly for practical reasons (out-of-tree development, separate repo, etc). For the curious ones I pushed the repo to github (https://github.com/ributzka/tapi).

I imagine, for example, that the reading/writing of TBD files is something that would fit better into the LLVM sources, which makes it available to other libraries and tools (e.g. LLVMObject, llvm-nm, lld, …).

I created a small patch that integrates it with llvm-nm and LLVMObject. This patch is not complete and I will split it up into smaller patches for review. I am providing it as a reference to get the discussion started.

Please let me know what you think and bikeshed away :slight_smile:

Thanks

Cheers,
Juergen

tapi-llvm-nm.patch.tar.bz2 (16.3 KB)

Hi @ll,

Over the past years I have been looking into how to reduce the size of the
SDK that ships with Xcode and how to improve build times for the overall OS
inside Apple. The result is a tool called TAPI, which is used at Apple for
all things related to text-based dynamic library files (.tbd).

What are text-based dynamic library files?
Text-based dynamic library files (TBDs) are a textual representation of the
information in a dynamic library / shared library that is required by the
static linker - basically a symbol list of the exported symbols.

Apple’s SDKs originally used Mach-O Dynamic Library Stubs. Mach-O Dynamic
Library Stubs are dynamic library files, but with all the text and data
stripped out. TBD files were introduced to replaced Mach-O Dynamic Library
Stub files in the SDK to further reduce its overall size.

Over time the TAPI tool has grown and is used now in a variety of ways.

Dynamic Library Stubbing:
As mentioned above, TAPI is used to read the content of dynamic library /
shared library and generates a textual representation that can be used by
the static linker. The current implementation reads MachO files, but it
could be extended to also provide the same functionality for other object
file formats.

Framework / Dynamic Library Verification:
The symbols that are exported from a dynamic library should ideally match,
or at least contain, all the API that is specified in the associated header
files. TAPI performs this verification by parsing the header files with
CLANG and compare the findings to the exported symbols from the library.

InstallAPI:
InstallAPI is a new build phase that generates the TBD file from header
files only. This allows a dependency of the library to build concurrently
even before the library has been built itself. This can be used to increase
parallelism in the build or larger projects or operating systems.

Misc:
- display and operate on TBD files
- automatically generate API tests from header files
- libtapi, which is used by the linker (ld64) to parse the TBD files

I'm interested in whether you plan to have this integrated in lld as well.
As far as I understand, this is going to be the de-facto way of
shipping for Mach-O binaries (at least, the ones released by Apple).
Please correct me if I'm wrong.
I tried to self-host lld on El Capitan and it fails because lld
doesn't really know about TBD files.
This, unfortunately, makes the linker not really usable for modern Mac
OS releases.

InstallAPI:
InstallAPI is a new build phase that generates the TBD file from header
files only. This allows a dependency of the library to build concurrently
even before the library has been built itself. This can be used to
increase parallelism in the build or larger projects or operating systems.

My experience is that headers don't necessarily form the best source of
truth about the API exported from a library. If you follow the Windows
model of marking exported APIs explicitly (declspec(dllexport) or something)
then okay, but that's a Windows extension and not common in other systems.
Linker scripts seem to be a more popular method; does the tool read linker
scripts to form the content of a TBD file?
Otherwise I'm not seeing a generic improvement in build parallelism.
--paulr

> Hi @ll,
>
> Over the past years I have been looking into how to reduce the size of
the
> SDK that ships with Xcode and how to improve build times for the overall
OS
> inside Apple. The result is a tool called TAPI, which is used at Apple
for
> all things related to text-based dynamic library files (.tbd).
>
> What are text-based dynamic library files?
> Text-based dynamic library files (TBDs) are a textual representation of
the
> information in a dynamic library / shared library that is required by the
> static linker - basically a symbol list of the exported symbols.
>
> Apple’s SDKs originally used Mach-O Dynamic Library Stubs. Mach-O Dynamic
> Library Stubs are dynamic library files, but with all the text and data
> stripped out. TBD files were introduced to replaced Mach-O Dynamic
Library
> Stub files in the SDK to further reduce its overall size.
>
> Over time the TAPI tool has grown and is used now in a variety of ways.
>
> Dynamic Library Stubbing:
> As mentioned above, TAPI is used to read the content of dynamic library /
> shared library and generates a textual representation that can be used by
> the static linker. The current implementation reads MachO files, but it
> could be extended to also provide the same functionality for other object
> file formats.
>
> Framework / Dynamic Library Verification:
> The symbols that are exported from a dynamic library should ideally
match,
> or at least contain, all the API that is specified in the associated
header
> files. TAPI performs this verification by parsing the header files with
> CLANG and compare the findings to the exported symbols from the library.
>
> InstallAPI:
> InstallAPI is a new build phase that generates the TBD file from header
> files only. This allows a dependency of the library to build concurrently
> even before the library has been built itself. This can be used to
increase
> parallelism in the build or larger projects or operating systems.
>
> Misc:
> - display and operate on TBD files
> - automatically generate API tests from header files
> - libtapi, which is used by the linker (ld64) to parse the TBD files
>

I'm interested in whether you plan to have this integrated in lld as well.
As far as I understand, this is going to be the de-facto way of
shipping for Mach-O binaries (at least, the ones released by Apple).
Please correct me if I'm wrong.

Yes, this is already the de-facto way of shipping Mach-O files in the SDK.
That means self-hosting LLD against the SDK is currently not possible. The
system itself is obviously still shipping full Mach-O files in /System, so
you should be still able to self-host against those file.

My plan is to integrate support for TBD files into all LLVM tools where it
makes sense (including LLD). This is why I wanted to start to put the basic
support into LLVM first, so it can be used by other tools and libraries.

Hi Paul,

My experience has shown the same when it comes to header files and I am not claiming this is going to work out of the box for all library projects. It usually requires some cleanup first and that is why the tool comes with a verification mode to make sure the headers are the truth. Also keep in mind that you don’t have to parse all the headers, but only the small set that get installed as part of the library API.

The tool does not read the linker script / export file, because they are not necessarily the truth either and may have wildcards. In my view they are just one way of managing exported symbols. Another way, which I personally prefer, is to build with visibility hidden and annotate only the API with visibility default. That makes the headers the single source of what is API.

Cheers,
Juergen

I think it makes sense to have support for this input format in the tools. Since the macOS SDK is slowly switching to this, having the tools work out of the box is a nice feature. It is rather convenient having a single toolset be sufficient to provide infrastructure for all the targets.

Saleem

Hi Juergen,

At a minimum I think adding the support to libobject, etc so the various llvm tools can read or even write files from/for OSX should be fairly non-controversial so how about go ahead and do that first (I’ll happily review if you’d like) and then we can go from there to do anything else with TAPI and llvm?

Sound good?

-eric

To belatedly second Juergen, yes I think the concept of TBD files is great, and not just useful to the specific XCode situation of proprietary libraries. For example the mapfiles[1] of Illumos are exactly analogous and used not because the libc of Illumos is closed source (it isn't) but rather to ensure comparability across Illumos versions. The libc (shared library) ABI of Illumos is the stable interface rather than the syscall ABI like Linux, so ensuring linkers see *only* see the export list at build time prevents the abstraction from leaking and unintended incompatibilities being utilized.

Beyond my general esteem for using export lists for hermetic builds, I'm working on automated native and cross builds for Darwin, and it would be a lot more convenient if this was part of LLVM.

So what happened to this? I searched TAPI on differential and didn't see anything. Given the general need for something like a standardized export list to feed a linker instead of the shared library itself, and the utility of this specific format given its use by a major platform, I think it definitely should be upstreamed.

John

[1] https://github.com/joyent/illumos-joyent/blob/master/usr/src/lib/README.mapfiles

I’m also going to chime in here pretty late. I’ve been thinking about proposing a tool (which I’ve been calling llvm-abitool) to do many of these same things. I would be willing to contribute the ELF part of this tool and get that up and running.

I’m not really clear on the actual benefits of the TBD file, and why Apple migrated to them in the first place. Shouldn’t a dynamic library containing only the relevant parts (e.g. the dynamic symbol table) be roughly comparable in size? And, much simpler to support? I assume that’s effectively what “Mach-O Dynamic Library Stubs” actually were, before the introduction of TBD files, so presumably there were good reasons for switching?

If anyone wants to do something similar for another platform (that is to say, ELF; COFF already has import libraries), I’d suggest that the sensible way to do so would be to generate actual shared object files which contain only the appropriate interface definitions.

Regardless of any of that, given that TBD files are an integral part of the apple platform, supporting them is certainly a necessity in order to have a working apple linker. So, if making LLD work for Apple/MachO is the justification for adding TBD support to LLVM, that seems self-evidently a reasonable thing to do. On the other hand, it looks like the LLD mach-o code is unmaintained and nobody seems to be much interested in it. And having code for reading TBD files in LLVM seems not terribly interesting, unless it is as part of a project to make the LLD MachO linker actually functional and supported.

File size is one reason. A TBD file is typically one third the size of the corresponding stub library for a single architecture. Multiple architectures dramatically increase the TBD advantage: a new architecture in TBD may cost as little as a few bytes if all architectures export the same functions, but each new architecture in a stub library requires duplicating its entire contents.

Regardless of any of that, given that TBD files _are_ an integral

part of the apple platform, supporting them is certainly a necessity in order to have a working apple linker. So, if making LLD work for Apple/MachO is the justification for adding TBD support to LLVM, that seems self-evidently a reasonable thing to do. On the other hand, it looks like the LLD mach-o code is unmaintained and nobody seems to be much interested in it. And having code for reading TBD files in LLVM seems not terribly interesting, unless it is as part of a project to make the LLD MachO linker actually functional and supported.

Yes. I hope this can be reason enough. Hobbyists could push for LLD support for Mach-O besides Apple, and if LLD is to displace other linkers this is a necessary component as you say. Better to upstream now before the code diverges than more work later? Conversely if nothing happens, I doubt libtapi would be a greater drag on the codebase than the MachO LLD code, so whatever cost/benefit analysis exists for keeping that around could also apply to this.

>> I'm not really clear on the actual benefits of the TBD file, and why Apple migrated to them in the first place. Shouldn't a dynamic library containing only the relevant parts (e.g. the dynamic symbol table) be roughly comparable in size?...

> File size is one reason...

For the record, other small benefits are

  - The inclusion of the path to the actual library, which as far as I know is not something that can be done with a stub library. This allows easy absolute or relative (with R(UN)PATH) linking. Comparatively, passing the right -rpath and -rpath link is manual and (in my opinion) harder to understand and cumbersome, and also not a solution for absolute linking. I work with Nixpkgs of NixOS, where absolute path linking is frequently an objective as part of a general principle of avoiding indirection.

  - YAML. The option for line-oriented structure allows for easy diffing with conventional line-based diffing tools, which is useful for debugging compatability issues. (e.g. Why did my new version remove symbols? Why did my security update change anything at all?). Of course one can just objdump and diff, but that wouldn't happen automatically with version control, for example.

John

Speaking for the Zig project here, our goal is to support cross-compilation
for any target, on any target, without requiring installation of any
target-specific SDK. So, for example, these use cases:
* on linux, compile & link a binary targeting macos
* on windows, compile & link a binary targeting macos

This works today, although it depends on a patch to LLD to fix the MACH-O
linker that is not high enough quality to upstream.

So we have a vested interest in improving the MACH-O linker, and in fact a
Zig community member has fixed at least one bug in MACH-O LLD:
reviews.llvm.org/D35387

I don't fully understand how TBD or TAPI works, but I hope that it results
in improvements to the MACH-O linker.

Benifits of TBD:

  1. It’s human readable and diffs on TBDs correspond to changes in the ABI. Diffs can be automatically added to review processes to ensure that changes to the ABI are reviewed. The TBDs also document your precise ABI.
  2. The size is smaller which means they can be shipped in an SDK instead of binaries to reduce the size of an SDK
  3. Stubs are producible from TBDs (or should be) which means stubs for linking can be produced even if we don’t directly support them in LLD. This lets you ship the smaller TBD files in place of larger binaries and still link things without direct linker support (assuming you already ship a toolchain with your SDK or expect your users to have this tool)

Since stubs are producible from TBDs I don’t really see a downside. I think we need both, I was going to propose a yaml based representation for ELF for the above reasons anyhow.

Seems like there are a few of us interested in this then. I new around here and don’t really know how decisions are made, so what’s next? Just open a diff with the entire library??

John

Ideally Jurgen would cut up the code on github, put up an initial diff for a minimal viable tool, and then we would review it and then continue to copy code from the github repo into llvm and review it. I’m also willing to do that if Jurgen doesn’t want to at this point though. I’d like the OK from Jurgen on that and I’d also like the OK from someone that the license stuff is all good to go (I’m not sure who should check licence stuff).

Best,
Jake

That sounds great to me, thanks Jake. I’m not Jurgen either, of course, but I’m happy to assist you if he is unavailable. I’m not also not qualified to audit the license, but do note Apple formally also released some code at . If there’s anything else I can do to help, let me know.

Cheers,

John

Also I mainly care about getting the ELF part of this working so it would be nice to have an informal owner of the MachO part.

Benifits of TBD:

  1. It’s human readable and diffs on TBDs correspond to changes in the ABI. Diffs can be automatically added to review processes to ensure that changes to the ABI are reviewed. The TBDs also document your precise ABI.
  2. The size is smaller which means they can be shipped in an SDK instead of binaries to reduce the size of an SDK

I’m still skeptical that this is significant.

  1. Stubs are producible from TBDs (or should be) which means stubs for linking can be produced even if we don’t directly support them in LLD. This lets you ship the smaller TBD files in place of larger binaries and still link things without direct linker support (assuming you already ship a toolchain with your SDK or expect your users to have this tool)

Since stubs are producible from TBDs I don’t really see a downside. I think we need both, I was going to propose a yaml based representation for ELF for the above reasons anyhow.

Yea, a tool which can produce a .so from a textual description is certainly much less concerning than adding linker support for a new textual description format. If it’s an official linker-supported format, it’d be yet another format that potentially needs to be standardized across multiple linkers, and kept compatible for"ever", etc. I just don’t think that seems worthwhile for ELF.

OTOH, a standalone tool which can convert from a “full” shared-object to an interface shared-object would be great to have. If that tool also has some auxiliary textual I/O format it supports, I guess that’s fine, too. (We do have some existing yaml <-> ELF support, via the “obj2yaml” and “yaml2obj” tools.)

I’d note that reproducing all the things that are required/used from an ELF shared object during linking – symbol type, binding-type, visibility, version, alignment (!), .gnu.warning messages, various important “SHT_NOTE” sections, and whatever other things I’ve forgotten about, will need to be a significantly different format than what Apple has as their “TBD” format. Apple’s format also has a bunch of special cases in it to make it easier to use for their platform, but a rather less generic tool. E.g., symbols starting with “OBJC_CLASS$” are recorded in the “objc-classes” field with the prefix removed, instead of just recording it as-is.

So, I’d also caution that while the project of “import apple’s libtapi into LLVM for LLD/MachO” and “Make a scheme to do interface shared-libs for ELF” might seem superficially related, I’d be very surprised if that actually ended up being the case. I would really not expect it to share just about anything at all other than the concept of being a textual description for a library.

I fully agree TBD files will need to be format specific. Producing stubs (from full shared objects or text) is important and should have a tool to do it properly. Right now that’s an artislian process unique to each project. Adding linker support is a whole other issue and not one I’m too concerned with. If someone wants to propose that later, they can. I certainly won’t be proposing it.

As an aside I’m actually quite aware of demand for a textual representation. I work on a team producing an operating system and we care about what symbols expose in our system libraries. I also know that lots of people use libabigail for this sorts of reasons. The demand for human ABI review exists. Weather a textual representation is important or not is I suppose unclear.