Rewrite IR external symbol names for ThinLTO

Context

We are rearchitecting how we ship Python and C++ inter-op code at Facebook. Historically, the C/C++ code that serves as the Python/C++ interop layer (we call these python_cxx_extensions) are compiled into individual shared libraries, while all second-level C/C++ dependencies are compiled into a omnibus library (libomnibus.so). The main problem with this approach is that we need to load thousands of python_cxx_extensions shared libraries at runtime which can be very expensive. We are trying to reduce runtime overhead by linking all python_cxx_extensions, C/C++ dependencies, and the python runtime together as a static binary.

Since all python_cxx_extensions were separate shared libraries, we saw a ton of duplicate symbols when we try to link them together. It is not feasibly to change the code base to remove duplicate symbols, so instead we use objcopy to rename external symbols in python_cxx_extensions as <symbol_name>.<unqiue_identifier>. We’ve successfully linked our binaries using this approach, but this approach does not work when using ThinLTO because there is no easy way to rewrite symbol names in IR.

Our system is such that:

  • We need to be transparent to the user, we cannot change the source code
  • We know in which files the symbols are defined and referenced

Proposal

We’d like to mirror the objcopy approach when using ThinLTO. The UniqueInternalLinkageName pass already append internal symbol names with MD5 hash, we would like to create a similar pass where give a list of symbols and a list of IR files, rewrite all the specified symbol references into <symbol_name>.<unqiue_identifier>.

Questions

  • Aside from extending UniqueInternalLinkageName pass, are there any existing tools that we overlooked? There’s https://clang.llvm.org/extra/clang-rename.html but it doesn’t operate on IR level.
  • Are there alternatives we can use to avoid duplicate symbols at LTO time?

– with @LorenArthur

1 Like

Our current idea won’t require writing a new pass, but it’s less than efficient:

  1. Compile to bitcode
  2. Run llvm-nm to obtain the mangled symbol names we need to rewrite
  3. Run llvm-dis to obtain the IR in ASCII
  4. awk replace the symbol names
  5. Run llvm-as to obtain the modified bitcode

You can write a llvm tool to do that (consumes and produces bitcode).