[RFC] Integrated Distributed ThinLTO

| kromanova
May 3 |

  • | - |

Sorry for the late reply, I was on vacation for a few weeks without internet access.

Blockquote
We integrated Distributed ThinLTO with the Buck2 build system (buck2/dist_lto.bzl at main · facebook/buck2 · GitHub ) at Meta last year, so distributed thinlto enablement is definitely a topic near and dear to my heart. Buck2 first schedule the thin link and call --thinlto-emit-imports-files to obtain the dynamic dependencies, it then dispatch the dynamic dependencies to a remote executor which load balances remote workers for LTO. Finally, the post-optimized bitcode is sent back to the local machine for final link step.
Blockquote

This sounds similar to the way it was implemented in Bazel, except the final link step is also sent to a remote machine.

We are doing exactly the same thing but within the linker.

Blockquote
Is it possible to integrate dynamic dispatch directly into your build system?
Blockquote

The problem is that our clients (game studios) do not have a specific build system. Most of them use CMAKE, which generates projects for MSBUILD. However, some of them generate projects for ninja or make. But potentially our clients could use any build system that they need.

So, the life is simple for the developers who are using Buck or Bazel and want to enable DTLTO. But can you actually estimate percentage of projects who use Bazil or Buck as their build system? I suspect it’s not a large percent of all the existing software projects. We are trying allow DTLTO adoption for the projects that are not using Buck and Bazel and we are doing in such a way, so that it will be very simple - just by adding one command option to the linker!

Blockquote
Conceptually the linker should be transparent to the build environment, just like the compiler, and vice versa.
Blockquote

I totally agree with this statement, but at the same time giving the linker knowledge about a specific common build environment (or several of them) will allow developers who use these build environment to enable DTLTO very easily. Of course, the developers who want to use DTLTO could rewrite their build scripts on Buck or Bazel, but it’s not easy task and require a lot of time and knowledge, especially with huge projects.

I understand the motivation behind the DTLTO approach. But I have a concern about adding build environment information directly into the linkers. What if this was done such that a script of some sort was passed to the linker to fork with the results of the distributed thin link, and scripts for common build systems could be contributed to the LLVM project repository? The scripts would take the information produced by the thin link step, create the appropriate Makefile/JSON/whatever, and invoke the remote build system, and ensure the native files are in the locations specified to the script by the linker so that it could perform the final link once the forked process completed. This would also make the support easier to add to other linkers.

Blockquote

My worry is that linkers do not have knowledge of the underlying build system and won’t be able to leverage build system characteristics, e.g. caching.

Blockquote
I will reply specifically about caching, but if you worry about something else, feel free to ask.

We support caching in the linker for the DTLTO and we will achieve a better performance result compared to caching supported by distribution system, because we use internal knowledge about the bitcode files being cached and calculate cache entry key much faster. We had compared our DTLTO caching with the caching provided by SN-DBS (Sony’s proprietary distribution system), and our DTLTO caching time is 5x faster.

How does your caching compare to the caching support that is implemented within LTO?

Could someone please point me to official documentation for how Distributed ThinLTO is supposed to be utilized within a build system?

This discussion is interesting to me as a potential user of Distributed ThinLTO, but I cannot find concrete details about the current state of the world (e.g. prior to this RFC) to compare against. The best resources I’ve found are:

Is there an authoritative document that’s up to date?

@teresajohnson - do you have a good pointer for @justincady here?

@justincady I think those are the 2 best resources. Basically @MaskRay’s doc is an updated version of what is in the blog post (with lld and other more advanced options added). That looks up to date to me.

We should probably have a distributed section that discusses this mode on ThinLTO — Clang 17.0.0git documentation. I will take an action item to add that. But for now following the doc @MaskRay wrote should be accurate. Please reach out if you have any questions or issues.

Hi Teresa,
Thank you for your response!

Let me clarify that only small part our integrated DTLTO code is located in the linker. We added around 130 LOC in the existing linker files (almost half of it is DTLTO related-options support) and around 800 LOC in a new separate file. The majority of the code that we were added was in LLVM side, including an abstract interface for build script/makefile generator as well the concrete classes for build script/makefile generator for specific build environments.

So, we are not adding build environment information directly to the linker.

I think the approach that you proposed with the script will work. However, we might lose performance because we will not able to pass generated native object files or cashed files from the script to the linker as memory buffers.

Note, that if we rewrite our code using the approach with a script that you suggested, we will still have to add about the same number of line of code as we added to the linker before. So, I don’t completely understand what we will gain.

Knowing all that, do you still think that it makes sense to invoke the script from the linker? If so, what is the main motivation for that? We already established that build environment information is not added to directly linker, so hopefully it’s not a concern anymore.

We are using the same algorithm for calculating cache entry as ThinLTO. Our distribution system SN-DBS doesn’t know anything about the bitcode files and using the entire file for calculating cache entry.

From what I understand after reading all the feedback related to our RFC that the biggest obstacle for committing our project to LLVM is the fact that in the current implementation we keep the build platform specifics needed for generation of the build scripts in the linker and llvm/lib/LTO.
We discussed it internally and decided that we will rewrite our implementation following Teresa’s suggestion that she made in one of the comments to this RFC.

Below I’m quoting what Teresa Johnson had proposed. Independently, Tobias Hieta proposed the same thing. “A script of some sort will be passed to the linker to fork with the results of the distributed ThinLink. The scripts would take the information produced by the ThinLink step, create the appropriate Makefile/JSON/etc, and invoke the remote build system, and ensure the native files are in the locations specified to the script by the linker so that it could perform the final link once the forked process completed.”

After making this change, can we start sending our patches upstream?

Sorry for the slow reply. I think that makes sense and is easier to integrate into the linkers.
@MaskRay what are your thoughts from the lld perspective?

Teresa

Hi Teresa,
Thank you so much for your reply!

I’m assuming that your comment can work as an approval for us to start submitting our work upstream… We will need some time to redo our current design and implement a python script that will forked from the linker to invoke the distribution system (we are planning to support IceCC), but hopefully in a few weeks we could start submitting patches.

We were also thinking that it’s a good idea to place all the DTLTO functionality into a separate DTLTO subproject (shared library), because in essence, DTLTO is one library with well-defined interface.
We will need 2 major API functions:

  • A function that converts archives into thin archives (it will need to get called before scanning phase in the linker);
  • A function that performs ThinLink and Codegen (it will need to get called after the scanning phase). The input for this function will be a list of preserved symbols and the list of bitcode files; the output will be a list of native object files.

Only a few dozen of lines of code will be left in the linker to process DTLTO parameter, then to load the shared library and invoke a few API functions.

Advantages:

  1. All options related to DTLTO integrated approach (except one main DTLTO option) will be ignored by the linker and processed by the plugin. That means that even less code related to DTLTO parameter processing will be kept in the linker.
  2. All code related to DTLTO functionality (i.e. converting archives of bitcodes, caching, and everything else) will be located outside of linker/compiler project. Only a few dozen of lines of code that load and invoke the shared library API functions will be in the linker.

Potential problems:
We need to make sure we have a stable API between LLD and DTLTO plugin.
We don’t anticipate that our simple interface with 2 major API functions will be changing much. In the unlikely event it will change, we could do similar to what is done in LTO.dll (i.e. versioning).

Do you approve the idea of putting all DTLTO functionality into a separate DTLTO subproject (shared library)?

@MaskRay can you take a look at the proposal?

@kromanova I’m not the right person to approve the lld proposal, but @MaskRay is a good person to do so.

Hello @MaskRay,
Could you please have a look at the integrated Distributed ThinLTO (DTLTO) proposal and let us know what do you think about it? We would prefer to move all DTLTO functionality into a separate DTLTO subproject (shared library) leaving only a few dozen of line of code in the linker (to load the shared library and to invoke a few API functions to perform DTLTO).

I assume you might be busy at work. In this case, maybe you could recommend someone else (who has the right to approve) to have a look at the proposal?

Sorry, I was out of town and have a large backlog to process.

I am still confused by the proposal. I assume that the initial design has been significant changed. Then where is the latest one? Does [RFC] Integrated Distributed ThinLTO - #28 by kromanova contain the gist of the new proposal? Can you please give some commands that end users will invoke?

I hope that my [RFC] Integrated Distributed ThinLTO - #13 by MaskRay is clear enough about the commands.

Thank you for your reply. Let me rewrite the proposal including all the latest design changes and simplify it where possible.

Hello,
I have updated RFC for Distributed ThinLTO. I have also simplified it quite a bit. Please have a look and let me know if you have any questions/comments.

Integrated Distributed ThinLTO

Goal

We have customers with LLVM-based toolchains that use a variety of different build and distribution systems including but not limited to our own. We’d like to implement support for an integrated Distributed ThinLTO (DTLTO) approach to improve their build performance, while making adoption of DTLTO seamless, as easy as adding a couple of options on the command line.

1. Challenges

DTLTO is more complex for integration into existing build systems compared to ThinLTO, because build rule dependencies are not known in advance, and they become available only after DTLTO’s ThinLink phase completes and for each of the bitcode files a list of its import files becomes known.

1.1 For some high-level build systems (such as Bazel or Buck) integration with DTLTO is not so challenging, since there is way to overcome the problem of dynamic dependencies by pruning everything that is not needed. These build systems start off with every DTLTO backend compile depending on every input module, but after the ThinLink step is finished and actual dependencies are known, they use the information about the dependencies to prune down those lists. Unfortunately, very few build systems have this capability.

1.2 For all other build systems, a non-trivial rewrite of a project’s buildscript/makefile is required. Build/Makefile developers will do the following steps to enable DTLTO for a project.

(a) unpack archives and place their members between –start-lib/–end-lib pair on the linker command line (note: this task is even more challenging than it seems on the surface, since name collision prevention when unpacking the same archives in different parallel processes is required)

(b) ThinLink step needs to be invoked

(c) after the ThinLink step is completed and the dependencies become known (it could be done by parsing the content of the import files), a script needs to be written to generate a set of codegen command lines; this set of command lines will need to be fed to the distribution system; also all the dependencies have to be copied to the particular remote machines.

(d) after the distribution system returns the result of the compilation, the buildscript/makefile will have to identify which files failed to compile and re-do the compilation for these files on the local machine

(e) A final link phase needs to be performed, linking all the native object files.

We are not aware of any large scale project that is using DTLTO (with the exception of projects built under Bazel or Buck), simply because modifying existing build scripts/makefiles to do the steps mentioned will be very difficult.

2. Our solution

In simple words, our DTLTO project will orchestrate all the steps described in section 1.2.

We are planning to place all the DTLTO functionality into a separate DTLTO subproject (shared library), because in essence, DTLTO is one library with well-defined interface.
We will need 2 major API functions:

  • A function that converts archives into thin archives (it will need to get called before the scanning phase in the linker);
  • A function that performs ThinLink and Codegen (it will need to get called after the scanning phase). The input for this function will be a list of preserved symbols and the list of bitcode files; the output will be a list of native object files.
    • Firstly, ThinLink will get invoked.
    • Dynamic dependencies created by the ThinLink step will be determined. A generic JSON file containing the list of compilation command lines will be created. It will also contain the locations of the output native object files and the list of dependencies to be copied on the remote node.
    • A custom script of some sort will be passed to the DTLTO shared library and it will be spawned as a separate process. [Note: each distribution system will require its own custom script]. The script would take the information in the generic JSON file generated in the previous step, convert it to the custom (distribution system specific) Makefile/Fast build .FB file/JSON/Incredibuild XML/etc, and invoke the remote build system.
    • Final link will be executed once the spawn process completes.

Only a few dozen of lines of code will be added to the linker to process the DTLTO parameters, to load the shared library and invoke a few API functions.

Note: If a new distribution system must be supported, interested parties will have to implement their custom script (or some kind of child process, e.g. an executable) that has a knowledge of a particular distribution system. This script or a child process will be spawned from the DTLTO shared library.

As a part of our upstreaming efforts, we will provide a script that will support the IceCream distribution system. That script could be used as an example for the other developers of what needs to be done to support a different distribution system.

3. Overview of existing popular Open Source & Proprietary systems that could be used for ThinLTO codegen distribution

Some or all of these systems could be potentially supported, bringing a lot of value for the ThinLTO customers who have already deployed one of these systems.

  • Distcc
  • Icecream. We are planning to provide support for IceCream as a part of our upstreaming efforts.
  • FastBuild. There are several parties that showed interest in support of integrated DTLTO solution with FastBuild. Unfortunately, we anticipate that Fastbuild will not perform as well when integrated with DTLTO compared to the other distribution systems mentioned here, because FastBuild doesn’t support load balancing. The performance result of DTLTO integrated with Fastbuild might even get worse when a project has several link processes executed in parallel.
  • Incredibuild; Incredibuild is one of the most popular proprietary build systems.
  • SN-DBS; SN-DBS is a proprietary distributed build system developed by SN Systems, which is part of Sony. SN-DBS uses job description documents in the form of JSON files for distributing jobs across the network. In Sony, we already have production level DTLTO implementation using SN-DBS. Several of our customers effortlessly switched from using ThinLTO into using Distributive ThinLTO.

4. Challenges & problems

This section describes the challenges that we encountered when implementing DTLTO integration with our proprietary distributed build system called SN-DBS. All of these problems will be applicable to DTLTO integration with any distributed system in general. The solution for these problems is described in detail in Section 6.

4.1 Regular archives handling

Archive files can be huge. It would be too time-consuming to send the whole archive to a remote node. One of the solutions is to convert regular archives into thin archives and send individual thin archive members to the remote machines.

4.2 Processes access to file system synchronization

Since at any given moment several linker processes can be active on a given machine, they can access the same files at the same time. We need to provide a reliable synchronization mechanism. The existing LLVM file access mutex is not adequate since it does not have a reliable way to detect abnormal process failure.

4.3 File name clashes

We can have situations where file names can clash with each other. We need to provide file name isolation for each individual link target.

4.4 Remote execution is not reliable and can fail at any time

We need to provide a fallback system that can do code generation on a local machine for those modules where remote code generation failed.

5. Linker execution flow

5.1 Linker options

The following options need to be added:

  • An option that tells the linker to use the integrated DTLTO approach and specifies what kind of distribution system to use [–distribute=[icecream/distcc/fastbuid/incredibuild]
  • Options for debugging and testing.

5.2. The linker will invoke DTLTO’s API function for archive conversion (Pre-SCAN Phase)

This is what this API function will do:

If an input file is a regular archive:

  • Convert regular archive into a thin archive. If the regular archive contains another regular archive, it will be converted to a thin archive during the next linker scan pass.
  • Replace the path to the regular archive with a path to the thin archive.

After the scan phase has completed, the linker has determined a list of input bitcode modules that will participate in the final link. Also, by now, the linker has collected all symbol resolution information.

5.3. The linker will invoke DTLTO’s API function to perform ThinLink and Codegen (Post-SCAN Phase)

This is what this API function will do:

  • Invoke ThinLink step. Individual module summary index file and cross module import files will get produced.
  • Check if any of the input bitcode has a corresponding cache entry. If the cache entry exists, this particular bitcode will be excluded from code generation.
  • Generate generic build script (JSON file). This JSON file will contain the list of compilation command lines, the locations of the output native object files and the list of dependencies to be copied on the remote node.
  • Invoke the custom script that has the knowledge of specifics of a particular distribution system. That script needs to be written by developers who are planning to support integration of DTLTO with a particular build system and passed to the DTLTO shared library as a parameter. The script would take the information in the generic JSON file generated in the previous step, convert it to the custom (distribution system specific) Makefile/Fast build .FB file/JSON/Incredibuild XML/etc, and invoke the remote build system.
  • Check that the list of expected native object files matches the list of the files returned after build script execution. If any of the native object files are missing, the DTLTO shared library uses the fallback system to perform code generation locally for all of these missing native object files.
  • Place native object files into corresponding cache entries.
  • Perform the final link and produce an executable.

6. Implementation details

6.1 Regular to Thin Archive converter

In section 4.1 we explained why dealing with regular archives is inefficient and proposed converting regular archives into thin archives, later copying only individual thin archive members to remote nodes.

We implemented a regular to thin archive converter based on llvm/Object/Archive.h

  • The regular to thin archive converter creates or opens an inter-process sync object.
  • It acquires sync object lock.
  • It determines to what directory to unpack the regular archive members. This decision is based on the command line option, system temp, or current process directory (in this priority).
  • If the thin archive doesn’t exist:
    • Unpack the regular archive
    • Create the thin archive from regular archive members
  • Else:
    • Check the thin archive file modification time***
    • If (the thin archive is newer than the regular archive) &&*** ( **the thin archive integrity is good):
      • Use existing thin archive
    • Else:
      • Unpack the regular archive
      • Create the thin archive from regular archive members.

Note: all thin archive members match regular archive members

6.2 Fallback code generation

In section 4.4 we described a problem that remote execution is not as reliable as local execution and it can fail at any time (e.g. a network is down, remote nodes are not accessible, etc). So, we need to implement a reliable fallback mechanism that can do code generation on a local machine for all those modules that failed to generate remotely.

  • Check if a list of missing native object files is not empty.
  • Create queue of commands for performing codegen for missing native object files.
  • Use the process pool to execute the queue of commands.
  • Report fatal error if some native object files are still missing.

7. Usage

With our integrated DTLTO approach it will be trivial for a build master/Makefile developer to enable DTLTO for any kind of project, no matter how complex it is. All that will be required is to add one additional option --distribute=[icecream/distcc/fastbuid/incredibuild] on the linker command line.

Let’s say we have this rule to perform linking of an executable using ThinLTO:

program.elf: main.bc.o libsupport.a
       lld --lto=thin main.bc.o -lsupport -o program.elf

If we want to start using DTLTO with Icecream as a distribution system, we simply can add “–distribution=icecream” to the list of linker options. Our Integrated DTLTO approach will take care of everything, from handling archives of bitcode files (which are currently not supported by current DTLTO implementation), to caching, and to managing situations when the distribution system failed to produce the native object file (fallback mechanism).

program.elf: main.bc.o libsupport.a
       lld --lto=thin --distribution=icecream main.bc.o -lsupport -o program.elf

Thanks for the updated RFC. So the usage is similar to in-process LTO (implicit LTO), but ThinLink and backend compilations are handled by a distributed system.

  1. Our solution

We are planning to place all the DTLTO functionality into a separate DTLTO subproject (shared library), because in essence, DTLTO is one library with well-defined interface.

Sounds good!

  1. Linker execution flow

An option that tells the linker to use the integrated DTLTO approach and specifies what kind of distribution system to use [–distribute=[icecream/distcc/fastbuid/incredibuild]

I assume that the value of --distribute= is opaque to lld and the distributed system handles icecream/distcc/fastbuid/incredibuild?
Then it looks good to me. We do not want build system customization in the linker.

5.2. The linker will invoke DTLTO’s API function for archive conversion (Pre-SCAN Phase)

5.3. The linker will invoke DTLTO’s API function to perform ThinLink and Codegen (Post-SCAN Phase)

Looks good.

Hello,

So the usage is similar to in-process LTO (implicit LTO), but ThinLink and backend compilations are handled by a distributed system.

From the user’s perspective, the usage for DTLTO is the same as the usage for ThinLTO, with an exception that one additional option (–distribute=) needs to be added on the linker command line to perform DTLTO and to specify the name of the distribution system to use.

DTLTO shared library will handle ThinLink. The backend compilation will be handled by a distributed system.

I assume that the value of --distribute= is opaque to lld and the distributed system handles icecream/distcc/fastbuid/incredibuild ?
Then it looks good to me. We do not want build system customization in the linker.

Yes, --distribute= option is opaque to lld. When lld encounter this option, the linker will simply load DTLTO shared library and will call API functions. DTLTO shared library will get the name of the distribution system and do customization for that particular distribution system.

1 Like

Can we considere Integrated DTLTO RFC approved or something else required to be changed? From whom to we need to get a formal approval for RFC?

When approved, we will immediately start working on implementation of DTLTO coupled with Icecream and re-designing our current implement to comply with the latest proposed RFC. It’s a few months worth of work for our small team of two people, so we wanted to make sure that RFC is approved before we commence our work.

So, there are primarily changes to the 3 components:

  • llvm/lib/LTO: seems that @teresajohnson is happy.
  • lld/ELF: approved. I think the direction is good. I assume that in lld/ELF/LTO.cpp, we need a third mode beside createWriteIndexesThinBackend and createInProcessThinBackend
  • dtlto/ (new for the DTLTO subproject):

We are planning to place all the DTLTO functionality into a separate DTLTO subproject (shared library), because in essence, DTLTO is one library with well-defined interface.

I think there is general concern when a large subproject wants to be integrated into llvm-project (e.g. flang, bolt). (Several people have expressed concern about the repo size).
But I assume that DTLTO will not have too much code, so I think this is fine:)

I guess it is best to notify admins, @tstellar @tra ?

Do I understand correctly that this new library is like a compiler plugin to make the linker/compiler integrate better into distributed build systems?

I was thinking that instead of dtlto/ folder on the same level as lld/ and llvm/ we should create our subproject in llvm/lib/DTLTO/. What do you think? Maybe someone has a better idea where to place our project?

This is exactly right.