# [RFC] Distributed ThinLTO Build for Kernel

[RFC] Distributed ThinLTO Build for Kernel

Authors: Rong Xu

Contributors: Teresa Johnson, Sriraman Tallam, and David Li

Summary

We propose to add distributed ThinLTO build support to the Linux kernel. This build mode is more user-friendly to the developers and offers greater convenience for objtool and livepatch integration.

Background

ThinLTO is a compiler optimization in LLVM that is both scalable and offers better performance [ThinLTO paper]. The ThinLTO build process consists of four phases:

  1. Pre-link compilation - IR objects are generated.
  2. Thin-link - Use program summaries from the pre-link compilation phase for global analyses.
  3. ThinLTO backend (BE) compilation - Backend optimizations are performed in parallel with expanded scope (imported modules). We call call this BE compilation.
  4. Final link - The final binary is generated.

ThinLTO has two build modes:

  • In-process: Thin-link and ThinLTO BE compilation are invoked through the linker. The BE compilation is multi-threaded.
  • Distributed: Thin-link is through the linker. It generates ThinLTO index files. ThinLTO BE compilation is separated from linker and the build system / makefile invokes it explicitly with ThinLTO index files and pre-link IR objects.

Note that “distributed” in this context refers to a term that differentiates in-process ThinLTO builds by invoking backend compilation through the linker, not necessarily building in distributed environments.

For a non-LTO build:

> $CC $CFLAGS <srcs> -c <srcs> # Generate objects
> $LD $LDFLAGS <objs> -o <binary> # Link

In-process ThinLTO build is:

> $CC $CFLAGS -flto=thin -c <srcs> # Generate IR files
> $LD $LDFALGS <IR_files> -o <binary> # Thin-link, BE compile, and final link

In contrast, distributed ThinLTO build needs to break the build into multiple steps:

> $CC $CFLAGS -flto=thin -c <srcs> # Generate IR files
> $LD $LDFLAGS <IR_files> --thinlto-index-only # Thin-link
> $CC $CFLAGS -x ir -fthinlto-index=<thinlto_bc_file> <IR_file> # BE compile
> $LD $LDFLAGS <final_objects> -o <binary> # Final link

While in-process is generally easier to integrate into the build system, and potentially faster due to reduced file I/O, the distributed build mode offers several advantages:

  • Customization: Each BE compilation job can have specific compiler options, unlike the in-process build’s single, unified BE compilation options.
  • Developer Control and Visibility: Developers have control over each sub-step and access to the final objects.
  • Scalability: Backend compilations can be distributed across multiple machines, accelerating the build process. This can be done in blaze build system.
  • Resource Efficiency: It requires less RAM, which is crucial for large applications with high memory demands during linking.

ThinLTO builds at Google predominantly utilize a distributed mode. This build mode has been thoroughly tested with a wide range of applications.

The kernel KBuild system benefits greatly from a distributed build system. This is because:

  • Each compilation step is recorded the same as non LTO builds, facilitating
    easy debugging.
  • Specific compilation options are maintained throughout the process, a
    contrast to in-process ThinLTO (and full LTO) which uses unified BE options.
  • The final object files are available for tools such as objtool and KPatch.

High level overview

Here is the overview of the KBuild build. vmlinux is the final kernel image. vmlinux.o is the relocatable EFL. vmlinux.a is a thin-archive of built-in.a and lib/lib.a and arch/<arch>/lib.a.

vmlinux <- vmlinux.o <- vmlinux.a <- built-in.a <- kernel/built-in.a <- kernel/%.o
                                                <- net/built-in.a <- net/%.o
                                                <- …/built-in.a <- …/%.o
%.o <- %.c

For the in-process build mode, the ld-lld happens between vmlinux.a and vmlinux.o

> ld.lld -m elf_x86_64 <...> -z noexecstack -r -o vmlinux.o -T .tmp_initcalls.lds --whole-archive vmlinux.a --no-whole-archive

In the ThinLTO distributed build, compilation requires two steps: generating IR files, followed by compiling those IR files into final object (.o) files. This creates a need to differentiate between these intermediate and final files.

While we could use a distinct suffix for IR files (e.g., .ir_o) and keep .o for the final objects, this conflicts with existing Makefile rules. Many rules are structured like:
$(obj)/crc32.o: $(obj)/crc32table.h.
These rules assume the .o file depends directly on source or header files. With ThinLTO’s two-stage process, this assumption breaks. Therefore, merely renaming the IR files is insufficient; we must modify the Makefile rules to correctly represent the dependency on the intermediate IR file and the new build steps.

Alternatively, we can use a different suffix for the final object files. This is the approach I’ve chosen.

  • .o: the file type for an IR file.
  • .a: the (thin)archive for IR files.
  • .final_o: the file type for the final object.
  • .final_a: the (thin)archive for the final objects.

The work flow of Dist ThinLTO is the following:

vmlinux
   <- vmlinux.o
        <- vmlinux.final_a
              <- built-in.final_a <- kernel/built-in.final_a <- kernel/%.final_o
                                  <- net/built-in.final_a <- net/%.final_o
                                  <- …/built-in.final_a <- …/%.final_o

%.final_o <- %.o
          <- %.o.thinlto.bc

%.o.thinlto.bc <- vmlinux.a <- built-in.a <- kernel/built-in.a <- kernel/%.o
                |                         <- net/built-in.a <- net/%.o
                |                         <- …/built-in.a <- …/%.o
            thin-link
%.o <- %.c

The main challenge in enabling distributed ThinLTO is its integration into Kbuild. Our approach involves several modifications:

  • A new Kconfig option, CONFIG_LTO_CLANG_THIN_DIST, will be introduced alongside the existing CONFIG_LTO_CLANG_FULL and CONFIG_LTO_CLANG_THIN settings.
  • The top-level Makefile will be updated with a new macro specifically for generating vmlinux.o. This macro activates the new distributed ThinLTO workflow, conditional on CONFIG_LTO_CLANG_THIN_DIST being enabled.
  • The core of the implementation leverages the existing Kbuild infrastructure to perform two distinct recursive passes through the sub-directories. The first pass generates LLVM IR object files (this is the same as in the in-process ThinLTO). After the thin-link, we have the second pass compiling these IR files into final native object files. The necessary build rules and actions for this two-pass system are primarily added within scripts/Makefile.build.

Note that the current patch focuses solely on the main kernel image (vmlinux) and does not yet support building kernel modules with this method. Module support is planned as a follow-up patch upon acceptance of this initial work.

Detailed Implementation

The patch, linked here, will be sent to the kernel upstream for review shortly. Specific issues resolved in this patch include:

1. Archive objects handling of thin-link

We want to create separate index files (i.e. .o.thinlto.bc) for each source file. However, distributed ThinLTO’s limited support for static archives prevents us from using vmlinux.a directly; doing so would likely generate an index file that is associated with the archive, not for its individual members. The index file is like: 'vmlinux.a(fork.o at 1307358).thinlto.bc'. We cannot
use these in BE compilation.

Our solution is to extract a list of all member object files from vmlinux.a into a temporary file and provide this list to the thin-link.

While standard archive linking often uses --start-lib and --end-lib, vmlinux.o is linked using the --whole-archive option. This flag forces the inclusion of all objects within the archive, bypassing standard archive semantics. Consequently, providing an explicit list containing every object file from vmlinux.a is functionally equivalent to using the archive with --whole-archive in this specific context.

The command line for the thin-link:

> llvm-ar t vmlinux.a > .vmlinux_thinlto_bc_files
> ld.lld -m elf_x86_64 <...> -z noexecstack -r --thinlto-index-only -T .tmp_initcalls.lds @.vmlinux_thinlto_bc_files

2. BE compilation options handling

Passing all compiler options from the initial IR (front-end) generation directly to the BE compilation results in errors. To prevent this, certain incompatible options must be filtered out. Our implementation removes the following specific options::

  • -flto=thin
  • -D%
  • %.h.gc and %.h
  • Wp%
  • <linux-includes>

We also add -Wno-unused-command-line-argument to suppress the warnings.

3. Non-IR files handling

Assembly files (.S) and source files explicitly compiled with -fno-lto bypass ThinLTO IR generation; they produce standard object files instead. Passing these standard objects to the ThinLTO BE compilation would cause errors.

To prevent this, we implement a check: before invoking the BE to an .o file, we verify its type (using file command). If the file contains LLVM IR bitcode, we proceed with the BE compilation. Otherwise, we simply create a link from the .o file to .final_o file.

4. Module path problems

In ThinLTO BE compilation, the compiler imports modules using the module table found in the .thinlto.bc file. It correlates the BE command-line name with the module_id created during the preceding thin-link phase.

Typically, the identifier used for the primary module matches the identifier used when it’s referenced for import. But this assumption is broken in some kernel compilation cases. For example, the primary module during BE compilation is a path like arch/x86/../../virt/kvm/kvm_main.o. However, in the list used by thin-link, this same module is a simplified, canonical path: virt/kvm/kvm_main.o.

This mismatch causes build errors because the compiler incorrectly thinks the module as non-primary, resulting in undefined symbol errors during the final link.

While whether this behavior constitutes a compiler bug is debatable, I’ve worked around it by ensuring Kbuild consistently uses the simplified path names to avoid the mismatch.

Comparing distributed build and in-process build

Option handling

ThinLTO in-process mode uses a single, unified set of options for all backend compilations. While some options can be passed via metadata from the frontend, conflicting options from different modules force LTO to choose one configuration, potentially changing the options for some modules. Distributed can address this issue by reusing the exact options for each module.

For example, CFLAGS_jitterentropy.o = -O0 in crypto/jitterentropy.o, and ccflags-y += -O3 in lib/lz4/Makefile are not honored in the in-process ThinLTO BE build. But we have these options in distributed BE builds.

One other difference is -functions-sections and -fdata-sections: These two options are unconditionally turned on and cannot be turned off in the in-process ThinLTO. It has a huge impact on the compilation time. In distributed ThinLTO, these options are independent of LTO.

Lastly, in-process ThinLTO ignores the kernel’s -falign-loops=1 compilation option, causing all loops to use the default alignment of 16 bytes (for x86-64 builds) instead. This issue needs to be fixed in LLVM / lld.

Once the above specified issues are fixed, in-process and distributed ThinLTO produce identical code for each function. However, a section-by-section comparison shows that the function order differs within a few specific sections, namely:

  • .altinstr_aux
  • .altinstr_replacement
  • .init.text (3226 functions)
  • .noinstr.text (117 functions)
  • .text (63 functions)

Note that option -ffunction-sections was used in this comparison. This means most functions are in separate, identical sections. Furthermore, within the executable sections (like .init.text, .noinstr.text and .text) containing multiple functions, the function code itself is identical; only their order differs. Also, the experiments described above were performed using defconfig on the x86_64 architecture. The compilation results may be different for other configurations.

Build time performance

All the timing data were collected using the following configuration:

  • Intel Xeon @2.60G (Ice Lake), 48 physical cores (96 logical cores)
  • 240GB RAM
  • Clang/ld.lld: LLVM-20.1.0
  • Kernel: commit 1e26c5e28ca5821a824e90dd359556f5e9e7b89f (Mar 25, 2025)
  • Use defconfig (x86_64)
  • Build command: make LLVM=1 vmlinux -j 96
  • Average of 5 runs

Build times in seconds were measured for non-LTO, in-process ThinLTO, and distributed ThinLTO methods. Distributed ThinLTO increased the build time by 33% compared to the non-LTO baseline, while in-process ThinLTO increased by 168%.

Time Non LTO In-process ThinLTO Distributed ThinLTO
real 44.30 118.71 58.88
user 2222.23 2278.43 2324.16
sys 322.93 324.78 441.47

In-process ThinLTO limits backend jobs to the number of physical CPU cores and enforces the use of function-sections and data-sections. Considering these constraints, the following provides a fairer comparison between in-process and distributed ThinLTO. The total running times were approximately the same, but the distributed method showed higher system time usage, most likely due to increased file I/O.

Time In-process (–thinlto_jobs=96) Distributed with function-sections on
real 119.10 120.65
user 2352.93 2398.40
sys 335.06 447.39

Here is a breakdown of the time taken by different stages within the distributed ThinLTO build.

Time FE Thin-link BE Final link Final link (w/ function section
real 30.94 4.46 10.42 16.30 83.95
user 2053.65 8.79 308.09 19.16 84.83
sys 319.34 50.02 124.78 49.69 54.61
2 Likes

I noticed that the link to the referenced ThinLTO paper doesn’t work. Here is a link to it in the ACM and IEEE libraries:
https://dl.acm.org/doi/10.5555/3049832.3049845

Teresa

Thanks Teresa. I’ve updated the links to use ACM URLs.

This is really cool! I wish you all the best of luck in upstreaming ThinLTO build support into KBuild. :slight_smile:

What kind of feedback are you looking for from the LLVM community? I’m assuming what we can do to help is to review the proposal, look for rough edges that the kernel community might object to, and suggest changes that would help make it go down smoothly over there.

As an example, do we need to force on -ffunction-sections / -fdata-sections in integrated ThinLTO? I vaguely recall being involved in approving patches to do something like this, and it seemed like a good idea at the time because we don’t encode these settings in IR metadata and function sections are super cheap relative to ThinLTO. If you’re pulling out all the stops to do LTO, you probably want the low-hanging fruit of section GC. I’m surprised you call it out as having a major impact on compile time, but I could see things being different in C vs C++ codebases.

I expect you’ll get some pushback on this point, but you’ve clearly explained why you chose the scheme you have. I’ve certainly run into XNU kernel people surprised that their .o file is an LLVM bitcode file, who wish we wouldn’t do things this way, but this anecdote is about 14 years stale.

Thank you! I enjoy much your patch in CachyMod [1], applied to 6.14 and 6.12 kernels with success. At the end of the thinlto patch, I make a one line change to scripts/Makefile.autofdo and scripts/Makefile.propeller.

-ifdef CONFIG_LTO_CLANG_THIN
+ifeq ($(or $(CONFIG_LTO_CLANG_THIN),$(CONFIG_LTO_CLANG_THIN_DIST)),y)

About time savings: I have two machines, 32-cores/64-threads and 8-cores/16 threads. From testing, time savings is more noticeable with more CPU cores. The thinlto patch involves more actions behind the scene. Notice user and sys times increase significantly in the description above (48-cores/96-threads CPU). The 8-core machine saw little time savings, possibly because the extra user and sys times shared by lesser number of CPU cores. On the 32-core machine, time reduced by 1 minute building thindist-LTO + AutoFDO + Propeller kernel; 7.75m down to 6.75m.

AMD Ryzen Threadripper 3970X (32/64), -j56
        thin-LTO        thindist-LTO
real    7m45.892s       6m44.295s
user    240m3.212s      247m18.157s
sys     16m36.359s      18m11.147s

AMD Ryzen 9800X3D (8/16), -j12
        thin-LTO        thindist-LTO
real    11m32.022s      11m14.059s
user    120m41.657s     123m2.249s
sys     6m48.403s       7m19.658s

[1] GitHub - marioroy/cachymod: Build lazy kernel on CachyOS

The main purpose of this post was to share the experiences in enabling distributed ThinLTO within a Makefile-based system.

I think both build modes offer their unique benefits depending on the application. However, implementing distributed build mode, particularly for complex systems like Kbuild, requires more effort and can be challenging, as I experienced.

I hope this post helps others in adopting distributed ThinLTO. Additionally, I’d like to bring to the LLVM community’s attention potential improvements for LTO, for example, better handling of archive objects in distributed builds, and correctly passing of BE compilation options.

Teresa pointed me to the original 2017 patch (https://reviews.llvm.org/D35809). I briefly looked at the discussion: we did this for symbol-ordering-files. I am wondering if enabling function-section in the front end upon detecting such a file would be a better strategy than forcing the option within lld.

Also, lld shows poor performance with orphan sections, taking over 70 seconds to process approximately 140,000 sections. This seems to be excessive.

Thanks for trying the patch.

For Linux kernel, the build time (or RAM) is not the main goal of this work – comparing to the other built targets, kernel is very small. And the data in the post shows that if both models using the same set of options, their build time is comparable. Your data also shows the same.

I believe distributed ThinLTO is a better fit for the kernel compared to the in-process mode. Beyond style, two key advantages:

  1. The availability of final object files is crucial for tools like objtool and kpatch. I have a kpatch patch for ThinLTO involving parsing the -save-temp=prelink output, which was complex and fragile.
  2. The control over back-end options is significant, potentially impacting correctness. In addition, some advanced optimizations such as CSFDO seem achievable only with distributed mode due to the need for per-module options (certain modules cannot be instrumented).
1 Like

Kernel upstream patch linked here.