[RFC] Distributed ThinLTO Build for Kernel
Authors: Rong Xu
Contributors: Teresa Johnson, Sriraman Tallam, and David Li
Summary
We propose to add distributed ThinLTO build support to the Linux kernel. This build mode is more user-friendly to the developers and offers greater convenience for objtool and livepatch integration.
Background
ThinLTO is a compiler optimization in LLVM that is both scalable and offers better performance [ThinLTO paper]. The ThinLTO build process consists of four phases:
- Pre-link compilation - IR objects are generated.
- Thin-link - Use program summaries from the pre-link compilation phase for global analyses.
- ThinLTO backend (BE) compilation - Backend optimizations are performed in parallel with expanded scope (imported modules). We call call this BE compilation.
- Final link - The final binary is generated.
ThinLTO has two build modes:
- In-process: Thin-link and ThinLTO BE compilation are invoked through the linker. The BE compilation is multi-threaded.
- Distributed: Thin-link is through the linker. It generates ThinLTO index files. ThinLTO BE compilation is separated from linker and the build system / makefile invokes it explicitly with ThinLTO index files and pre-link IR objects.
Note that “distributed” in this context refers to a term that differentiates in-process ThinLTO builds by invoking backend compilation through the linker, not necessarily building in distributed environments.
For a non-LTO build:
> $CC $CFLAGS <srcs> -c <srcs> # Generate objects
> $LD $LDFLAGS <objs> -o <binary> # Link
In-process ThinLTO build is:
> $CC $CFLAGS -flto=thin -c <srcs> # Generate IR files
> $LD $LDFALGS <IR_files> -o <binary> # Thin-link, BE compile, and final link
In contrast, distributed ThinLTO build needs to break the build into multiple steps:
> $CC $CFLAGS -flto=thin -c <srcs> # Generate IR files
> $LD $LDFLAGS <IR_files> --thinlto-index-only # Thin-link
> $CC $CFLAGS -x ir -fthinlto-index=<thinlto_bc_file> <IR_file> # BE compile
> $LD $LDFLAGS <final_objects> -o <binary> # Final link
While in-process is generally easier to integrate into the build system, and potentially faster due to reduced file I/O, the distributed build mode offers several advantages:
- Customization: Each BE compilation job can have specific compiler options, unlike the in-process build’s single, unified BE compilation options.
- Developer Control and Visibility: Developers have control over each sub-step and access to the final objects.
- Scalability: Backend compilations can be distributed across multiple machines, accelerating the build process. This can be done in blaze build system.
- Resource Efficiency: It requires less RAM, which is crucial for large applications with high memory demands during linking.
ThinLTO builds at Google predominantly utilize a distributed mode. This build mode has been thoroughly tested with a wide range of applications.
The kernel KBuild system benefits greatly from a distributed build system. This is because:
- Each compilation step is recorded the same as non LTO builds, facilitating
easy debugging. - Specific compilation options are maintained throughout the process, a
contrast to in-process ThinLTO (and full LTO) which uses unified BE options. - The final object files are available for tools such as objtool and KPatch.
High level overview
Here is the overview of the KBuild build. vmlinux is the final kernel image. vmlinux.o is the relocatable EFL. vmlinux.a is a thin-archive of built-in.a and lib/lib.a and arch/<arch>/lib.a.
vmlinux <- vmlinux.o <- vmlinux.a <- built-in.a <- kernel/built-in.a <- kernel/%.o
<- net/built-in.a <- net/%.o
<- …/built-in.a <- …/%.o
%.o <- %.c
For the in-process build mode, the ld-lld happens between vmlinux.a and vmlinux.o
> ld.lld -m elf_x86_64 <...> -z noexecstack -r -o vmlinux.o -T .tmp_initcalls.lds --whole-archive vmlinux.a --no-whole-archive
In the ThinLTO distributed build, compilation requires two steps: generating IR files, followed by compiling those IR files into final object (.o) files. This creates a need to differentiate between these intermediate and final files.
While we could use a distinct suffix for IR files (e.g., .ir_o) and keep .o for the final objects, this conflicts with existing Makefile rules. Many rules are structured like:
$(obj)/crc32.o: $(obj)/crc32table.h.
These rules assume the .o file depends directly on source or header files. With ThinLTO’s two-stage process, this assumption breaks. Therefore, merely renaming the IR files is insufficient; we must modify the Makefile rules to correctly represent the dependency on the intermediate IR file and the new build steps.
Alternatively, we can use a different suffix for the final object files. This is the approach I’ve chosen.
- .o: the file type for an IR file.
- .a: the (thin)archive for IR files.
- .final_o: the file type for the final object.
- .final_a: the (thin)archive for the final objects.
The work flow of Dist ThinLTO is the following:
vmlinux
<- vmlinux.o
<- vmlinux.final_a
<- built-in.final_a <- kernel/built-in.final_a <- kernel/%.final_o
<- net/built-in.final_a <- net/%.final_o
<- …/built-in.final_a <- …/%.final_o
%.final_o <- %.o
<- %.o.thinlto.bc
%.o.thinlto.bc <- vmlinux.a <- built-in.a <- kernel/built-in.a <- kernel/%.o
| <- net/built-in.a <- net/%.o
| <- …/built-in.a <- …/%.o
thin-link
%.o <- %.c
The main challenge in enabling distributed ThinLTO is its integration into Kbuild. Our approach involves several modifications:
- A new Kconfig option,
CONFIG_LTO_CLANG_THIN_DIST, will be introduced alongside the existingCONFIG_LTO_CLANG_FULLandCONFIG_LTO_CLANG_THINsettings. - The top-level Makefile will be updated with a new macro specifically for generating
vmlinux.o. This macro activates the new distributed ThinLTO workflow, conditional onCONFIG_LTO_CLANG_THIN_DISTbeing enabled. - The core of the implementation leverages the existing Kbuild infrastructure to perform two distinct recursive passes through the sub-directories. The first pass generates LLVM IR object files (this is the same as in the in-process ThinLTO). After the thin-link, we have the second pass compiling these IR files into final native object files. The necessary build rules and actions for this two-pass system are primarily added within
scripts/Makefile.build.
Note that the current patch focuses solely on the main kernel image (vmlinux) and does not yet support building kernel modules with this method. Module support is planned as a follow-up patch upon acceptance of this initial work.
Detailed Implementation
The patch, linked here, will be sent to the kernel upstream for review shortly. Specific issues resolved in this patch include:
1. Archive objects handling of thin-link
We want to create separate index files (i.e. .o.thinlto.bc) for each source file. However, distributed ThinLTO’s limited support for static archives prevents us from using vmlinux.a directly; doing so would likely generate an index file that is associated with the archive, not for its individual members. The index file is like: 'vmlinux.a(fork.o at 1307358).thinlto.bc'. We cannot
use these in BE compilation.
Our solution is to extract a list of all member object files from vmlinux.a into a temporary file and provide this list to the thin-link.
While standard archive linking often uses --start-lib and --end-lib, vmlinux.o is linked using the --whole-archive option. This flag forces the inclusion of all objects within the archive, bypassing standard archive semantics. Consequently, providing an explicit list containing every object file from vmlinux.a is functionally equivalent to using the archive with --whole-archive in this specific context.
The command line for the thin-link:
> llvm-ar t vmlinux.a > .vmlinux_thinlto_bc_files
> ld.lld -m elf_x86_64 <...> -z noexecstack -r --thinlto-index-only -T .tmp_initcalls.lds @.vmlinux_thinlto_bc_files
2. BE compilation options handling
Passing all compiler options from the initial IR (front-end) generation directly to the BE compilation results in errors. To prevent this, certain incompatible options must be filtered out. Our implementation removes the following specific options::
- -flto=thin
- -D%
- %.h.gc and %.h
- Wp%
- <linux-includes>
We also add -Wno-unused-command-line-argument to suppress the warnings.
3. Non-IR files handling
Assembly files (.S) and source files explicitly compiled with -fno-lto bypass ThinLTO IR generation; they produce standard object files instead. Passing these standard objects to the ThinLTO BE compilation would cause errors.
To prevent this, we implement a check: before invoking the BE to an .o file, we verify its type (using file command). If the file contains LLVM IR bitcode, we proceed with the BE compilation. Otherwise, we simply create a link from the .o file to .final_o file.
4. Module path problems
In ThinLTO BE compilation, the compiler imports modules using the module table found in the .thinlto.bc file. It correlates the BE command-line name with the module_id created during the preceding thin-link phase.
Typically, the identifier used for the primary module matches the identifier used when it’s referenced for import. But this assumption is broken in some kernel compilation cases. For example, the primary module during BE compilation is a path like arch/x86/../../virt/kvm/kvm_main.o. However, in the list used by thin-link, this same module is a simplified, canonical path: virt/kvm/kvm_main.o.
This mismatch causes build errors because the compiler incorrectly thinks the module as non-primary, resulting in undefined symbol errors during the final link.
While whether this behavior constitutes a compiler bug is debatable, I’ve worked around it by ensuring Kbuild consistently uses the simplified path names to avoid the mismatch.
Comparing distributed build and in-process build
Option handling
ThinLTO in-process mode uses a single, unified set of options for all backend compilations. While some options can be passed via metadata from the frontend, conflicting options from different modules force LTO to choose one configuration, potentially changing the options for some modules. Distributed can address this issue by reusing the exact options for each module.
For example, CFLAGS_jitterentropy.o = -O0 in crypto/jitterentropy.o, and ccflags-y += -O3 in lib/lz4/Makefile are not honored in the in-process ThinLTO BE build. But we have these options in distributed BE builds.
One other difference is -functions-sections and -fdata-sections: These two options are unconditionally turned on and cannot be turned off in the in-process ThinLTO. It has a huge impact on the compilation time. In distributed ThinLTO, these options are independent of LTO.
Lastly, in-process ThinLTO ignores the kernel’s -falign-loops=1 compilation option, causing all loops to use the default alignment of 16 bytes (for x86-64 builds) instead. This issue needs to be fixed in LLVM / lld.
Once the above specified issues are fixed, in-process and distributed ThinLTO produce identical code for each function. However, a section-by-section comparison shows that the function order differs within a few specific sections, namely:
- .altinstr_aux
- .altinstr_replacement
- .init.text (3226 functions)
- .noinstr.text (117 functions)
- .text (63 functions)
Note that option -ffunction-sections was used in this comparison. This means most functions are in separate, identical sections. Furthermore, within the executable sections (like .init.text, .noinstr.text and .text) containing multiple functions, the function code itself is identical; only their order differs. Also, the experiments described above were performed using defconfig on the x86_64 architecture. The compilation results may be different for other configurations.
Build time performance
All the timing data were collected using the following configuration:
- Intel Xeon @2.60G (Ice Lake), 48 physical cores (96 logical cores)
- 240GB RAM
- Clang/ld.lld: LLVM-20.1.0
- Kernel: commit 1e26c5e28ca5821a824e90dd359556f5e9e7b89f (Mar 25, 2025)
- Use defconfig (x86_64)
- Build command:
make LLVM=1 vmlinux -j 96 - Average of 5 runs
Build times in seconds were measured for non-LTO, in-process ThinLTO, and distributed ThinLTO methods. Distributed ThinLTO increased the build time by 33% compared to the non-LTO baseline, while in-process ThinLTO increased by 168%.
| Time | Non LTO | In-process ThinLTO | Distributed ThinLTO |
|---|---|---|---|
| real | 44.30 | 118.71 | 58.88 |
| user | 2222.23 | 2278.43 | 2324.16 |
| sys | 322.93 | 324.78 | 441.47 |
In-process ThinLTO limits backend jobs to the number of physical CPU cores and enforces the use of function-sections and data-sections. Considering these constraints, the following provides a fairer comparison between in-process and distributed ThinLTO. The total running times were approximately the same, but the distributed method showed higher system time usage, most likely due to increased file I/O.
| Time | In-process (–thinlto_jobs=96) | Distributed with function-sections on |
|---|---|---|
| real | 119.10 | 120.65 |
| user | 2352.93 | 2398.40 |
| sys | 335.06 | 447.39 |
Here is a breakdown of the time taken by different stages within the distributed ThinLTO build.
| Time | FE | Thin-link | BE | Final link | Final link (w/ function section |
|---|---|---|---|---|---|
| real | 30.94 | 4.46 | 10.42 | 16.30 | 83.95 |
| user | 2053.65 | 8.79 | 308.09 | 19.16 | 84.83 |
| sys | 319.34 | 50.02 | 124.78 | 49.69 | 54.61 |