Authors: Wei Wei, Zheng Cheng, Zhongying Liu, Jiaping Mao, Jinjie Huang
1. Motivation
ThinLTO already provides efficient module-level parallelism by distributing optimization across translation units. However, in large-scale production codebases, extremely large LLVM modules can still become major compilation bottlenecks.
In large-scale data-center applications, a single translation unit may contain hundreds of thousands of lines of code. Even under highly parallel distributed build systems, overall build latency is often dominated by several extremely large modules:
Total Build Time β max_i (T_compile(Module_i))
The root cause is that optimization and code generation inside a single LLVM module are still largely serialized.
Existing FullLTO parallel backend infrastructure primarily focuses on parallelizing CodeGen. However, Opt can also dominate compilation time for large modules.
As a result, existing FullLTO backend parallelism cannot fully address long-tail compilation latency caused by very large translation units.
This RFC proposes a new backend compilation model to address this issue.
2. Proposal: Multi-Thread Parallel Compilation (MTPC)
This RFC proposes Multi-Thread Parallel Compilation (MTPC), a backend extension for ThinLTO that enables intra-module parallelism.
The design is inspired by the existing FullLTO parallel backend model, which splits modules at function granularity for parallel code generation. MTPC extends this idea by introducing CallGraph-aware partitioning instead of naive function-level splitting.
The key idea is:
Partition a large LLVM module into multiple CallGraph-aware submodules and compile them in parallel.
3. High-Level Design
The MTPC pipeline extends the existing ThinLTO backend flow.
Original ThinLTO Flow
Multi-Thread Parallel Compilation Flow
At a high level:
- Source files are compiled into LLVM bitcode (
.bc) - During the ThinLTO backend stage, a large module is
partitionedinto multipleCallGraph-aware IR partitions - Each partition is compiled independently and in parallel
- Generated object files are merged into a single relocatable object via
lld -r - Each backend partition may also emit its own DWARF object (
.dwo) - Multiple
.dwofiles are merged into a final.dwp(optional) - The merged object then participates in the normal final link step to produce the executable
4. Design Overview
4.1 CallGraph-aware Partitioning
A naive function-level split severely hurts interprocedural optimization quality. Many LLVM IPO passes rely on complete CallGraph visibility, including:
- Inlining
- Indirect Call Promotion (ICP)
- Devirtualization
- Constant propagation
MTPC partitions modules based on CallGraph and SCC hierarchy, attempting to balance:
- optimization quality
- backend parallelism
- partition granularity
PGO profile information may additionally be used to recover indirect call edges.
4.2 Parallel Backend Compilation
Each partition executes independently through the normal LLVM backend pipeline.
Two implementation strategies were explored to further increase intra-module parallelism:
-
CallGraph-level parallelism for both Opt and CodeGen
In this model, both Opt and CodeGen are executed per CallGraph-aware partition. Each partition is treated as an independent submodule throughout the backend pipeline.
-
Hybrid parallelism: CallGraph-level Opt + function-level CodeGen
In this model, Opt is performed at CallGraph-partition granularity, while the CodeGen phase is further parallelized at function granularity within each partition, similar to existing FullLTO parallel CodeGen strategies.
Both approaches have been implemented and evaluated.
In practice, the first approach already provides sufficient parallelism for the target workloads, and avoids additional complexity in intra-partition scheduling and function-level dispatch. Therefore, the current upstream proposal only retains the first strategy.
4.3 Object Merging
Each partition generates an independent object file. Final merging is performed through:
lld -r
This design keeps MTPC:
- ABI-compatible
- transparent to existing build systems
- minimally invasive to LLVM backend pipeline
5. Symbol Handling Strategy
Symbol transformation is performed during the split phase to preserve IR semantic correctness across partitions.
5.1 Transformation Scope
The symbol transformation stage is integrated directly into module partitioning.
The following operations are performed together:
- partition construction
- symbol cloning
- linkage rewriting
- symbol renaming
Pipeline:
Module IR
β Split Module
β Change Symbol Attribute
β Parallel backend execution across CallGraph partitions (Opt + CodeGen)
β lld -r
β Final Object
5.2 Function and Global Variable Linkage Rules
| Original Linkage | Transformed Result |
|---|---|
external (function) |
Primary partition: externalOther partitions: available_externally |
internal (function) |
If single-use: keep internalOtherwise: promote to external |
external (global variable) |
Primary partition: externalOther partitions: declaration only |
internal (global variable) |
Promote to external, then handled as external |
available_externally |
Directly cloned |
5.3 Special IR Semantic Handling
LLVM IR contains several entities with non-trivial semantic constraints. Incorrect partitioning may break correctness or introduce inconsistencies.
Current implementation provides explicit handling for:
| IR Entity | Transformation Rule |
|---|---|
VTable (external) |
Primary: external, others: available_externally |
| COMDAT group | Kept whole; primary emits external, others available_externally |
alias |
Strong alias: resolved to aliasee; weak alias: co-located in primary partition |
ifunc |
Resolver kept in primary; others converted to available_externally |
In addition, global constructors/destructors ownership is preserved via primary-partition assignment.
It is possible that other LLVM IR semantic constraints require additional handling. Feedback is welcome.
6. Current Limitations
6.1 CloneModule Serialization Bottleneck
The current implementation still contains serialization bottlenecks:
Serial CloneModule
β
Serial IR Serialization
β
Parallel Backend Compilation
Because LLVM IR objects are tightly coupled with their owning LLVMContext, each partition currently requires an independent context. However, CloneModule preserves the original moduleβs LLVMContext in its output Module. As a result, partition isolation cannot be achieved directly, and an explicit serialization/deserialization step is required to reconstruct each partition in a separate LLVMContext.
CloneModule is not thread-safe today, preventing fully parallel cloning without significant LLVM IR infrastructure changes.
Future work includes:
- reducing serialization overhead
- making
CloneModulethread-safe or partially parallelizable
6.2 Debug Information Growth
Module partitioning increases the number of generated compilation units (CUs), which may introduce duplicated DWARF debug information across partitions.
In current experiments, enabling -fdebug-types-section keeps debug information growth within approximately 20%, which is currently considered acceptable for production deployment.
Further optimizations for DWARF deduplication and debug information compaction are planned as future work.
7. Experimental Results
The following workloads are large-scale internal production applications.
Additional validation on open-source workloads is planned in future evaluations.
Production-scale experiments show substantial compilation time reductions:
| Workload | Baseline | MTPC | Reduction |
|---|---|---|---|
| Large TU A | 51m 26s | 13m 58s | 72.85% |
| Large TU B | 24m 07s | 7m 38s | 68.35% |
| Large TU C | 9m 34s | 4m 02s | 57.84% |
| Large TU D | 7m 23s | 2m 47s | 62.30% |
| Full Application | 30m 45s | 20m 42s | 32.68% |
The optimization is most beneficial for builds containing a small number of extremely large translation units that underutilize available CPU cores due to insufficient backend parallelism.
For workloads already containing a sufficiently large number of independent translation units relative to available hardware parallelism, the overall benefit is naturally smaller or even negative.
8. Conclusion
MTPC extends ThinLTO with CallGraph-aware intra-module backend parallelism for extremely large LLVM modules.
The approach aims to:
- reduce long-tail compilation latency
- improve distributed build scalability
- preserve IPO effectiveness
- remain compatible with existing LLVM infrastructure
MTPC is intended as a complementary scalability mechanism to existing ThinLTO and distributed build parallelism, rather than a replacement for them.
Feedback and discussion are welcome.
9. References
These patches implement the initial MTPC prototype.
The full design has been completed, and the implementation is being split into multiple smaller patches. These patches will be submitted incrementally to LLVM for review to enable step-by-step validation and easier integration.

