[RFC] [ThinLTO]: Multi-Thread Parallel Compilation for Large Modules

Authors: Wei Wei, Zheng Cheng, Zhongying Liu, Jiaping Mao, Jinjie Huang


1. Motivation

ThinLTO already provides efficient module-level parallelism by distributing optimization across translation units. However, in large-scale production codebases, extremely large LLVM modules can still become major compilation bottlenecks.

In large-scale data-center applications, a single translation unit may contain hundreds of thousands of lines of code. Even under highly parallel distributed build systems, overall build latency is often dominated by several extremely large modules:

Total Build Time β‰ˆ max_i (T_compile(Module_i))

The root cause is that optimization and code generation inside a single LLVM module are still largely serialized.

Existing FullLTO parallel backend infrastructure primarily focuses on parallelizing CodeGen. However, Opt can also dominate compilation time for large modules.

As a result, existing FullLTO backend parallelism cannot fully address long-tail compilation latency caused by very large translation units.

This RFC proposes a new backend compilation model to address this issue.


2. Proposal: Multi-Thread Parallel Compilation (MTPC)

This RFC proposes Multi-Thread Parallel Compilation (MTPC), a backend extension for ThinLTO that enables intra-module parallelism.

The design is inspired by the existing FullLTO parallel backend model, which splits modules at function granularity for parallel code generation. MTPC extends this idea by introducing CallGraph-aware partitioning instead of naive function-level splitting.

The key idea is:

Partition a large LLVM module into multiple CallGraph-aware submodules and compile them in parallel.


3. High-Level Design

The MTPC pipeline extends the existing ThinLTO backend flow.

Original ThinLTO Flow

Multi-Thread Parallel Compilation Flow

At a high level:

  1. Source files are compiled into LLVM bitcode (.bc)
  2. During the ThinLTO backend stage, a large module is partitioned into multiple CallGraph-aware IR partitions
  3. Each partition is compiled independently and in parallel
  4. Generated object files are merged into a single relocatable object via lld -r
  5. Each backend partition may also emit its own DWARF object (.dwo)
  6. Multiple .dwo files are merged into a final .dwp (optional)
  7. The merged object then participates in the normal final link step to produce the executable

4. Design Overview

4.1 CallGraph-aware Partitioning

A naive function-level split severely hurts interprocedural optimization quality. Many LLVM IPO passes rely on complete CallGraph visibility, including:

  • Inlining
  • Indirect Call Promotion (ICP)
  • Devirtualization
  • Constant propagation

MTPC partitions modules based on CallGraph and SCC hierarchy, attempting to balance:

  • optimization quality
  • backend parallelism
  • partition granularity

PGO profile information may additionally be used to recover indirect call edges.


4.2 Parallel Backend Compilation

Each partition executes independently through the normal LLVM backend pipeline.

Two implementation strategies were explored to further increase intra-module parallelism:

  1. CallGraph-level parallelism for both Opt and CodeGen

    In this model, both Opt and CodeGen are executed per CallGraph-aware partition. Each partition is treated as an independent submodule throughout the backend pipeline.

  2. Hybrid parallelism: CallGraph-level Opt + function-level CodeGen

    In this model, Opt is performed at CallGraph-partition granularity, while the CodeGen phase is further parallelized at function granularity within each partition, similar to existing FullLTO parallel CodeGen strategies.

Both approaches have been implemented and evaluated.

In practice, the first approach already provides sufficient parallelism for the target workloads, and avoids additional complexity in intra-partition scheduling and function-level dispatch. Therefore, the current upstream proposal only retains the first strategy.


4.3 Object Merging

Each partition generates an independent object file. Final merging is performed through:

lld -r

This design keeps MTPC:

  • ABI-compatible
  • transparent to existing build systems
  • minimally invasive to LLVM backend pipeline

5. Symbol Handling Strategy

Symbol transformation is performed during the split phase to preserve IR semantic correctness across partitions.

5.1 Transformation Scope

The symbol transformation stage is integrated directly into module partitioning.

The following operations are performed together:

  • partition construction
  • symbol cloning
  • linkage rewriting
  • symbol renaming

Pipeline:

Module IR
β†’ Split Module
β†’ Change Symbol Attribute
β†’ Parallel backend execution across CallGraph partitions (Opt + CodeGen)
β†’ lld -r
β†’ Final Object

5.2 Function and Global Variable Linkage Rules

Original Linkage Transformed Result
external (function) Primary partition: external
Other partitions: available_externally
internal (function) If single-use: keep internal
Otherwise: promote to external
external (global variable) Primary partition: external
Other partitions: declaration only
internal (global variable) Promote to external, then handled as external
available_externally Directly cloned

5.3 Special IR Semantic Handling

LLVM IR contains several entities with non-trivial semantic constraints. Incorrect partitioning may break correctness or introduce inconsistencies.

Current implementation provides explicit handling for:

IR Entity Transformation Rule
VTable (external) Primary: external, others: available_externally
COMDAT group Kept whole; primary emits external, others available_externally
alias Strong alias: resolved to aliasee; weak alias: co-located in primary partition
ifunc Resolver kept in primary; others converted to available_externally

In addition, global constructors/destructors ownership is preserved via primary-partition assignment.

It is possible that other LLVM IR semantic constraints require additional handling. Feedback is welcome.


6. Current Limitations

6.1 CloneModule Serialization Bottleneck

The current implementation still contains serialization bottlenecks:

Serial CloneModule
        ↓
Serial IR Serialization
        ↓
Parallel Backend Compilation

Because LLVM IR objects are tightly coupled with their owning LLVMContext, each partition currently requires an independent context. However, CloneModule preserves the original module’s LLVMContext in its output Module. As a result, partition isolation cannot be achieved directly, and an explicit serialization/deserialization step is required to reconstruct each partition in a separate LLVMContext.

CloneModule is not thread-safe today, preventing fully parallel cloning without significant LLVM IR infrastructure changes.

Future work includes:

  • reducing serialization overhead
  • making CloneModule thread-safe or partially parallelizable

6.2 Debug Information Growth

Module partitioning increases the number of generated compilation units (CUs), which may introduce duplicated DWARF debug information across partitions.

In current experiments, enabling -fdebug-types-section keeps debug information growth within approximately 20%, which is currently considered acceptable for production deployment.

Further optimizations for DWARF deduplication and debug information compaction are planned as future work.


7. Experimental Results

The following workloads are large-scale internal production applications.

Additional validation on open-source workloads is planned in future evaluations.

Production-scale experiments show substantial compilation time reductions:

Workload Baseline MTPC Reduction
Large TU A 51m 26s 13m 58s 72.85%
Large TU B 24m 07s 7m 38s 68.35%
Large TU C 9m 34s 4m 02s 57.84%
Large TU D 7m 23s 2m 47s 62.30%
Full Application 30m 45s 20m 42s 32.68%

The optimization is most beneficial for builds containing a small number of extremely large translation units that underutilize available CPU cores due to insufficient backend parallelism.

For workloads already containing a sufficiently large number of independent translation units relative to available hardware parallelism, the overall benefit is naturally smaller or even negative.


8. Conclusion

MTPC extends ThinLTO with CallGraph-aware intra-module backend parallelism for extremely large LLVM modules.

The approach aims to:

  • reduce long-tail compilation latency
  • improve distributed build scalability
  • preserve IPO effectiveness
  • remain compatible with existing LLVM infrastructure

MTPC is intended as a complementary scalability mechanism to existing ThinLTO and distributed build parallelism, rather than a replacement for them.

Feedback and discussion are welcome.


9. References

These patches implement the initial MTPC prototype.

The full design has been completed, and the implementation is being split into multiple smaller patches. These patches will be submitted incrementally to LLVM for review to enable step-by-step validation and easier integration.

3 Likes