[RFC] [ThinLTO]: Multi-Thread Parallel Compilation for Large Modules

Authors: Wei Wei, Zheng Cheng, Zhongying Liu, Jiaping Mao, Jinjie Huang


1. Motivation

ThinLTO already provides efficient module-level parallelism by distributing optimization across translation units. However, in large-scale production codebases, extremely large LLVM modules can still become major compilation bottlenecks.

In large-scale data-center applications, a single translation unit may contain hundreds of thousands of lines of code. Even under highly parallel distributed build systems, overall build latency is often dominated by several extremely large modules:

Total Build Time ā‰ˆ max_i (T_compile(Module_i))

The root cause is that optimization and code generation inside a single LLVM module are still largely serialized.

Existing FullLTO parallel backend infrastructure primarily focuses on parallelizing CodeGen. However, Opt can also dominate compilation time for large modules.

As a result, existing FullLTO backend parallelism cannot fully address long-tail compilation latency caused by very large translation units.

This RFC proposes a new backend compilation model to address this issue.


2. Proposal: Multi-Thread Parallel Compilation (MTPC)

This RFC proposes Multi-Thread Parallel Compilation (MTPC), a backend extension for ThinLTO that enables intra-module parallelism.

The design is inspired by the existing FullLTO parallel backend model, which splits modules at function granularity for parallel code generation. MTPC extends this idea by introducing CallGraph-aware partitioning instead of naive function-level splitting.

The key idea is:

Partition a large LLVM module into multiple CallGraph-aware submodules and compile them in parallel.


3. High-Level Design

The MTPC pipeline extends the existing ThinLTO backend flow.

Original ThinLTO Flow

Multi-Thread Parallel Compilation Flow

At a high level:

  1. Source files are compiled into LLVM bitcode (.bc)
  2. During the ThinLTO backend stage, a large module is partitioned into multiple CallGraph-aware IR partitions
  3. Each partition is compiled independently and in parallel
  4. Generated object files are merged into a single relocatable object via lld -r
  5. Each backend partition may also emit its own DWARF object (.dwo)
  6. Multiple .dwo files are merged into a final .dwp (optional)
  7. The merged object then participates in the normal final link step to produce the executable

4. Design Overview

4.1 CallGraph-aware Partitioning

A naive function-level split severely hurts interprocedural optimization quality. Many LLVM IPO passes rely on complete CallGraph visibility, including:

  • Inlining
  • Indirect Call Promotion (ICP)
  • Devirtualization
  • Constant propagation

MTPC partitions modules based on CallGraph and SCC hierarchy, attempting to balance:

  • optimization quality
  • backend parallelism
  • partition granularity

PGO profile information may additionally be used to recover indirect call edges.


4.2 Parallel Backend Compilation

Each partition executes independently through the normal LLVM backend pipeline.

Two implementation strategies were explored to further increase intra-module parallelism:

  1. CallGraph-level parallelism for both Opt and CodeGen

    In this model, both Opt and CodeGen are executed per CallGraph-aware partition. Each partition is treated as an independent submodule throughout the backend pipeline.

  2. Hybrid parallelism: CallGraph-level Opt + function-level CodeGen

    In this model, Opt is performed at CallGraph-partition granularity, while the CodeGen phase is further parallelized at function granularity within each partition, similar to existing FullLTO parallel CodeGen strategies.

Both approaches have been implemented and evaluated.

In practice, the first approach already provides sufficient parallelism for the target workloads, and avoids additional complexity in intra-partition scheduling and function-level dispatch. Therefore, the current upstream proposal only retains the first strategy.


4.3 Object Merging

Each partition generates an independent object file. Final merging is performed through:

lld -r

This design keeps MTPC:

  • ABI-compatible
  • transparent to existing build systems
  • minimally invasive to LLVM backend pipeline

5. Symbol Handling Strategy

Symbol transformation is performed during the split phase to preserve IR semantic correctness across partitions.

5.1 Transformation Scope

The symbol transformation stage is integrated directly into module partitioning.

The following operations are performed together:

  • partition construction
  • symbol cloning
  • linkage rewriting
  • symbol renaming

Pipeline:

Module IR
→ Split Module
→ Change Symbol Attribute
→ Parallel backend execution across CallGraph partitions (Opt + CodeGen)
→ lld -r
→ Final Object

5.2 Function and Global Variable Linkage Rules

Original Linkage Transformed Result
external (function) Primary partition: external
Other partitions: available_externally
internal (function) If single-use: keep internal
Otherwise: promote to external
external (global variable) Primary partition: external
Other partitions: declaration only
internal (global variable) Promote to external, then handled as external
available_externally Directly cloned

5.3 Special IR Semantic Handling

LLVM IR contains several entities with non-trivial semantic constraints. Incorrect partitioning may break correctness or introduce inconsistencies.

Current implementation provides explicit handling for:

IR Entity Transformation Rule
VTable (external) Primary: external, others: available_externally
COMDAT group Kept whole; primary emits external, others available_externally
alias Strong alias: resolved to aliasee; weak alias: co-located in primary partition
ifunc Resolver kept in primary; others converted to available_externally

In addition, global constructors/destructors ownership is preserved via primary-partition assignment.

It is possible that other LLVM IR semantic constraints require additional handling. Feedback is welcome.


6. Current Limitations

6.1 CloneModule Serialization Bottleneck

The current implementation still contains serialization bottlenecks:

Serial CloneModule
        ↓
Serial IR Serialization
        ↓
Parallel Backend Compilation

Because LLVM IR objects are tightly coupled with their owning LLVMContext, each partition currently requires an independent context. However, CloneModule preserves the original module’s LLVMContext in its output Module. As a result, partition isolation cannot be achieved directly, and an explicit serialization/deserialization step is required to reconstruct each partition in a separate LLVMContext.

CloneModule is not thread-safe today, preventing fully parallel cloning without significant LLVM IR infrastructure changes.

Future work includes:

  • reducing serialization overhead
  • making CloneModule thread-safe or partially parallelizable

6.2 Debug Information Growth

Module partitioning increases the number of generated compilation units (CUs), which may introduce duplicated DWARF debug information across partitions.

In current experiments, enabling -fdebug-types-section keeps debug information growth within approximately 20%, which is currently considered acceptable for production deployment.

Further optimizations for DWARF deduplication and debug information compaction are planned as future work.


7. Experimental Results

The following workloads are large-scale internal production applications.

Additional validation on open-source workloads is planned in future evaluations.

Production-scale experiments show substantial compilation time reductions:

Workload Baseline MTPC Reduction
Large TU A 51m 26s 13m 58s 72.85%
Large TU B 24m 07s 7m 38s 68.35%
Large TU C 9m 34s 4m 02s 57.84%
Large TU D 7m 23s 2m 47s 62.30%
Full Application 30m 45s 20m 42s 32.68%

The optimization is most beneficial for builds containing a small number of extremely large translation units that underutilize available CPU cores due to insufficient backend parallelism.

For workloads already containing a sufficiently large number of independent translation units relative to available hardware parallelism, the overall benefit is naturally smaller or even negative.


8. Conclusion

MTPC extends ThinLTO with CallGraph-aware intra-module backend parallelism for extremely large LLVM modules.

The approach aims to:

  • reduce long-tail compilation latency
  • improve distributed build scalability
  • preserve IPO effectiveness
  • remain compatible with existing LLVM infrastructure

MTPC is intended as a complementary scalability mechanism to existing ThinLTO and distributed build parallelism, rather than a replacement for them.

Feedback and discussion are welcome.


9. References

These patches implement the initial MTPC prototype.

The full design has been completed, and the implementation is being split into multiple smaller patches. These patches will be submitted incrementally to LLVM for review to enable step-by-step validation and easier integration.

3 Likes

Did you use AI to contribute to this? I see that you are new, and per LLVM AI Tool Use Policy — LLVM 23.0.0git documentation, any substantial use must be disclosed. Some of the wordier phrases you used (notably the last sentence) are common expressions from AI that aren’t commonly human generated.

That said, over in Julia, we’ve maintained something exactly like this for a few years: https://github.com/JuliaLang/julia/pull/47797. This may validate your performance finding. Though as you noted, there can be some semantic issues with this also (you had to rewrite the meaning of various linkages to make splitting valid.)

There is also already a copy of this logic in OrcJIT (orc::IRPartitionLayer), so it may be nice to find ways to reduce duplication there.

Update section 5.3

5.3 Special Global Value Handling

LLVM IR contains several global value constructs with non-trivial linkage and runtime semantics. These require special handling during partitioning to preserve correctness.

Current implementation provides explicit handling for:

Global value type What it is Rule in MTPC Reason
VTable Virtual method dispatch table used by C++ for dynamic dispatch Primary: external, others: available_externally Preserve devirtualization optimization
COMDAT group Linker-level grouping mechanism enforcing One Definition Rule (ODR) for weak/inline entities Kept whole; primary emits external, others available_externally Incomplete COMDAT splitting may lead to missing symbol
alias Symbol that refers directly to another global value (aliasee) Strong alias: replaced with aliasee (per globalopt style); weak alias: put alias + aliasee in primary partition ensure correct alias-to-aliasee resolution
ifunc Indirect function whose implementation is selected by a runtime resolver Primary: keep resolver + ifunc together; resolver marked external; others: ifunc becomes declare, resolver becomes available_externally preserve runtime resolution semantics
  • external indicates a symbol with global linkage that is visible across the entire module and participates in normal linking.
  • available_externally indicates a definition that is used for optimization purposes but is discarded during code generation and does not appear in the final linked output.

In addition, global constructors/destructors ownership is preserved via primary-partition assignment.

It is possible that other LLVM IR semantic constraints require additional handling. Feedback is welcome.

Thanks for the comments and suggestions.

  1. Regarding AI usage:
    This RFC did use AI for translation, but the design, implementation, and codes are not generated by AI. Since English is not my native language, AI may translate better than me ~~ I will try to avoid using AI tranlation next time.
  2. Regarding the Julia implementation:
    We will take a closer look and check which parts can be reused.
  3. Regarding special handling of global values:
    I have updated Section 5.3 with more clarification in comments.

Thanks again for the review. Welcome for more discussion.

For the 2nd question, about the difference between JuliaLang/OrcJIT and MTPC:

JuliaLang and OrcJIT do not partition modules at the granularity of call graph . Instead, they partition based on connected components in the call graph.

The new MTPC partitions at the call graph granularity, resulting in finer-grained decomposition and exposing more opportunities for parallel compilation.

      **Original Big Module**

    F1		F6
   /  \    /  \             F9 <-----+
  /    \  /    \             |       |
 v      v       v            v       |  
F2      F4      F7          F10      |   
 |       |      |            |       |
 v       v      v            v       |
F3      F5      F8          F11 ----- 


   **JuliaLang or OrcJit 's Partitioning**


    Partition 1             Partition2

    F1		F6
   /  \    /  \             F9 <-----+
  /    \  /    \             |       |
 v      v       v            v       |  
F2      F4      F7          F10      |   
 |       |      |            |       |
 v       v      v            v       |
F3      F5      F8          F11 ----- 



  **New MTPC Partitioning**

 Partition 1    Partition 2     Partition 3 

    F1		     F6
   /  \         /  \             F9 <-----+
  /    \       /    \             |       |
 v      v     v     v             v       |  
F2      F4    F4    F7           F10      |   
 |       |    |     |             |       |
 v       v    v     v             v       |
F3      F5    F5    F8           F11 -----

Due to the more fine-grained partitioning, it is necessary to handle shared functions and data in a more precise manner, ensuring that functions and data are only placed into their necessary partitions as needed. This avoids duplicating them across multiple partitions, which could lead to code size bloat and duplicate symbol issues during linking.

I’d like the Full and Thin LTO pipelines to be moving closer together rather than to diverge even more. This would be beneficial for Universal and FatLTO. I think it also helps a great deal for maintenance if the pipelines all work basically the same way, and only mildly diverge in key aspects (e.g., Thin Link vs monolithic Bitcode).

Is there any reason your approach is not suitable for FullLTO? You would probably be leaving perf on the table, but it shouldn’t be fundamentally incompatible, right? From the description, it sounds to me like it’s also possible that there could be a way to structure the partitions so that you minimize/eliminate duplicate work in FullLTO and use a different algorithm to support ThinLTO for reduced latency. I’m guessing it may also be possible to add additional information to the summaries to make up for the missing bits that would normally result in less than ideal optimization for FullLTO with multiple partitions.

The numbers of link time reduction looks great. But I’d like to see the impact on performance numbers. IIUC, the new model may have some performance regressions?

For linkage, you mentioned the tool may promote internal to external linkage, this may no be ideal. As new symbol names may be an ABI breaking changer. And also these names may contribute to code size significantly.

Thanks for the RFC. Indeed, the ThinLTO backend compilations can be quite expensive and become a serial bottleneck.

Copying a meta comment I had added to the PR here: LTO should not be directly calling lld. Instead, look at how Regular LTO’s existing split code gen handling adds the split modules to the link, which is via the AddStream callback.

Since the PR initially didn’t have the RFC I was under the initial impression that it was only splitting for codegen, so I had left a comment about leveraging Regular LTO’s existing support for split codegen, but I see now that opt and codegen are both being applied to the split modules. Agree with @ilovepi that if we go this route of creating parallelism in the LTO backends that it would be better to share infra between thin and regular LTO.

I’m unsure that this will work for optimizations that make assumptions during the thin link about what will be exported and use the summary in the backend passes. This includes things like WPD, CFI, and MemProfContextDisambiguation (these are all not enabled by default but under pretty heavy use in practice).

I need to look through the RFC in finer detail, but is there a reason for not splitting the modules before the LTO link? There are some downsides, such as a heavier weight thin link due to the need for importing into a bigger set of modules (and of course this wouldn’t help regular LTO, but that’s not the focus of this RFC anyway). But it will likely be much simpler to implement and fit much more naturally into the existing infrastructure. E.g. in your example, F1 and F6 and their subgraphs could be split and end up in separate modules going into the thin link. I’d rather do something simple if that addresses a big part of the issue rather than add a huge amount of support to the LTO code.

And, on @teresajohnson last comment (why not during thinlink), we already have support there for partitioning based on a call graph (part of the instrumented contextual profile work). You can probably build on that (on that code in thinlink, I mean, not on the instrumented profiling part. Referenced that because it discusses call graph - driven partitioning)

That aside, I think splitting / merging modules in post-thinlink is interesting, e.g. doing all the function optimizations in parallel (not to mention machine IR passes, which are all function-level). @weiweic presented something like that in 2024, meaning there could be some reuse between the solutions here.

And, on @teresajohnson last comment (why not during thinlink), we already have support there for partitioning based on a call graph (part of the instrumented contextual profile work). You can probably build on that (on that code in thinlink, I mean, not on the instrumented profiling part. Referenced that because it discusses call graph - driven partitioning)

I actually meant before the thin link (i.e. towards the end of the initial compile steps), but this is a very valid point - during the thin link would be another option. I would greatly prefer to see either of these options explored before we consider a huge rewrite of the LTO handling, which is likely to have brittleness with thin link assumptions baked into the summaries. Doing this in a place that fits more naturally into the existing infrastructure is greatly preferable.

1 Like

Thanks for your comment.

  1. We also care a lot about performance. So far, no performance regression or improvement has been observed. The number of inlined functions is also almost the same as before.

  2. This design does change symbols from internal to external, but this is also how ThinLTO handles them. Although the symbols become external, they are renamed with a hash ID appended. As a result, the renamed symbols are unlikely to conflict with symbols from other ELF files and therefore should not cause runtime issues.

  3. Regarding the code size issue, it is true that this design increases code size by about 10–15%. However, the symbol table itself does not contribute much to the increase. We checked SPEC2007 as shown below. While the symbol table size itself increases significantly in percentage, its contribution relative to the whole binary size is small.

SPEC2007 Benchmark ThinLTO symtab size (bytes) MTPC symtab size (bytes) Increase Object size (bytes) Symtab Increase / Object Size
500 perlbench 27960 212208 30.88% 2788572 1.795%
502 gcc 903024 1277912 40.61% 13181326 2.782%
525 x264 40752 48816 19.79% 623701 1.293%
531 deepsjeng 12576 15504 23.28% 12234259 0.024%
541 leela 19056 22488 18.01% 135542 2.532%
557 xz 20880 24600 17.82% 159264 2.336%

Thanks for your comment. I think it’s possible to extend this design to fulllto. we will try to refactor the logic into a standalone function to fit both pipelines. And also will check the performance impact on fulllto.

JuliaLang and OrcJIT do not partition modules at the granularity of call graph . Instead, they partition based on connected components in the call graph.

This isn’t really true of either one. OrcJIT partitions modules based upon the minimum set required to maintain IR correctness (see source comment): LLVM: lib/ExecutionEngine/Orc/IRPartitionLayer.cpp Source File

JuliaLang does partition the call graph based on connected components, but then aims to use that to partitions the module with finer-grained decomposition, essentially trying to do a graph min cut, to prevent the size increase you mentioned.

@jdoerfert and I have been discussing an idea for doing an early split, likely at compile time. The idea is that, instead of having N modules for N TUs at link time, we could have M modules, where M > N. This could further improve parallelism while fitting naturally into the linker infrastructure we already have.

Regarding module splitting, AMDGPU already has a module split pass for full LTO that splits the monolithic module after the middle-end and then invokes the backend in parallel. There is also another module split pass implemented by the Intel folks. I think we should definitely consolidate these different module split implementations.

This makes sense and has existing art. Rust does this to parallelize crate compilation by splitting CodeGen into multiple ā€œCodeGen unitsā€ that get thin-linked to avoid pessimizing most optimizations that would otherwise benefit from whole-crate visibility. I believe rustc does this by explicitly creating multiple LLVM modules rather than splitting an existing LLVM module. It looks like there’s also existing art in tree (llvm-project/llvm/lib/Transforms/Utils/SplitModule.cpp at main Ā· llvm/llvm-project Ā· GitHub) if you do want to split an existing module.

Consolidation sounds good to me, but I think that’s mostly orthogonal to what is being proposed in the RFC.

Yes and no. The ā€œnoā€ part is that the first PR in this RFC introduces yet another module split pass, which I don’t think is completely orthogonal. The module split idea proposed in this RFC, however, is orthogonal.

An interesting idea about splitting the modules at the end of compile steps. To me that may make clang driver become complicated. If one doesn’t want to change the build system of a project(with/without thinlto module splitting or different splitted module numbers), clang driver seems can not easily handle this. A file to share the spltting info between lto compile step and lto link step?

thinlink is a thinlto specific functionality, as this RFC is going toward to a goal that can be commonly used between thinlto and fulllto, maybe doing the splitting inside thinlink can not achieve this goal.

  1. Why did we design a new split plan?
    Our splitting implementation shares core similarities with AMD’s approach: both build upon the CallGraph infrastructure for module partitioning. Nevertheless, multiple design choices in AMD’s splitting algorithm make it incompatible with our target use case, detailed as below:
  • AMD’s splitting pass selects kernel functions as split root nodes, which is a GPU-kernel-specific design and cannot generalize to our workloads.

  • AMD constructs its SplitGraph with overly conservative handling for indirect calls: it establishes call edges via a full Cartesian product between all functions containing indirect call sites and all functions whose address is taken, even when no real runtime invocation exists between any pair.
    Example illustration:
    Functions with indirect calls: A, B
    Address-taken candidate functions: C, D

         partition0           partition1
           A                      B
             \                  /   \
         C    D                C     D
           (indirect)  (indirect) (indirect)  

This aggressive edge creation bundles excessive functions into identical partitions, leaving individual split submodules still bloated in size after separation.

  • AMD’s splitting logic lacks IFunc support; post-split verification in LLVM Verifier will emit the diagnostic ifunc must point to resolver and fail validation.
    Given the above incompatibilities, neither unifying our design with AMD’s nor directly reusing AMD’s existing splitting implementation fits our requirements.
  1. If module is splited at compile time as you said, it will introduce non-trivial modifications to the existing distributed ThinLTO build workflow.
    Specifically: splitting N compilation units at the initial compilation phase will generate M split object files, which in turn requires corresponding adjustments for all subsequent build steps (Step 2 ~ Step 6 in the workflow below). Could you elaborate on your proposed solution to resolve this pipeline disruption?
# Step1: Source compile to ThinLTO bitcode objects
clang -flto=thin -O2 t1.c t2.c t3.c -c
# Step2: Generate ThinLTO index via thin-link phase
clang -flto=thin -O2 t1.o t2.o t3.o -fuse-ld=lld -Wl,--thinlto-index-only
# Step3~5: Per-object backend compilation with precomputed index
clang -O2 -o t1.native.o t1.o -c -fthinlto-index=t1.o.thinlto.bc
clang -O2 -o t2.native.o t2.o -c -fthinlto-index=t2.o.thinlto.bc
clang -O2 -o t3.native.o t3.o -c -fthinlto-index=t3.o.thinlto.bc
# Step6: Final link with native objects
clang t1.native.o t2.native.o t3.native.o -o a.out -fuse-ld=lld
  1. I have not yet located Intel’s corresponding split implementation. Could you share the source code repository link for their solution if available?

Thanks for the suggestions. After carefully reviewing the code, it is clear that OrcJIT is not based on connected components. It follows a JIT ā€œcompile-on-demandā€ design, where only the minimal required set is compiled. It only handles two types of Global Values—global variables and aliases—and does not appear to handle ifuncs (I am not sure why this does not cause issues). In addition, its handling of global variables is very conservative: if any global variable is needed, all global variables are included. This guarantees correctness, but inevitably introduces a lot of redundancy, leading to code size growth.

For JuliaLang, I only saw evidence of call-graph partitioning based on connected components, with a priority-queue-based approach to distribute them across partitions. I did not find code corresponding to the ā€œfiner-grained decompositionā€ or ā€œgraph min-cutā€ mentioned above—could you point out where this is implemented? However, there is one aspect that could be reused: in JuliaLang’s module splitting process, deserialization is performed lazily, meaning only the split submodules are deserialized instead of the entire module. This approach should help mitigate high peak memory usage.

I’m not saying the proposed splitting strategy needs to be the same as AMDGPU’s. What I wanted to emphasize is that we already have at least two module-splitting passes in the tree. In fact, Intel’s pass (https://github.com/llvm/llvm-project/pull/131347) is more general-purpose. I believe that, with some additional tailoring, it could fit what you need here.

From the build system’s perspective, steps 2-6 are transparent because they are expected to be handled automatically by the linker. From the linker’s point of view, it is simply receiving M object files.

I do acknowledge that we will end up generating more object files at compile time, potentially a non-deterministic number of them, and then pass those to the linker. That is probably the only part that needs careful consideration.

1 Like