[RFC] [ThinLTO]: Multi-Thread Parallel Compilation for Large Modules

flyahead · May 22, 2026, 9:46am

Authors: Wei Wei, Zheng Cheng, Zhongying Liu, Jiaping Mao, Jinjie Huang

1. Motivation

ThinLTO already provides efficient module-level parallelism by distributing optimization across translation units. However, in large-scale production codebases, extremely large LLVM modules can still become major compilation bottlenecks.

In large-scale data-center applications, a single translation unit may contain hundreds of thousands of lines of code. Even under highly parallel distributed build systems, overall build latency is often dominated by several extremely large modules:

Total Build Time ≈ max_i (T_compile(Module_i))

The root cause is that optimization and code generation inside a single LLVM module are still largely serialized.

Existing FullLTO parallel backend infrastructure primarily focuses on parallelizing CodeGen. However, Opt can also dominate compilation time for large modules.

As a result, existing FullLTO backend parallelism cannot fully address long-tail compilation latency caused by very large translation units.

This RFC proposes a new backend compilation model to address this issue.

2. Proposal: Multi-Thread Parallel Compilation (MTPC)

This RFC proposes Multi-Thread Parallel Compilation (MTPC), a backend extension for ThinLTO that enables intra-module parallelism.

The design is inspired by the existing FullLTO parallel backend model, which splits modules at function granularity for parallel code generation. MTPC extends this idea by introducing CallGraph-aware partitioning instead of naive function-level splitting.

The key idea is:

Partition a large LLVM module into multiple CallGraph-aware submodules and compile them in parallel.

3. High-Level Design

The MTPC pipeline extends the existing ThinLTO backend flow.

Original ThinLTO Flow

Multi-Thread Parallel Compilation Flow

At a high level:

Source files are compiled into LLVM bitcode (.bc)
During the ThinLTO backend stage, a large module is partitioned into multiple CallGraph-aware IR partitions
Each partition is compiled independently and in parallel
Generated object files are merged into a single relocatable object via lld -r
Each backend partition may also emit its own DWARF object (.dwo)
Multiple .dwo files are merged into a final .dwp (optional)
The merged object then participates in the normal final link step to produce the executable

4. Design Overview

4.1 CallGraph-aware Partitioning

A naive function-level split severely hurts interprocedural optimization quality. Many LLVM IPO passes rely on complete CallGraph visibility, including:

Inlining
Indirect Call Promotion (ICP)
Devirtualization
Constant propagation

MTPC partitions modules based on CallGraph and SCC hierarchy, attempting to balance:

optimization quality
backend parallelism
partition granularity

PGO profile information may additionally be used to recover indirect call edges.

4.2 Parallel Backend Compilation

Each partition executes independently through the normal LLVM backend pipeline.

Two implementation strategies were explored to further increase intra-module parallelism:

CallGraph-level parallelism for both Opt and CodeGen

In this model, both Opt and CodeGen are executed per CallGraph-aware partition. Each partition is treated as an independent submodule throughout the backend pipeline.
Hybrid parallelism: CallGraph-level Opt + function-level CodeGen

In this model, Opt is performed at CallGraph-partition granularity, while the CodeGen phase is further parallelized at function granularity within each partition, similar to existing FullLTO parallel CodeGen strategies.

Both approaches have been implemented and evaluated.

In practice, the first approach already provides sufficient parallelism for the target workloads, and avoids additional complexity in intra-partition scheduling and function-level dispatch. Therefore, the current upstream proposal only retains the first strategy.

4.3 Object Merging

Each partition generates an independent object file. Final merging is performed through:

lld -r

This design keeps MTPC:

ABI-compatible
transparent to existing build systems
minimally invasive to LLVM backend pipeline

5. Symbol Handling Strategy

Symbol transformation is performed during the split phase to preserve IR semantic correctness across partitions.

5.1 Transformation Scope

The symbol transformation stage is integrated directly into module partitioning.

The following operations are performed together:

partition construction
symbol cloning
linkage rewriting
symbol renaming

Pipeline:

Module IR
→ Split Module
→ Change Symbol Attribute
→ Parallel backend execution across CallGraph partitions (Opt + CodeGen)
→ lld -r
→ Final Object

5.2 Function and Global Variable Linkage Rules

Original Linkage	Transformed Result
`external` (function)	Primary partition: `external` Other partitions: `available_externally`
`internal` (function)	If single-use: keep `internal` Otherwise: promote to `external`
`external` (global variable)	Primary partition: `external` Other partitions: declaration only
`internal` (global variable)	Promote to `external`, then handled as external
`available_externally`	Directly cloned

5.3 Special IR Semantic Handling

LLVM IR contains several entities with non-trivial semantic constraints. Incorrect partitioning may break correctness or introduce inconsistencies.

Current implementation provides explicit handling for:

IR Entity	Transformation Rule
VTable (`external`)	Primary: `external`, others: `available_externally`
COMDAT group	Kept whole; primary emits `external`, others `available_externally`
`alias`	Strong alias: resolved to aliasee; weak alias: co-located in primary partition
`ifunc`	Resolver kept in primary; others converted to `available_externally`

In addition, global constructors/destructors ownership is preserved via primary-partition assignment.

It is possible that other LLVM IR semantic constraints require additional handling. Feedback is welcome.

6. Current Limitations

6.1 CloneModule Serialization Bottleneck

The current implementation still contains serialization bottlenecks:

Serial CloneModule
        ↓
Serial IR Serialization
        ↓
Parallel Backend Compilation

Because LLVM IR objects are tightly coupled with their owning LLVMContext, each partition currently requires an independent context. However, CloneModule preserves the original module’s LLVMContext in its output Module. As a result, partition isolation cannot be achieved directly, and an explicit serialization/deserialization step is required to reconstruct each partition in a separate LLVMContext.

CloneModule is not thread-safe today, preventing fully parallel cloning without significant LLVM IR infrastructure changes.

Future work includes:

reducing serialization overhead
making CloneModule thread-safe or partially parallelizable

6.2 Debug Information Growth

Module partitioning increases the number of generated compilation units (CUs), which may introduce duplicated DWARF debug information across partitions.

In current experiments, enabling -fdebug-types-section keeps debug information growth within approximately 20%, which is currently considered acceptable for production deployment.

Further optimizations for DWARF deduplication and debug information compaction are planned as future work.

7. Experimental Results

The following workloads are large-scale internal production applications.

Additional validation on open-source workloads is planned in future evaluations.

Production-scale experiments show substantial compilation time reductions:

Workload	Baseline	MTPC	Reduction
Large TU A	51m 26s	13m 58s	72.85%
Large TU B	24m 07s	7m 38s	68.35%
Large TU C	9m 34s	4m 02s	57.84%
Large TU D	7m 23s	2m 47s	62.30%
Full Application	30m 45s	20m 42s	32.68%

The optimization is most beneficial for builds containing a small number of extremely large translation units that underutilize available CPU cores due to insufficient backend parallelism.

For workloads already containing a sufficiently large number of independent translation units relative to available hardware parallelism, the overall benefit is naturally smaller or even negative.

8. Conclusion

MTPC extends ThinLTO with CallGraph-aware intra-module backend parallelism for extremely large LLVM modules.

The approach aims to:

reduce long-tail compilation latency
improve distributed build scalability
preserve IPO effectiveness
remain compatible with existing LLVM infrastructure

MTPC is intended as a complementary scalability mechanism to existing ThinLTO and distributed build parallelism, rather than a replacement for them.

Feedback and discussion are welcome.

9. References

These patches implement the initial MTPC prototype.

The full design has been completed, and the implementation is being split into multiple smaller patches. These patches will be submitted incrementally to LLVM for review to enable step-by-step validation and easier integration.

vtjnash · May 22, 2026, 3:15pm

Did you use AI to contribute to this? I see that you are new, and per LLVM AI Tool Use Policy — LLVM 23.0.0git documentation, any substantial use must be disclosed. Some of the wordier phrases you used (notably the last sentence) are common expressions from AI that aren’t commonly human generated.

That said, over in Julia, we’ve maintained something exactly like this for a few years: https://github.com/JuliaLang/julia/pull/47797. This may validate your performance finding. Though as you noted, there can be some semantic issues with this also (you had to rewrite the meaning of various linkages to make splitting valid.)

There is also already a copy of this logic in OrcJIT (orc::IRPartitionLayer), so it may be nice to find ways to reduce duplication there.

flyahead · May 25, 2026, 12:04pm

Update section 5.3

5.3 Special Global Value Handling

LLVM IR contains several global value constructs with non-trivial linkage and runtime semantics. These require special handling during partitioning to preserve correctness.

Current implementation provides explicit handling for:

Global value type	What it is	Rule in MTPC	Reason
VTable	Virtual method dispatch table used by C++ for dynamic dispatch	Primary: `external`, others: `available_externally`	Preserve devirtualization optimization
COMDAT group	Linker-level grouping mechanism enforcing One Definition Rule (ODR) for weak/inline entities	Kept whole; primary emits `external`, others `available_externally`	Incomplete COMDAT splitting may lead to missing symbol
`alias`	Symbol that refers directly to another global value (aliasee)	Strong alias: replaced with aliasee (per `globalopt` style); weak alias: put alias + aliasee in primary partition	ensure correct alias-to-aliasee resolution
`ifunc`	Indirect function whose implementation is selected by a runtime resolver	Primary: keep resolver + ifunc together; resolver marked `external`; others: ifunc becomes `declare`, resolver becomes `available_externally`	preserve runtime resolution semantics

external indicates a symbol with global linkage that is visible across the entire module and participates in normal linking.
available_externally indicates a definition that is used for optimization purposes but is discarded during code generation and does not appear in the final linked output.

In addition, global constructors/destructors ownership is preserved via primary-partition assignment.

It is possible that other LLVM IR semantic constraints require additional handling. Feedback is welcome.

flyahead · May 25, 2026, 12:16pm

Thanks for the comments and suggestions.

Regarding AI usage:
This RFC did use AI for translation, but the design, implementation, and codes are not generated by AI. Since English is not my native language, AI may translate better than me ~~ I will try to avoid using AI tranlation next time.
Regarding the Julia implementation:
We will take a closer look and check which parts can be reused.
Regarding special handling of global values:
I have updated Section 5.3 with more clarification in comments.

Thanks again for the review. Welcome for more discussion.

flyahead · May 26, 2026, 12:25pm

For the 2nd question, about the difference between JuliaLang/OrcJIT and MTPC：

JuliaLang and OrcJIT do not partition modules at the granularity of call graph . Instead, they partition based on connected components in the call graph.

The new MTPC partitions at the call graph granularity, resulting in finer-grained decomposition and exposing more opportunities for parallel compilation.

      **Original Big Module**

    F1		F6
   /  \    /  \             F9 <-----+
  /    \  /    \             |       |
 v      v       v            v       |  
F2      F4      F7          F10      |   
 |       |      |            |       |
 v       v      v            v       |
F3      F5      F8          F11 ----- 


   **JuliaLang or OrcJit 's Partitioning**


    Partition 1             Partition2

    F1		F6
   /  \    /  \             F9 <-----+
  /    \  /    \             |       |
 v      v       v            v       |  
F2      F4      F7          F10      |   
 |       |      |            |       |
 v       v      v            v       |
F3      F5      F8          F11 ----- 



  **New MTPC Partitioning**

 Partition 1    Partition 2     Partition 3 

    F1		     F6
   /  \         /  \             F9 <-----+
  /    \       /    \             |       |
 v      v     v     v             v       |  
F2      F4    F4    F7           F10      |   
 |       |    |     |             |       |
 v       v    v     v             v       |
F3      F5    F5    F8           F11 -----

Due to the more fine-grained partitioning, it is necessary to handle shared functions and data in a more precise manner, ensuring that functions and data are only placed into their necessary partitions as needed. This avoids duplicating them across multiple partitions, which could lead to code size bloat and duplicate symbol issues during linking.

ilovepi · May 26, 2026, 3:39pm

I’d like the Full and Thin LTO pipelines to be moving closer together rather than to diverge even more. This would be beneficial for Universal and FatLTO. I think it also helps a great deal for maintenance if the pipelines all work basically the same way, and only mildly diverge in key aspects (e.g., Thin Link vs monolithic Bitcode).

Is there any reason your approach is not suitable for FullLTO? You would probably be leaving perf on the table, but it shouldn’t be fundamentally incompatible, right? From the description, it sounds to me like it’s also possible that there could be a way to structure the partitions so that you minimize/eliminate duplicate work in FullLTO and use a different algorithm to support ThinLTO for reduced latency. I’m guessing it may also be possible to add additional information to the summaries to make up for the missing bits that would normally result in less than ideal optimization for FullLTO with multiple partitions.

ChuanqiXu · May 27, 2026, 9:21am

The numbers of link time reduction looks great. But I’d like to see the impact on performance numbers. IIUC, the new model may have some performance regressions?

For linkage, you mentioned the tool may promote internal to external linkage, this may no be ideal. As new symbol names may be an ABI breaking changer. And also these names may contribute to code size significantly.

teresajohnson · May 28, 2026, 2:27pm

Thanks for the RFC. Indeed, the ThinLTO backend compilations can be quite expensive and become a serial bottleneck.

Copying a meta comment I had added to the PR here: LTO should not be directly calling lld. Instead, look at how Regular LTO’s existing split code gen handling adds the split modules to the link, which is via the AddStream callback.

Since the PR initially didn’t have the RFC I was under the initial impression that it was only splitting for codegen, so I had left a comment about leveraging Regular LTO’s existing support for split codegen, but I see now that opt and codegen are both being applied to the split modules. Agree with @ilovepi that if we go this route of creating parallelism in the LTO backends that it would be better to share infra between thin and regular LTO.

I’m unsure that this will work for optimizations that make assumptions during the thin link about what will be exported and use the summary in the backend passes. This includes things like WPD, CFI, and MemProfContextDisambiguation (these are all not enabled by default but under pretty heavy use in practice).

I need to look through the RFC in finer detail, but is there a reason for not splitting the modules before the LTO link? There are some downsides, such as a heavier weight thin link due to the need for importing into a bigger set of modules (and of course this wouldn’t help regular LTO, but that’s not the focus of this RFC anyway). But it will likely be much simpler to implement and fit much more naturally into the existing infrastructure. E.g. in your example, F1 and F6 and their subgraphs could be split and end up in separate modules going into the thin link. I’d rather do something simple if that addresses a big part of the issue rather than add a huge amount of support to the LTO code.

mtrofin · May 28, 2026, 2:41pm

And, on @teresajohnson last comment (why not during thinlink), we already have support there for partitioning based on a call graph (part of the instrumented contextual profile work). You can probably build on that (on that code in thinlink, I mean, not on the instrumented profiling part. Referenced that because it discusses call graph - driven partitioning)

That aside, I think splitting / merging modules in post-thinlink is interesting, e.g. doing all the function optimizations in parallel (not to mention machine IR passes, which are all function-level). @weiweic presented something like that in 2024, meaning there could be some reuse between the solutions here.

teresajohnson · May 28, 2026, 4:04pm

And, on @teresajohnson last comment (why not during thinlink), we already have support there for partitioning based on a call graph (part of the instrumented contextual profile work). You can probably build on that (on that code in thinlink, I mean, not on the instrumented profiling part. Referenced that because it discusses call graph - driven partitioning)

I actually meant before the thin link (i.e. towards the end of the initial compile steps), but this is a very valid point - during the thin link would be another option. I would greatly prefer to see either of these options explored before we consider a huge rewrite of the LTO handling, which is likely to have brittleness with thin link assumptions baked into the summaries. Doing this in a place that fits more naturally into the existing infrastructure is greatly preferable.

flyahead · May 29, 2026, 6:54am

Thanks for your comment.

We also care a lot about performance. So far, no performance regression or improvement has been observed. The number of inlined functions is also almost the same as before.
This design does change symbols from internal to external, but this is also how ThinLTO handles them. Although the symbols become external, they are renamed with a hash ID appended. As a result, the renamed symbols are unlikely to conflict with symbols from other ELF files and therefore should not cause runtime issues.
Regarding the code size issue, it is true that this design increases code size by about 10–15%. However, the symbol table itself does not contribute much to the increase. We checked SPEC2007 as shown below. While the symbol table size itself increases significantly in percentage, its contribution relative to the whole binary size is small.

SPEC2007	Benchmark	ThinLTO symtab size (bytes)	MTPC symtab size (bytes)	Increase	Object size (bytes)	Symtab Increase / Object Size
500	perlbench	27960	212208	30.88%	2788572	1.795%
502	gcc	903024	1277912	40.61%	13181326	2.782%
525	x264	40752	48816	19.79%	623701	1.293%
531	deepsjeng	12576	15504	23.28%	12234259	0.024%
541	leela	19056	22488	18.01%	135542	2.532%
557	xz	20880	24600	17.82%	159264	2.336%

flyahead · May 29, 2026, 9:51am

Thanks for your comment. I think it’s possible to extend this design to fulllto. we will try to refactor the logic into a standalone function to fit both pipelines. And also will check the performance impact on fulllto.

vtjnash · May 29, 2026, 7:08pm

JuliaLang and OrcJIT do not partition modules at the granularity of call graph . Instead, they partition based on connected components in the call graph.

This isn’t really true of either one. OrcJIT partitions modules based upon the minimum set required to maintain IR correctness (see source comment): LLVM: lib/ExecutionEngine/Orc/IRPartitionLayer.cpp Source File

JuliaLang does partition the call graph based on connected components, but then aims to use that to partitions the module with finer-grained decomposition, essentially trying to do a graph min cut, to prevent the size increase you mentioned.

shiltian · May 31, 2026, 4:36am

@jdoerfert and I have been discussing an idea for doing an early split, likely at compile time. The idea is that, instead of having N modules for N TUs at link time, we could have M modules, where M > N. This could further improve parallelism while fitting naturally into the linker infrastructure we already have.

Regarding module splitting, AMDGPU already has a module split pass for full LTO that splits the monolithic module after the middle-end and then invokes the backend in parallel. There is also another module split pass implemented by the Intel folks. I think we should definitely consolidate these different module split implementations.

boomanaiden154-1 · May 31, 2026, 5:39am

This makes sense and has existing art. Rust does this to parallelize crate compilation by splitting CodeGen into multiple “CodeGen units” that get thin-linked to avoid pessimizing most optimizations that would otherwise benefit from whole-crate visibility. I believe rustc does this by explicitly creating multiple LLVM modules rather than splitting an existing LLVM module. It looks like there’s also existing art in tree (llvm-project/llvm/lib/Transforms/Utils/SplitModule.cpp at main · llvm/llvm-project · GitHub) if you do want to split an existing module.

Consolidation sounds good to me, but I think that’s mostly orthogonal to what is being proposed in the RFC.

shiltian · May 31, 2026, 5:46pm

Yes and no. The “no” part is that the first PR in this RFC introduces yet another module split pass, which I don’t think is completely orthogonal. The module split idea proposed in this RFC, however, is orthogonal.

shchenz · June 1, 2026, 12:44am

An interesting idea about splitting the modules at the end of compile steps. To me that may make clang driver become complicated. If one doesn’t want to change the build system of a project(with/without thinlto module splitting or different splitted module numbers), clang driver seems can not easily handle this. A file to share the spltting info between lto compile step and lto link step?

thinlink is a thinlto specific functionality, as this RFC is going toward to a goal that can be commonly used between thinlto and fulllto, maybe doing the splitting inside thinlink can not achieve this goal.

mmjjpp · June 4, 2026, 12:19pm

Why did we design a new split plan?
Our splitting implementation shares core similarities with AMD’s approach: both build upon the CallGraph infrastructure for module partitioning. Nevertheless, multiple design choices in AMD’s splitting algorithm make it incompatible with our target use case, detailed as below:

AMD’s splitting pass selects kernel functions as split root nodes, which is a GPU-kernel-specific design and cannot generalize to our workloads.
AMD constructs its SplitGraph with overly conservative handling for indirect calls: it establishes call edges via a full Cartesian product between all functions containing indirect call sites and all functions whose address is taken, even when no real runtime invocation exists between any pair.
Example illustration:
Functions with indirect calls: A, B
Address-taken candidate functions: C, D

         partition0           partition1
           A                      B
             \                  /   \
         C    D                C     D
           （indirect)  （indirect) （indirect)

This aggressive edge creation bundles excessive functions into identical partitions, leaving individual split submodules still bloated in size after separation.

AMD’s splitting logic lacks IFunc support; post-split verification in LLVM Verifier will emit the diagnostic ifunc must point to resolver and fail validation.
Given the above incompatibilities, neither unifying our design with AMD’s nor directly reusing AMD’s existing splitting implementation fits our requirements.

If module is splited at compile time as you said, it will introduce non-trivial modifications to the existing distributed ThinLTO build workflow.
Specifically: splitting N compilation units at the initial compilation phase will generate M split object files, which in turn requires corresponding adjustments for all subsequent build steps (Step 2 ~ Step 6 in the workflow below). Could you elaborate on your proposed solution to resolve this pipeline disruption?

# Step1: Source compile to ThinLTO bitcode objects
clang -flto=thin -O2 t1.c t2.c t3.c -c
# Step2: Generate ThinLTO index via thin-link phase
clang -flto=thin -O2 t1.o t2.o t3.o -fuse-ld=lld -Wl,--thinlto-index-only
# Step3~5: Per-object backend compilation with precomputed index
clang -O2 -o t1.native.o t1.o -c -fthinlto-index=t1.o.thinlto.bc
clang -O2 -o t2.native.o t2.o -c -fthinlto-index=t2.o.thinlto.bc
clang -O2 -o t3.native.o t3.o -c -fthinlto-index=t3.o.thinlto.bc
# Step6: Final link with native objects
clang t1.native.o t2.native.o t3.native.o -o a.out -fuse-ld=lld

I have not yet located Intel’s corresponding split implementation. Could you share the source code repository link for their solution if available?

flyahead · June 4, 2026, 12:22pm

Thanks for the suggestions. After carefully reviewing the code, it is clear that OrcJIT is not based on connected components. It follows a JIT “compile-on-demand” design, where only the minimal required set is compiled. It only handles two types of Global Values—global variables and aliases—and does not appear to handle ifuncs (I am not sure why this does not cause issues). In addition, its handling of global variables is very conservative: if any global variable is needed, all global variables are included. This guarantees correctness, but inevitably introduces a lot of redundancy, leading to code size growth.

For JuliaLang, I only saw evidence of call-graph partitioning based on connected components, with a priority-queue-based approach to distribute them across partitions. I did not find code corresponding to the “finer-grained decomposition” or “graph min-cut” mentioned above—could you point out where this is implemented? However, there is one aspect that could be reused: in JuliaLang’s module splitting process, deserialization is performed lazily, meaning only the split submodules are deserialized instead of the entire module. This approach should help mitigate high peak memory usage.

shiltian · June 8, 2026, 8:33pm

I’m not saying the proposed splitting strategy needs to be the same as AMDGPU’s. What I wanted to emphasize is that we already have at least two module-splitting passes in the tree. In fact, Intel’s pass (https://github.com/llvm/llvm-project/pull/131347) is more general-purpose. I believe that, with some additional tailoring, it could fit what you need here.

From the build system’s perspective, steps 2-6 are transparent because they are expected to be handled automatically by the linker. From the linker’s point of view, it is simply receiving M object files.

I do acknowledge that we will end up generating more object files at compile time, potentially a non-deterministic number of them, and then pass those to the linker. That is probably the only part that needs careful consideration.

Topic		Replies	Views
[Proposal] Parallelize post-IPO stage. LLVM Dev List Archives	28	344	July 18, 2013
Proposal/patch: simple parallel LTO code generation LLVM Dev List Archives	2	106	August 12, 2015
[LLVM Dev] [Discussion] Function-based parallel LLVM backend code generation LLVM Dev List Archives	17	289	July 17, 2013
[RFC] A Unified LTO Bitcode Frontend IR & Optimizations lto , thinlto , clang	59	5213	May 31, 2023
exploring possibilities for unifying ThinLTO and FullLTO frontend + initial optimization pipeline LLVM Dev List Archives	14	378	April 30, 2018