[RFC] [ThinLTO]: Multi-Thread Parallel Compilation for Large Modules

mmjjpp · June 17, 2026, 2:40am

My apologies for failing to fully understand your viewpoint previously. Even so, Intel’s module splitting scheme cannot be adapted for our requirements. Its design is tailored specifically to split host and device compilation pipelines, addressing use cases vastly different from ours—we cannot repurpose this implementation for our workloads.
A concise comparison of all candidate solutions we’ve assessed to date is provided below. Having thoroughly balanced their respective merits and limitations, we maintain that the newly proposed splitting logic serves an indispensable purpose and is not superfluous.

Split Method	Core Splitting Mechanism	Limitations for Our Use Case
AMDGPU	Split based on call graph	1. The splitting pass picks GPU kernel functions as split root nodes. which is a GPU-kernel-specific design and cannot generalize to our workloads. 2. No dedicated handling logic for ifunc symbols is implemented.
Intel	Category-based splitting	1. Root nodes are determined by kernel invocations or the presence of the `sycl-module-id` attribute, tailored for heterogeneous host-device compilation. 2. ifunc and symbol alias scenarios are not supported.
Julia	Split by connected components	Fine-grained module partitioning is not achievable with this approach.

Please feel free to review the analysis above and offer any comments. We also encourage you to inspect our codebase for further verification.（ [ThinLTO][Split] Split module for parallel compilation in backend (1/N) by mmjjpp · Pull Request #198702 · llvm/llvm-project） As all existing splitting approaches fail to fit our use case, we hope you can approve our new splitting design.

In practice, Steps 2–6 are statically hardcoded in user build scripts and opaque to modification. A variable number of split objects would break this ThinLTO pipeline entirely. This motivates our merging step after codegen: it preserves a single consistent output artifact for Step 3~5 and eliminates the need for script adjustments.
If we instead split modules during compilation, full transparency to user build flows would be far harder to achieve.

mmjjpp · June 22, 2026, 12:57pm

Thanks for your reply. @teresajohnson

We have dropped the original logic that invokes lld inside LTO and switched to the AddStream callback. However this approach cannot support Distributed ThinLTO (DTLTO), as codegen needs to output multiple split submodules. We intend to handle submodule merging via lld in the Clang Driver to enable a complete DTLTO workflow; the implementation idea is detailed at [ThinLTO][Split] Split module for parallel compilation in backend (1/N) by mmjjpp · Pull Request #198702 · llvm/llvm-project · GitHub.

The MTPC parallel runtime also works for regular LTO, with corresponding fixes merged in the same PR. We built new infrastructure to launch split instances because the callback interfaces for split instances diverge from upstream’s existing implementation.

About splitting modules before LTO linking, as @shchenz noted, hoisting splitting before the LTO link stage disrupts build system integration, especially in DTLTO case.

Topic		Replies	Views
[Proposal] Parallelize post-IPO stage. LLVM Dev List Archives	28	344	July 18, 2013
Proposal/patch: simple parallel LTO code generation LLVM Dev List Archives	2	106	August 12, 2015
[LLVM Dev] [Discussion] Function-based parallel LLVM backend code generation LLVM Dev List Archives	17	289	July 17, 2013
[RFC] A Unified LTO Bitcode Frontend IR & Optimizations lto , thinlto , clang	59	5215	May 31, 2023
exploring possibilities for unifying ThinLTO and FullLTO frontend + initial optimization pipeline LLVM Dev List Archives	14	378	April 30, 2018

[RFC] [ThinLTO]: Multi-Thread Parallel Compilation for Large Modules

Related topics