[RFC] [ThinLTO]: Multi-Thread Parallel Compilation for Large Modules

My apologies for failing to fully understand your viewpoint previously. Even so, Intel’s module splitting scheme cannot be adapted for our requirements. Its design is tailored specifically to split host and device compilation pipelines, addressing use cases vastly different from ours—we cannot repurpose this implementation for our workloads.
A concise comparison of all candidate solutions we’ve assessed to date is provided below. Having thoroughly balanced their respective merits and limitations, we maintain that the newly proposed splitting logic serves an indispensable purpose and is not superfluous.

Split Method Core Splitting Mechanism Limitations for Our Use Case
AMDGPU Split based on call graph 1. The splitting pass picks GPU kernel functions as split root nodes. which is a GPU-kernel-specific design and cannot generalize to our workloads.
2. No dedicated handling logic for ifunc symbols is implemented.
Intel Category-based splitting 1. Root nodes are determined by kernel invocations or the presence of the `sycl-module-id` attribute, tailored for heterogeneous host-device compilation.
2. ifunc and symbol alias scenarios are not supported.
Julia Split by connected components Fine-grained module partitioning is not achievable with this approach.

Please feel free to review the analysis above and offer any comments. We also encourage you to inspect our codebase for further verification.( [ThinLTO][Split] Split module for parallel compilation in backend (1/N) by mmjjpp · Pull Request #198702 · llvm/llvm-project) As all existing splitting approaches fail to fit our use case, we hope you can approve our new splitting design.

In practice, Steps 2–6 are statically hardcoded in user build scripts and opaque to modification. A variable number of split objects would break this ThinLTO pipeline entirely. This motivates our merging step after codegen: it preserves a single consistent output artifact for Step 3~5 and eliminates the need for script adjustments.
If we instead split modules during compilation, full transparency to user build flows would be far harder to achieve.

Thanks for your reply. @teresajohnson

We have dropped the original logic that invokes lld inside LTO and switched to the AddStream callback. However this approach cannot support Distributed ThinLTO (DTLTO), as codegen needs to output multiple split submodules. We intend to handle submodule merging via lld in the Clang Driver to enable a complete DTLTO workflow; the implementation idea is detailed at [ThinLTO][Split] Split module for parallel compilation in backend (1/N) by mmjjpp · Pull Request #198702 · llvm/llvm-project · GitHub.

The MTPC parallel runtime also works for regular LTO, with corresponding fixes merged in the same PR. We built new infrastructure to launch split instances because the callback interfaces for split instances diverge from upstream’s existing implementation.

About splitting modules before LTO linking, as @shchenz noted, hoisting splitting before the LTO link stage disrupts build system integration, especially in DTLTO case.