This is Part 2 of the Enhanced Machine Outliner, continuing [RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come)
Motivation
The machine outliner operates within a single module, and its efficiency can significantly decrease without the use of a costly link-time optimization (LTO), which is often not feasible for large-scale app development. We propose a global function outlining that can be used with any separate compilation mode like NoLTO or ThinLTO.
We use prior codegen data to outline functions that can be optimized by the conventional linker’s identical code folding (ICF). Local outlining instances are serialized into object files and can be merged into the codegen data, either offline (using llvm-cgdata
) or at link time with -fcodegen-data-generate
. The subsequent codegen uses the codegen data to outline functions globally with -fcodegen-data-use
Results
We used the self-hosted LLD for arm64-MachO to evaluate the text segment size and the link time. We compiled it with -Oz
and applied the conventional linker’s optimizations (-dead_strip
and –icf=safe
).
NoOutline
:-mno-outline
. The machine outliner is disabled.Base
: The machine outlined is enabled by default under-Oz
.Part1
: [RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come)Part2
: This builds uponPart1
. NotePart2
does not apply to the LTO.-fcodegen-data-generate (Gen)
: There’s effectively no change fromPart1
as the custom section doesn’t contribute to the final executable, except for producing the combined codegen data file.-fcodegen-data-use (Use)
: This uses the codegen data to significantly reduce the code size from the global outlining.-fcodegen-data-thinlto-two-rounds (Rounds)
: This runs the above two by repeating the codegen in place only applicable with-flto=thin
. This resulted in the same size saving as the Use case.
The link time with NoLTO (-fno-lto
) is fast (~1sec
), and Part2
doesn’t significantly alter this (~2sec
). Therefore, we decided not to include this in the subsequent graph. The following graph presents a comparison of link times between ThinLTO (-flto=thin
) and LTO (-flto
). It indicates that Part2
ThinLTO, requires more time to link than the Base
ThinLTO as it handles codegen data and produces more outlining instances. However, the link time for ThinLTO is just 1/7
of that for LTO, yet it’s nearly as efficient in size reduction, dropping from 98.4%
to 85.7%
in ThinLTO, compared to a decrease from 96.7%
to 82.0%
in LTO.
Discussion
- The initial concept of global outlining, introduced in [7] and expanded in [8], is coupled with the global function merger. This RFC implements the outlining part by utilizing codegen data as a foundation, making it applicable to other cross-module codegen optimizations. New codegen data can be defined for each client optimization and can be easily extended into the current format.
- The implementation is complete and tested for MachO + LLD for both NoLTO and ThinLTO, and can be easily extended to ELF + LLD.
- The use of
llvm-cgdata
, similar to IRPGO’s use ofllvm-profdata
, should be employed to safely optimize the binary. Unlike profile data, codegen data is a build artifact, and its staleness can diminish optimization efficiency, but must not affect correctness. Notably, as discussed in [8], the outlining opportunity remains quite stable despite numerous source changes over time. - Extending the bitcode summary for codegen isn’t viable as it’s designed for IR, not MIR. Applying IR optimization first can invalidate or degrade the MIR summary.
- Serializing MIR isn’t supported and needs significant backing. Using global data with synchronization for cross-module codegen contradicts the parallel design of (distributed) ThinLTO backend (opt/codegen). Our method avoids synchronization during codegen, and merges stable codegen data offline for safe use in future codegen.
- Serializing raw codegen data into separate files for each compilation unit isn’t preferred. Our raw codegen data is private and conveniently located with object files, eliminating the need for extra booking in the existing compilation pipeline. This private codegen data doesn’t go to the final executable from the conventional linker’s dead-strip or garbage-collection.
- Creating a global suffix tree contradicts the use of separate compilations as it requires mapping the entire IRs. Instead, our approach only tracks the stable hashes of outlined functions locally. We then construct a compact global outlined hash tree using the traditional trie structure.
- The efficiency of global outlining is heavily influenced by the accuracy of stable hashes on global variables/objects. This RFC includes basic improvements on this aspect, but it could be further enhanced by hashing the structural contents.
Patches
[1] [CGData] Outlined Hash Tree by kyulee-com · Pull Request #89792 · llvm/llvm-project · GitHub – The outlined hash tree, crucial for global outlining, captures local outlining instances in a compact form by sharing the prefixed sequence using stable hash values.
[2] [CGData] llvm-cgdata by kyulee-com · Pull Request #89884 · llvm/llvm-project · GitHub – This introduces the llvm-cgdata
tool to manipulate codegen data.
[3-nfc] [MachineOutliner][NFC] Refactor by kyulee-com · Pull Request #90082 · llvm/llvm-project · GitHub – This is NFC for the global function outlining.
[3] [CGData][MachineOutliner] Global Outlining by kyulee-com · Pull Request #90074 · llvm/llvm-project · GitHub – This is the main global function outlining in the machine outliner. Depending on the flags or the availability of codegen data, the machine outliner runs in 3 different modes: None, Write(Gen), or Read(Use).
[4] [CGData] LLD for MachO by kyulee-com · Pull Request #90166 · llvm/llvm-project · GitHub – This supports LLD MachO. It reads custom sections in object files and merges them into the indexed codegen data file.
[5] [CGData] Clang Options by kyulee-com · Pull Request #90304 · llvm/llvm-project · GitHub – This adds new Clang flags to support codegen data.
[6-nfc] [ThinLTO][NFC] Prep for two-codegen rounds by kyulee-com · Pull Request #90934 · llvm/llvm-project · GitHub – This is NFC for the ThinLTO pipeline.
[6] [CGData][ThinLTO] Global Outlining with Two-CodeGen Rounds by kyulee-com · Pull Request #90933 · llvm/llvm-project · GitHub – This supports two-codegen rounds for ThinLTO which repeats the codegen in place when using in-process backends.
References
[7] https://dl.acm.org/doi/10.1145/3497776.3517764
[8] [2312.03214] Optimistic Global Function Merger, LCTES2024 (to appear)