[RFC] Enhanced Machine Outliner – Part 2: ThinLTO/NoLTO

This is Part 2 of the Enhanced Machine Outliner, continuing [RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come)

Motivation

The machine outliner operates within a single module, and its efficiency can significantly decrease without the use of a costly link-time optimization (LTO), which is often not feasible for large-scale app development. We propose a global function outlining that can be used with any separate compilation mode like NoLTO or ThinLTO.
We use prior codegen data to outline functions that can be optimized by the conventional linker’s identical code folding (ICF). Local outlining instances are serialized into object files and can be merged into the codegen data, either offline (using llvm-cgdata) or at link time with -fcodegen-data-generate. The subsequent codegen uses the codegen data to outline functions globally with -fcodegen-data-use

Results

We used the self-hosted LLD for arm64-MachO to evaluate the text segment size and the link time. We compiled it with -Oz and applied the conventional linker’s optimizations (-dead_strip and –icf=safe).

  • NoOutline: -mno-outline. The machine outliner is disabled.
  • Base: The machine outlined is enabled by default under -Oz.
  • Part1: [RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come)
  • Part2: This builds upon Part1. Note Part2 does not apply to the LTO.
    • -fcodegen-data-generate (Gen): There’s effectively no change from Part1 as the custom section doesn’t contribute to the final executable, except for producing the combined codegen data file.
    • -fcodegen-data-use (Use): This uses the codegen data to significantly reduce the code size from the global outlining.
    • -fcodegen-data-thinlto-two-rounds (Rounds): This runs the above two by repeating the codegen in place only applicable with -flto=thin. This resulted in the same size saving as the Use case.

The link time with NoLTO (-fno-lto) is fast (~1sec), and Part2 doesn’t significantly alter this (~2sec). Therefore, we decided not to include this in the subsequent graph. The following graph presents a comparison of link times between ThinLTO (-flto=thin) and LTO (-flto). It indicates that Part2 ThinLTO, requires more time to link than the Base ThinLTO as it handles codegen data and produces more outlining instances. However, the link time for ThinLTO is just 1/7 of that for LTO, yet it’s nearly as efficient in size reduction, dropping from 98.4% to 85.7% in ThinLTO, compared to a decrease from 96.7% to 82.0% in LTO.

Discussion

  • The initial concept of global outlining, introduced in [7] and expanded in [8], is coupled with the global function merger. This RFC implements the outlining part by utilizing codegen data as a foundation, making it applicable to other cross-module codegen optimizations. New codegen data can be defined for each client optimization and can be easily extended into the current format.
  • The implementation is complete and tested for MachO + LLD for both NoLTO and ThinLTO, and can be easily extended to ELF + LLD.
  • The use of llvm-cgdata, similar to IRPGO’s use of llvm-profdata, should be employed to safely optimize the binary. Unlike profile data, codegen data is a build artifact, and its staleness can diminish optimization efficiency, but must not affect correctness. Notably, as discussed in [8], the outlining opportunity remains quite stable despite numerous source changes over time.
  • Extending the bitcode summary for codegen isn’t viable as it’s designed for IR, not MIR. Applying IR optimization first can invalidate or degrade the MIR summary.
  • Serializing MIR isn’t supported and needs significant backing. Using global data with synchronization for cross-module codegen contradicts the parallel design of (distributed) ThinLTO backend (opt/codegen). Our method avoids synchronization during codegen, and merges stable codegen data offline for safe use in future codegen.
  • Serializing raw codegen data into separate files for each compilation unit isn’t preferred. Our raw codegen data is private and conveniently located with object files, eliminating the need for extra booking in the existing compilation pipeline. This private codegen data doesn’t go to the final executable from the conventional linker’s dead-strip or garbage-collection.
  • Creating a global suffix tree contradicts the use of separate compilations as it requires mapping the entire IRs. Instead, our approach only tracks the stable hashes of outlined functions locally. We then construct a compact global outlined hash tree using the traditional trie structure.
  • The efficiency of global outlining is heavily influenced by the accuracy of stable hashes on global variables/objects. This RFC includes basic improvements on this aspect, but it could be further enhanced by hashing the structural contents.

Patches

[1] [CGData] Outlined Hash Tree by kyulee-com · Pull Request #89792 · llvm/llvm-project · GitHub – The outlined hash tree, crucial for global outlining, captures local outlining instances in a compact form by sharing the prefixed sequence using stable hash values.
[2] [CGData] llvm-cgdata by kyulee-com · Pull Request #89884 · llvm/llvm-project · GitHub – This introduces the llvm-cgdata tool to manipulate codegen data.
[3-nfc] [MachineOutliner][NFC] Refactor by kyulee-com · Pull Request #90082 · llvm/llvm-project · GitHub – This is NFC for the global function outlining.
[3] [CGData][MachineOutliner] Global Outlining by kyulee-com · Pull Request #90074 · llvm/llvm-project · GitHub – This is the main global function outlining in the machine outliner. Depending on the flags or the availability of codegen data, the machine outliner runs in 3 different modes: None, Write(Gen), or Read(Use).
[4] [CGData] LLD for MachO by kyulee-com · Pull Request #90166 · llvm/llvm-project · GitHub – This supports LLD MachO. It reads custom sections in object files and merges them into the indexed codegen data file.
[5] [CGData] Clang Options by kyulee-com · Pull Request #90304 · llvm/llvm-project · GitHub – This adds new Clang flags to support codegen data.
[6-nfc] [ThinLTO][NFC] Prep for two-codegen rounds by kyulee-com · Pull Request #90934 · llvm/llvm-project · GitHub – This is NFC for the ThinLTO pipeline.
[6] [CGData][ThinLTO] Global Outlining with Two-CodeGen Rounds by kyulee-com · Pull Request #90933 · llvm/llvm-project · GitHub – This supports two-codegen rounds for ThinLTO which repeats the codegen in place when using in-process backends.

References

[7] https://dl.acm.org/doi/10.1145/3497776.3517764
[8] [2312.03214] Optimistic Global Function Merger, LCTES2024 (to appear)

6 Likes

For large binaries (>200MB) with global machine outlining, one thing we notice that the linker may need to insert excessive branch islands for ARM64 platforms when it applies ICF on the identical outlined functions, which can affect both size saving & performance. I wonder if you observed a similar behavior. If yes, is there any idea to alleviate such problems?

You seem to be asking about the LD64 case involving textSizeWhenMightNeedBranchIslands in https://opensource.apple.com/source/ld64/ld64-609/src/ld/passes/branch_island.cpp. I think LLD employs a thunk for this purpose – llvm-project/lld/MachO/ConcatOutputSection.cpp at main · llvm/llvm-project · GitHub. However, we haven’t encountered a text size exceeding +/-128M yet.

One potential approach could be to place or order outlined functions closer to their callers at link time by employing a heuristic or algorithm. Also note that this RFC already generates outlined function names based on their content (using a post-fix .content.{hash}). If we have to order functions before ICF, we could group these functions together, instead of handling them in separate functions.

You/other contributors seem to be landing patches for this RFC, but unless I missed something there was no discussion with the wider community to actually accept these proposals. A lack of engagement/consensus isn’t tacit approval. You need to have more people than the people who worked on the features to approve the design and PRs.

Thank you for your comment. We appreciate your concern about the lack of engagement and consensus on this RFC. However, we would like to assure you that the key idea behind this proposal has been discussed in various workshops and conferences, including LLVM workshop 2020, Euro LLVM 2023, CC2022, and LCTES 2024. There was strong interest from companies like ByteDance, Uber, Sony, and Google, and we have been trying to engage them in these RFCs as well.

Our team has experience contributing to LLVM in the PGO spaces, including llvm-profdata, and we are committed to upstreaming rock-solid code. We understand that this is a relatively new concept and it may take some time to gain more attention. However, we are open to feedback and design changes through this RFC and/or patches. We believe that the proposed feature is orthogonal to existing features and can evolve over time.

We hope this information addresses your concerns and look forward to further discussion and collaboration on this proposal. If you know of anyone who might be interested, please feel free to add them to the discussion.

Thank you for your comment. We appreciate your concern about the lack of engagement and consensus on this RFC. However, we would like to assure you that the key idea behind this proposal has been discussed in various workshops and conferences, including LLVM workshop 2020, Euro LLVM 2023, CC2022, and LCTES 2024. There was strong interest from companies like ByteDance, Uber, Sony, and Google, and we have been trying to engage them in these RFCs as well.

My concern is not about the technical merits of your changes. Simply put, decisions are not made at conferences because not everyone from the community attends. Technical discussions of course but decisions are made by discussion and building consensus. Previously that was on the llvm-dev mailing lists and now on discourse. The problem is not that your approach is wrong, the problem is that the process wasn’t followed correctly. It would be within our rights as the community to completely revert all changes although that’s not a step that I’m going to take (yet).

At Uber, we employ full LTO and iterative machine outlining in our release builds. This results in lengthy build times, which significantly impact our CI/CD processes. We are very interested in the changes proposed in this RFC and would like to try them out.

2 Likes

Who from Sony? I’ve been asking around and the strongest interest I’ve heard is “it would be interesting to benchmark” without necessarily thinking it was valuable.

Kyungwoo has reached out privately to ask for more feedback on the approach. My current position on these changes are that we/I do a detailed review of the papers and the patches this week. If it seems ok we let the changes remain in tree/release the hold on the existing PRs.

However, feedback from others, especially those who have knowledge of the outliner, would be greatly appreciated @jinlin-bayarea . If any of you have strong objects to this approach please speak up because I’m not a domain expert in this area.

Thank you for sharing. Cool proposal.
We have also been interested in inter-module code folding at Google. @tmsriram shared [RFC] Interprocedural Identical Basic Block Folding earlier this year.

I skimmed through your paper and came up with few questions:

  1. What is the granularity of code being folded? For instance, does your implementation fold code with multiple basic blocks, branches, and calls?
  2. How do you generate the outlined code sections? We thought basic block sections could be used as they support seamless DebugInfo and CFI.
  3. What is the long-term strategy regarding avoiding multiple rounds of CodeGen? Could this fit within a post-link-optimization framework like Bolt or Propeller?
1 Like

Thank you for your interest and great feedbacks! Here is a published (concise) version that includes both the outliner and merger aspects: Optimistic and Scalable Global Function Merging | Proceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems.

What is the granularity of code being folded? For instance, does your implementation fold code with multiple basic blocks, branches, and calls?

The current RFC focuses on the outliner perspective only (while supplying the codegen data framework), handling straight-lined code within a machine basic block. However, I am planning to submit another RFC soon for the merger case, based on the aforementioned paper, which will cover entire functions.

How do you generate the outlined code sections? We thought basic block sections could be used as they support seamless DebugInfo and CFI.

For the outliner, which deals with straight-lined code, DebugInfo typically isn’t a concern in practice in this case. Although the machine outliner erases the debug info for the outlined function, it’s usually easy to locate the nearby code, and the outlined function typically doesn’t have a frame and doesn’t appear in the stack frame when a crash occurs. It’s implemented as an LLVM pass, and CFI alignment is maintained without breaking semantics.

For the upcoming merger RFC, I plan to preserve the original context as much as possible while only parameterizing constants and globals. This should maintain safety in terms of annotations or attributes in IR.

A caveat with both approaches (outliner and merger) is that they tend to increase the number of identical function instances, which the linker eventually folds. We’ve found that improving the debugging experience is crucial, and we are actively working to upstream these improvements on this end:

What is the long-term strategy regarding avoiding multiple rounds of CodeGen? Could this fit within a post-link-optimization framework like Bolt or Propeller?

The core design of this approach is to ensure optimizations remain effective and safe even with stale codegen data, similar to how PGO (profile-guided optimization) uses profile data to enhance code in subsequent builds. Here, we use prior codegen data to later improve code quality, rather than using a runtime profile data. We can run the writer build infrequently while feeding its codegen data into subsequent (reader) optimization builds. This model fits well with the distributed thin-lto we’re targeting, without disrupting the existing build mode. As demonstrated in the paper, the stability of code size reduction is quite high even with a week of stale data.

I believe post-link optimization is orthogonal to this approach. However, in practice, we’ve found it challenging, particularly for MachO, when considering function splitting using block-level encapsulation due to some constraints in MachO object format. Moreover, reconstructing debug info, such as separate dSYM/gSYM generation, could be more challenging with post-link optimizations.

1 Like

At ByteDance we are also planning to design & implement a similar “parameterizing” optimization. Would you like to share some progress of your works? I am definitely willing to co-contribute if circumstances allow.

1 Like

Thanks for your interest! I’m excited to see any contributions you might have.

Right now, we’re wrapping up the last part of this RFC that supports a two-round build – [CGData][ThinLTO] Global Outlining with Two-CodeGen Rounds by kyulee-com · Pull Request #90933 · llvm/llvm-project · GitHub. It’s not strictly required for a common use case, but it’s pretty handy since it works on its own (single machine build) without needing extra tools or linker support, even though it does increase the build time a bit. I’m hoping we can get this done soon so we can move on to the next step.

As for the function merger I mentioned, I’m tidying up the code to get ready for the RFC. I’ve managed to reduce the code size by about 2-3% on top of what we already saved with global function outlining for an LLD binary. This could be even more significant for larger mobile apps that use a lot of lambdas, templates or generics. I’m aiming to get this out in the next few weeks.

1 Like

Thanks for the responses.

Let me rephrase my question. If a post-link optimization is already in the pipeline, could you just use the cached bitcode from the pre-link (the profiled build) to choose the outlined code, instead of a 2-round codegen in the post-link (the final optimized build)?

Let me first clarify my understanding of how the post-link optimization framework operates, which consists of two phases: 1. profiling followed by 2. the final build. It appears that you are not modifying the (bitcode) IR, yet there are changes in code generation (CG) between these two phases. So, you wish to avoid repeating the optimization (OPT) on the IR.

I believe the -codegen-data-thinlto-two-rounds option in PR [CGData][ThinLTO] Global Outlining with Two-CodeGen Rounds by kyulee-com · Pull Request #90933 · llvm/llvm-project · GitHub might achieve what you are looking for.
Without caching, the logical sequence would be [OPT1 + CG1] followed by CG2. With your post-link optimization framework, it would naively run:

  1. Profile build: [OPT1 + CG1] + CG2 + (some profile/instrumentation)
  2. Final build: [OPT1 + CG1] + CG2 + (final post-link optimization)

As discussed in the PR, with caching support, all three outputs are cached:

  1. Optimized bitcode resulting from OPT1
  2. Scratch object file from CG1
  3. Final object file from CG2

The effectiveness might depend on how we compute the key for caching. If there are no changes in OPT1 and CG1, and thus the keys are identical, we effectively skip OPT1 + CG1. For correct caching enabling, the key should reflect any changes between the profile and final build in CG size. It’s advisable not to change CG1 (in either profile or final build), as the actual change need to happen in CG2, so that you don’t pollute the caches for CG1.

Hi Kyungwoo,

After playing with -codegen-data-* and taking a few days looking into the implementation, I have come up with 2 ideas of improvement:

  1. During the generation pass, computeAndPublishHashSequence only takes the outlining opportunities that actually happened in the local module into account. Some beneficial cross-module outlining opportunities are wasted because they are considered non-beneficial within its original module. This could bring a significant code size win for NoLTO builds.
  2. Is it safe that we implicitly enable OutlineFromLinkOnceODRs for the users? With a global hash tree I assume that the outlining algorithm is stable across modules, so linkonceodr functions will be still identical, thus allowing linker to dedup.

Some beneficial cross-module outlining opportunities are wasted because they are considered non-beneficial within its original module.

Are you referring to a singleton instance that isn’t repeated within a module but might be repeated across different modules? If so, we attempted to use a global suffix tree to identify such instances, but we didn’t observe significant benefits in practice. Additionally, identifying these cases requires publishing all sequences as a form of summary, which undermines the design of a summary-based optimization since it effectively requires the entire set of instructions.

From the second thought, it seems that due to prioritization within a module, some repeated sequences that overlap are discarded and never considered as cross-module outlining opportunities because they weren’t recognized as local outlining instances. If this is the case, it sounds like a promising idea worth exploring.

Is it safe that we implicitly enable OutlineFromLinkOnceODRs for the users?

Frankly, I’m not sure why it’s disabled by default. Even if functions with One Definition Rule (ODR) have different outlining sequences, as long as they are semantically identical, the linker should be able to fold either one without issues. I might be overlooking something, and it could depend on the platform. However, in this global outlining scenario, the same ODR function will experience the same outlining sequence, so it should definitely be safe.

Yes, this is exactly what I mean. We have a private GFO implementation that does the “entire” mapping thing and it can shrink down the lld binary noticeably better than the one behind this RFC: with LTO off, I observed that our implementation can bring ~3% more code size win. Skipping overlapped sequences may also be a part of the differences.

I also noticed a regression that a function is transformed into a jump to the outlined sequence by the outliner, while it could be folded by the linker without the global summary. This results in some redundant unconditional jumps in the final linked product. I am not sure if we should exclude these candidates or add a dedicated linker optimization (to extend ICF so that it can fold unconditional jumps into alias) to work around this, though.

a function is transformed into a jump to the outlined sequence by the outliner

I suspect this might be an issue mentioned in the above – [RFC] Enhanced Machine Outliner – Part 2: ThinLTO/NoLTO - #2 by gl387, which seems to be a branch island inserted by the linker due to too long a distance, potentially due to a too large binary.