[RFC] Enhanced Machine Outliner – Part 2: ThinLTO/NoLTO

Thank you for your interest and great feedbacks! Here is a published (concise) version that includes both the outliner and merger aspects: Optimistic and Scalable Global Function Merging | Proceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems.

What is the granularity of code being folded? For instance, does your implementation fold code with multiple basic blocks, branches, and calls?

The current RFC focuses on the outliner perspective only (while supplying the codegen data framework), handling straight-lined code within a machine basic block. However, I am planning to submit another RFC soon for the merger case, based on the aforementioned paper, which will cover entire functions.

How do you generate the outlined code sections? We thought basic block sections could be used as they support seamless DebugInfo and CFI.

For the outliner, which deals with straight-lined code, DebugInfo typically isn’t a concern in practice in this case. Although the machine outliner erases the debug info for the outlined function, it’s usually easy to locate the nearby code, and the outlined function typically doesn’t have a frame and doesn’t appear in the stack frame when a crash occurs. It’s implemented as an LLVM pass, and CFI alignment is maintained without breaking semantics.

For the upcoming merger RFC, I plan to preserve the original context as much as possible while only parameterizing constants and globals. This should maintain safety in terms of annotations or attributes in IR.

A caveat with both approaches (outliner and merger) is that they tend to increase the number of identical function instances, which the linker eventually folds. We’ve found that improving the debugging experience is crucial, and we are actively working to upstream these improvements on this end:

What is the long-term strategy regarding avoiding multiple rounds of CodeGen? Could this fit within a post-link-optimization framework like Bolt or Propeller?

The core design of this approach is to ensure optimizations remain effective and safe even with stale codegen data, similar to how PGO (profile-guided optimization) uses profile data to enhance code in subsequent builds. Here, we use prior codegen data to later improve code quality, rather than using a runtime profile data. We can run the writer build infrequently while feeding its codegen data into subsequent (reader) optimization builds. This model fits well with the distributed thin-lto we’re targeting, without disrupting the existing build mode. As demonstrated in the paper, the stability of code size reduction is quite high even with a week of stale data.

I believe post-link optimization is orthogonal to this approach. However, in practice, we’ve found it challenging, particularly for MachO, when considering function splitting using block-level encapsulation due to some constraints in MachO object format. Moreover, reconstructing debug info, such as separate dSYM/gSYM generation, could be more challenging with post-link optimizations.

1 Like