[RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come)

xuanzh-meta · May 2, 2024, 4:52pm

Patches (Prototype):
[1] Efficient Implementation of MachineOutliner::findCandidates() by xuanzh-meta · Pull Request #90260 · llvm/llvm-project · GitHub
[2] Sort by Benefit to Cost Ratio by xuanzh-meta · Pull Request #90264 · llvm/llvm-project · GitHub
[3] Leaf Descendants by xuanzh-meta · Pull Request #90275 · llvm/llvm-project · GitHub

Motivation

Machine Outliner uses the Suffix Tree data structure to find repeated sequences of instructions. On a high level, every internal node of the suffix tree represents a repeated sequence and its leaf descendants represent the repetitions. However, in the current implementation, only leaf children of each internal node are included as candidates for outlining. We modify this implementation in [3] to include all leaf descendants.

Results

Together with [1] and [2] (explained in details later), our change significantly reduces the resulting binary size without affecting the build time or the build performance too much (based on Clang/LLD on MachO).

We compare our proposal with another strategy used to further reduce code size, which is to repeat the outliner pass (i.e., having reruns). One can specify (the number of) reruns by using the flag -machine-outliner-reruns=<number_of_reruns>. For Clang/LLD on MachO, we showed that our proposal dominates the outliner reruns (even up to 5 times) in all aspects –

the outlining process is faster,
the resulting binary is smaller, and
the resulting binary runs faster.

We give more details in the Table below.

                       Build Time | Code Size | Perf Time | # Func Created | # Cand Outlined 
                        (second)      (MiB)     (second)      (thousands)     (millions)
Baseline (0 Rerun)         320        61.14        986            113            1.240
Our proposal (0 Rerun)     328        59.41        1015           107            1.369
—----------------------------------------------------------------------------------------------
Baseline (1 Rerun)         338        59.73        1047           160            1.563
Baseline (3 Reruns)        366        59.55        1059           174            1.625
Baseline (5 Reruns)        393        59.53        1059           174            1.626

Column “Build time” measures the LTO link time when building LLD. Although it is around 3% slower than the baseline build, it is much faster than the outliner reruns.
Column “Code Size” measures the text segment size of the resulting linker binary built with the minimum size optimization (-Oz). It shows around 3% reduction compared to the baseline -Oz binary.
Column “Perf Time” measures the performance of the resulting linker binary, LLD, when building the LLD itself.
Columns “# Func Created” and “# Cand Outlined” are two statistics of the machine outliner. It shows that overall, our proposal outlines candidates that appear frequently in a more effective manner. This leads to the creation of fewer outlined functions, resulting in a lower overall call overhead.

Besides Clang/LLD, our proposal can be used to reduce the size of large real-world mobile apps built with FullLTO. We are also working on further improving the separate compilation scenarios, such as the NoLTO or ThinLTO case, using the CodeGen summary. We have some positive preliminary results and will work on upstreaming these as well.

Details

Below are some additional details for each of the commits.

[3] The main idea is to consider leaf descendants of each internal node, instead of just the leaf children, in the suffix tree. This is enabled on a flag -outliner-leaf-descendants so that the change is in effect only when the Suffix Tree is used by Machine Outliner.

[1] This patch reduces the time complexity from O(n^2) to O(n log n) for the main loop of MachineOutliner::findCandidates(). For small n, the modification does not regress the build time, but it helps significantly when n is large. This patch results in no functional change.

[2] In MachineOutliner::outline(), functions are sorted first and then outlined sequentially in a greedy manner. We showed that sorting the functions by a different value, which we call priority, could be more beneficial.

dtellenbach · May 2, 2024, 7:54pm

Thanks for working on this!

Can you share a bit more on what you see for the non-LTO, ThinLTO cases? I think if using no LTO leads to a regression that wouldn’t be acceptable and your additions would at least need to be hidden behind a flag. You mention a flag to enable leaf descendants but it’s unclear to me which of your three patches are potential problem for LTO.

If I see this right the comparison between Baseline (0 Rerun) and your proposal shows a code-size improvement while creating fewer outlined functions. Do you have an example for a case where the current implementation does a suboptimal job?

kyulee-com · May 2, 2024, 8:14pm

Can you share a bit more on what you see for the non-LTO, ThinLTO cases? I think if using no LTO leads to a regression that wouldn’t be acceptable and your additions would at least need to be hidden behind a flag.

We will follow up more detailed data for non-LTO and ThinLTO cases with the part 2. Even with the part 1 (this RFC only), we’re seeing the code size improvement universally.

xuanzh-meta · May 2, 2024, 8:24pm

I am gonna direct the ThinLTO and non-LTO question to Kyungwoo @kyulee-com!

You mention a flag to enable leaf descendants but it’s unclear to me which of your three patches are potential problem for LTO.

We were being cautious before, but if the patches are not regressing anything, we will remove the flag and set this as default.

Do you have an example for a case where the current implementation does a suboptimal job?

I haven’t seen it doing a very-suboptimal job yet. We tested on a few LLVM test-suites cases and for all, the proposal is better than the baseline:

| run   (CTMark/)  | baseline (1) | full_proposal (2) | diff (1 -> 2)  |
|------------------|--------------|-------------------|----------------|
| lencod           | 349624       | 348564            | -0.3032%       |
| SPASS            | 219672       | 218440            | -0.5608%       |
| kc               | 271956       | 250068            | -8.0484%       |
| sqlite3          | 223920       | 222484            | -0.6413%       |
| 7zip-benchmark   | 405364       | 401244            | -1.0164%       |
| bullet           | 139820       | 138340            | -1.0585%       |
| consumer-typeset | 295684       | 286628            | -3.0627%       |
| pairlocalalign   | 72236        | 71936             | -0.4153%       |
| tramp3d-v4       | 189572       | 183676            | -3.1102%       |

dtellenbach · May 2, 2024, 9:37pm

@barrelshifter fyi

gl387 · May 15, 2024, 1:38am

For
[2] In MachineOutliner::outline() , functions are sorted first and then outlined sequentially in a greedy manner. We showed that sorting the functions by a different value, which we call priority, could be more beneficial.
Do you see any cases where the new priority function degrades the size saving?

xuanzh-meta · May 15, 2024, 6:34pm

So far, we have not observed any regressions. I want to note that although there is no guarantee that the new priority function will yield better size savings, we are quite confident that the new priority function will work better in most scenarios.

Here are some benchmark tests we run besides clang:

run (CTMark/)	baseline (1)	priority (2)	diff (1 → 2)
lencod	349624	349264	-0.1030%
SPASS	219672	219480	-0.0874%
kc	271956	251200	-7.6321%
sqlite3	223920	223708	-0.0947%
7zip-benchmark	405364	402624	-0.6759%
bullet	139820	139500	-0.2289%
consumer-typeset	295684	290196	-1.8560%
pairlocalalign	72236	72092	-0.1993%
tramp3d-v4	189572	189292	-0.1477%

ilovepi · May 16, 2024, 12:40am

cc: @petrhosek @hiraditya

Topic		Replies	Views
[RFC] Enhanced Machine Outliner – Part 2: ThinLTO/NoLTO Code Generation thinlto , llvm	2	396	May 15, 2024
[RFC] PT.2 Add IR level interprocedural outliner for code size. LLVM Dev List Archives	14	90	February 23, 2018
MachinePipeliner refactoring LLVM Dev List Archives	6	77	July 29, 2019
[RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks LLVM Dev List Archives	1	80	August 25, 2017
About Code coverage Algorithms. LLVM Dev List Archives	2	124	October 14, 2014

[RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come)

Motivation

Results

Details

Related Topics