[RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come)

Patches (Prototype):
[1] Efficient Implementation of MachineOutliner::findCandidates() by xuanzh-meta · Pull Request #90260 · llvm/llvm-project · GitHub
[2] Sort by Benefit to Cost Ratio by xuanzh-meta · Pull Request #90264 · llvm/llvm-project · GitHub
[3] Leaf Descendants by xuanzh-meta · Pull Request #90275 · llvm/llvm-project · GitHub

Motivation

Machine Outliner uses the Suffix Tree data structure to find repeated sequences of instructions. On a high level, every internal node of the suffix tree represents a repeated sequence and its leaf descendants represent the repetitions. However, in the current implementation, only leaf children of each internal node are included as candidates for outlining. We modify this implementation in [3] to include all leaf descendants.

Results

Together with [1] and [2] (explained in details later), our change significantly reduces the resulting binary size without affecting the build time or the build performance too much (based on Clang/LLD on MachO).

We compare our proposal with another strategy used to further reduce code size, which is to repeat the outliner pass (i.e., having reruns). One can specify (the number of) reruns by using the flag -machine-outliner-reruns=<number_of_reruns>. For Clang/LLD on MachO, we showed that our proposal dominates the outliner reruns (even up to 5 times) in all aspects –

  1. the outlining process is faster,
  2. the resulting binary is smaller, and
  3. the resulting binary runs faster.

We give more details in the Table below.

                       Build Time | Code Size | Perf Time | # Func Created | # Cand Outlined 
                        (second)      (MiB)     (second)      (thousands)     (millions)
Baseline (0 Rerun)         320        61.14        986            113            1.240
Our proposal (0 Rerun)     328        59.41        1015           107            1.369
—----------------------------------------------------------------------------------------------
Baseline (1 Rerun)         338        59.73        1047           160            1.563
Baseline (3 Reruns)        366        59.55        1059           174            1.625
Baseline (5 Reruns)        393        59.53        1059           174            1.626
  • Column “Build time” measures the LTO link time when building LLD. Although it is around 3% slower than the baseline build, it is much faster than the outliner reruns.
  • Column “Code Size” measures the text segment size of the resulting linker binary built with the minimum size optimization (-Oz). It shows around 3% reduction compared to the baseline -Oz binary.
  • Column “Perf Time” measures the performance of the resulting linker binary, LLD, when building the LLD itself.
  • Columns “# Func Created” and “# Cand Outlined” are two statistics of the machine outliner. It shows that overall, our proposal outlines candidates that appear frequently in a more effective manner. This leads to the creation of fewer outlined functions, resulting in a lower overall call overhead.

Besides Clang/LLD, our proposal can be used to reduce the size of large real-world mobile apps built with FullLTO. We are also working on further improving the separate compilation scenarios, such as the NoLTO or ThinLTO case, using the CodeGen summary. We have some positive preliminary results and will work on upstreaming these as well.

Details

Below are some additional details for each of the commits.

[3] The main idea is to consider leaf descendants of each internal node, instead of just the leaf children, in the suffix tree. This is enabled on a flag -outliner-leaf-descendants so that the change is in effect only when the Suffix Tree is used by Machine Outliner.

[1] This patch reduces the time complexity from O(n^2) to O(n log n) for the main loop of MachineOutliner::findCandidates(). For small n, the modification does not regress the build time, but it helps significantly when n is large. This patch results in no functional change.

[2] In MachineOutliner::outline(), functions are sorted first and then outlined sequentially in a greedy manner. We showed that sorting the functions by a different value, which we call priority, could be more beneficial.

6 Likes

Thanks for working on this!

Can you share a bit more on what you see for the non-LTO, ThinLTO cases? I think if using no LTO leads to a regression that wouldn’t be acceptable and your additions would at least need to be hidden behind a flag. You mention a flag to enable leaf descendants but it’s unclear to me which of your three patches are potential problem for LTO.

If I see this right the comparison between Baseline (0 Rerun) and your proposal shows a code-size improvement while creating fewer outlined functions. Do you have an example for a case where the current implementation does a suboptimal job?

Can you share a bit more on what you see for the non-LTO, ThinLTO cases? I think if using no LTO leads to a regression that wouldn’t be acceptable and your additions would at least need to be hidden behind a flag.

We will follow up more detailed data for non-LTO and ThinLTO cases with the part 2. Even with the part 1 (this RFC only), we’re seeing the code size improvement universally.

I am gonna direct the ThinLTO and non-LTO question to Kyungwoo @kyulee-com!

You mention a flag to enable leaf descendants but it’s unclear to me which of your three patches are potential problem for LTO.

We were being cautious before, but if the patches are not regressing anything, we will remove the flag and set this as default.

Do you have an example for a case where the current implementation does a suboptimal job?

I haven’t seen it doing a very-suboptimal job yet. We tested on a few LLVM test-suites cases and for all, the proposal is better than the baseline:

| run   (CTMark/)  | baseline (1) | full_proposal (2) | diff (1 -> 2)  |
|------------------|--------------|-------------------|----------------|
| lencod           | 349624       | 348564            | -0.3032%       |
| SPASS            | 219672       | 218440            | -0.5608%       |
| kc               | 271956       | 250068            | -8.0484%       |
| sqlite3          | 223920       | 222484            | -0.6413%       |
| 7zip-benchmark   | 405364       | 401244            | -1.0164%       |
| bullet           | 139820       | 138340            | -1.0585%       |
| consumer-typeset | 295684       | 286628            | -3.0627%       |
| pairlocalalign   | 72236        | 71936             | -0.4153%       |
| tramp3d-v4       | 189572       | 183676            | -3.1102%       |

@barrelshifter fyi

For
[2] In MachineOutliner::outline() , functions are sorted first and then outlined sequentially in a greedy manner. We showed that sorting the functions by a different value, which we call priority, could be more beneficial.
Do you see any cases where the new priority function degrades the size saving?

So far, we have not observed any regressions. I want to note that although there is no guarantee that the new priority function will yield better size savings, we are quite confident that the new priority function will work better in most scenarios.

Here are some benchmark tests we run besides clang:

run (CTMark/) baseline (1) priority (2) diff (1 → 2)
lencod 349624 349264 -0.1030%
SPASS 219672 219480 -0.0874%
kc 271956 251200 -7.6321%
sqlite3 223920 223708 -0.0947%
7zip-benchmark 405364 402624 -0.6759%
bullet 139820 139500 -0.2289%
consumer-typeset 295684 290196 -1.8560%
pairlocalalign 72236 72092 -0.1993%
tramp3d-v4 189572 189292 -0.1477%

cc: @petrhosek @hiraditya