LOH conflicting with MachineOutliner

nocchijiang · November 22, 2024, 12:24pm

Hi folks,

This post kinda works like a tiny RFC, but I would like to hear your thoughts on this first, since I am not an expert of LOH.

Motivation

Currently the MachineOutliner almost never outlines sequences containing global value accesses on AArch64 backend. This is due to AArch64InstrInfo::getOutliningTypeImpl intentionally rejecting LOH-related machine instructions (see llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp at 490e58a98e0518542c87aa16e326fcb446d7b1cc · llvm/llvm-project · GitHub). Tests on mobile app projects show that disabling LOHs can reduce the code size noticeably (~2%) with machine outliner enabled.

Possible solutions

Do not emit LOHs if MachineOutliner is enabled

This is the most straightforward solution that entirely disables LOHs which comes with performance loss. I believe that the loss brought by the outliner would be significantly worse than the loss of LOHs which could be ignored. This should be acceptable since we want to maximize the code size wins and performance loss is expected. The thing I am not sure about is how much performance loss it would bring and if it is worth trading it with 2% code size reduction.

Move AArch64CollectLOH after MachineOutliner

By letting the outliner run first we can still collect LOHs as usual so we could still benefit from them. However, adjusting the optimization pipeline seems to be a nontrivial change and I am not sure if it is doable or acceptable by the community.

DavidSpickett · November 25, 2024, 9:38am

Might help to explain what LOH is. Though I went looking and no-one seems to know what it stands for ([lld] feasibility of `.loh` linker optimization hints on elf platforms · Issue #60534 · llvm/llvm-project · GitHub) but they do know what it actually does ([lld] feasibility of `.loh` linker optimization hints on elf platforms · Issue #60534 · llvm/llvm-project · GitHub)?

ellishg · November 26, 2024, 5:30pm

I believe it stands for Linker Optimization Hints.

efriedma-quic · November 26, 2024, 10:16pm

For anyone following along, LOH is a MachO-specific directive which directs linker relocation optimizations. ELF uses a different mechanism to achieve a similar result.

I would guess the performance difference from enabling LOH is small; modern cores are very good at integer arithmetic.

Moving AArch64CollectLOH from addPreEmitPass to addPreEmitPass2 is reasonable at first glance… but I might be missing something; I only briefly glanced at the code.

CC @amara

qcolombet · November 27, 2024, 12:27pm

That’s exactly that.

And I think too that running this later is fine.

amara · November 28, 2024, 7:45am

In theory LOHs shouldn’t have any impact on compiler codegen so IMO the fact that it was inhibiting outlining was a bug that moving it LOH computing later is the correct fix for.

Do you have CTMark Oz numbers for that change?

nocchijiang · November 29, 2024, 5:16am

Program                                       size                        size.__gcc_except_tab                size.__text                 size.__unwind_info                size.__eh_frame
                                              before     after      diff  before                after    diff  before      after     diff  before             after    diff  before          after  diff
CTMark/mafft/pairlocalalign                    332992.00  338368.00  1.6%                                      209844.00   205880.00 -1.9%  1256.00            1256.00  0.0% 112.00          112.00  0.0%
CTMark/lencod/lencod                           627288.00  636968.00  1.5%                                      389564.00   384332.00 -1.3%  2056.00            2056.00  0.0% 664.00          664.00  0.0%
CTMark/SPASS/SPASS                             542296.00  550360.00  1.5%                                      305876.00   299888.00 -2.0%  4584.00            4584.00  0.0% 768.00          768.00  0.0%
CTMark/ClamAV/clamscan                         638344.00  644808.00  1.0%                                      341440.00   338812.00 -0.8%  2624.00            2624.00  0.0% 280.00          280.00  0.0%
CTMark/7zip/7zip-benchmark                    1137248.00 1141024.00  0.3% 38064.00              38136.00  0.2% 490924.00   488388.00 -0.5% 17936.00           17952.00  0.1%
CTMark/sqlite3/sqlite3                         382408.00  383528.00  0.3%                                      229680.00   228572.00 -0.5%  2432.00            2432.00  0.0%  88.00           88.00  0.0%
CTMark/Bullet/bullet                          1616632.00 1617480.00  0.1% 13296.00              13308.00  0.1% 374192.00   373828.00 -0.1% 11272.00           11272.00  0.0% 368.00          368.00  0.0%
CTMark/tramp3d-v4/tramp3d-v4                  1202824.00 1203192.00  0.0%                                      345572.00   345488.00 -0.0%    88.00              88.00  0.0%
CTMark/kimwitu++/kc                           1279552.00 1273152.00 -0.5%  3804.00               3804.00  0.0% 316800.00   301708.00 -4.8%  7336.00            7368.00  0.4% 392.00          456.00 16.3%
CTMark/con...sumer-typeset/consumer-typeset    524448.00  507200.00 -3.3%                                      344016.00   323976.00 -5.8%  1400.00            1400.00  0.0%
                           Geomean difference                        0.2%                                 0.1%                       -1.8%                              0.1%                         2.2%

The total size increase are largely contributed by the additional OUTLINED_FUNCTION_s in the symtab which we can get rid of by strip.

kyulee-com · December 2, 2024, 5:44am

We also used to disable LOH to reduce the binary size.

Moving it to addPreEmitPass2 also seems reasonable. Although the uncompressed size might remain similar to when LOH is disabled, this approach can enhance the compressed size because the linker will emit more NOPs.

As a side note, if we aim to reduce the size further in this context (without emitting NOPs), tools like BOLT or other post-link optimization techniques could theoretically accomplish this.

nocchijiang · December 23, 2024, 8:26am

I have identified a couple of issues by moving AArch64CollectLOH to addPreEmitPass2 on my local machine. My plan is to fix them individually and adjust the optimization pipeline once all problems are resolved.

Here is the first patch: [MachineOutliner] Preserve regmasks in calls to outlined functions by nocchijiang · Pull Request #120940 · llvm/llvm-project · GitHub

Topic		Replies	Views
[RFC] Turn the MachineOutliner on by default in AArch64 under -Oz LLVM Dev List Archives	0	76	April 21, 2018
[RFC] Turn the MachineOutliner on by default in AArch64 under -Oz LLVM Dev List Archives	21	122	January 16, 2019
[MachineOutliner][ARM][RFC] Add Machine Outliner support for ARM LLVM Dev List Archives	0	92	January 22, 2019
AArch64 enabling out-of-line atomics by default LLVM Dev List Archives	0	106	December 19, 2020
Code size BoF Minutes LLVM Dev List Archives	0	86	October 6, 2020

LOH conflicting with MachineOutliner

Motivation

Possible solutions

Do not emit LOHs if MachineOutliner is enabled

Move AArch64CollectLOH after MachineOutliner

Related topics