LOH conflicting with MachineOutliner

Hi folks,

This post kinda works like a tiny RFC, but I would like to hear your thoughts on this first, since I am not an expert of LOH.

Motivation

Currently the MachineOutliner almost never outlines sequences containing global value accesses on AArch64 backend. This is due to AArch64InstrInfo::getOutliningTypeImpl intentionally rejecting LOH-related machine instructions (see llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp at 490e58a98e0518542c87aa16e326fcb446d7b1cc · llvm/llvm-project · GitHub). Tests on mobile app projects show that disabling LOHs can reduce the code size noticeably (~2%) with machine outliner enabled.

Possible solutions

Do not emit LOHs if MachineOutliner is enabled

This is the most straightforward solution that entirely disables LOHs which comes with performance loss. I believe that the loss brought by the outliner would be significantly worse than the loss of LOHs which could be ignored. This should be acceptable since we want to maximize the code size wins and performance loss is expected. The thing I am not sure about is how much performance loss it would bring and if it is worth trading it with 2% code size reduction.

Move AArch64CollectLOH after MachineOutliner

By letting the outliner run first we can still collect LOHs as usual so we could still benefit from them. However, adjusting the optimization pipeline seems to be a nontrivial change and I am not sure if it is doable or acceptable by the community.

1 Like

Might help to explain what LOH is. Though I went looking and no-one seems to know what it stands for ([lld] feasibility of `.loh` linker optimization hints on elf platforms · Issue #60534 · llvm/llvm-project · GitHub) but they do know what it actually does ([lld] feasibility of `.loh` linker optimization hints on elf platforms · Issue #60534 · llvm/llvm-project · GitHub)?

I believe it stands for Linker Optimization Hints.

1 Like

For anyone following along, LOH is a MachO-specific directive which directs linker relocation optimizations. ELF uses a different mechanism to achieve a similar result.

I would guess the performance difference from enabling LOH is small; modern cores are very good at integer arithmetic.

Moving AArch64CollectLOH from addPreEmitPass to addPreEmitPass2 is reasonable at first glance… but I might be missing something; I only briefly glanced at the code.

CC @amara

That’s exactly that.

And I think too that running this later is fine.

In theory LOHs shouldn’t have any impact on compiler codegen so IMO the fact that it was inhibiting outlining was a bug that moving it LOH computing later is the correct fix for.

Do you have CTMark Oz numbers for that change?

Program                                       size                        size.__gcc_except_tab                size.__text                 size.__unwind_info                size.__eh_frame
                                              before     after      diff  before                after    diff  before      after     diff  before             after    diff  before          after  diff
CTMark/mafft/pairlocalalign                    332992.00  338368.00  1.6%                                      209844.00   205880.00 -1.9%  1256.00            1256.00  0.0% 112.00          112.00  0.0%
CTMark/lencod/lencod                           627288.00  636968.00  1.5%                                      389564.00   384332.00 -1.3%  2056.00            2056.00  0.0% 664.00          664.00  0.0%
CTMark/SPASS/SPASS                             542296.00  550360.00  1.5%                                      305876.00   299888.00 -2.0%  4584.00            4584.00  0.0% 768.00          768.00  0.0%
CTMark/ClamAV/clamscan                         638344.00  644808.00  1.0%                                      341440.00   338812.00 -0.8%  2624.00            2624.00  0.0% 280.00          280.00  0.0%
CTMark/7zip/7zip-benchmark                    1137248.00 1141024.00  0.3% 38064.00              38136.00  0.2% 490924.00   488388.00 -0.5% 17936.00           17952.00  0.1%
CTMark/sqlite3/sqlite3                         382408.00  383528.00  0.3%                                      229680.00   228572.00 -0.5%  2432.00            2432.00  0.0%  88.00           88.00  0.0%
CTMark/Bullet/bullet                          1616632.00 1617480.00  0.1% 13296.00              13308.00  0.1% 374192.00   373828.00 -0.1% 11272.00           11272.00  0.0% 368.00          368.00  0.0%
CTMark/tramp3d-v4/tramp3d-v4                  1202824.00 1203192.00  0.0%                                      345572.00   345488.00 -0.0%    88.00              88.00  0.0%
CTMark/kimwitu++/kc                           1279552.00 1273152.00 -0.5%  3804.00               3804.00  0.0% 316800.00   301708.00 -4.8%  7336.00            7368.00  0.4% 392.00          456.00 16.3%
CTMark/con...sumer-typeset/consumer-typeset    524448.00  507200.00 -3.3%                                      344016.00   323976.00 -5.8%  1400.00            1400.00  0.0%
                           Geomean difference                        0.2%                                 0.1%                       -1.8%                              0.1%                         2.2%

The total size increase are largely contributed by the additional OUTLINED_FUNCTION_s in the symtab which we can get rid of by strip.

We also used to disable LOH to reduce the binary size.

Moving it to addPreEmitPass2 also seems reasonable. Although the uncompressed size might remain similar to when LOH is disabled, this approach can enhance the compressed size because the linker will emit more NOPs.

As a side note, if we aim to reduce the size further in this context (without emitting NOPs), tools like BOLT or other post-link optimization techniques could theoretically accomplish this.

I have identified a couple of issues by moving AArch64CollectLOH to addPreEmitPass2 on my local machine. My plan is to fix them individually and adjust the optimization pipeline once all problems are resolved.

Here is the first patch: [MachineOutliner] Preserve regmasks in calls to outlined functions by nocchijiang · Pull Request #120940 · llvm/llvm-project · GitHub

1 Like

@nocchijiang I’m curious if you are still planning on merging [MachineOutliner] Preserve regmasks in calls to outlined functions by nocchijiang · Pull Request #120940 · llvm/llvm-project · GitHub and moving forward with this project.

I’m also wondering if you’ve considered this possible solution: when outlining a block, remove all LOHs from those instructions and add them to the corresponding instructions in the outlined function. This might be more complicated than moving AArch64CollectLOH after MachineOutliner, but maybe it’s worth thinking about if that solution isn’t working very well.

I’ve rebased the PR and pinged the key reviewers. I recall that one of the reviewers had significant concerns about the changes around MachineFunction::allocateRegMask, so I tried to find a middle ground based on all the feedback received. Since we ultimately decided to shut down LOHs entirely, this PR slipped my mind for a while. If the current approach still doesn’t address everyone’s concerns, I will close the PR and give up this project.

1 Like

I’ve identified a problem with running LOH analysis after the MachineOutliner. Since outlined function do not follow calling conversions, we can’t even be sure temporary registers don’t live across function calls.

In this example, the MachineOutliner creates the the thunk _OUTLINED_FUNCTION_0 and immediately moves x16 to the first argument of bar or goo.

_OUTLINED_FUNCTION_0:
        adrp    x19, _A@PAGE
        add     x19, x19, _A@PAGEOFF
        ldr     w20, [x19]
        b       _foo
...
        bl      _OUTLINED_FUNCTION_0
        mov     x0, x19
        bl      _bar

If we run LOH now, it will try to add an .loh AdrpAddLdr directive for the adrp, add, ldr triplet, which could turn into this in the linker.

        nop
        nop
        ldr     w20, [x19, <_A offset>]

This leaves x16 undefined.

I don’t think it’s reasonable to expect LOH to analyze register liveness across functions, so it’s probably best to keep LOH before the MachineOutliner. Instead, I think we should teach the MachineOutliner how to handle these LOH directives, instead of allowing the to block outlining. Either by removing LOH directives for outlined functions, or maybe analyzing whether they are valid and fixing them up if possible.

Thank you for your investigation. I have already been able to reproduce the issue locally. Initially, the plan was to skip generating LOH for outlined functions, and this change was intended to be included alongside the pipeline update.
However, as mentioned previously, at ByteDance we have decided to completely disable LOHs for minsize functions. Given this decision, we do not plan to move forward with this project at this time.

LOH is for Mach-O and I haven’t read through the thread, but how does it prevent splitting the following two instructions for Mach-O?

adrp x0, :got:g\index
...
ldr x0, [x0, :got_lo12:g\index]

ELF might also need to prevent the split to resolve an --icf= issue. [lld][ELF] Merge equivalent symbols found during ICF by pranavk · Pull Request #139493 · llvm/llvm-project · GitHub

We have the pass AArch64CollectLOH that runs on MIR to identify where we can add .loh directives. In many cases it will abort if the instructions are not sequential, but I think this analysis could be improved.

For the new linker at Apple we ended up not supporting LOHs. We didn’t find that it was needed any more.

I wonder if it would be simpler to drop LOHs from llvm entirely, unless there are folks using lld who still find its worth it in some cases?

LOH is on by default I believe, which is why we noticed it in the machine outliner. And they are still supported in LLD. I think we would need to measure their performance impact in practice to see if they are worth keeping or not.