This is the most straightforward solution that entirely disables LOHs which comes with performance loss. I believe that the loss brought by the outliner would be significantly worse than the loss of LOHs which could be ignored. This should be acceptable since we want to maximize the code size wins and performance loss is expected. The thing I am not sure about is how much performance loss it would bring and if it is worth trading it with 2% code size reduction.
Move AArch64CollectLOH after MachineOutliner
By letting the outliner run first we can still collect LOHs as usual so we could still benefit from them. However, adjusting the optimization pipeline seems to be a nontrivial change and I am not sure if it is doable or acceptable by the community.
For anyone following along, LOH is a MachO-specific directive which directs linker relocation optimizations. ELF uses a different mechanism to achieve a similar result.
I would guess the performance difference from enabling LOH is small; modern cores are very good at integer arithmetic.
Moving AArch64CollectLOH from addPreEmitPass to addPreEmitPass2 is reasonable at first glance… but I might be missing something; I only briefly glanced at the code.
In theory LOHs shouldn’t have any impact on compiler codegen so IMO the fact that it was inhibiting outlining was a bug that moving it LOH computing later is the correct fix for.
We also used to disable LOH to reduce the binary size.
Moving it to addPreEmitPass2 also seems reasonable. Although the uncompressed size might remain similar to when LOH is disabled, this approach can enhance the compressed size because the linker will emit more NOPs.
As a side note, if we aim to reduce the size further in this context (without emitting NOPs), tools like BOLT or other post-link optimization techniques could theoretically accomplish this.
I have identified a couple of issues by moving AArch64CollectLOH to addPreEmitPass2 on my local machine. My plan is to fix them individually and adjust the optimization pipeline once all problems are resolved.
I’m also wondering if you’ve considered this possible solution: when outlining a block, remove all LOHs from those instructions and add them to the corresponding instructions in the outlined function. This might be more complicated than moving AArch64CollectLOH after MachineOutliner, but maybe it’s worth thinking about if that solution isn’t working very well.
I’ve rebased the PR and pinged the key reviewers. I recall that one of the reviewers had significant concerns about the changes around MachineFunction::allocateRegMask, so I tried to find a middle ground based on all the feedback received. Since we ultimately decided to shut down LOHs entirely, this PR slipped my mind for a while. If the current approach still doesn’t address everyone’s concerns, I will close the PR and give up this project.
I’ve identified a problem with running LOH analysis after the MachineOutliner. Since outlined function do not follow calling conversions, we can’t even be sure temporary registers don’t live across function calls.
In this example, the MachineOutliner creates the the thunk _OUTLINED_FUNCTION_0 and immediately moves x16 to the first argument of bar or goo.
If we run LOH now, it will try to add an .loh AdrpAddLdr directive for the adrp, add, ldr triplet, which could turn into this in the linker.
nop
nop
ldr w20, [x19, <_A offset>]
This leaves x16 undefined.
I don’t think it’s reasonable to expect LOH to analyze register liveness across functions, so it’s probably best to keep LOH before the MachineOutliner. Instead, I think we should teach the MachineOutliner how to handle these LOH directives, instead of allowing the to block outlining. Either by removing LOH directives for outlined functions, or maybe analyzing whether they are valid and fixing them up if possible.
Thank you for your investigation. I have already been able to reproduce the issue locally. Initially, the plan was to skip generating LOH for outlined functions, and this change was intended to be included alongside the pipeline update.
However, as mentioned previously, at ByteDance we have decided to completely disable LOHs for minsize functions. Given this decision, we do not plan to move forward with this project at this time.
We have the pass AArch64CollectLOH that runs on MIR to identify where we can add .loh directives. In many cases it will abort if the instructions are not sequential, but I think this analysis could be improved.
LOH is on by default I believe, which is why we noticed it in the machine outliner. And they are still supported in LLD. I think we would need to measure their performance impact in practice to see if they are worth keeping or not.