RFC] Shrink-wrapping improvement

Hello,

During my internship, I worked on improving shrink-wrapping in LLVM.

As you might already know, we currently have a shrink-wrapping pass in-tree, announced here: http://lists.llvm.org/pipermail/llvm-dev/2015-May/085220.html, by Quentin Colombet. This pass is currently enabled by default at all optimization levels (except O0, obviously) for targets that support it.

The current pass performs really good when the uses of callee-saved registers or stack objects can efficiently share an unique saving point, and an unique restoring point. Basically, what it does is "delay" the execution of the prologue and the epilogue until it's really needed.
For that, it uses dominance and post-dominance properties of the CFG, which performs great in terms of compile-time (we already have dominator trees) and code-size (there is no "extra" code, just code motion).

Here are the main limitations of the current algorithm which motivate the research of an improvement:

* Requiring an unique save / restore point can very easily result in the default placement points, the entry block and the return blocks. It works great for early returns but once the CFG gets more complicated we end up using the default placement.

* Shrink-wrapping the whole state as a single unit is good because most of the target-specific code in `emitPrologue` / `emitEpilogue` will still work. The opportunity that we're missing here is shrink-wrapping every register separately and exploiting the possibility to split up the prologue / epilogue by having separate CSR operations, separate stack setup, etc.

My shrink-wrapping algorithm is inspired from Minimizing register usage penalty at procedure calls - F. C. Chow (http://dl.acm.org/citation.cfm?id=53999), which introduces the shrink-wrapping idea based on a dataflow analysis. This analysis is expensive and if implemented as-is, will have a big compile-time impact.

About my implementation:

* We want to completely avoid placing save / restore instructions inside loops. For that, I'm only placing save / restore code at SCC boundaries, which will always end up outside of loops (even irreducible ones). So there is no need of an expensive dataflow analysis, and we can end up with a simple linear traversal of the CFG.

* This algorithm allows us to have multiple save / restore points for a set of uses in a CFG.

* My implementation tracks all the different CSRs separately and choses as many points as needed so that we *ONLY* save and restore when the register is used.

This algorithm has been generalized to be able to shrink-wrap any CFG with any kind of input. For example, based on the points chosen for the save / restore of the CSRs, I choose the best stack setup / destroy points where `emitPrologue` / `emitEpilogue` is called. The same algorithm and interface is used, and can be used for other things.

In order to use this interface, you have to describe what is a "use" (ex: use of register, use of frame idx, etc). The only information it needs is:

* the number of separate elements we want to shrink-wrap, let's call it `NumElt` (ex: number of callee saved registers)

* a mapping between a machine basic block number and a BitVector of size `NumElt`. If there is a use of an element, the bit is set (ex: if the 3rd callee saved register is used in the MBB#9, then `map[9].test(3) == true`)

The output of this is a mapping between a machine basic block number and a BitVector of size `NumElt` (ex: if `saves[1].test(3) == true`, then the 3rd callee saved register needs to be saved in MBB#1)

The patch implementing the algorithm and this interface is here: [CodeGen] Provide an advanced shrink-wrapping interface (⚙ D36109 [CodeGen] Provide an advanced shrink-wrapping interface).

There are a few limitations that need some more work:

* Exception handling support. In order to properly describe the state of the function for every block, we now need more CFI directives. In order to get that to the level of correctness, we need more logic when generating CFI (⚙ D35844 Correct dwarf unwind information in function epilogue), and some unwinder fixes (⚙ D34544 [libunwind] Don't assume the return address register is always saved and has CFI for it). Also, compact unwinding can't work anymore and needs to fallback on DWARF.

* Windows CFI support.

* Sanitizers, crash reporters, other things that assume there is a prologue and don't understand DWARF.

There are some good results on AArch64 on the LLVM test suite + SPEC (2000 + 2006):

* On average, an estimation of 8% reduction of the executed save / restore instructions based on the block frequency and the number of saved / restored registers per block.

O3:
* Up to 8% execution time improvement
* 76 improved tests out of 322 tests.

* +0.6% compile-time impact.

* +2.5% code-size impact.

Os:
* Up to 6.5% execution time improvement
* 58 improved tests out of 322 tests.

* +0.3% compile-time impact.

* +2.8% code-size impact.
* We could think of an Os / Oz mode of the algorithm where we won't duplicate the save / restore points.

There are also many regressions:

O3:
* 75 regressions out of 322 tests.
* 33 of them > 1%
Os:
* 98 regressions out of 322 tests.
* 22 of them > 1%

Here are some problems I noticed:

* Since we want to save / restore *ONLY* where a register is used, we can end up splitting a pair of registers. Take the following example:

stp reg1, reg2
use reg1
use reg2
if (cond) {
// do stuff
} else {
// do stuff
}
ldp reg1, reg2
ret

This is good. But if reg2 is only used in the if.then block:

str reg1
use reg1
if (cond) {
str reg2
use reg2
ldr reg2
// do stuff
} else {
// do stuff
}
ldr reg1
ret

To fix this, instead of shrink-wrapping registers separately, we can think of building pairs of registers and shrink-wrapping the pair as one single unit.

* Merging the sp adjustment with loads and stores. We sometimes can't do that if the first registers to be saved / last to be restored are shrink-wrapped away. Example:

stp reg1, reg2, [sp, #-32]! # sp -= 32; store reg1, store reg2;
stp reg3, reg4
use reg1, reg2, reg3, reg4
if (cond) {
// do stuff
} else {
// do stuff
}
ldp reg3, reg4
ldp reg1, reg2, [sp], #32 # load reg1, load reg2; sp += 32;

This is good. The LoadStoreOptimizer can build this easily even if we don't generate this code in PEI. But if reg1, reg2 are used in the if.then block:

sub sp, sp, #32
stp reg3, reg4
use reg3, reg4
if (cond) {
stp reg1, reg2
use reg1, reg2
ldp reg1, reg2
// do stuff
} else {
// do stuff
}
ldp reg3, reg4
add sp, sp, #32

The LoadStoreOptimizer can't merge them because the stack slots are in the wrong order. To fix this it would require to have more flexibility on the order of the stack objects.

There is also more room for improvement:

* The register allocator problem described here: https://bugs.llvm.org/show_bug.cgi?id=29154 and here: https://reviews.llvm.org/D34608.

* Avoid having "artificial" uses from the target in the entry block (or whatever we call the prologue here). At the time we shrink-wrap, we don't know if we're going to use FP, if we're going to use the base pointer, where it's going to be used, and where it should be initialized.

Also, all the target specific-code is very hacky. I guess I'm not the only one noticing that the current FrameLowering process is quite messy and needs some re-thinking. The main thing that bothered me was that the PrologueEpilogueInserter is driving the whole FrameLowering process, and goes back and forth with the TargetFrameLowering interface. Some hooks in that interface are there to change the order things are done in the PEI pass. Some targets have internal target-specific side effects in some of these hooks. One solution here would be to have the whole FrameLowering process to be driven by the target, by calling generic functions, which would allow this pass to integrate much more nicely.

I managed to split this up into 4 (messy) commits. I will post these for review once we have the general algorithm in, but if anyone is interested of trying this out, here are the commits implementing this for AArch64, and a less-advanced version of it for X86.

[PEI] Use the ShrinkWrapper interface for placing callee-saved registers ([PEI] Use the ShrinkWrapper interface for placing callee-saved registers · francisvm/llvm@57111b7 · GitHub)
[ShrinkWrapper] Add AArch64 support ([ShrinkWrapper] Add AArch64 support · francisvm/llvm@c74e923 · GitHub)
[ShrinkWrapper] Add X86 support [ShrinkWrapper] Add X86 support · francisvm/llvm@53c368f · GitHub
[ShrinkWrapper] Add AArch64 support for shrink-wrapping stack setup code ([ShrinkWrapper] Add AArch64 support for shrink-wrapping stack setup code · francisvm/llvm@ce787c0 · GitHub)

Thanks,

Also, I would like to add some more things:

* There is some more interest in this: http://lists.llvm.org/pipermail/llvm-dev/2017-May/112623.html

* The current shrink-wrapping implementation is very compile-time efficient and caught many opportunities so far. Since there have been sightings that there is more performance potential, one of the main goals of my internship was to explore the remaining performance headroom.

* I am looking for a way to have a mix between the current and the new implementation, or some kind of fallback on the old one when the new one does too much / not enough, or when things like code-size matter more.

* While it's the end of my internship, I plan to continue working on this and keeping track of the patches until it becomes good enough.

Cheers,