Basic block cloning enhancement for Propeller

TL;DR

Propeller introduced link time function layout optimizations that improves clang, SPEC and internal Google workload run times and relevant PMU metrics. I worked on enhancing Propeller with basic block cloning, which manages to get up to 1% speed up over Propeller on the same loads. This work enhances Propeller with the ability to clone machine basic blocks to take advantage of certain layout opportunities that are currently left on the table.

Motivation

Propeller is a profile guided function layout optimization using LLVM. It attempts to improve cache and branch behaviour by favouring frequently executed edges and making them fall throughs in the optimized binary.

However, in cases where multiple edges are good candidates, Propeller has to pick one option and potentially leave some performance on the table. For instance, when presented with the following CFG fragment, it has to pick one of the edges:

┌───┐ 50000 ┌───┐ 49999 ┌───┐

│ A │ ──────:arrow_forward: │ C │ :arrow_backward:────── │ B │

└───┘ └───┘ └───┘

However, the BC edge is equally hot and having it to be a jump is a missed opportunity since we cannot make it a fallthrough, especially if the execution pattern looks like AC,AC,AC…,AC,BC,BC,BC,…,BC, i.e. the edges execute in separate phases of the program rather than being interleaved. We call blocks that have multiple hot incoming edges such as C hot join blocks.

To be able to make both edges fallthroughs in such cases, I worked on basic block cloning for Propeller during the summer and got some promising results. Part of the changes were in LLVM itself, specifically, adding the ability to clone Machine Basic Blocks in codegen.

Implementation

While LLVM has the ability to clone IR Basic Blocks through CloneBasicBlock, there is no corresponding facility to clone MachineBasicBlocks.

I implemented the support for MBB cloning in BasicBlockSections pass as a 2 step process: making the clone and adjusting its predecessor to make it reachable and have a correct CFG with the new clone.

The cloning step makes a copy of the cloned block in the original block’s MachineFunction verbatim, copying all the instructions and successors. To ensure correctness, if the original block has a layout fallthrough, we insert an unconditional jump to the same block.

After a clone is made, it is made reachable by modifying one of the predecessors. For instance, in the below example, after the clone is made, the predecessor B is modified to now jump to C’ rather than C:

┌───┐ 50000 ┌───┐

│ A │ ──────▶T │ C │

└───┘ └───┘

┌───┐ 49999 ┌───┐

│ B │ ──────▶ │ C’│

└───┘ └───┘

We use TargetInstrInfo::removeBranch and TargetInstrInfo::insertBranch to modify the branches accordingly.

Which blocks to clone are decided in create_llvm_prof. To accommodate the cloning information, we add a new cloning decisions section which contains recipes to make clones for every function. The decision algorithm discovers frequently executed paths by cloning hot join blocks and updating the CFG weights correctly by querying the Last Branch Records (LBR).

Due to the inaccurate nature of profiles, or any other change that renders the profile inaccurate, it is possible to have clone decisions that are impossible to perform in LLVM. Therefore, there also exists the correct handling of bad clone decisions. In such cases where a clone decision is impossible to perform (missing blocks, wrong predecessor information etc.), we back out of that function and make no clones.

Evaluation

With path profile guided cloning decisions, we see consistent performance and branch characteristic improvements across the board and some increased i-cache pressure compared to vanilla Propeller. There is also no significant binary bloat in any programs.

clang

150 Clones




Baseline



BB Cloning



Difference



Run time



135.217



134.023



-0.88%



Taken branches



31051306225



30524253874



-1.7%



Not-taken branches



56745337469



56954834930



+0.37%



I-cache misses



15979272719



16161290846



+1.1%



Binary size



75613543



75615575



+0.002%



(2032 bytes)

SPEC2017

All the numbers are percent changes from propeller. With 100 Clones in all loads.




gcc



perl



mcf



omnetpp



xalancbmk



x264



deepsjeng



leela



xz



Run time



-0.30%



-2.59%



+2.09%



-0.16%



0%



-2.17%



0%



+1.13%



-0.71%



Taken branch



-1.8%



+2.2%



-10.6%



-1.3%



-2%



-1.3%



-6.1%



-9.2%



-11%



Not taken branch



+0.8%



0%



+3.4%



+0.9%



+12%



-4.6%



+1%



+2.7%



+15.2%



I-cache miss



+0.7%



+2%



+6.3%



+15.2%



-2.4%



+6.5%



+63.3%



+131.3%



0%

Google workloads

On the Search2 benchmark with 2000 clones.