Basic block cloning enhancement for Propeller


Propeller introduced link time function layout optimizations that improves clang, SPEC and internal Google workload run times and relevant PMU metrics. I worked on enhancing Propeller with basic block cloning, which manages to get up to 1% speed up over Propeller on the same loads. This work enhances Propeller with the ability to clone machine basic blocks to take advantage of certain layout opportunities that are currently left on the table.


Propeller is a profile guided function layout optimization using LLVM. It attempts to improve cache and branch behaviour by favouring frequently executed edges and making them fall throughs in the optimized binary.

However, in cases where multiple edges are good candidates, Propeller has to pick one option and potentially leave some performance on the table. For instance, when presented with the following CFG fragment, it has to pick one of the edges:

┌───┐ 50000 ┌───┐ 49999 ┌───┐

│ A │ ──────:arrow_forward: │ C │ :arrow_backward:────── │ B │

└───┘ └───┘ └───┘

However, the BC edge is equally hot and having it to be a jump is a missed opportunity since we cannot make it a fallthrough, especially if the execution pattern looks like AC,AC,AC…,AC,BC,BC,BC,…,BC, i.e. the edges execute in separate phases of the program rather than being interleaved. We call blocks that have multiple hot incoming edges such as C hot join blocks.

To be able to make both edges fallthroughs in such cases, I worked on basic block cloning for Propeller during the summer and got some promising results. Part of the changes were in LLVM itself, specifically, adding the ability to clone Machine Basic Blocks in codegen.


While LLVM has the ability to clone IR Basic Blocks through CloneBasicBlock, there is no corresponding facility to clone MachineBasicBlocks.

I implemented the support for MBB cloning in BasicBlockSections pass as a 2 step process: making the clone and adjusting its predecessor to make it reachable and have a correct CFG with the new clone.

The cloning step makes a copy of the cloned block in the original block’s MachineFunction verbatim, copying all the instructions and successors. To ensure correctness, if the original block has a layout fallthrough, we insert an unconditional jump to the same block.

After a clone is made, it is made reachable by modifying one of the predecessors. For instance, in the below example, after the clone is made, the predecessor B is modified to now jump to C’ rather than C:

┌───┐ 50000 ┌───┐

│ A │ ──────▶T │ C │

└───┘ └───┘

┌───┐ 49999 ┌───┐

│ B │ ──────▶ │ C’│

└───┘ └───┘

We use TargetInstrInfo::removeBranch and TargetInstrInfo::insertBranch to modify the branches accordingly.

Which blocks to clone are decided in create_llvm_prof. To accommodate the cloning information, we add a new cloning decisions section which contains recipes to make clones for every function. The decision algorithm discovers frequently executed paths by cloning hot join blocks and updating the CFG weights correctly by querying the Last Branch Records (LBR).

Due to the inaccurate nature of profiles, or any other change that renders the profile inaccurate, it is possible to have clone decisions that are impossible to perform in LLVM. Therefore, there also exists the correct handling of bad clone decisions. In such cases where a clone decision is impossible to perform (missing blocks, wrong predecessor information etc.), we back out of that function and make no clones.


With path profile guided cloning decisions, we see consistent performance and branch characteristic improvements across the board and some increased i-cache pressure compared to vanilla Propeller. There is also no significant binary bloat in any programs.


150 Clones


BB Cloning


Run time




Taken branches




Not-taken branches




I-cache misses




Binary size




(2032 bytes)


All the numbers are percent changes from propeller. With 100 Clones in all loads.










Run time










Taken branch










Not taken branch










I-cache miss










Google workloads

On the Search2 benchmark with 2000 clones.