Overview
We propose a new profile-guided and target-independent cost/benefit analysis for selecting conditional moves over branches when optimizing for performance. The goal is to i) consolidate the decision-making; ii) take ideas from existing passes and other sensible guidelines; and, iii) fully leverage profile information. Our experiments on x86 show 0.5-1.5% improvements on internal Google workloads and 1% performance improvement on clang bootstrap (all with instrumentation-based PGO).
Background
In both the x86 and ARM architectures, conditionally executed moves (e.g., a=c?b:d) could be implemented with either predicated code (i.e., conditional move) or with a branch (i.e., conditional jump). Which is optimal (when optimizing for performance) depends on how predictable the branch is and whether the move is part of a critical and long dependence chain.
Branch:
- short latency by executing only the predicted path [+]
- relies on accurate branch prediction [-]
Conditional move:
- avoids branch mispredictions [+]
- increases the length of dependence chains by waiting on the computation of its operands
(i.e., the condition and both paths’ expressions) [-] - increase common-case resource consumption in the backend by computing both true/false operands [-]
Rules of thumb in Agner Fog’s optimization manual.
Branch is faster than a conditional move if:
- move is part of a dependence chain, especially a long loop-carried one,
and prediction rate is better than 75%; or - can avoid a lengthy calculation of non-chosen operand
Current LLVM heuristics
Middle-end (LLVM IR)
- SimplifyCFG
- Converts fairly aggressively branches to selects to canonicalize the IR and simplify the control flow, and thus facilitate subsequent LLVM IR optimizations.
- Problem: Not meant to make a final decision on selects vs branches.
- CodeGenPrepare
- Converts select to branch if:
- condition is obviously predictable (one-sided based on profile info); or
- either operand is expensive-to-compute and sinkable (even if typically chosen one) and the condition is not shared with another select
- Problem: Shallow and conservative heuristics that rarely convert to branches and sometimes even suboptimally do so. For example, assume code
v1=load(x); select c,v1,v2. If c is typically always true, the load is on the common path and always executed with either a branch or a conditional move. Thus, the branch is not obviously better than a cmov here.
- Converts select to branch if:
CodeGen (Machine IR)
- X86CmovConversion
- Converts cmov to branch if:
- either value operand is a load; or
- decreased longest dependence chain for innermost loops
- Problem: Reasonable analysis but no profile information used, a bit aggressive (due to the first condition), and issues with some low-level computations in the critical path analysis.
- Converts cmov to branch if:
- EarlyIfConversion and IfConversion
- Convert conditional branches into predicated instructions with target-specific heuristics.
- Problem: Effectively disabled for x86 and AArch64 (isProfitableToIfCvt returns false by default and is not overridden for x86 and AArch64).
Motivation
Motivating example 1
Choice of branch over cmov costs ~5% performance loss in x86:
Source code:
auto r = *p;
r = c ? r : 0;
LLVM IR:
%r1 = load i64, i64* %p
%cmp = icmp eq i64 %c, 0
%r2 = select i1 %cmp, i64 0, i64 %r1
Assembly 1 (Current codegen):
movq %rdi, %rax
testq %rdi, %rdi
je .L
movq (%rsi), %rax
.L
Assembly 2 (Optimal codegen):
movq %rdi, %rax
testq %rdi, %rdi
cmovneq (%rsi), %rax
cmov is preferable here because the branch is unpredictable and the load is cheap (hits L1 based on profile info).
The branch is 50/50 in terms of frequency.
Motivating example 2
Choice of conditional moves over a branch costs 4-5% performance loss in ARM:
Source code:
while (i < n) {
double r = tr[i];
if (x <= r) {
i = 2 * i + 1;
} else {
x -= r;
i = 2 * i + 2;
}
}
Use of conditional moves puts x -= r (a load and a floating-point operation) on the critical path (loop-carried dependence chain), almost doubling the critical path’s latency compared to the average latency with a branch. The branch weights are 50/50 and the branch prediction here is decently accurate.
Design
The goal is to build a well-founded cost/benefit analysis for selecting conditional moves over branches. This analysis should fully leverage profile information and be target-agnostic so as to cater for multiple architectures.
Where to decide
SimplifyCFG is the first candidate. SimplifyCFG aggressively converts branches to selects to canonicalize the IR. A lot of subsequent LLVM IR optimizations (middle-end) rely on this canonical form. Adding more logic to SimplifyCFG and more conservatively converting to selects might have a lot of collateral damage later in the pipeline. Such proposals have been rejected in the past.
Another option is to do a target-specific implementation in Machine IR like X86CmovConversion. Such a pass is hard to extend with profile information since machine-level predicated code does not currently support weights (at least not for cmov in x86). Moreover, AArch64 and potentially other architectures would also benefit from a similar heuristic and re-implementing for them would lead to a lot of redundant logic.
CodegenPrepare seems to be a good spot:
- end of the middle-end (not affecting middle-end optimizations but still in LLVM IR form)
- access to profile-based branch weights information on select instructions
- could still query target-specific information (used in the existing MachineIR X86CmovConversion pass) such as misprediction penalty and instruction latency, thanks to the instruction scheduling machine model that is available to CodeGen passes
- allows for target-independent implementation
- already contains logic that converts selects to branches for some limited cases
Probably optimal option: a new Codegen IR pass just before CodegenPrepare. This option will share the advantages mentioned before in addition to avoiding concerns over deviating further CodegenPrepare from its original purpose (never meant to be an optimization pass).
How to decide
The employed heuristics attempt to match Agner Fog’s rule of thumb guidelines discussed in the background.
Guideline 1: branch preferable when move is part of a dependence chain, especially long loop-carried one, and prediction rate is better than 75%
Heuristic 1a: loop-level critical-path analysis
For this guideline, the proposed heuristic focuses on innermost loops similarly to X86CmovConversion. Innermost loops represent hotspots, include critical loop-carried dependence chains, and allow for a well-defined scope. Preliminary results showed that computing dependence chains for other loops or non-loops was ineffective in predicting good conversion opportunities. These cases are covered, instead, by the second and the third heuristic.
Computing dependence chains in innermost loops:
Compute the end-to-end latency of two loop iterations for a predicated version (select instructions as conditional moves) and for a non-predicated version (selects converted to branches), same as in the X86CmovConversion pass.
Assumptions:
- Assume infinite resources that allow to fully exploit the available instruction-level parallelism (ILP).
- Instructions outside the loop are not accounted for and cost zero cycles.
The cost of each instruction (in cycles) is the cost of the instruction itself in addition to the maximum cost of each operands:
InstCost = InstLatency + max(Op1Cost, Op2Cost, … OpNCost);
For the case of a cmov instruction the cost is:
CmovCost = CmovLatency + max(CondCost, TrueOperandCost, FalseOperandCost)
The cost of a branch instruction with profile-based branch probabilities is:
BranchCost = CorrectPredCost + MispredictCost
where
CorrectPredCost = TrueOperandCost * TrueProb + FalseOperandCost * FalseProb MispredictCost = max(MispredictPenalty, CondCost) * MispredictRate
Given a select a=c?b:d, the TrueOperandCost, FalseOperandCost, andCondCostare the costs of the true-expression instruction (compute b), the false-expression instruction (compute d) and the condition instruction (compute c), respectively.
The MispredictRate conservatively defaults to 25% (same as in X86CmovConversion). Use of profile information (LBR data) could allow for more accuracy here.
The MispredictPenalty is target-specific and is provided by the instruction scheduling machine model that is available to CodeGen passes (cycles wasted to recover from misprediction).
CondCost is included to account for cases where the computation of the condition is part of a long dependence chain (potentially loop-carried) that would delay detection of a misprediction.
Conversion criteria (per select group; i.e., consecutive selects with the same condition):
- Conversion to branches reduces the loop’s critical path by some threshold, similar to the X86CmovConversion pass.
- The cost of a given select group is cheaper with the branch version compared to the cmov one.
Assuming infinite resources, the cost of a group of instructions is the cost of the most expensive instruction of the group.
Guideline 1: branch preferable when move is part of a dependence chain, especially long loop-carried one, and prediction rate is better than 75%
Heuristic 1b: highly biased branches assumed to be highly predictable and thus converted to branches
For non-innermost loops and non-loops, properly calculating dependence chains could get complicated or even impractical. Therefore, the proposed work focuses on the extreme (but quite common) case of highly predictable selects that are almost always more performant as branches (rather than conditional moves), even without being part of long dependence chains.
Accurate estimation of prediction rate is hard, even with the presence of profile-based branch probabilities. For example, a 50/50 branch could be potentially 100% predicted correctly, but could surely also cause a lot of mispredictions. Branches that could be assumed to be easily predictable are the ones that are heavily biased. If a branch is always taken, one would expect a state-of-the-art branch predictor to predict it correctly at least most of the time (hopefully more often than 75% of the time). This heuristic already exists in the CodeGenPrepare and it is kept (the only one kept from CodeGenPrepare). The X86CmovConversion pass tries to identify a hard-to-predict pattern in tree-like algorithms where the result of a cmov is used as a pointer to load from. This seems hacky and it had a slight negative effect on some Google workloads, so it was not used. It seems unlikely that static analysis of the code would accurately predict misprediction rates. Instead, use of profile-based misprediction information could help for this guideline.
Guideline 2: branch preferable when avoiding a lengthy calculation of non-chosen operand
Heuristic 2: look for expensive instructions in the one-use chains of the cold path
For the rarely chosen operand (if any) of the select instruction (adjustable knob that defaults to 20%), look for expensive instructions in the whole chain of single-use instructions leading to and including the computation of this operand (instructions used solely for the computation of the select’s true/false expressions). If there is at least one expensive instruction in this chain, a branch would avoid (in the common case) blocking on this expensive operation (e.g., a long latency load instruction) and thus it seems preferable to convert the select to a branch.
This heuristic encapsulates the reasoning behind chandlerc’s patch to convert selects with load operands (in X86CmovConversion). Yet, it evolves the idea to look for only rarely-chosen expensive operands (if the load operand is executed most of the time then the select does not do much unnecessary expensive work) and beyond just the computation of the operand to a potentially bigger sequence of instructions.
This heuristic is not used for inner-most loops since those cases already have a relatively accurate cost/benefit analysis that takes into account expensive instructions in the cold path of selects (heuristic 1).
How to transform
If the compiler decides to convert a select instruction such as a=c?op1:op2 to a branch, it needs to decide what to sink into the true and false blocks.
In its simplest form the true and false blocks will not contain any instructions and the merge block will include phi nodes that pick the chosen operand based on the manifested control flow:
op1 = …
op2 = …
br c, bb1, bb2
bb1: jmp bbm
bb2:
bbm: phi (bb1, op1), (bb2, op2)
Originally, CodeGenPrepare would only sink to the true/false blocks the computation of the operands (op1, op2) if :
- operands have a single use
- Safety condition to prevent invalid LLVM IR. It is an overapproximation since two select instructions with the same condition could be using the same operand in the same path (so two uses but it is sound to sink).
- computation of operands is expensive
- Seems unnecessary but inconsequential given that misplacing a cheap computation might not matter that much.
- computation of operands is safe to speculatively execute (isSafeToSpeculativelyExecute)
- over-restrictive for probably lack of a better LLVM API (initially it was mayHaveSideEffects but some clang errors led to this check). This would be needed on hoisting to execute on all paths, not on sinking. Use of at least mayHaveSideEffects is required since sinkable operations should not have exposed downstream uses (anticipated).
The proposed approach sinks instructions more aggressively:
- amended the third condition with a safe but less strict check (e.g., no need to check for division by zero and dangerous loads when sinking for speed) and dropped the second check. Kept though the first condition. Originally, I was accounting for cases where two selects were using the same operands in the same path but the extra complexity does not seem to be justified by the evaluated performance.
- sinking is not limited to just the computation of the operands (
op1, op2). Instead, if the operands are one-use, it sinks the whole one-use chains that lead to these operands. Essentially, any instructions that are solely used to produce these operands. This is especially beneficial for cold operands that are rarely chosen. In those cases, part of their dependence slices are sinked and are no longer computed in the common path. There are also checks to ensure that there is no sinking to inner-loops and in general higher frequency blocks.
In the case of multiple select instructions in the same group (same condition), the order of non-dependent instructions in the true/false blocks seems to affect performance. Interleaving the chains seems to have a beneficial effect for a Google workload. This interleaving scheduling would allow for more ILP (with a natural downside of increasing register pressure) compared to a simpler ordering of one whole chain after another. One would expect that this ordering would not matter since the scheduling in the backend of the compiler would take care of it, but maybe backend schedulers become too restricted and fail to deliver optimal ILP with a naive ordering in the beginning of CodeGen.
Heuristics in action
Walk through the heuristics for the motivating examples.
Motivating example 1
For the first motivating example none of the heuristics apply. The heuristic 1a only applies to innermost loops and thus it is not applicable here. The condition in this example is 50/50, and thus it is not considered highly predictable (heuristic 1b not applicable) and there is not a cold path (heuristic 2 not applicable). Therefore, no conversion will be made and the select will be optimally converted to a cmov.
Motivating example 2
The second motivating example involves an innermost loop.
The cost/benefit analysis of the 1a heuristic will be used.
Here is the dependence graph (with only data flow deps) for the operations within the loop:

As it is clear from this dependence graph, using conditional moves, that would respect all the flow dependences, will create long loop-carried dependence chains involving expensive floating point operations. Using branches will break some of these dependences and chains, significantly reducing the critical path, as demonstrated with the following detailed analysis.
In detail, as discussed earlier, the compiler computes the cost of all the instructions for two iterations for both the predicated (with cmovs) and non-predicated (branches) version.
Opcode latency for AArch64 in cycles (based on LLVM’s TargetTransformInfo):
iadd: 1, imul: 1, cmp: 1, fsub: 3, load: 4, fcmov: 3, icmov: 1
MispredictionPenalty for AArch64 (based on LLVM’s MCSchedModel): 8 cycles
Branch weights 54/46
x, r, s are double and i is integer
Iteration 1:
r: pred_cost = 6, nonpred_cost = 6
c: pred_cost = 7, nonpred_cost = 7
a : pred_cost = 2, nonpred_cost = 2
b : pred_cost = 2, nonpred_cost = 2
i : pred_cost = 8, nonpred_cost = 4
s : pred_cost = 10, nonpred_cost = 6.5
x : pred_cost = 13, nonpred_cost = 9.5
Iteration 2:
r: pred_cost = 14, nonpred_cost = 10
c: pred_cost = 15, nonpred_cost = 11
a : pred_cost = 10, nonpred_cost = 6
b : pred_cost = 10, nonpred_cost = 6
i : pred_cost = 16, nonpred_cost = 8.75
s : pred_cost = 18, nonpred_cost = 10.25
x : pred_cost = 21, nonpred_cost = 13.25
The critical path’s cost is 21 cycles for predicated code and 13.25 cycles for the non-predicated code.
This reduction satisfies the default threshold of 4 cycles reduction in the loop’s critical path.
To ensure that the critical path reduces as the iteration count increases, there is also a check for the gradient of the gain to be at least 25%. Here the gradient of the gain is 53%.
The most expensive select instruction (s) of the one select group here is also more profitable as a branch (10.25) rather than a cmov (18).
Therefore, the compiler’s decision is to convert these selects to a branch, which is the optimal choice for this scenario.
Experimental Results
Google Workloads
We find that overall throughput for Search1 Google workload improves by 0.5% and CPU utilization improves up to 1% for Search2; significant improvements on these workloads.
The workloads are built with (instrumentation) CSFDO.
Search1 built with AFDO (Sample FDO) also shows a 0.3% improvement in throughput.
It seems that less accurate profiles lead to smaller, but still noticeable, gains.
For a critical Google database workload built with instrumentation FDO, the geomean improvement is 1.5%.
Clang bootstrap
We bootstrap clang and build a PGO+ThinLTO optimized version and PGO+ThinLTO+cmov-opti compile (the latter includes the proposed cmov optimization). We compare the performance of building clang on x86 using these two versions and observe a 1% improvement in total runtime with the proposed optimization (diff: -1.00% ±0.231% with p=0.000).
Thank you,
Sotiris Apostolakis
Software Engineer, Google