[RFC] cmov-vs-branch optimization

Overview

We propose a new profile-guided and target-independent cost/benefit analysis for selecting conditional moves over branches when optimizing for performance. The goal is to i) consolidate the decision-making; ii) take ideas from existing passes and other sensible guidelines; and, iii) fully leverage profile information. Our experiments on x86 show 0.5-1.5% improvements on internal Google workloads and 1% performance improvement on clang bootstrap (all with instrumentation-based PGO).

Background

In both the x86 and ARM architectures, conditionally executed moves (e.g., a=c?b:d) could be implemented with either predicated code (i.e., conditional move) or with a branch (i.e., conditional jump). Which is optimal (when optimizing for performance) depends on how predictable the branch is and whether the move is part of a critical and long dependence chain.

Branch:

  • short latency by executing only the predicted path [+]
  • relies on accurate branch prediction [-]

Conditional move:

  • avoids branch mispredictions [+]
  • increases the length of dependence chains by waiting on the computation of its operands
    (i.e., the condition and both paths’ expressions) [-]
  • increase common-case resource consumption in the backend by computing both true/false operands [-]

Rules of thumb in Agner Fog’s optimization manual.
Branch is faster than a conditional move if:

  1. move is part of a dependence chain, especially a long loop-carried one,
    and prediction rate is better than 75%; or
  2. can avoid a lengthy calculation of non-chosen operand

Current LLVM heuristics

Middle-end (LLVM IR)

  • SimplifyCFG
    • Converts fairly aggressively branches to selects to canonicalize the IR and simplify the control flow, and thus facilitate subsequent LLVM IR optimizations.
    • Problem: Not meant to make a final decision on selects vs branches.
  • CodeGenPrepare
    • Converts select to branch if:
      • condition is obviously predictable (one-sided based on profile info); or
      • either operand is expensive-to-compute and sinkable (even if typically chosen one) and the condition is not shared with another select
    • Problem: Shallow and conservative heuristics that rarely convert to branches and sometimes even suboptimally do so. For example, assume code v1=load(x); select c,v1,v2. If c is typically always true, the load is on the common path and always executed with either a branch or a conditional move. Thus, the branch is not obviously better than a cmov here.

CodeGen (Machine IR)

  • X86CmovConversion
    • Converts cmov to branch if:
      • either value operand is a load; or
      • decreased longest dependence chain for innermost loops
    • Problem: Reasonable analysis but no profile information used, a bit aggressive (due to the first condition), and issues with some low-level computations in the critical path analysis.
  • EarlyIfConversion and IfConversion
    • Convert conditional branches into predicated instructions with target-specific heuristics.
    • Problem: Effectively disabled for x86 and AArch64 (isProfitableToIfCvt returns false by default and is not overridden for x86 and AArch64).

Motivation

Motivating example 1

Choice of branch over cmov costs ~5% performance loss in x86:

Source code:

auto r = *p;
r = c ? r : 0;

LLVM IR:

%r1 = load i64, i64* %p
%cmp  = icmp eq i64 %c, 0  
%r2 = select i1 %cmp, i64 0, i64 %r1

Assembly 1 (Current codegen):

  movq    %rdi, %rax
  testq   %rdi, %rdi
  je      .L
  movq    (%rsi), %rax
.L 

Assembly 2 (Optimal codegen):

movq    %rdi, %rax
testq   %rdi, %rdi 
cmovneq (%rsi), %rax

cmov is preferable here because the branch is unpredictable and the load is cheap (hits L1 based on profile info).
The branch is 50/50 in terms of frequency.

Motivating example 2

Choice of conditional moves over a branch costs 4-5% performance loss in ARM:

Source code:

while (i < n) {
   double r = tr[i];
   if (x <= r) {
        i = 2 * i + 1;
   } else {
        x -= r;
        i = 2 * i + 2;
   }
}

Use of conditional moves puts x -= r (a load and a floating-point operation) on the critical path (loop-carried dependence chain), almost doubling the critical path’s latency compared to the average latency with a branch. The branch weights are 50/50 and the branch prediction here is decently accurate.

Design

The goal is to build a well-founded cost/benefit analysis for selecting conditional moves over branches. This analysis should fully leverage profile information and be target-agnostic so as to cater for multiple architectures.

Where to decide

SimplifyCFG is the first candidate. SimplifyCFG aggressively converts branches to selects to canonicalize the IR. A lot of subsequent LLVM IR optimizations (middle-end) rely on this canonical form. Adding more logic to SimplifyCFG and more conservatively converting to selects might have a lot of collateral damage later in the pipeline. Such proposals have been rejected in the past.

Another option is to do a target-specific implementation in Machine IR like X86CmovConversion. Such a pass is hard to extend with profile information since machine-level predicated code does not currently support weights (at least not for cmov in x86). Moreover, AArch64 and potentially other architectures would also benefit from a similar heuristic and re-implementing for them would lead to a lot of redundant logic.

CodegenPrepare seems to be a good spot:

  • end of the middle-end (not affecting middle-end optimizations but still in LLVM IR form)
  • access to profile-based branch weights information on select instructions
  • could still query target-specific information (used in the existing MachineIR X86CmovConversion pass) such as misprediction penalty and instruction latency, thanks to the instruction scheduling machine model that is available to CodeGen passes
  • allows for target-independent implementation
  • already contains logic that converts selects to branches for some limited cases

Probably optimal option: a new Codegen IR pass just before CodegenPrepare. This option will share the advantages mentioned before in addition to avoiding concerns over deviating further CodegenPrepare from its original purpose (never meant to be an optimization pass).

How to decide

The employed heuristics attempt to match Agner Fog’s rule of thumb guidelines discussed in the background.

Guideline 1: branch preferable when move is part of a dependence chain, especially long loop-carried one, and prediction rate is better than 75%
Heuristic 1a: loop-level critical-path analysis

For this guideline, the proposed heuristic focuses on innermost loops similarly to X86CmovConversion. Innermost loops represent hotspots, include critical loop-carried dependence chains, and allow for a well-defined scope. Preliminary results showed that computing dependence chains for other loops or non-loops was ineffective in predicting good conversion opportunities. These cases are covered, instead, by the second and the third heuristic.

Computing dependence chains in innermost loops:

Compute the end-to-end latency of two loop iterations for a predicated version (select instructions as conditional moves) and for a non-predicated version (selects converted to branches), same as in the X86CmovConversion pass.

Assumptions:

  • Assume infinite resources that allow to fully exploit the available instruction-level parallelism (ILP).
  • Instructions outside the loop are not accounted for and cost zero cycles.

The cost of each instruction (in cycles) is the cost of the instruction itself in addition to the maximum cost of each operands:
InstCost = InstLatency + max(Op1Cost, Op2Cost, … OpNCost);

For the case of a cmov instruction the cost is:
CmovCost = CmovLatency + max(CondCost, TrueOperandCost, FalseOperandCost)

The cost of a branch instruction with profile-based branch probabilities is:
BranchCost = CorrectPredCost + MispredictCost
where
CorrectPredCost = TrueOperandCost * TrueProb + FalseOperandCost * FalseProb MispredictCost = max(MispredictPenalty, CondCost) * MispredictRate

Given a select a=c?b:d, the TrueOperandCost, FalseOperandCost, andCondCostare the costs of the true-expression instruction (compute b), the false-expression instruction (compute d) and the condition instruction (compute c), respectively.

The MispredictRate conservatively defaults to 25% (same as in X86CmovConversion). Use of profile information (LBR data) could allow for more accuracy here.

The MispredictPenalty is target-specific and is provided by the instruction scheduling machine model that is available to CodeGen passes (cycles wasted to recover from misprediction).

CondCost is included to account for cases where the computation of the condition is part of a long dependence chain (potentially loop-carried) that would delay detection of a misprediction.

Conversion criteria (per select group; i.e., consecutive selects with the same condition):
  • Conversion to branches reduces the loop’s critical path by some threshold, similar to the X86CmovConversion pass.
  • The cost of a given select group is cheaper with the branch version compared to the cmov one.
    Assuming infinite resources, the cost of a group of instructions is the cost of the most expensive instruction of the group.
Guideline 1: branch preferable when move is part of a dependence chain, especially long loop-carried one, and prediction rate is better than 75%
Heuristic 1b: highly biased branches assumed to be highly predictable and thus converted to branches

For non-innermost loops and non-loops, properly calculating dependence chains could get complicated or even impractical. Therefore, the proposed work focuses on the extreme (but quite common) case of highly predictable selects that are almost always more performant as branches (rather than conditional moves), even without being part of long dependence chains.

Accurate estimation of prediction rate is hard, even with the presence of profile-based branch probabilities. For example, a 50/50 branch could be potentially 100% predicted correctly, but could surely also cause a lot of mispredictions. Branches that could be assumed to be easily predictable are the ones that are heavily biased. If a branch is always taken, one would expect a state-of-the-art branch predictor to predict it correctly at least most of the time (hopefully more often than 75% of the time). This heuristic already exists in the CodeGenPrepare and it is kept (the only one kept from CodeGenPrepare). The X86CmovConversion pass tries to identify a hard-to-predict pattern in tree-like algorithms where the result of a cmov is used as a pointer to load from. This seems hacky and it had a slight negative effect on some Google workloads, so it was not used. It seems unlikely that static analysis of the code would accurately predict misprediction rates. Instead, use of profile-based misprediction information could help for this guideline.

Guideline 2: branch preferable when avoiding a lengthy calculation of non-chosen operand
Heuristic 2: look for expensive instructions in the one-use chains of the cold path

For the rarely chosen operand (if any) of the select instruction (adjustable knob that defaults to 20%), look for expensive instructions in the whole chain of single-use instructions leading to and including the computation of this operand (instructions used solely for the computation of the select’s true/false expressions). If there is at least one expensive instruction in this chain, a branch would avoid (in the common case) blocking on this expensive operation (e.g., a long latency load instruction) and thus it seems preferable to convert the select to a branch.

This heuristic encapsulates the reasoning behind chandlerc’s patch to convert selects with load operands (in X86CmovConversion). Yet, it evolves the idea to look for only rarely-chosen expensive operands (if the load operand is executed most of the time then the select does not do much unnecessary expensive work) and beyond just the computation of the operand to a potentially bigger sequence of instructions.

This heuristic is not used for inner-most loops since those cases already have a relatively accurate cost/benefit analysis that takes into account expensive instructions in the cold path of selects (heuristic 1).

How to transform

If the compiler decides to convert a select instruction such as a=c?op1:op2 to a branch, it needs to decide what to sink into the true and false blocks.

In its simplest form the true and false blocks will not contain any instructions and the merge block will include phi nodes that pick the chosen operand based on the manifested control flow:

     op1 = … 
     op2 = … 
     br c, bb1, bb2 
bb1: jmp bbm  
bb2: 
bbm: phi (bb1, op1), (bb2, op2)

Originally, CodeGenPrepare would only sink to the true/false blocks the computation of the operands (op1, op2) if :

  • operands have a single use
    • Safety condition to prevent invalid LLVM IR. It is an overapproximation since two select instructions with the same condition could be using the same operand in the same path (so two uses but it is sound to sink).
  • computation of operands is expensive
    • Seems unnecessary but inconsequential given that misplacing a cheap computation might not matter that much.
  • computation of operands is safe to speculatively execute (isSafeToSpeculativelyExecute)
    • over-restrictive for probably lack of a better LLVM API (initially it was mayHaveSideEffects but some clang errors led to this check). This would be needed on hoisting to execute on all paths, not on sinking. Use of at least mayHaveSideEffects is required since sinkable operations should not have exposed downstream uses (anticipated).

The proposed approach sinks instructions more aggressively:

  • amended the third condition with a safe but less strict check (e.g., no need to check for division by zero and dangerous loads when sinking for speed) and dropped the second check. Kept though the first condition. Originally, I was accounting for cases where two selects were using the same operands in the same path but the extra complexity does not seem to be justified by the evaluated performance.
  • sinking is not limited to just the computation of the operands (op1, op2). Instead, if the operands are one-use, it sinks the whole one-use chains that lead to these operands. Essentially, any instructions that are solely used to produce these operands. This is especially beneficial for cold operands that are rarely chosen. In those cases, part of their dependence slices are sinked and are no longer computed in the common path. There are also checks to ensure that there is no sinking to inner-loops and in general higher frequency blocks.

In the case of multiple select instructions in the same group (same condition), the order of non-dependent instructions in the true/false blocks seems to affect performance. Interleaving the chains seems to have a beneficial effect for a Google workload. This interleaving scheduling would allow for more ILP (with a natural downside of increasing register pressure) compared to a simpler ordering of one whole chain after another. One would expect that this ordering would not matter since the scheduling in the backend of the compiler would take care of it, but maybe backend schedulers become too restricted and fail to deliver optimal ILP with a naive ordering in the beginning of CodeGen.

Heuristics in action

Walk through the heuristics for the motivating examples.

Motivating example 1

For the first motivating example none of the heuristics apply. The heuristic 1a only applies to innermost loops and thus it is not applicable here. The condition in this example is 50/50, and thus it is not considered highly predictable (heuristic 1b not applicable) and there is not a cold path (heuristic 2 not applicable). Therefore, no conversion will be made and the select will be optimally converted to a cmov.

Motivating example 2

The second motivating example involves an innermost loop.
The cost/benefit analysis of the 1a heuristic will be used.

Here is the dependence graph (with only data flow deps) for the operations within the loop:

dep_graph

As it is clear from this dependence graph, using conditional moves, that would respect all the flow dependences, will create long loop-carried dependence chains involving expensive floating point operations. Using branches will break some of these dependences and chains, significantly reducing the critical path, as demonstrated with the following detailed analysis.

In detail, as discussed earlier, the compiler computes the cost of all the instructions for two iterations for both the predicated (with cmovs) and non-predicated (branches) version.

Opcode latency for AArch64 in cycles (based on LLVM’s TargetTransformInfo):
iadd: 1, imul: 1, cmp: 1, fsub: 3, load: 4, fcmov: 3, icmov: 1

MispredictionPenalty for AArch64 (based on LLVM’s MCSchedModel): 8 cycles
Branch weights 54/46
x, r, s are double and i is integer

Iteration 1:
r: pred_cost = 6, nonpred_cost = 6
c: pred_cost = 7, nonpred_cost = 7
a : pred_cost = 2, nonpred_cost = 2
b : pred_cost = 2, nonpred_cost = 2
i : pred_cost = 8, nonpred_cost = 4
s : pred_cost = 10, nonpred_cost = 6.5
x : pred_cost = 13, nonpred_cost = 9.5

Iteration 2:
r: pred_cost = 14, nonpred_cost = 10
c: pred_cost = 15, nonpred_cost = 11
a : pred_cost = 10, nonpred_cost = 6
b : pred_cost = 10, nonpred_cost = 6
i : pred_cost = 16, nonpred_cost = 8.75
s : pred_cost = 18, nonpred_cost = 10.25
x : pred_cost = 21, nonpred_cost = 13.25

The critical path’s cost is 21 cycles for predicated code and 13.25 cycles for the non-predicated code.
This reduction satisfies the default threshold of 4 cycles reduction in the loop’s critical path.
To ensure that the critical path reduces as the iteration count increases, there is also a check for the gradient of the gain to be at least 25%. Here the gradient of the gain is 53%.

The most expensive select instruction (s) of the one select group here is also more profitable as a branch (10.25) rather than a cmov (18).

Therefore, the compiler’s decision is to convert these selects to a branch, which is the optimal choice for this scenario.

Experimental Results

Google Workloads

We find that overall throughput for Search1 Google workload improves by 0.5% and CPU utilization improves up to 1% for Search2; significant improvements on these workloads.
The workloads are built with (instrumentation) CSFDO.

Search1 built with AFDO (Sample FDO) also shows a 0.3% improvement in throughput.
It seems that less accurate profiles lead to smaller, but still noticeable, gains.

For a critical Google database workload built with instrumentation FDO, the geomean improvement is 1.5%.

Clang bootstrap

We bootstrap clang and build a PGO+ThinLTO optimized version and PGO+ThinLTO+cmov-opti compile (the latter includes the proposed cmov optimization). We compare the performance of building clang on x86 using these two versions and observe a 1% improvement in total runtime with the proposed optimization (diff: -1.00% ±0.231% with p=0.000).

Thank you,
Sotiris Apostolakis
Software Engineer, Google

8 Likes

As we have ML and all these nice cool new things… my more general question:

Did you consider this approach vs ML-based “cmov conversion” pass? Like for inliner, there is ML-based inling advisor. No PGO profile needed for ML-based approach as well.

But overall, +1 for this RFC. PGO based approach makes perfect sense.

Recently I was thinking how to give an user who really cares about performance more control over “cmov vs branch” and I think we can improve situation more or less with something like https://reviews.llvm.org/D118118 (feel free to comment)

1 Like

Like for inliner, there is ML-based inling advisor. No PGO profile needed for ML-based approach as well.

A clarification, not needed currently because it’s only targeting -O{z|s}. We will very much want profiles when we do -O3. We also depend on profile presence and their accuracy for the ML regalloc eviction advisor.

We plan to consider a ML-based approach in the future. It is unclear though how much headroom will be left to justify it, and as Mircea clarified you still need profile information.

That’s another reason for a more sophisticated LLVM IR pass. Having access to this kind of metadata without having to change the format of the instructions.

Yes, this will be a good candidate optimization for MLGO. Profile information will be important for critical path identification.

That looks great! I assume this pre-CodeGenPrepare pass will make the Codegen code simpler by removing heuristics from backend (?)

Thanks for taking a look.
This will depend on the target.
The presented results for x86 disable the current heuristics in CodeGenPrepare and the whole X86CmovConversion MachineIR pass. So, yes the Codegen code is simplified for x86. This is currently done only when profile information is available. However, it might be worth completely deprecating X86CmovConversion and bringing all the logic to the pre-CodeGenPrepare pass even for cases with no profile information and add some target-specific knobs to disable parts of the heuristics.
For AArch64 which will be the second target we will investigate, there is not much logic in the backend making this decision, so I do not expect much simplification in the backend but hopefully this new pass will improve performance.

1 Like

Was there anything done for the profile collection build in AFDO? One thing I’ve come up against is that since CMOV is a canonicalization the profiling build will have CMOV more often than no. Since profiling is LBR based no branch probabilities get collected.

Nothing special was done for the used AFDO profile. Just used an already collected AFDO profile.
The initial effort focuses on instrFDO and AFDO will be investigated more thoroughly next. Some tuning might be needed in the heuristics to address AFDO issues.

Good observation regarding LBR. The lack of branch probabilities for cmovs for LBR-based profiles poses a challenge. Note though that there is still some benefit from converting when profitable (currently emitted) branches to cmovs. After that depending on the non-profile-based heuristics there is a potential issue of oscillations, where the decision to convert to cmov will be reverted, and then reverted again once profile information are once again available etc.
There is though a very interesting advantage with AFDO. LBR data already contain a misprediction bit allowing computation of misprediction rates. So, the decision for whether a branch should be converted to a cmov becomes very well-informed and much more accurate.

In short, there are surely some new challenges with AFDO but also new opportunities. I will be investigating them soon after I finish with InstrFDO.

Yep this makes it tricky. In addition, branches that start out as CMOV will be stuck in that state as well.

That is an exciting advantage. With modern branch predictors biasing is only a loose correlation with predicability.

Awesome, excited to see what you come up with! I can help out with additional evaluation from workloads at Meta.

Yes, exactly. Even a 50/50 branch could have 100% predictability and a somewhat biased one could have surprisingly low predictability.

That would be great. I will keep you in the loop.

1 Like

Hi,

Does this also take into account the cases where the SBB instruction is used to remove the branch entirely? The SBB method is already implemented in LLVM.
E.g.
Consider the following x86_64 assembly:
cmpl $0x12,0x308(%rdi)
sbb %esi,%esi
and $0xffffffffffffffdf,%esi
add $0x5b,%esi
It contains the SBB instruction. SBB cannot be represented in LLVM_IR
cmpl: if ($0x12 < 0x308(%rdi)) set the carry flag.
sbb: if carry flag set: %esi = 0xffffffffffffffff else %esi = 0; // Note: 0xffffffffffffffff = -1
and: if carry flag set: %esi = 0xffffffffffffffdf else %esi = 0; // Note: 0xffffffffffffffdf = -33
add: if carry flag set: %esi = 0x3a else %esi = 0x5b;
This can then be converted into:
if ($0x12 < 0x308(%rdi)) {
%esi = 58; // 0x3a
} else {
%esi = 91; // 0x5b
}
So, the SBB is used here to remove the need for a Branch instruction.
WOW, compilers are clever!!!

Yes the compiler sometimes makes smart decisions in instruction selection, and indeed in some cases it will prefer a sequence including a sbb over a cmov. However, regardless of whether the compiler chooses to lower branchless-code/conditional-moves as predicated code (i.e., cmov) or some arithmetic tricks (e.g., with sbb), it does not change much regarding the dilemma between branchless and branch-ed code. In all the branchless instances including your example, both value operands of the conditional move will need to be computed (here happen to be constant numbers). The heuristics focus on the computation of the value operands and the condition, and decide on whether a branch is preferable. So, if for example one rarely-chosen value operand is a load, it is very likely that you would prefer code with a branch (that will almost never have to execute this load) rather than any form of branchless code.

This looks awesome. There was thought on unifying branch-select logic ⚙ D34769 [X86] X86::CMOV to Branch heuristic based optimization and the prototype looks having achieved it, I’m pleased to test the patch on PowerPC.

I have a question, does this pass take instruction legality into account? E.g., on PowerPC we don’t have FCMOV like instruction for float numbers.

Glad you liked the proposal.
Yes, it seems to me that unifying this decision-making in a target-independent fashion was long overdue.
This proposal will probably be opt-in for new targets and initially only enabled for x86, since it is currently only tested for x86.
Once the initial patches land, it will be great if you test it for PowerPC and enable it in a subsequent patch if it looks okay.
We can also add at that point some extra bailout checks (overridable by targets). I currently only have some basic checks (e.g., whether selects in general are supported by the target). I do not fully check legality. Not sure if it always necessary since I think most targets (including I think PowerPC) will lower unsupported instances of selects to a legal sequence of instructions.

Hi, @apostolakis

As I go through some benchmark, GCC seems better in dealing with this problem even in O2 without PGO. So is there any points or opportunity to enhance this branch/select problem in LLVM even without profiling?

For x86, the X86CmovConversion pass was created (fairly recently) and refined for optimizing non-PGO builds. The new heuristics mostly rely on profile information and I do not expect much improvements for non-PGO builds on x86.
On the other hand, there is much more room for improvement for other architectures and in particular for AArch64. AArch64 backend almost always prefers conditional moves over branches without any reasoning. I have seen some improvements for AArch64 with the new optimization for non-PGO builds.
One other potential optimization for all architectures would be to have SimplifyCFG more aggressively convert to predicated code to facilitate more LLVM IR optimizations and then make all the branch-vs-cmovs decisions just before Codegen.

You mean the pass SelectOptimize you add? It can also be enable in non-PGO?

And I also see that DisableSelectOptimize is true as default, why it’s not enable as default?

Sorry I missed your comment.

Yes it can be enabled for non-PGO builds. In non-PGO, only the loop-level analysis will be used but with conservative assumptions for weights (profile information would have improved the cost modeling). For x86, the non-PGO version will not improve much over the X86CmovConversion pass which already does non-profile-guided loop-level analysis. Yet for AArch64 and other archs, this loop-level analysis should be beneficial.

It has been thoroughly evaluated and tuned only for instr-PGO builds. Not much testing on non-PGO builds and other flavors of PGO yet. Additionally, as I said, for x86 the non-PGO version will not give much of an improvement. It could eventually replace the X86CmovConversion pass and then get enabled by default but that will require a lot of additional testing.

Thank you for your reply.
I test the pass in RISCV hardware. It seems no much improvement as expected (RISCV is similar to AArch64). I am not diving into the result to figure out why it’s coming out. Maybe give you more specific feedback later.