With the increasing deployment of runtime safety checks in native binaries, such as bounds checks (e.g. -fsanitize=bounds, LLVM control flow integrity, Rust array/slice bounds) and overflow checks (e.g. -fsanitize=signed-integer-overflow, Swift arithmetic), the runtime cost of these checks has increasing importance. Of course, the most efficient check is one that isn’t executed at all because the compiler has optimized it away, but when we do emit a check, we should try to make it as efficient as possible. This document focuses on the x86_64 architecture.
Architecturally, we need three instructions to implement a check: a compare instruction (or some other instruction that sets flags), a conditional branch and a branch target, which must prevent further execution of the program. We focus on the branch itself and the branch target. The status quo is a conditional branch to a UD1 instruction placed at the end of the function, and in this RFC I explore a number of alternatives to this. Optimizing a check is highly microarchitecturally dependent, so the performance impact is measured on a variety of microarchitectures.
The following experiments were carried out by measuring the performance impact of LLVM CFI on a large realistic internal Google benchmark. Additional experiments were carried out using the Fleetbench benchmark suite; the benchmark BM_PROTO_Arena was found to be directionally representative of the internal Google benchmark, at least as far as CFI performance impact is concerned. All experiments were controlled for link order measurement bias.
At the risk of burying the lede, I cover the experiments in the order that I performed them, including the negative results. Experiment 3 was the only experiment to show an improvement, so you can skip to that if you like.
Experiment 1: conditional branch to UDB
With the status quo, the UD1 at the end of the function could be, and frequently is, far enough away to trigger a long (6-byte) conditional branch. Long branches can increase code size, which can harm performance.
Recently, the one-byte instruction 0xD6 (UDB) was officially designated as an instruction that is guaranteed to be undefined (#UD). This creates a number of interesting possibilities for implementing a trap as a short conditional branch, as long as we can find some way to place a 0xD6 byte near the branch. Let’s assume that our compare instruction will set CF=0 and ZF=0 if the check fails, so we naively would have a ja instruction that will branch to the trap. I evaluated the following options for placing the 0xD6 byte:
Option 1: conditional branch into middle of NOP
ja 1f+2
1: .byte 0x0f, 0x1f, 0xd6 ; this is a 3-byte NOP ending in 0xd6
Option 2: conditional branch past unconditional branch past trap
ja 1f
jmp 2f
1: udb
2:
Option 3: inverted conditional branch past trap
jbe 1f
udb
1:
In the prototype implementation, I only emitted one of these forms if the branch target was far enough away to require a long branch. This allows us to compare the microarchitectural cost of a long branch to the microarchitectural cost of one of the embedded UDB instruction sequences.
I also considered the option of a conditional branch to a pre-existing 0xD6 byte as part of the encoding of another instruction, but these bytes don’t seem to be common enough to make this worthwhile.
Unfortunately, none of these options resulted in a performance improvement relative to baseline, according to the internal Google benchmark. The following results were obtained, showing a regression on all tested microarchitectures:
Skylake Rome Milan
Baseline 2.607% 2.123% 1.946%
Option 1 2.752% 2.1455% 2.046%
Option 2 3.134% 2.5065% 2.192%
Option 3 2.8875% 2.471% 2.487%
Experiment 2: replace conditional branches with CMOV
This experiment was a CFI-specific optimization. With CFI, we need to perform some checks on the target address before branching to it. But we can avoid the conditional branches entirely by replacing them with a CMOV of 0 (or some other harmless value) to the register containing the target address, or a register from which the target address is loaded. For example, a virtual call:
0: 4d 31 db xor %r11,%r11
3: b9 e0 55 d8 04 mov $0x4d855e0,%ecx
8: 48 29 c1 sub %rax,%rcx
b: 48 c1 c9 05 ror $0x5,%rcx
f: 48 83 f9 21 cmp $0x21,%rcx
13: 49 0f 47 c3 cmova %r11,%rax
17: 48 ba 49 92 24 49 02 movabs $0x249249249,%rdx
1e: 00 00 00
21: 48 0f a3 ca bt %rcx,%rdx
25: 49 0f 43 c3 cmovae %r11,%rax
29: ff 50 08 call *0x8(%rax)
I was conscious of the fact that CMOV is predicted as both paths being equally likely, which is not what we want for CFI (we would ideally predict the CMOV to be not taken). But given that the target address is likely unpredictable anyway due to a dependent function pointer load, it may have been the case that existing feedback from the indirect branch predictor is sufficient to predict the indirect branch.
Unfortunately, this did not yield an improvement either, according to the internal Google benchmark. The following results were obtained, showing a regression on all tested microarchitectures:
Skylake Rome
Baseline 2.64% 2.14%
CMOV 3.07% 2.83%
Experiment 3: conditional branch to self
Consider the following instruction:
1:
ja 1b
We call this self-branch to jcc (jcc being the general term for conditional branches on x86). This instruction is a no-op if the condition is not met, and spins in an infinite loop if the condition is met. This is almost what we want, except that we need it to trap instead of infinite looping. We can make it trap with runtime support: in a periodic interrupt handler, we check whether we are at such an instruction and the condition is met, and trap if so. On Linux, we can register a SIGPROF handler and implement the check in the handler.
Many recent microarchitectures also implement macro-op fusion. With macro-op fusion, a compare followed by a conditional branch is decoded to a single micro-op. I hypothesized that the above instruction when appearing after a compare would trigger a speculative decode of the ja instruction on its own, which may increase micro-op cache pressure, and that if we made the conditional branch go to the cmp instruction as below, that would be decoded as a single micro-op that branches to itself without requiring a second decode; we call this self-branch to cmp:
1:
cmp $0x1234, %rax
ja 1b
The self-branch experiment resulted in a performance improvement on AMD and older Intel microarchitectures, and a regression on newer Intel microarchitectures. The following results were obtained using the internal Google benchmark:
Skylake Rome Milan
Baseline 2.607% 2.123% 1.946%
Self-branch to jcc 2.110% 1.994% 1.801%
And the following results using Fleetbench BM_PROTO_Arena (the low baseline measurements on AMD are anomalies caused by different inlining decisions with CFI enabled; the delta between baseline and the experiments are what is significant here):
Haswell Skylake Rome Milan GraniteRapids
Baseline 4.10% 3.84% 0.18% 0.42% 0.78%
Self-branch to jcc 2.89% 3.09% -0.70% 0.17% 1.26%
Self-branch to cmp 2.89% 3.19% -0.68% 0.19% 1.30%
Because these results show that self-branch to cmp was either performance neutral or a slight regression compared to self-branch to jcc, I do not consider self-branch to cmp any further.
I propose to implement self-branch to jcc for CFI and UBSan conditional traps as an opt-in feature. Users targeting predominantly AMD and/or older Intel microarchitectures who can meet the runtime requirements will benefit from enabling this feature.