RFC: Optimizing conditional traps

With the increasing deployment of runtime safety checks in native binaries, such as bounds checks (e.g. -fsanitize=bounds, LLVM control flow integrity, Rust array/slice bounds) and overflow checks (e.g. -fsanitize=signed-integer-overflow, Swift arithmetic), the runtime cost of these checks has increasing importance. Of course, the most efficient check is one that isn’t executed at all because the compiler has optimized it away, but when we do emit a check, we should try to make it as efficient as possible. This document focuses on the x86_64 architecture.

Architecturally, we need three instructions to implement a check: a compare instruction (or some other instruction that sets flags), a conditional branch and a branch target, which must prevent further execution of the program. We focus on the branch itself and the branch target. The status quo is a conditional branch to a UD1 instruction placed at the end of the function, and in this RFC I explore a number of alternatives to this. Optimizing a check is highly microarchitecturally dependent, so the performance impact is measured on a variety of microarchitectures.

The following experiments were carried out by measuring the performance impact of LLVM CFI on a large realistic internal Google benchmark. Additional experiments were carried out using the Fleetbench benchmark suite; the benchmark BM_PROTO_Arena was found to be directionally representative of the internal Google benchmark, at least as far as CFI performance impact is concerned. All experiments were controlled for link order measurement bias.

At the risk of burying the lede, I cover the experiments in the order that I performed them, including the negative results. Experiment 3 was the only experiment to show an improvement, so you can skip to that if you like.

Experiment 1: conditional branch to UDB

With the status quo, the UD1 at the end of the function could be, and frequently is, far enough away to trigger a long (6-byte) conditional branch. Long branches can increase code size, which can harm performance.

Recently, the one-byte instruction 0xD6 (UDB) was officially designated as an instruction that is guaranteed to be undefined (#UD). This creates a number of interesting possibilities for implementing a trap as a short conditional branch, as long as we can find some way to place a 0xD6 byte near the branch. Let’s assume that our compare instruction will set CF=0 and ZF=0 if the check fails, so we naively would have a ja instruction that will branch to the trap. I evaluated the following options for placing the 0xD6 byte:

Option 1: conditional branch into middle of NOP

ja 1f+2
1: .byte 0x0f, 0x1f, 0xd6 ; this is a 3-byte NOP ending in 0xd6

Option 2: conditional branch past unconditional branch past trap

ja 1f
jmp 2f
1: udb
2:

Option 3: inverted conditional branch past trap

jbe 1f
udb
1:

In the prototype implementation, I only emitted one of these forms if the branch target was far enough away to require a long branch. This allows us to compare the microarchitectural cost of a long branch to the microarchitectural cost of one of the embedded UDB instruction sequences.

I also considered the option of a conditional branch to a pre-existing 0xD6 byte as part of the encoding of another instruction, but these bytes don’t seem to be common enough to make this worthwhile.

Unfortunately, none of these options resulted in a performance improvement relative to baseline, according to the internal Google benchmark. The following results were obtained, showing a regression on all tested microarchitectures:

            Skylake     Rome    Milan
Baseline    2.607%    2.123%   1.946%
Option 1    2.752%    2.1455%  2.046%
Option 2    3.134%    2.5065%  2.192%
Option 3    2.8875%   2.471%   2.487%

Experiment 2: replace conditional branches with CMOV

This experiment was a CFI-specific optimization. With CFI, we need to perform some checks on the target address before branching to it. But we can avoid the conditional branches entirely by replacing them with a CMOV of 0 (or some other harmless value) to the register containing the target address, or a register from which the target address is loaded. For example, a virtual call:

   0:	4d 31 db             	xor    %r11,%r11
   3:	b9 e0 55 d8 04       	mov    $0x4d855e0,%ecx
   8:	48 29 c1             	sub    %rax,%rcx
   b:	48 c1 c9 05          	ror    $0x5,%rcx
   f:	48 83 f9 21          	cmp    $0x21,%rcx
  13:	49 0f 47 c3          	cmova  %r11,%rax
  17:	48 ba 49 92 24 49 02 	movabs $0x249249249,%rdx
  1e:	00 00 00 
  21:	48 0f a3 ca          	bt     %rcx,%rdx
  25:	49 0f 43 c3          	cmovae %r11,%rax
  29:	ff 50 08             	call   *0x8(%rax)

I was conscious of the fact that CMOV is predicted as both paths being equally likely, which is not what we want for CFI (we would ideally predict the CMOV to be not taken). But given that the target address is likely unpredictable anyway due to a dependent function pointer load, it may have been the case that existing feedback from the indirect branch predictor is sufficient to predict the indirect branch.

Unfortunately, this did not yield an improvement either, according to the internal Google benchmark. The following results were obtained, showing a regression on all tested microarchitectures:

           Skylake     Rome
Baseline    2.64%     2.14%
CMOV        3.07%     2.83%

Experiment 3: conditional branch to self

Consider the following instruction:

1:
ja 1b

We call this self-branch to jcc (jcc being the general term for conditional branches on x86). This instruction is a no-op if the condition is not met, and spins in an infinite loop if the condition is met. This is almost what we want, except that we need it to trap instead of infinite looping. We can make it trap with runtime support: in a periodic interrupt handler, we check whether we are at such an instruction and the condition is met, and trap if so. On Linux, we can register a SIGPROF handler and implement the check in the handler.

Many recent microarchitectures also implement macro-op fusion. With macro-op fusion, a compare followed by a conditional branch is decoded to a single micro-op. I hypothesized that the above instruction when appearing after a compare would trigger a speculative decode of the ja instruction on its own, which may increase micro-op cache pressure, and that if we made the conditional branch go to the cmp instruction as below, that would be decoded as a single micro-op that branches to itself without requiring a second decode; we call this self-branch to cmp:

1:
cmp $0x1234, %rax
ja 1b

The self-branch experiment resulted in a performance improvement on AMD and older Intel microarchitectures, and a regression on newer Intel microarchitectures. The following results were obtained using the internal Google benchmark:

                   Skylake     Rome    Milan
Baseline            2.607%    2.123%   1.946%
Self-branch to jcc  2.110%    1.994%   1.801%

And the following results using Fleetbench BM_PROTO_Arena (the low baseline measurements on AMD are anomalies caused by different inlining decisions with CFI enabled; the delta between baseline and the experiments are what is significant here):

                     Haswell Skylake  Rome    Milan  GraniteRapids
Baseline              4.10%   3.84%   0.18%   0.42%   0.78%
Self-branch to jcc    2.89%   3.09%  -0.70%   0.17%   1.26%
Self-branch to cmp    2.89%   3.19%  -0.68%   0.19%   1.30%

Because these results show that self-branch to cmp was either performance neutral or a slight regression compared to self-branch to jcc, I do not consider self-branch to cmp any further.

I propose to implement self-branch to jcc for CFI and UBSan conditional traps as an opt-in feature. Users targeting predominantly AMD and/or older Intel microarchitectures who can meet the runtime requirements will benefit from enabling this feature.

AFAIK, cmov is not predicted at all? Mostly, it should be treated as arithmetic operation that requires all three inputs to be available.

Did you consider branchless cmov to trap? In contrast to your option 2, nothing depends on the result of the cmov, so often this should just be hidden by out-of-order execution.

xor edx, edx
cmp rax, 0x1234
cmovbe rdx, rsp
mov dl, byte ptr [rdx] // NB: value is never used!

That matches my understanding. I was using the word “predicted” a bit loosely.

That’s an interesting option! I think it’s worth trying. It’s more code size than the other options but maybe the other aspects compensate for that.