This is a design document about processing the DW_CFA_AARCH64_negate_ra_state DWARF instruction in BOLT.
DW_CFA_AARCH64_negate_ra_state is also referred to as .cfi_negate_ra_state in assembly, or OpNegateRAState is BOLT sources. In this document, I will use negate-ra-state as a shorthand.
Problem
Currently BOLT does not support Aarch64 binaries with pointer authentication, because it cannot correctly handle negate-ra-state.
This has been raised in several issues:
-
[BOLT] instrumentation fails on aarch64 due to an unsupported CFI opcode OpNegateRAState #74833
-
[BOLT] perf2bolt fails on aarch64 due to an unsupported CFI opcode #80992
Introduction
DW_CFA_AARCH64_negate_ra_state
The negate-ra-state CFI is a vendor-specific Call Frame Instruction defined in the Arm ABI.
The DW_CFA_AARCH64_negate_ra_state operation negates bit[0] of the RA_SIGN_STATE pseudo-register.
This bit indicates to the unwinder whether the current return address is signed or not (hence the name). The unwinder uses this information to strip the Pointer Authentication Code (PAC) bits from pointers (by authenticating the pointer before using it).
There are no DWARF instructions to directly set or clear the RA State. However, two other CFIs can also affect the RA state:
DW_CFA_remember_state: this CFI stores register rules onto an implicit stack.DW_CFA_restore_state: this CFI pops rules from this stack.
Example:
| CFI | Effect on RA state |
|---|---|
| (default) | 0 |
| DW_CFA_AARCH64_negate_ra_state | 0 → 1 |
| DW_CFA_remember_state | 1 pushed to the stack |
| DW_CFA_AARCH64_negate_ra_state | 1 → 0 |
| DW_CFA_restore_state | 0 → 1 (popped from the stack) |
Where are these CFIs needed?
To understand why we need non-trivial processing in BOLT for this, we need to look at where exactly negate-ra-state CFIs are generated by compilers.
Case 1: explicitly signing or authenticating instructions
Instructions that sign or authenticate the link register (LR, x30), such as paciasp and autiasp require a negate-ra-state CFI referring to them. Supporting this case in BOLT is trivial.
Signing happens when the LR is stored to memory (the stack), and authenticating happens when the return address is loaded back to the LR.
Case 2: Two consecutive instructions with different RA state (no explicit signing or authenticating)
The other case where two consecutive instructions have different RA state, but neither of them is signing or authenticating means that they are not next to each other in control flow. One is part of an execution path with signed RA, the other is part of a path with an unsigned RA.
These locations have to be on the edges of BasicBlocks.
The unwinder does not follow the control flow graph. It reads unwind information in the layout order.
In the examples below, arrows denote control-flow, and adjacency between blocks indicates layout order.
Before reordering, the function needs four negate-ra-state CFIs in total:
- two for
paciaspandautiaspinstructions respectively (block 2 and 6). - two for the state change between the unsigned (block 4) and the two signed blocks around it (blocks 3 and 5).
During optimizations, the CFG is reordered in such a way that it only needs two negate-ra-state CFIs. As the reordered layout is 1-4-7-2-3-5-6, block 4 has no signed neighbours, needing no negate-ra-state CFIs.
This example illustrates why the negate-ra-state CFIs on the borders between BasicBlocks need to be removed and regenerated by BOLT: both their locations and the amount generated can change during optimizations.
Why correct CFI placement matters
The unwinder reads the DWARF CFIs attached to the code in several scenarios, including:
- unwinding from C++ exceptions,
- generating stack traces.
If the RA state is incorrectly parsed as unsigned when it should be signed, the unwinder does not authenticate the pointer, leaving the PAC bits in it. This way the accessed location will be incorrect, causing a segmentation fault.
Solution
Original approach (abandoned): Track RA state at BasicBlock level
As mentioned earlier, locations where negate-ra-state CFIs are needed depend on the Control Flow Graph (CFG). In this approach I assigned each BasicBlock an RA state using the CFG, and later iterated on the BasicBlocks in layout order to find BasicBlocks with different RA state.
The problem with this approach is that BOLT lacks information on noreturn functions and therefore cannot determine whether a function call will return or not.
As a result, BOLT generates an incorrect CFG: both the edges between BasicBlocks and the length of BasicBlocks can be different to what the compiler “intended” to emit.
This issue is discussed in depth at #115154. The developed solution required the user to manually input the names of noreturn functions (see PR #117578). Since this approach relies on manual steps, and there is no clear path to solving the noreturn issue, I chose to abandon the PR.
Current approach: Track RA state at the Instruction level
Note: ideas described here are implemented in PR #120064.
Instead of assigning an RA state to each BasicBlock, this approach assigns an RA state to each instruction. We can track state on each instruction during reordering, and emit negate-ra-state CFIs after optimizations.
This approach introduces two new passes:
MarkRAStatesPass: assigns the RA state to each instruction based on the CFIs in the input binaryInsertNegateRAStatePass: reads those assigned instruction RA states after optimizations, and emitsDW_CFA_AARCH64_negate_ra_stateat the correct places: on paciasp/autiasp instructions, and wherever there is a state change between two consecutive blocks in the layout order
To track metadata on individual instructions, the MCAnnotation class was extended.
Saving annotations at CFI reading
CFIs are read and added to BinaryFunctions in CFIReaderWriter::FillCFIInfoFor. At this point, we add MCAnnotations about negate-ra-state, remember-state and restore-state CFIs to the instructions they refer to. This is to not interfere with the CFI processing that already happens in BOLT (e.g. remember-state and restore-state CFIs are removed in normalizeCFIState for reasons unrelated to PAC).
MarkRAStates Pass
This pass runs before optimizations reorder anything.
It processes MCAnnotations generated during the CFI reading stage to check if instructions have either of the three CFIs that can modify RA state:
- negate-ra-state
- remember-state
- restore-state
Then it adds new MCAnnotations to each instruction, indicating their RA state. Those annotations are:
- Signing (the
pac*instruction) - Signed
- Authenticating (the
aut*instruction) - Unsigned
Error handling in MarkRAState Pass:
Whenever the MarkRAStates pass finds inconsistencies in the current BinaryFunction, it ignores it by calling BF.setIgnored(). This prevents BOLT from optimizing that function, but it will still be emitted as part of the original section (.bolt.org.text) in its original form.
The inconsistencies are as follows:
- finding a
pac*instruction when already in signed state - finding an
aut*instruction when already in unsigned state - finding
pac*andaut*instructions without.cfi_negate_ra_state. This indicates poorly assembly, possibly handwritten, which we cannot safely optimize.
With these checks and the fallback segIgnored() behaviour the likelihood of taking a correct function, incorrectly annotating RA states, and creating an incorrect function in the output binary is low. If the original code was indeed incorrect, resulting binary will not work properly (as expected). But if the input was actually correct, and MarkRAStates failed to annotate properly, the original function in .bolt.org.text can still be called, and the emitted executable would operate properly.
Users will be informed about the number of ignored function in the pass, or the exact functions ignored when choosing verbose output.
InsertNegateRAStatePass
This pass runs after the optimizations are done. In essence, it does the inverse of MarkRAState pass:
- it reads the RA states attached to instructions, and
- whenever the state changes, it adds a PseudoInstruction that holds an OpNegateRAState CFI.
Covering newly generated instructions:
Some BOLT passes can add new Instructions. In InsertNegateRAStatePass, we have to know what RA state these have.
The current solution has the fixUnknownStates function to cover these, using a fairly simple strategy: unknown states are inherited from last known state. Testing so far has shown this implementation is sufficient, but to prove correctness, we would need to examine all passes that insert new instructions.
Feedback on this is especially welcome!
This is the tradeoff we make when tracking RA states on instructions instead of BasicBlocks: if we could rely on the CFG, this step would not be necessary.
Test Plan:
The PR contains several lit tests:
negate-ra-state.s: checks if BOLT is able to generate negate-ra-state CFIs to the same locations, where the input binary had them.negate-ra-state-incorrect.s: checks how MarkRAState deals with inconsistent input functionsnegate-ra-state.cpp: processes C++ binary with exception handling, checks that it is running correctly after BOLTing.negate-ra-state-flags.cpp: checks that binaries with negate-ra-state need the--allow-experimental-pacretflag.pacret-instrument-flags.s: ensures that bolt bails out when trying to instrument binaries with pacret (I will add support for that in a separate PR)
More lit tests will be added to cover the function splitting and BasicBlock reordering cases.
This approach has also been tested on unit-test binaries from the Chromium project. All scenarios/edge cases seen here have support for, but I think we might need to refine parts of the patch once it has been used “in the wild”.
We intent to add binaries in the out-of-tree repo as well: rafaelauler/bolt-tests
Caveats of putting the feature behind a flag
As the patch is new, relatively big, and could have major consequences from incorrect implementation, @paschalis.mpeis advised to put the feature behind a flag, so users are aware that they are relying on new/experimental features.
The flag is also useful to disallow modes where the patch does not work yet.
The ideal experience would be the following:
- the user tries to BOLT something that was compiled with
-mbranch-protection=pac-retor=standard, - BOLT warns them to either pass the
--allow-experimental-pacretflag, or turn off branch protection when compiling their input binary, - user chooses and BOLT produces an optimized binary according to the choice.
As the OpNegateRAState CFIs are read from the binary in CFIReaderWriter::FillCFIInfoFor(BF), that is the earliest point where we can check the flag, and warn users. We can also drop the OpNegateRAState from the input (as its location will become invalid).
This is not possible on all systems however.
During compilation, the linker inserts several functions into the binary from the C runtime. These are coming from platform specific object files, e.g. from /usr/lib/gcc/aarch64-linux-gnu/13/crtbegin.o .
Functions from such objects may have been already compiled with -mbranch-protection enabled. One example we have seen is the function __do_global_dtors_aux. This affected 39 tests when running on aarch64 in the LLVM CI. See this comment.
Workaround
Instead of dropping OpNegateRAState CFIs from the input binary, we read simply them early on, and check for the flag later, after several functions have operated on the CFIs, like unwindCFIState. This way the check only happens in the same location (as in at the same point in BOLT sources), where BOLT would previously fail because of unsupported CFI opcode.
This choice could seem like some input binaries can “slip through” the check, but if any BinaryFunction in the input has remember-state and restore-state CFIs, BOLT will run the normalizeCFIState function to remove them, and update other CFIs accordingly. This will call unwindCFIState, and check for the flag. This means that the bigger the input binary, the less likely the flag will not be checked. For any reasonable binary, the chance is virtually zero. The location of the check is relevant for unittests however.
