[RFC][BOLT][AArch64] Handle OpNegateRAState to enable optimizing binaries with pac-ret hardening

bgergely0 · May 29, 2025, 8:15am

This is a design document about processing the DW_CFA_AARCH64_negate_ra_state DWARF instruction in BOLT.

DW_CFA_AARCH64_negate_ra_state is also referred to as .cfi_negate_ra_state in assembly, or OpNegateRAState is BOLT sources. In this document, I will use negate-ra-state as a shorthand.

Problem

Currently BOLT does not support Aarch64 binaries with pointer authentication, because it cannot correctly handle negate-ra-state.

This has been raised in several issues:

[BOLT] instrumentation fails on aarch64 due to an unsupported CFI opcode OpNegateRAState #74833
[BOLT] perf2bolt fails on aarch64 due to an unsupported CFI opcode #80992

Introduction

DW_CFA_AARCH64_negate_ra_state

The negate-ra-state CFI is a vendor-specific Call Frame Instruction defined in the Arm ABI.

The DW_CFA_AARCH64_negate_ra_state operation negates bit[0] of the RA_SIGN_STATE pseudo-register.

This bit indicates to the unwinder whether the current return address is signed or not (hence the name). The unwinder uses this information to strip the Pointer Authentication Code (PAC) bits from pointers (by authenticating the pointer before using it).

There are no DWARF instructions to directly set or clear the RA State. However, two other CFIs can also affect the RA state:

DW_CFA_remember_state: this CFI stores register rules onto an implicit stack.
DW_CFA_restore_state: this CFI pops rules from this stack.

Example:

CFI	Effect on RA state
(default)	0
DW_CFA_AARCH64_negate_ra_state	0 → 1
DW_CFA_remember_state	1 pushed to the stack
DW_CFA_AARCH64_negate_ra_state	1 → 0
DW_CFA_restore_state	0 → 1 (popped from the stack)

Where are these CFIs needed?

To understand why we need non-trivial processing in BOLT for this, we need to look at where exactly negate-ra-state CFIs are generated by compilers.

Case 1: explicitly signing or authenticating instructions

Instructions that sign or authenticate the link register (LR, x30), such as paciasp and autiasp require a negate-ra-state CFI referring to them. Supporting this case in BOLT is trivial.

Signing happens when the LR is stored to memory (the stack), and authenticating happens when the return address is loaded back to the LR.

Case 2: Two consecutive instructions with different RA state (no explicit signing or authenticating)

The other case where two consecutive instructions have different RA state, but neither of them is signing or authenticating means that they are not next to each other in control flow. One is part of an execution path with signed RA, the other is part of a path with an unsigned RA.

These locations have to be on the edges of BasicBlocks.

The unwinder does not follow the control flow graph. It reads unwind information in the layout order.

In the examples below, arrows denote control-flow, and adjacency between blocks indicates layout order.

Before reordering, the function needs four negate-ra-state CFIs in total:

two for paciasp and autiasp instructions respectively (block 2 and 6).
two for the state change between the unsigned (block 4) and the two signed blocks around it (blocks 3 and 5).

During optimizations, the CFG is reordered in such a way that it only needs two negate-ra-state CFIs. As the reordered layout is 1-4-7-2-3-5-6, block 4 has no signed neighbours, needing no negate-ra-state CFIs.

This example illustrates why the negate-ra-state CFIs on the borders between BasicBlocks need to be removed and regenerated by BOLT: both their locations and the amount generated can change during optimizations.

Why correct CFI placement matters

The unwinder reads the DWARF CFIs attached to the code in several scenarios, including:

unwinding from C++ exceptions,
generating stack traces.

If the RA state is incorrectly parsed as unsigned when it should be signed, the unwinder does not authenticate the pointer, leaving the PAC bits in it. This way the accessed location will be incorrect, causing a segmentation fault.

Solution

Original approach (abandoned): Track RA state at BasicBlock level

As mentioned earlier, locations where negate-ra-state CFIs are needed depend on the Control Flow Graph (CFG). In this approach I assigned each BasicBlock an RA state using the CFG, and later iterated on the BasicBlocks in layout order to find BasicBlocks with different RA state.

The problem with this approach is that BOLT lacks information on noreturn functions and therefore cannot determine whether a function call will return or not.

As a result, BOLT generates an incorrect CFG: both the edges between BasicBlocks and the length of BasicBlocks can be different to what the compiler “intended” to emit.

This issue is discussed in depth at #115154. The developed solution required the user to manually input the names of noreturn functions (see PR #117578). Since this approach relies on manual steps, and there is no clear path to solving the noreturn issue, I chose to abandon the PR.

Current approach: Track RA state at the Instruction level

Note: ideas described here are implemented in PR #120064.

Instead of assigning an RA state to each BasicBlock, this approach assigns an RA state to each instruction. We can track state on each instruction during reordering, and emit negate-ra-state CFIs after optimizations.

This approach introduces two new passes:

MarkRAStatesPass: assigns the RA state to each instruction based on the CFIs in the input binary
InsertNegateRAStatePass: reads those assigned instruction RA states after optimizations, and emits DW_CFA_AARCH64_negate_ra_state at the correct places: on paciasp/autiasp instructions, and wherever there is a state change between two consecutive blocks in the layout order

To track metadata on individual instructions, the MCAnnotation class was extended.

Saving annotations at CFI reading

CFIs are read and added to BinaryFunctions in CFIReaderWriter::FillCFIInfoFor. At this point, we add MCAnnotations about negate-ra-state, remember-state and restore-state CFIs to the instructions they refer to. This is to not interfere with the CFI processing that already happens in BOLT (e.g. remember-state and restore-state CFIs are removed in normalizeCFIState for reasons unrelated to PAC).

MarkRAStates Pass

This pass runs before optimizations reorder anything.

It processes MCAnnotations generated during the CFI reading stage to check if instructions have either of the three CFIs that can modify RA state:

negate-ra-state
remember-state
restore-state

Then it adds new MCAnnotations to each instruction, indicating their RA state. Those annotations are:

Signing (the pac* instruction)
Signed
Authenticating (the aut* instruction)
Unsigned

Error handling in MarkRAState Pass:

Whenever the MarkRAStates pass finds inconsistencies in the current BinaryFunction, it ignores it by calling BF.setIgnored(). This prevents BOLT from optimizing that function, but it will still be emitted as part of the original section (.bolt.org.text) in its original form.

The inconsistencies are as follows:

finding a pac* instruction when already in signed state
finding an aut* instruction when already in unsigned state
finding pac* and aut* instructions without .cfi_negate_ra_state. This indicates poorly assembly, possibly handwritten, which we cannot safely optimize.

With these checks and the fallback segIgnored() behaviour the likelihood of taking a correct function, incorrectly annotating RA states, and creating an incorrect function in the output binary is low. If the original code was indeed incorrect, resulting binary will not work properly (as expected). But if the input was actually correct, and MarkRAStates failed to annotate properly, the original function in .bolt.org.text can still be called, and the emitted executable would operate properly.

Users will be informed about the number of ignored function in the pass, or the exact functions ignored when choosing verbose output.

InsertNegateRAStatePass

This pass runs after the optimizations are done. In essence, it does the inverse of MarkRAState pass:

it reads the RA states attached to instructions, and
whenever the state changes, it adds a PseudoInstruction that holds an OpNegateRAState CFI.

Covering newly generated instructions:

Some BOLT passes can add new Instructions. In InsertNegateRAStatePass, we have to know what RA state these have.

The current solution has the fixUnknownStates function to cover these, using a fairly simple strategy: unknown states are inherited from last known state. Testing so far has shown this implementation is sufficient, but to prove correctness, we would need to examine all passes that insert new instructions.

Feedback on this is especially welcome!

This is the tradeoff we make when tracking RA states on instructions instead of BasicBlocks: if we could rely on the CFG, this step would not be necessary.

Test Plan:

The PR contains several lit tests:

negate-ra-state.s: checks if BOLT is able to generate negate-ra-state CFIs to the same locations, where the input binary had them.
negate-ra-state-incorrect.s: checks how MarkRAState deals with inconsistent input functions
negate-ra-state.cpp: processes C++ binary with exception handling, checks that it is running correctly after BOLTing.
negate-ra-state-flags.cpp: checks that binaries with negate-ra-state need the --allow-experimental-pacret flag.
pacret-instrument-flags.s: ensures that bolt bails out when trying to instrument binaries with pacret (I will add support for that in a separate PR)

More lit tests will be added to cover the function splitting and BasicBlock reordering cases.

This approach has also been tested on unit-test binaries from the Chromium project. All scenarios/edge cases seen here have support for, but I think we might need to refine parts of the patch once it has been used “in the wild”.

We intent to add binaries in the out-of-tree repo as well: rafaelauler/bolt-tests

Caveats of putting the feature behind a flag

As the patch is new, relatively big, and could have major consequences from incorrect implementation, @paschalis.mpeis advised to put the feature behind a flag, so users are aware that they are relying on new/experimental features.

The flag is also useful to disallow modes where the patch does not work yet.

The ideal experience would be the following:

the user tries to BOLT something that was compiled with -mbranch-protection=pac-ret or =standard,
BOLT warns them to either pass the --allow-experimental-pacret flag, or turn off branch protection when compiling their input binary,
user chooses and BOLT produces an optimized binary according to the choice.

As the OpNegateRAState CFIs are read from the binary in CFIReaderWriter::FillCFIInfoFor(BF), that is the earliest point where we can check the flag, and warn users. We can also drop the OpNegateRAState from the input (as its location will become invalid).

This is not possible on all systems however.

During compilation, the linker inserts several functions into the binary from the C runtime. These are coming from platform specific object files, e.g. from /usr/lib/gcc/aarch64-linux-gnu/13/crtbegin.o .

Functions from such objects may have been already compiled with -mbranch-protection enabled. One example we have seen is the function __do_global_dtors_aux. This affected 39 tests when running on aarch64 in the LLVM CI. See this comment.

Workaround

Instead of dropping OpNegateRAState CFIs from the input binary, we read simply them early on, and check for the flag later, after several functions have operated on the CFIs, like unwindCFIState. This way the check only happens in the same location (as in at the same point in BOLT sources), where BOLT would previously fail because of unsupported CFI opcode.

This choice could seem like some input binaries can “slip through” the check, but if any BinaryFunction in the input has remember-state and restore-state CFIs, BOLT will run the normalizeCFIState function to remove them, and update other CFIs accordingly. This will call unwindCFIState, and check for the flag. This means that the bigger the input binary, the less likely the flag will not be checked. For any reasonable binary, the chance is virtually zero. The location of the check is relevant for unittests however.

maksfb · June 11, 2025, 5:58pm

Thank you for the detailed writeup.

Without noreturn info, the instruction annotation approach sounds reasonable. I assume there will be no overhead for functions that don’t change RA state.
Random thought: can CFIs be used to discover noreturn calls and fixing CFG?
Have you considered annotating instructions with RA state for pre-CFG functions using a new MetadataRewriter?
For functions with inconsistent RA state, do you think it’s safe to perform optimizations other than basic block reordering?
For the flag, I’m in favor of having it on by default.

bgergely0 · June 12, 2025, 11:19am

Thanks for the questions!

I assume there will be no overhead for functions that don’t change RA state.

Currently, there is some runtime memory overhead for all functions, if any function requires pac-ret handling. I plan on reducing runtime mem overhead in a follow-up patch (unless it’s crucial for the initial patch).

For functions with inconsistent RA state, do you think it’s safe to perform optimizations other than basic block reordering?

For function where we don’t parse a consistent RA state no layout-changing optimizations are safe in my opinion. Such functions are ignored, so BB reordering is skipped.

Have you considered annotating instructions with RA state for pre-CFG functions using a new MetadataRewriter?

I can definately take a look, if it would be cleaner. I did not know about these yet:)

Random thought: can CFIs be used to discover noreturn calls and fixing CFG?

I’ve also thought of this, and I think yes, taking CFI locations into account can improve the CFG. E.g. if a BB changes RA state after a call, it must be a noreturn call.
This fixup would not cover all cases though.

We cound probably find more noreturn calls with other probabilistic tricks, like hardcoding plt entry names, such as abort@PLT, and mark each function in a function chain that only leads to such PLT calls.

Topic		Replies	Views
Using CallingConvLower in ARM target LLVM Dev List Archives	26	210	April 17, 2009
[RFC] Simple control-flow integrity LLVM Dev List Archives	30	261	April 4, 2014
RFC: PowerPC tail call optimization patch LLVM Dev List Archives	6	100	April 25, 2008
RFC: Tail call optimization X86 LLVM Dev List Archives	38	240	October 10, 2007
RFC: Tail call optimization X86 LLVM Dev List Archives	6	166	September 13, 2007