Authors: Zachary Yedidia, Tal Garfinkel, Taehyun Noh, Shravan Narayan, Sharjeel Khan, Nathan Egge
Contributors: Vitaly Buka, Peter Collingbourne, Nick Desaulniers
This RFC follows up on broader LFI RFC, adding more detail about our approach to supporting the x86-64 LFI target with its specific backend compiler passes. These changes primarily touch the MC layer of LLVM and x86-64 backend.
Overview
Lightweight Fault Isolation (LFI) is a compiler-based sandboxing technology that confines native code to a restricted region of the virtual address space. LFI’s high-level design, threat model, runtime architecture, and ABI conventions were described in a previous RFC.
This RFC is focused on the details and rationale for our x86-64 version, most notably our use of bundle alignment in the assembler.
Use cases
LFI currently has two primary users we are aiming to support, other use cases were discussed in prior RFC, and we expect the community of users with similar challenges to grow as LFI becomes available in LLVM:
-
Google has a large number of third-party native libraries that we rely on for a whole range of tasks: including media decoding, compression, file parsing, image processing, etc. At present, process sandboxing is the only option available for confining these libraries. Unfortunately, this introduces prohibitive overheads for certain use cases, in particular, IPC performance and sandbox creation overheads can be quite limiting. Consequently, a variety of teams within Google are either currently using LFI (Android), or building tooling to use LFI when it lands upstream (Google Cloud), and as x86-64 support for LFI becomes available (ChromeOS).
-
The Firefox team currently uses Wasm to sandbox third party libraries with RLBox, they are looking into LFI to overcome the performance and compatibility limitations of Wasm.
The x86-64 port preserves the goals stated in the AArch64 RFC: compatibility with existing C/C++/assembly code, low overhead, a small and verifiable sandboxing primitive, and easy retrofitting of existing libraries.
Background
Here we provide an overview of LFI in x86-64, major design elements, and alternatives considered. We discussed how this is instantiated in LLVM in the next section.
LFI in x86-64
The x86-64 sandboxing scheme follows the same shape as the AArch64 scheme: assembly rewrites confine memory accesses and control flow to the sandbox region, and a small set of registers are reserved to maintain sandbox invariants. The reserved registers, context register layout, and runtime-call mechanism are documented in more detail in our implementation documentation. The key registers are:
- %r14: sandbox base address.
- %gs: segment register holding the sandbox base (when available).
- %rsp: always holds an address within the sandbox.
- %r15: context register (thread-local runtime state).
- %r11: scratch register for rewrites.
Memory Sandboxing
Memory accesses are rewritten by changing the addressing mode to use the %gs segment, which the runtime configures to point at the 4GiB-aligned sandbox base. The register holding the index is truncated to its 32-bit form, which (when added to the sandbox base), makes the access provably within the 4GiB region starting at that base address.
movq N(%rX), %rD
->
movq %gs:N(%eX), %rD
Control-flow Sandboxing
Bundling
An essential property for SFI schemes, and other rewrites that aim to ensure some security invariant holds, is that all interpretations of the instruction stream maintain that invariant. Determining this on aarch64 is straightforward, all instructions are 4-bytes and must be 4-byte aligned, thus, there is only one canonical interpretation of the instruction stream.
However, because x86-64 has variable-length instructions and jumps can target any byte-addressable location, there can be many interpretations of the instruction stream. Thus, without some additional trick, malicious code can choose a different version of the instruction stream, i.e. jump into the “middle” of an instruction in our canonical encoding that yields an instruction sequence that bypasses our sandbox.
To prevent this, instruction bundling provides a primitive that forces our instruction stream into a canonical encoding to enforce safety in a widely compatible, efficient, simple, and verifiable way. Bundling is the simplest and most well understood mechanism available for ensuring that X86 machine code can be soundly analyzed. It additionally can handle complex applications like libvips (media codec) and SpiderMonkey/V8 that use JIT compilers, or programs like dav1d that have large amounts of hand-written assembly.
When enabled, the assembler places instructions into bundles of a certain size (e.g., 32 bytes). The assembler ensures that no instruction may cross a bundle boundary by inserting padding in cases where that would happen. Instructions can be locked together to ensure that they are emitted within the same bundle.
Indirect jumps and calls, as well as returns, must be constrained so that their targets are within the sandbox, and land at a valid instruction boundary.
jmpq *%rax
->
.bundle_lock
andl $0xffffffe0, %eX
addq %r14, %rX
jmpq *%rX
.bundle_unlock
call foo
->
.bundle_lock align_to_end
call foo
.bundle_unlock
ret
->
popq %r11
.bundle_lock
andl $0xffffffe0, %r11d
addq %r14, %r11
jmpq *%r11
.bundle_unlock
Note: The rewrites do not always preserve the exact existing behavior with respect to flags. For example, the rewrite of ret uses instructions that clobber flags, while ret itself does not. This means we rely on the compiler not expecting flags to be preserved across returns/indirect branches, which is currently the case in LLVM. This can cause compatibility issues with hand-written assembly that uses a non-standard calling convention ABI.
Prefix Padding. Earlier implementations of bundling padded bundles nop instructions, however, this imposes undesirable backend pressure as these nop instructions still issue and consume resources in the pipeline. A more efficient and recent approach is to instead pad out bundles with instruction prefixes. Processors can drop these prefixes, which are otherwise unused, at the front end, eliminating this source of overhead. This provides a noticeable improvement in overhead on SPEC 2017, of roughly 1-2 percentage points. LLVM already has infrastructure for placing prefixes on instructions for alignment purposes.
In our benchmarking on SPEC 2017, control-flow rewrites with bundling cost roughly 3% (Intel) to 4.5% (AMD) in overhead. See the performance section for the benchmark breakdown.
Alternatives Considered
Bundling enables coarse grain control-flow integrity (CFI) for both forward and backward-edge control flow, by allowing control flow to be limited to the sandbox using simple masking (described above). As an alternative to bundling, one could rely on separate schemes for forward and backward-edge protections. For forward-edge protections, we could employ fine-grain label-based CFI schemes (which as a corollary also offer the coarse-grain CFI properties needed for sandboxing). Backward-edge protections can be enforced with a variety of schemes including label-based CFI, Safestack-based schemes etc. We experimented with a variety of these schemes to understand their trade-offs vs. bundling. However, we found that for performance, hardware compatibility, simplicity and verifiability, bundling still offers the most attractive option in most situations. We discuss some of the alternatives we explored below.
Forward Edge CFI
To provide forward-edge CFI, we could use a mechanism like Intel CET’s Indirect Branch Tracking (IBT), augmented with some software checks. Note that since hardware IBT is not supported in userspace by Linux or Windows (and the hardware is not widespread enough), we must implement it via a software simulation. IBT requires that every indirect branch targets a endbr64 instruction, and the hardware causes a fault otherwise. On its own, this is not sufficient: we must also confine the indirect branch target to the sandbox region, and ensure that the endbr64 target is not embedded within another instruction. To reduce the chance of an embedded endbr64 instruction, we can also require that indirect branch targets are aligned to a certain power of 2 (e.g., 8 or 32). If there is still an embedded endbr64 detected by static verification, the program could be recompiled with padding to move it to an unaligned (unreachable) location, but the chance of this happening is vanishingly small with a 32-byte alignment requirement.
jmpq *%rax
->
andl $0xffffffe0,%eax
.p2align 1
cs cmpl $0xfa1e0ff3,(%r14,%rax,1)
jne _trap
addq %r14,%rax
jmp *%rax
_trap: ud2
In our benchmarking, this transformation for forward-edge CFI costs ~1.5-2% on SPEC 2017. This is acceptable for forward-edge CFI, but cannot also be efficiently used for backward-edge CFI, meaning we would need a separate backward-edge CFI approach. The approaches we considered are discussed in the next section.
Another alternative for forward-edge CFI is to use a table of indirect call targets that is stored outside the sandbox, transform all indirect branch target addresses into offsets within this table, and perform a lookup based on the offset before an indirect branch. Compared to the previous approach, this has more compatibility issues with hand-written assembly, requires more complexity in the compiler and runtime, and also introduces an ABI change when calling functions that take function pointers as arguments, while only providing similar performance to label-based CFI schemes.
Backward Edge CFI
For backward edge CFI, we need a different mechanism. We explored four possibilities. The first two schemes have reasonable performance, but unfortunately have compatibility challenges preventing widespread use. The latter two schemes have significantly higher costs than bundling.
-
Hardware CET shadow stack: Rely on Intel CET’s hardware shadow stack feature for backward edge CFI. This works, but requires support from the hardware, operating system, and system libraries. The entire application (including the native host code) must enable the hardware shadow stack. In our measurements, enabling the CET shadow stack costs ~1% on Intel CPUs and ~4% on AMD CPUs. The overhead also applies globally, rather than just to the library/application being sandboxed. For Google, CET hardware support is not widespread enough for this to be a full solution.
-
SafeStack: The existing SafeStack pass in LLVM splits the stack into “safe” and “unsafe” stacks, and synergizes well with LFI, since LFI can be used to soundly protect access to the safe stack. The unsafe stack can be placed in the sandbox and used for arbitrary stack variables, such as buffers. The safe stack can be placed outside the sandbox and used for return addresses and simple local variables. While the performance of SafeStack+software endbranch is good, there are two primary issues with SafeStack: compatibility and verifiability.
a. Compatibility: SafeStack is not compatible with hand-written assembly or JIT-generated code without significant re-engineering of those applications. We don’t think it would be feasible to use this approach to sandbox applications like dav1d (tens of thousands of lines of hand-written assembly) or SpiderMonkey/V8 (complex stack usage in JIT-generated code), while doing so with bundling (or CET shadow stack) is feasible.
b. Verifiability: Once we start allowing data to be placed on the safe stack, it becomes difficult to statically verify the security of the sandbox at the machine code level. We must ensure that pushes/pops are always correctly balanced at function entry/exit and that indirect jumps that don’t call a function always remain within the function they originate from. If we skip verification, we begin to include the compiler’s correctness in the trusted code base of the sandboxing system (see the VeriWasm project for an example of the complexity that arises from this).
c. A third, less critical reason, is that the current SafeStack pass in LLVM would need to be updated to be used in combination with LFI’s memory masks. Since the memory masks are applied to every access through a general-purpose register, this will break if the safe stack is accessed through a general-purpose register. If we took this path, we would need to upstream changes to the SafeStack pass that make it more secure, by moving things like variadic arguments into the unsafe stack, so that safe stack accesses only originate from rsp (these involve ABI changes for variadic arguments). We have prototyped the changes for this, and they are comparable with the changes needed for bundling.
d. SafeStack is not compatible with acceleration via Intel Memory Protection Keys (MPK): SafeStack (as proposed in the scheme above), requires each thread’s stack to be inaccessible to other threads. If we try to accelerate LFI by removing memory access guards when MPK is available, accesses to the SafeStack will be trapped by MPK. With a bundling-based scheme, the stack and heap are both located within the sandbox region without per-thread restrictions, which is compatible with MPK.
-
Software-based shadow call stack (software simulation of CET shadow stack): The performance and code size overhead of this approach was not suitable.
-
Control-flow stack: a more restricted version of SafeStack, where the safe stack is used purely for return addresses. All accesses of rsp, including pushes/pops are rewritten to use a reserved register. This approach is more broadly compatible than SafeStack, but the performance of this approach was not suitable for general use.
Reference for performance benchmarks: lfi-bench/spec2017/CFI.md at master · lfi-project/lfi-bench · GitHub
LLVM Implementation
Here we discuss the particulars of our implementation of our implementation of LFI in LLVM.
Scope: Our changes apply to x86-64 only. At present we are only targeting ELF support, however, there is nothing particularly ELF specific in our implementation.
Rewriter Implementation
The x86-64 rewriter takes the same approach as the AArch64 rewriter, implemented as an MCLFIRewriter class instance in the X86 backend. We have opened an initial X86 rewriter PR, and we would like to upstream rewriter components, to allow others to experiment with this feature, even while the control-flow mechanism (bundling) is still under discussion. Similar to the AArch64 implementation, there are minor changes required in compiler-rt to avoid using reserved registers and to enable the build for the x86_64_lfi target.
Bundling Implementation
Link to current bundling PR: https://github.com/llvm/llvm-project/pull/175830
Assembler Directives
LFI requires 3 new bundling directives, which are the same ones that LLVM used to support for NaCl before removal in 2025. GCC also supports these directives, though it does not support the align_to_end modifier on .bundle_lock.
Interaction with Existing Options
Interaction with --x86-align-branch: This flag is used to align branches for the JCC erratum mitigation to prevent them from crossing a N-byte boundary, which addresses a hardware bug in certain Intel processors from 5-10 years ago. This option can coexist with bundling – it just adds additional constraints about whether jump instructions (possibly with fused comparisons) can span a particular boundary, and there is no fundamental incompatibility. Alternatively, we don’t have any requirements for the use of this option in combination with bundling, so generating an error for the combination of both features is reasonable. The reason is that the JCC hardware bug is handled via a microcode update, but this does cause some performance degradation (0-4% according to the Intel document) for branches/fused branches that span the 32-byte boundary. When bundling is enabled, the frequency of problematic branches lowers so dramatically that we are not concerned with the performance impact of purely relying on the microcode fix.
Interaction with --x86-pad-max-prefix-size: this flag selects max prefix sizing for existing prefix padding use-cases in LLVM (for alignment purposes). We intend to reuse this option for select the maximum number of prefixes to insert per-instruction for bundle prefix padding, but also can introduce a new option, such as --x86-pad-for-bundle-align, to control prefix padding for bundles separately.
Interaction with -mrelax-all: Bundling is performed after relaxation, so -mrelax-all works as expected even with bundling enabled. Since bundling already relaxes bundle-locked instructions, enabling -mrelax-all effectively extends relaxation to all remaining non-bundle-locked sequences. The relax-all option is also a legacy option, and we have no intention to use or rely on it in combination with bundling.
Alternatives Considered
Our current bundling PR leverages the MCAssembler layer, which is the most compelling location because: 1) it is where final code offsets are decided; 2) non-instruction entities (like .nops or .align) have concrete sizes that affect layout; 3) instruction sizes are finalized here after fixups and relaxation; and 4) it already hosts the basic functionality for code resizing.
- Pros: This is the optimal layer for precision.
- Cons: A possible concern would be dependency on existing components in the layer, which may complicate future refactors of the assembler. However, our current PRs point to the code changes being maintainable.
We considered three alternative approaches to this implementation.
-
A backend hook instead of generic MC state. The X86 backend could potentially handle bundling internally without touching
MCAssembler,MCStreamer, orMCSection. This is worse than the current proposed bundling implementation because bundle alignment must influence offset assignment and instruction relaxation, both of which are generic MC concerns. Threading X86-specific knowledge through genericlayoutSection()to avoid putting the field onMCAssembleris, in our judgement, a worse outcome than the small generic addition. -
Implementation within the LFI rewriter: This is not possible because the LFI rewriter only sees instructions, but not other existing directives used by the compiler such as “align” which would affect how bundling would need to be performed. The rewriter also sees instructions before relaxation is performed.
-
Implementing at the MCObjectStreamer Layer An alternative is to expand MCObjectStreamer to create bundles. While this would allow us to not rely on newly added assembler directives, it would require a number of changes which add complexity or worsen performance.
a. This implementation would need to track other emitters like emitCodeAlignment to ensure alignment shifts don’t overflow a bundle. Since instruction sizes aren’t finalized at this stage, this, in turn, would require rewiring MCAssembler utilities to provide preliminary estimates.
b. Even if we implement point a. above, we would end up with worse results. This is because the streamer approach applies bundling preemptively, based on maximum potential instruction sizes. Since this is naturally more conservative, we expect that this would result in larger binary sizes and worse performance than the assembler approach (which uses the final sizes of instructions).
Documentation and Testing
General LFI x86 testing and documentation will follow the AArch64 version, with documentation of all rewrites and the overall sandboxing design in the LFI.rst document, and unit tests under llvm/test/MC/X86/LFI. For bundling in particular, since this proposal is compatible with the prior implementation of bundling, the tests and documentation can be reused. We can also include new test cases for interactions with flags like those mentioned above.
Performance
The performance analysis of the LFI-x86 scheme is shown below, as benchmarked on SPEC 2017. This uses the scheme designed for 4GiB sandboxes, using %gs for memory accesses and bundles with prefix padding for control-flow integrity. We benchmark on both Intel and AMD CPUs since they exhibit slightly different performance characteristics. The “stores” configuration measures the effect of sandboxing control-flow and stores, but not loads, meaning the sandbox can read data outside the sandbox but cannot corrupt the host. This is a reasonable configuration in certain threat models, such as in a browser’s renderer process, which is already process-isolated. The “jumps” configuration only applies control-flow rewrites, which shows the baseline cost of bundling and may be relevant for devices that can support MPK for memory sandboxing.
- .bundle_align_mode N: enable bundles of size 2^N. We only plan to use bundles of size 32 bytes, so constraining the implementation to that limitation would also be reasonable.
- .bundle_lock [align_to_end]: start a “bundle-locked” region. The assembler guarantees that all instructions within the bundle-locked region are placed in the same bundle. If the align_to_end flag is provided, the instructions are aligned so that the end of the last instruction is aligned with the end of the bundle.
- .bundle_unlock: closes the bundle-locked region.
Bundle locking is necessary to ensure that it is not possible for malicious code to jump into the middle of a sequence of instructions that is performing a sandbox mask. The align_to_end modifier is necessary so that call instructions get aligned to the end of a bundle, since the return address must be aligned to a bundle boundary (since it is the target of an indirect branch).
Relation to prior LLVM bundling support: Bundling for x86 architectures was previously supported in LLVM for NaCl but was removed in July 2025, both because NaCl was deprecated and that the prior version imposed performance overheads on non-x86 architectures. Our proposed implementation is cleaner than the prior version: it does not require a virtual emitInstToData() override, NOPs never span bundle boundaries, labels do not move during bundling, and there is no measurable compilation performance regression when bundling is disabled. Our proposed implementation uses the minimal necessary surface. Compared to the previous version of bundling, we have removed bundle nesting, since it is not required. All other functionality is equivalent, so we can reuse the prior existing tests and documentation.
LFI 4GiB (Intel i7 13700k)
| Intel i7 13700k | lfi-clang-m64 | lfi-stores-clang-m64 | lfi-jumps-clang-m64 |
|---|---|---|---|
| geomean | 7.89 | 3.787 | 3.164 |
LFI 4GiB (AMD Ryzen 9 7950X)
| AMD Ryzen 9 7950X | lfi-clang-m64 | lfi-stores-clang-m64 | lfi-jumps-clang-m64 |
|---|---|---|---|
| geomean | 7.088 | 5.288 | 4.316 |
Upstreaming and Ongoing Maintenance Effort
Similar to AArch64, the x86-64 rewriter will be broken up into 3 PRs for system rewrites with the LFI target, control-flow rewrites, and memory rewrites with the system rewrite PR already up. As mentioned previously, the bundling part is its own PR but it is necessary for the x86-64 target to be feature complete.
The main implementor and maintainer for LFI is Zachary Yedidia. Additionally, the code will be long-term maintained by Vitaly Buka from Google Sanitizers team and Nick Desaulniers from Android toolchain team. We will make sure to respond to any MC refactors touching the bundling and will add ourselves to the Maintainers file.
Please let us know your thoughts, and thank you for your review!

