[RFC] Add a New Byte Type to LLVM IR

[RFC] Byte Type for LLVM IR

Authors: Juneyoung Lee (@aqjune), Pedro Lobo (@pedroclobo), Nuno Lopes (@nlopes), George Mitenkov (@george)

Summary

We propose adding a raw byte type (b<N>) to LLVM IR to correctly represent raw memory data in registers. This change:

  • Fixes known correctness bugs in memcpy lowering and load merging/widening
  • Enables the eventual removal of undef from LLVM IR
  • Has minimal performance impact: 0.2% average slowdown with optimizations across 20 benchmarks
  • Requires modest implementation effort: protoype has ~2.6k lines of code across LLVM & Clang
  • Is backwards compatible

We also propose clarifying the semantics of load to allow type punning, which reduces overhead to near-zero.


Motivation

LLVM IR currently lacks a way to represent raw memory values in registers without prematurely interpreting them as integers or pointers. Today, such values are represented using integer types (i8, i16, i64, …), even when they originate from bytewise memory operations such as memcpy.

This conflation causes both correctness issues and semantic ambiguity:

  • Integer optimizations may be applied to values that are merely copies of memory.
  • Type punning through loads exists in practice but is underspecified.

The byte type addresses this by making raw-memory values explicit in the IR, while leaving integer semantics and optimizations unchanged.


Concrete Problems This Solves

1. Unsound memcpy Lowering

LLVM currently lowers small memcpy calls into integer load/store pairs. This is incorrect when copying pointers:

; Current (incorrect) lowering
call void @llvm.memcpy(ptr %dst, ptr %src, i64 8, i1 false)
=>
%v = load i64, ptr %src    ; loses pointer information!
store i64 %v, ptr %dst

This transformation loses pointer information, which breaks alias analysis and causes miscompilations (see bug 37469).

With the byte type, memcpy can be correctly lowered:

%v = load b64, ptr %src     ; preserves all data correctly
store b64 %v, ptr %dst

This represents a pure memory copy with no implicit reinterpretation.

2. Unsound Load Merging/Widening

GVN and other optimizers merge loads but can spread poison incorrectly:

; Two 1-byte loads, one may be poison
%a = load i8, ptr %p
%b = load i8, ptr %q

; Current (incorrect) merging
%v = load i16, ptr %p
%a = trunc i16 %v to i8
%s = lshr i16 %v, 8
%b = trunc i16 %s to i8

Here, poison in one byte can incorrectly taint both results.

With the byte type:

%v = load b16, ptr %p
%c = trunc b16 %v to b8
%a = bytecast b8 %c to i8   ; poison only if %c is poison
%s = lshr b16 %v, 8
%d = trunc b16 %s to b8
%b = bytecast b8 %d to i8   ; poison only if %d is poison

Poison is tracked per byte, restoring correctness.

3. Enables Removing undef

The byte type is the final piece needed to eliminate undef values from LLVM IR. Currently, ~18% of Alive2-detected miscompilations occur solely because undef exists. Removing it would:

  • Simplify LLVM IR semantics
  • Make algebraic identities hold (e.g., x + x ≡ 2 * x)
  • Reduce compiler bugs from misunderstanding undef
  • Maintain backward compatibility via IR auto-upgrade

Proposal

1. Add Byte Type b<N>

A new IR type representing raw memory data, where each bit can be:

  • An integer bit (0 or 1)
  • Part of a pointer value
  • Poison

Example: b8, b16, b32, b64

The byte type is:

  • Allowed in alloca, load, and store
  • Distinct from integer types for optimization purposes

2. New bytecast Instruction

Converts between bytes and other types:

; Exact variant: returns poison if type doesn't match exactly
%i = bytecast exact b8 %byte to i8

; Type-punning variant: forces conversion
%i = bytecast b8 %byte to i8     ; allows reading pointer bytes as int
%p = bytecast b64 %byte to ptr   ; allows reading int bytes as ptr

The type-punning variant handles mixed-type data:

  • Pointer bytes → integer: behaves like like ptrtoaddr: it extracts the address value without escaping the pointer
  • Integer bytes → pointer: produces a pointer without provenance (cannot be dereferenced, but can be compared or used in getelementptr)
  • When bits are consistent with the target type, the conversion is a no-op

3. Extend Existing Instructions

The following are extended to support byte types:

  • bitcast ty %v to b<N>: Convert any type to byte
  • trunc b<N> %v to b<M>: Truncate byte values
  • lshr b<N> %v, amt: Shift byte values
  • freeze b<N> %v: Freeze on per-bit basis

4. Clarify load Semantics (Important!)

LangRef does not define the semantics of a type punning load (i.e., load a region of memory with a type different than what it was stored).
We propose that regular load allows type punning, making load ty, ptr %p equivalent to:

%b = load b64, ptr %p
%v = bytecast b64 %b to ty  ; type-punning bytecast

This is crucial for performance:

  • Existing code continues to work unchanged
  • No bytecast proliferation in normal code
  • Byte types appear only where needed (char variables, memcpy lowering)
  • Average overhead drops from ~0.8% to ~0.2% when comparing a version of LLVM without and with load type punning.

Implementation Experience

We have a working prototype:

  • LLVM: 1.3k LoC
  • Clang: 1.2k LoC (mostly lowering char to b8 and introducing bytecasts)
  • Alive2: Extended to support byte type for validation

Changes to LLVM

Core Lowerings

  • memcpy/memmove: Lowered to byte load/store pairs
  • Load merging, widening, and forwarding updated for byte semantics
  • Clang: char and std::byte variables → b8 instead of i8

Key Optimizations Implemented

  1. Bytecast constant folding
   %v = bytecast <2 x b8> <b8 1, b8 2> to <2 x i8>
   => %v = <2 x i8> <i8 1, i8 2>
  1. Redundant cast elimination
   %b = bitcast i32 %i to b32
   %c = bytecast b32 %b to i32
   => %c = %i
  1. Store forwarding
   store b32 %x, ptr %p
   %v = load i32, ptr %p
   => %v = bytecast b32 %x to i32
  1. SLP vectorization
   %b0 = load b8, ptr %p0
   %b1 = load b8, ptr %p1
   %x0 = bytecast b8 %b0 to i8
   %x1 = bytecast b8 %b1 to i8
   ...
   => %l = load <2 x b8>, ptr %p0
      %r = bytecast <2 x b8> %l to <2 x i8>
  1. Load combining
   %b = load b8, ptr %p
   %c = bytecast exact b8 %b to i8
   => %c = load i8, ptr %p

Performance Evaluation

Evaluated on 20 benchmarks (~6M LoC total, including LLVM itself):

Configuration Avg Slowdown Max Slowdown Binary Size Compile Time
Byte type 0.8% 4.4% +0.2% +0.15%
Byte type + load type punning 0.2% 4.4% +0.1% +0.19%

Remaining regressions are due to incomplete cost model and pattern updates, expected to be addressed incrementally (similar to freeze).

Correctness Validation

  • 11 LLVM unit tests previously flagged as unsound by Alive2 are now fixed
  • No new test failures introduced
  • All known memcpy-related miscompilations are resolved
  • Alive2 validation on Draco, eSpeak, FLAC, tjbench shows only bug fixes

Comparison to Alternatives

Why Not Use Metadata/Attributes?

Metadata would:

  • Be difficult to preserve across transformations
  • Complicate every pass
  • Fail to represent raw values in SSA
  • Prevent correct store/load forwarding

The byte type makes the property explicit and enforceable.

Why Not Just Make All Integer Types Behave Like Bytes?

This would pessimize all integer operations, as many optimizations (GVN, reassociation, value forwarding) would become unsound. The byte type allows us to:

  • Keep aggressive optimizations for integers
  • Use bytes only where raw memory semantics are needed
  • Make the distinction clear in the IR

Why Not Just Make Memory Untyped?

The issue is not memory but SSA values. We need a type that can hold raw memory data in registers while preserving:

  • Pointer information (for alias analysis)
  • Per-bit poison propagation
  • Compatibility with optimizations

Open Questions

  1. Load semantics: The performance data suggests making regular loads support type punning. Rust folks are supportive of this semantics. This has never been defined in LangRef; should we move on with this semantics?
  2. Bitwidth restrictions: Our prototype limits bitwidths to multiples of 8. Is this acceptable, or do we need arbitrary bitwidths?
  3. Bytecast variants: Should we support both exact and type-punning variants, or just one?

Proposed Timeline

We propose the following timeline:

  1. After LLVM 22 branches: Commit the implementation to main
  2. 6-month testing period: Allow broad community testing during LLVM 23 development
  3. Address feedback: Fix any issues discovered during testing
  4. LLVM 23 release: Ship with byte type fully integrated

This maximizes real-world testing while minimizing release risk.

Patches

They are available here:

Acknowledgements

Thank you Nikita Popov & Ralf Jung for feedback on earlier drafts of this RFC.

Conclusion

The byte type addresses real correctness bugs in LLVM today, enables the long-term removal of undef, and does so with minimal performance and implementation cost. The prototype demonstrates feasibility, and the proposed timeline allows ample testing.

We look forward to feedback and discussion!

16 Likes

The performance loss here is not inherent to usage of the byte type itself, right? It just stems from a lack of cost model updates/updates to other passes that now need to handle more cases? That seems to be confirmed in the performance analysis section:

but I would expect this to be something we have fixed before we enable it by default.

I remember one of the concerns brought up with the original proposal a couple years back was around the need to lower all integer types to byte types in IR because punning pointers through integers could also change provenance. I haven’t followed any of the recent drafts/papers in WG14/WG21 that have worked on specifying this, but it seems like this is no longer a concern?

This timeline seems a bit ambitious to me. I would expect we might be able to have lowering to the byte type enabled under a flag for LLVM 23, but probably not enabled by default.

2 Likes

General +1 here - while I haven’t thought through the implications in detail, I do think this is a reasonable evolution of the IR and am here for eventually removing undef

1 Like

I was confused by using the name “byte type” for bN where N might not be 8 but I guess I can get used to it.

I think we will need very clear documentation about the difference between bN and iN, when frontends should use one or the other, and what the rules are for valid IR transformations between them. Do you already have some proposed wording for LangRef, for example?

4 Likes

Right, it’s just a matter of fixing a few more optimizations/cost models.

My understanding from the C/C++ standards is that only chars can hold pointers safely. Real code may be abusing by implementing widened memcpys in C, but that’s technically ilegal.
Introducing the byte type won’t make anything worse per se. It’s about how aggressive we will be with optimizations in the future. This patch doesn’t introduce anything that would break such code.

Ah, yes, the goal is to have it under a flag for clang. We can have the byte type in LLVM; the creation of byte types by optimizations can potentially be under a flag as well (though there aren’t many in the patch).

1 Like

We haven’t written the LangRef patch yet.
What frontend writers need to know is simply: if a value may hold a pointer or another type at run time (and you don’t know which one) → use the byte type. Otherwise, use the specific type.
So, if you have a union that can have a pointer and another type, use a byte. If you are doing some memory copying (memcpy, etc) and you don’t know the underlying type, use bytes.
Otherwise, use the specific type.

Currently, we use integer types (like i8..) to represent many narrow float types from APFloat.

From your reply, I understand that this is expected to remain the same, since these values are never used to hold pointers. Is my understanding correct?

Correct, there is no differentiation between numeric scalar types.

Thanks for clarifying!

should there also be instructions for inserting smaller values into byte types, e.g. for lowering passing a struct with small fields as a b64 without needing to construct it in memory?

so maybe also add and, shl and or disjoint to the allowed byte type operations so you can e.g. lower:

union U {
    a: *mut (),
    b: [MaybeUninit<u8>; 4],
}
let mut v = U { a: some_pointer };
v.b[2] = unsafe { U { a: other_pointer }.b[2] };
return v;

The issue with and/or is the semantics. Like, what does it mean to ‘and’ two pointers? Or even ‘and’ between a pointer and an integer which is not -1 or 0.

We do support zext for bytes. This might be sufficient to pad bytes for unions?

I was thinking more that and would only work between a byte type and an integer.

maybe it would be better to have an insert-bytes instruction, like the insert-element for vector types?

I don’t oppose an insertbyte instruction, but first I would like to understand if it’s really needed. For unions, I think all you need is padding, so zext (plus maybe a shift) is sufficient.
Are there other examples you can think of?

1 Like

i think maybe you misunderstood the semantics of that Rust code. What I meant is that you can have code like:

union U {
    p: *mut (),
    b: [MaybeUninit<u8>; 4],
}
fn f(a: *mut (), b: *mut ()) -> U {
    let mut v = U { p: a };
    unsafe { v.b[2] = U { p: b }.b[2]; }
    v
}

it ends up with the bytes:

0 1 2 3 4 5 6 7
a[0] a[1] b[2] a[3] a[4] a[5] a[6] a[7]

note how the resulting bytes are a well-defined mixture of the two input pointers, unlike C unions which can only have one active member.

1 Like

How does this intersect with the semantics for non-integral pointers, specifically pointers with external state, where loads and stores of pointers must be correctly typed?

Are memcpy’s always lowered to full-width bN accesses, or decomposed? The latter would be breaking for CHERI, for example.

Tagging @davidchisnall who had concerns in this area previously.

3 Likes

A working example which you can run in Miri in the menu. Miri is designed to help catch UB and interprets the Rust source without going through LLVM:

Could you explain how this will work?

If I remember correctly, a remaining use of undef values is when reading uninitialized memory. This can occur validly, e.g. bitfield updates may read uninitialized memory words and partially overwrite them (example). Should this be implemented with byte types in the future?

I think CHERI will likely need some dedicated compiler support here. Its architecture and semantics are quite different from what this proposal is targeting, and in practice they already don’t fit cleanly into today’s LLVM semantics.

This proposal isn’t trying to solve that broader mismatch, and I don’t think it should be a prerequisite for moving forward. It seems reasonable to focus first on mainstream architectures, and then look at what additional work would be needed to support CHERI-style models.

We have CHERI working on LLVM today, including merging with TOT multiple times a week, and we are in the process of landing CHERI support in upstream in the form of the RISCV Y extension right now. If this will break CHERI support, it’s a pretty substantial issue for us.

2 Likes

You’re absolutely right that uninitialized memory is one of the remaining major sources of undef, and that this shows up in cases like bitfield updates.

The byte type is intended as a step in the right direction here, since it allows reading and writing individual bytes without immediately propagating poison to neighboring bits.

We do have a separate proposal in mind to add syntactic sugar to make bitfield lowering cleaner, but the idea is to get the byte type in place first and then build on top of it. The byte type has sufficient benefits beyond helping with the eventual removal of undef.