[RFC] Introducing a byte type to LLVM

Let me try to explain the issue around typed memory. It’s orthogonal to TBAA. Plus I agree the allocation type is irrelevant; it’s the type of the load/store operations that matters.

Fact 1: LLVM needs an “universal holder” type in its IR.

Why?

  • Frontends load data from memory and don’t necessarily know the type they are loading. When you load a char in C, it can be part of a pointer or an integer, for example. Clang has no way to know.
  • That’s necessary to express implementations of memcpy in IR, as it copies raw data without knowing what the type is.
  • LLVM lowers small memcpys into load/store.

I hope you agree LLVM needs a type that can contain any other type.

Fact 2: optimizations for this “universal holder” type need to be correct for any LLVM IR type.

Since this type can hold anything, optimizations that apply to this type must be correct for integers, pointers, etc. Otherwise, the results would be incorrect if the data was cast to the “wrong” type.

The corollary of this fact is that GVN is not correct for the “universal holder” type in general. That’s because GVN is not correct for pointers in general (though correct in some cases). GVN doesn’t work for pointers since just because 2 pointers compare equal, it doesn’t mean they have the same provenance. As LLVM’s AA uses provenance information, this information must be preserved.

Hopefully you agree with everything I wrote so far. If not please stop me and ask.

Ok, so what’s the solution?

We have two candidates for this “universal holder” type:

  • integer (i**)
  • byte (b**), a new dedicated type

The advantage of going with integers is that they already exist. The disadvantage? To make LLVM correct, we would need to disable GVN in most cases or stop using provenance information in AA (make pointers == integers).

This sounds pretty bad to me due to the expect performance hit.

Introducing a new type requires work, yes. But it allows us to keep all integer optimizations as aggressive as we want, and only throttle down when encountering the byte type. Another benefit is that loading a byte from memory doesn’t escape any pointer, while with the first solution you need to consider all stored pointers as escaped when loading an integer.

Introducing the byte type allow us to fix all the integer/pointer casts bugs in LLVM IR. And it enables more optimizations as we won’t need to be careful because of the LLVM IR bug.

A 3rd option is to keep everything as-is. That is, keep wrong optimizations in LLVM and keep adding hacks to limit the miscompilations in real-world programs.

That has bitten us multiple times. Last time someone told me my miscompilation example was academic, LLVM ended up miscompiling itself, and then miscompiled Google’s code. Only this year we are planning to fully fix that bug (we have a hack right now).

Last week Alive2 caught a miscompilation in the Linux kernel, in the network stack. The optimization got the pointer arithmetic wrong. Pretty scary, and the bug may have security implications.

Anyway, just to argue that leaving things as-is is not an option. We have momentum and 3 GSoC students ready to help to fix the last few design bugs in LLVM IR.

I know this stuff is complex; please ask questions if something is not clear or if you disagree with anything I wrote above.

Thanks,

Nuno

Perhaps a small concrete example of the GVN (or otherwise, if there
are smaller examples in other optimizations) transformation that you
find to be invalid - showing how it's important for integers, and
invalid for pointers (smallest example that manifests that invalidity
in an observable way) might help ground the conversation a bit.

Hi all,

to add to what Nuno said, I came up with a series of optimizations which demonstrate that dead store elimination, integer "substitution" (GVN replacing 'a' by 'b' after an 'a == b'), and provenance-based alias analysis, are in conflict with each other: the current LLVM semantics are inconsistent and can lead to miscompilations.

I posted them here as Rust code: <What about: Pointer-to-integer transmutes? · Issue #286 · rust-lang/unsafe-code-guidelines · GitHub. A direct translation to C wouldn't be correct due to strict aliasing, but if you imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then I hope this example demonstrates that *something* needs to give. This is not just a small bug somewhere, it is a case of the implicit assumptions made by different optimization passes / instructions being mutually incompatible.

Kind regards,
Ralf

Hi Ralf,

to add to what Nuno said, I came up with a series of optimizations which
demonstrate that dead store elimination, integer “substitution” (GVN replacing
‘a’ by ‘b’ after an ‘a == b’), and provenance-based alias analysis, are in
conflict with each other: the current LLVM semantics are inconsistent and can
lead to miscompilations.

I posted them here as Rust code:
<https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806>.
A direct translation to C wouldn’t be correct due to strict aliasing, but if you
imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then
I hope this example demonstrates that something needs to give. This is not
just a small bug somewhere, it is a case of the implicit assumptions made by
different optimization passes / instructions being mutually incompatible.

Agreed. Reading your example as LLVM IR, I think we have a choice to make: either memory is typed or it isn’t. By “typed” I mean that an integer load following a pointer store or vice versa returns poison or is immediate UB.

If memory is typed, then your original example program has UB in at least two places:

  1. A pointer stored via storage_ptr is loaded via storage_usize, used in a comparison, and the comparison result is branched on. This would be UB.
  2. An integer is stored via storage_usize and then loaded via storage_ptr (pointer load) and then dereferenced. This would also be UB.

If memory is untyped and you want the original program to have defined behavior, my first thought is that the resolution that makes the most sense is to say that removing the store to *storage_usize is incorrect because that store is not truly redundant: it changes the pointer provenance that is associated with the underlying memory. The other optimizations in that chain of optimizations seem of higher value (keeping in mind that most dead store elimination is still correct, it’s just this particular narrow case which isn’t).

Cheers,
Nicolai

Hi Nicolai,

    to add to what Nuno said, I came up with a series of optimizations which
    demonstrate that dead store elimination, integer "substitution" (GVN replacing
    'a' by 'b' after an 'a == b'), and provenance-based alias analysis, are in
    conflict with each other: the current LLVM semantics are inconsistent and can
    lead to miscompilations.

    I posted them here as Rust code:
    <What about: Pointer-to-integer transmutes? · Issue #286 · rust-lang/unsafe-code-guidelines · GitHub
    <What about: Pointer-to-integer transmutes? · Issue #286 · rust-lang/unsafe-code-guidelines · GitHub.

    A direct translation to C wouldn't be correct due to strict aliasing, but if
    you
    imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then
    I hope this example demonstrates that *something* needs to give. This is not
    just a small bug somewhere, it is a case of the implicit assumptions made by
    different optimization passes / instructions being mutually incompatible.

Agreed. Reading your example as LLVM IR, I think we have a choice to make: either memory is typed or it isn't. By "typed" I mean that an integer load following a pointer store or vice versa returns poison or is immediate UB.

If memory is typed, then your original example program has UB in at least two places:

1. A pointer stored via storage_ptr is loaded via storage_usize, used in a comparison, and the comparison result is branched on. This would be UB.
2. An integer is stored via storage_usize and then loaded via storage_ptr (pointer load) and then dereferenced. This would also be UB.

Agreed. However, to "fix" the example, it is enough to make one of these UB. So, the second part could still be easily allowed -- converting integers to pointers is lossless, so I do not think there is a problem with performing that conversion "implicitly" as part of a load at ptr type.

If memory is untyped and you want the original program to have defined behavior, my first thought is that the resolution that makes the most sense is to say that removing the store to *storage_usize is incorrect because that store is not truly redundant: it changes the pointer provenance that is associated with the underlying memory. The other optimizations in that chain of optimizations seem of higher value (keeping in mind that *most* dead store elimination is still correct, it's just this particular narrow case which isn't).

I first was confused what you mean by "narrow" here; this would affect *all* cases of 'removing a store of the value just loaded' since we have no way to know if that value might be a pointer.

However, judging from your other email, you seem to consider this a narrow case, compared to removing the first of two consecutive stores to the same location. That's fair, I am not nearly enough of a compiler person to evaluate these trade-offs. :slight_smile: I can just provide guidance by suggesting formal semantics that support a given set of optimizations, and pointing out when a combination of optimizations becomes internally inconsistent.

Kind regards,
Ralf