[RFC] Load Instruction: Uninitialized Memory Semantics

There are two main optimizations that you lose:

  1. Folding of phis originated from conditional loads (SROA).
    e.g.:
int x;
if (cond)
  x = *p;
use(x);

SROA will covert this into (with nondet semantics):

%nondet = freeze poison
br %cond, then, else

then:
  br else

else:
  %v = phi [%p, %then], [%nondet, %entry]

We want to convert this into:

...
else:
  %v = %p

But we cannot do this transformation because if %p is poison, we would be changing nondet into poison when %cond=false. One way around it would be to mark loads with !noundef in clang.
Right now, this transformation is done for phis with undef inputs (by InstSimplify), but that is a miscompilation!


  1. Duplicating loads. This non-deterministic semantics prevents duplicating loads that have multiple uses and split those uses across the 2 copies of the load.
    For example, you cannot sink a load into a loop, as before the optimization it would execute just once and after it would execute once per iteration, potentially giving a different value on each iteration.
    I think this issue is solvable with a trick I’ve called “entanglement” of loads (that forces that their non-determinism resolves to the same value), but that is not studied in-depth nor implemented.

Marking loads with !noundef seems similar to the effort of adding noundef to function parameters, that seems doable.

Would this also be solvable with !noundef loads?

Yes, both issues are solvable with !noundef.
But, !noundef has its own issues :slight_smile: When a load is hoisted, we need to drop !noundef. It’s not terrible because by that time SROA has run.

The other concern is regarding other frontends. While we can use !noundef in clang, I don’t know if that’s ok for Rust @RalfJung?

We can and in fact already do annotate most loads as !noundef since Put `noundef` on all scalars that don't allow uninit by Nilstrieb ¡ Pull Request #106294 ¡ rust-lang/rust ¡ GitHub.

1 Like

@vitalybuka Would msan be affected if loads of uninit memory returned nondet instead of undef/poison?

Assuming optimizations are mostly not impacted and there aren’t any other major issues, I’m slightly in favor slapping !noundef on lots of loads and load uninit -> nondet. load uninit -> nondet is more defined than the current load uninit -> undef so there’s no need to fix LLVM users in terms of correctness, but to keep performance frontends will need to use !noundef a lot more. This avoids adding another construct to the IR.

As mentioned before, !noundef should be fine to add to most loads as long as the frontend ensures no poison is stored to memory.

We have been working on adding noundef to clang, see D134410.

BTW, I don’t think that adding !noundef is valid for C code under the current semantics of load (where load of uninitialized memory results in undef, and thus triggers UB with !noundef), but it may be under the proposed new semantics (where load of uninit results in a nondeterministic value, which would thus presumably not trigger UB with !noundef).

In C, I believe it is permitted to load and use uninitialized memory, resulting in indeterminate values but not UB – except in two cases:

  1. C2x section 6.2.6.1 p6:
    “Certain object representations need not represent a value of the object type. If such a representation is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a non-value representation”

    Note that this allowance doesn’t apply when there are no non-value representations for a given type. E.g. on platforms LLVM supports, the fundamental integer types do not have any such non-value representations – all representations correspond to valid values.

  2. C2x section 6.3.2.1 p2: “If the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined.”

C++ has significantly more restrictive rules, in basic.indet.

1 Like

Thanks for the references.

Just to make sure we are on the same page, C++'s standard allows the indeterminate value to be poison, right?
It seems the spirit of the standard is to have some “tombstone” that triggers UB if used, which is exactly what poison is for.

Are those the intended semantics? It would be great to clarify this. The impression I got is that !noundef (in the context of this proposal) is intended to allow avoiding the insertion of freeze poison, which would not be the case if it does not trigger UB for uninitialized memory.

The semantics of !noundef is to really trigger UB on undef/poison. That’s why it would enable us to skip freeze.

Since that is ruled out, I think we should go back to the original purpose, which is:

  • Load of unitialized memory becomes poison instead of undef
  • Load, !uninit_is_nondet (or whatever name) - gives freeze poison on unitialized memory

Old/current APIs will be moved to the 2nd option to maintain compatibility, while clang will be moved manually and incrementally to the 1st option where possible (everywhere except unsigned char, std::byte and bitfields).

Any concerns (or alternatives) regarding this plan?

Thanks!

I assume you meant that it can contain poison, and in that case the loaded value will be poison?

I’d appreciate a concrete example, this is not at all clear to me.

Ouch, that is a bummer. :confused: I was hoping at least a i8 copy loop would be able to do this…

Rust integer types can certainly use !noundef. But Rust also has an explicit type for dealing with uninitialized memory. Loads at type MaybeUninit<u64> are intended to be 8-byte loads where it is okay for some (or all) of the bytes to be uninit, and crucially the initialized bytes (if any) are preserved perfectly when this value is eventually stored elsewhere in memory. Basically Rust allows users to implement a “perfect” memcpy with a MaybeUninit<u64> copy loop.

It looks like LLVM will not have a way to exactly express those semantics, but this type is used in perf-critical code so having some LLVM encoding of these semantics that does not regress on key optimizations compared to the status quo would be crucial.

I don’t think LLVM has a correct way of specifying that today.
Without the byte type, it’s very hard to argue that one can use i8 instead as the universal holder because of pointer escaping and whatever.

I meant “must not”… as in, if it does contain poison, you likely end up with UB.

If you have a sequence like:

store %possibly_poison, %p
%reload = load %p, !uninit_is_nondet

We can replace %reload with %possibly_poison. If you have load %p, !freeze_poisoned_bits instead, that isn’t legal; you have to use freeze %possibly_poison instead.

This proposal is mostly orthogonal to the pointer provenance problem. (I mean, if you had a byte type, you could use it to lower bitfields in a different way, but we need some solution for this problem whether or not we have a byte type.)

There’s a simple solution for sinking loads: invent an intrinsic to initialize a chunk of memory (equivalent to %a = load i32, ptr %p, !uninit_is_nondet ; store i32 %a, ptr %p). Then you can duplicate any loads after the intrinsic. (The intrinsic would inhibit further optimizations, but sinking should happen late anyway, so the barrier shouldn’t matter.)

Oh I see, basically this is exploiting the fact that SSA values like %possibly_poison can never be uninit.

So that situation is common enough to justify adding a whole new kind of value to the LLVM value hierarchy (next to the existing undef and poison, we now have uninit, which doesn’t even exist as an SSA value but can reside in memory and is “more defined” than poison)? This only applies when there is a “possibly uninit” load after a store of a value without noundef.

Though I think I can see the point that it can be good to distinguish poison values (arising from bad arithmetic) from uninit memory. My real problem is that uninit memory is treated rather poorly by LLVM. The last years whenever I read “loads from uninit memory will be poison” in my head this meant “uninit memory is the same as poison memory”, and I was happy since poison at least exists as an SSA value that can be propagated properly – the only issue being that one needs [i8 x 8] to represent an 8-byte value where each byte can be individually poison, which again is not handled well by LLVM. Now I learn this was a misunderstanding and we’re actually getting a new kind of in-memory state that cannot be represented by-value at all. I am not sure if I should be sad since this makes everything more complicated or happy since it increases the need for a byte type that can represent all in-memory states properly…

There are two reasons why we cannot go full poison:

  1. backwards compatibility: we cannot break all of our existing API users. They should migrate to poison as they see fit.
  2. bitfields. It’s annoying, but we need to solve that issue. For those we could just just use a load !freeze, as they are not very common, but for 1. I don’t think it’s ok.

Migrating existing loads to load !freeze would be correct, wouldn’t it? So is the concern that it would be too much of a perf loss for existing API users?

Anyway, the more I think about it the more I conclude that from a Rust perspective it doesn’t even make so much of a difference. If we wanted to avoid the freezing we’d have to start using [i8 x N] to get bytewise poison tracking; the only way we’d be in trouble is if we mixed that strategy with also using load !uninit_is_nondet elsewhere and I can’t see a reason why we would do that.

The more realistic concern is probably that we might want to also have a poison-like value in MIR (the Rust IR used before translating to LLVM IR), once we start getting more serious about doing optimizations ourselves. This would mean that we actually could have poison values in the programs we give to LLVM, so we’d have to be very careful about how we map that – the Rust poison value would be preserved bytewise in types like MaybeUninit<u64> and it’s not clear how we would map that to LLVM.


To switch tracks here for a bit, I am still not convinced by the name, uninit_is_nondet. This is very symmetric with freeze so it seems it could be nice to somehow indicate that, e.g. freeze_uninit or so?

You mean for Mem2reg Example in the original post? Is having a freeze poison an issue since it’s basically a constant? Should we have a new constant that represents freeze poison?

.

Regarding the original proposal, why is the default load uninit -> poison with an optional !uninit_is_nondet instead of load uninit -> nondet with an optional !uninit_is_poison? Most metadata should be droppable. If we drop an !uninit_is_nondet, we may now load a poison instead of nondet which is bad, but if we drop an !uninit_is_poison then we may now load a nondet instead of a poison which is fine. (this also makes the API change more intuitive)

Yes, exactly!

That works for me.

That doesn’t work. It has to be an instruction. If it was a constant, RAUW would create copies of the constant and suddenly they could have different values.

I’m not religious about this one. One reason is that we want people to poison as much as possible for perf reasons. Making that the default helps. Plus, in terms of efficiency, it’s better than the common case is not having the metadata.