[RFC] Load Instruction: Uninitialized Memory Semantics

I think this code is anyway UB since the load might go out-of-bounds. Imagine I have an alloca of size 1 that stores a 0 byte. Calling strlen on that is perfectly legal, but (assuming suitable alignment) that strlen will do a word-sized load from this alloca, which is UB.

On top of that, doing such a load at word type probably violates the effective type rule (strict aliasing), making it UB in C for yet another reason.

I brought up this concern in Remove undef: move uninitialized memory to poison, and was told that there’d be a flag that can be set on load to add an implicit freeze on all individual bytes, meaning that we don’t have a single uninit byte turn the entire load into poison. It seems to me like !uninit_is_nondet is exactly that flag? I am confused that it does not have ‘freeze’ in the name. It also introduces a new concept to the IR, “uninit”, which is not something LLVM currently has (it has undef and poison, both of which are different kinds of uninit).

Why is it necessary to introduce an entirely new concept to the IR here? How is “uninit” defined and how does it interact with all the other opcodes? For instance, if I do a (non-freezing) load of uninit, and then store that back, is that memory still uninit? Presumably it has to be because this load-store roundtrip could be optimized away. But this means that SSA values can now be “uninit” besides being poison/undef, and we need to define how uninit propagates through arithmetic operations and so on. We already have enough issues with undef and poison, I don’t think we want more of that. IMO this proposal would be much improved by not adding a 3rd, new concept to the mix. :slight_smile: Instead it can build on the existing concepts of undef, poison, and freeze, making it much easier to explain and understand what happens.

I would suggest to make the semantics: each byte is loaded separately as if it was a regular i8 load, then freeze is called on them, and then the bytes are put together to form a value of the load type. (There are some subtleties around loads of pointer type and loading provenance that might make a proper description of the semantics more complicated, but the LangRef doesn’t really talk about those aspects of LLVM semantics currently so seems fine to also omit that here, for now.)

This behaves the same on poison and undef. We can then eventually drop undef entirely, say that new memory is initially poison, and everything should be coherent.