[RFC] Introducing a byte type to LLVM

Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte type to LLVM to fix miscompilations due to load type punning. Please see the proposal below. It would be great to hear the feedback/comments/suggestions!

Motivation

Hi George,

Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
type to LLVM to fix miscompilations due to load type punning. Please see
the proposal below. It would be great to hear the
feedback/comments/suggestions!

Motivation

char and unsigned char are considered to be universal holders in C. They
can access raw memory and are used to implement memcpy. i8 is the LLVM’s
counterpart but it does not have such semantics, which is also not
desirable as it would disable many optimizations.

We therefore propose to introduce a new byte type that would have a raw-memory
access semantics and can be differentiated from i8. This can help to fix
unsound optimizations and make the lowering of char, unsigned char or
std::byte correct. Currently, we denote the byte type as b<N>, where N is
the number of bits.

In our semantics, byte type carries provenance and copying bytes does not
escape pointers, thereby benefiting alias analysis. If the memory contains
poison bits, then the result of loading a byte from it is bitwise poison.

Could you elaborate on the motivation a bit. I'm not sure I follow.
Maybe examples would be good. The only one I found is in
https://bugs.llvm.org/show_bug.cgi?id=37469. For that one I'm
wondering how much we would in reality loose if we make the required
ptr2int explicit when people cast a pointer to an integer. In general,
explicit encoding seems to me preferable and overall less work (new
types kinda scare me TBH).

~ Johannes

Hi Johannes,

Sure! The underlying problem is that raw-memory access handlers are treated
as integers, while they are not really integers. Especially std::byte that specifically
states that it has raw-memory access semantics. This semantic mismatch can make
AA wrong and a pointer to escape.

Consider the following LLVM IR that copies a pointer:

%src8 = bitcast i8** %src to i8*

%dst8 = bitcast i8** %dst to i8*

call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false)

%load = load i8*, i8** %dst

%addr = ptrtoint i8* %load to i64

ret i64 %addr

If we optimize the call to memcpy, then the IR becomes

%src64 = bitcast i8** %src to i64*

%dst64 = bitcast i8** %dst to i64*

%addr = load i64, i64* %src64, align 1

store i64 %addr, i64* %dst64, align 1

ret i64 %addr

Since there is no “data” type in LLVM like a byte, the ptrtoint is optimized out and the

pointer escapes. If we have used a pair of byte load/store that would not happen.

The mentioned bug is just an example, but this pattern can be

replicated in other cases that optimize memcpy as integer load/stores and drop ptrtoint,

undermining alias analysis. To illustrate further, consider loading a pointer like:

%v1 = load i8*, %p

%v2 = load i8*, %p

Instcombine optimizes this into

%v1 = load i64, %p

%v2 = inttoptr %v1

changing the underlying provenance (again, not good for AA). If we had a byte type instead,

then we could

  1. preserve provenance

  2. differentiate between integers and the raw data copied

This might be beneficial for some backends (e.g CHERI) that want to differentiate between

true integers and bytes that can be integers. The optimizations for known integers

can become more aggressive in this case.

Thanks,

George

Hi Johannes,

Sure! The underlying problem is that raw-memory access handlers are treated
as integers, while they are not really integers. Especially std::byte that
specifically
states that it has raw-memory access semantics. This semantic mismatch can
make
AA wrong and a pointer to escape.

I would assume being conservative is a possibility here, right?
So if someone wants to type-pun a pointer we would do very little
optimizations around it.

Consider the following LLVM IR that copies a pointer:

%src8 = bitcast i8** %src to i8*

%dst8 = bitcast i8** %dst to i8*

call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false)

%load = load i8*, i8** %dst

%addr = ptrtoint i8* %load to i64

ret i64 %addr

If we optimize the call to memcpy, then the IR becomes

%src64 = bitcast i8** %src to i64*

%dst64 = bitcast i8** %dst to i64*

%addr = load i64, i64* %src64, align 1

store i64 %addr, i64* %dst64, align 1

ret i64 %addr

Since there is no "data" type in LLVM like a byte, the ptrtoint is
optimized out and the pointer escapes. If we have used a pair of byte load/store that would not
happen.

The pointer (load %src) does escape in this example, doesn't it?

I don't get why "that would not happen" with byte load/store and why that would be correct at all.

The mentioned bug is just an example, but this pattern can be

replicated in other cases that optimize memcpy as integer load/stores and
drop ptrtoint,

undermining alias analysis. To illustrate further, consider loading a
pointer like:

%v1 = load i8*, %p

%v2 = load i8*, %p

Instcombine optimizes this into

%v1 = load i64, %p

%v2 = inttoptr %v1

changing the underlying provenance (again, not good for AA).

This example looks somehow broken (I'll assuming %v1 has integer type).

I agree this is not a great optimization, but is it wrong? We should have
done the opposite (ptr2int for v1) but even then I'm not sure what AA could
derive here. We load a pointer and then we escape it into an integer. If the
ptr2int is not removed, I'm not sure what AA should do about this. I'm also
not sure how practical this is. Do we have some idea how often such cases exist?

  If we had a
byte type instead,

then we could

1. preserve provenance

2. differentiate between integers and the raw data copied

This might be beneficial for some backends (e.g CHERI) that want to
differentiate between

true integers and bytes that can be integers. The optimizations for known
integers

can become more aggressive in this case.

It sounds to me like the problem is our optimization in the presence of
int2ptr and ptr2int are (still) unsound. We should at least not remove ptr2int,
is that it?

That said, I still don't get how a byte type helps for these examples. More concretely:

ref 1) How does a byte type preserve provenance? If it is somehow implicitly attached
would that not mean we need to now be careful doing optimizations on bytes (as we
should now be careful doing optimizations around p2i and i2p)? Literally shifting
the problem from looking for p2i and i2p to looking for bytecast?

ref 2) What optimization can I do if I have the differentiation of integers and raw data?
Can we somehow get an idea of the optimization potential?

~ Johannes

I don’t believe this is correct. LLVM does not have an innate
concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using tbaa
metadata, which are unrelated to the IR type of the access.

John.

If this is all related to https://bugs.llvm.org/show_bug.cgi?id=37469,
I don’t think anything about i8 is the ultimate problem there.

John.

I don’t believe this is correct. LLVM does not have an innate
concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using tbaa
metadata, which are unrelated to the IR type of the access.

It’s debatable whether LLVM considers memory to be typed or not. If we don’t consider memory to be typed, then all integer load operations have to be considered as potentially escaping pointers. Example:
store i32* %p, i32** %q
%q2 = bitcast i32** %q to i64*
%v = load i64* %q2

This program stores a pointer and then loads it back as an integer. So there’s an implicit pointer-to-integer cast, which escapes the pointer. If we allow this situation to happen, then the alias analysis code is broken, as well as several optimizations. LLVM doesn’t consider loads as potential pointer escape sites. It would probably be a disaster (performance wise) if it did!

The introduction of the byte type allow us to make all pointer <-> integer casts explicit, so that we don’t have to make all integer loads as escaping. It also allow us to say that LLVM optimizations are correct, and we “just” need to create a few new optimization to get rid of the extra bytecast instructions when they are provably not needed.

TBAA is unrelated to the problem we are trying to solve here.

Nuno

Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
type to LLVM to fix miscompilations due to load type punning. Please see
the proposal below. It would be great to hear the
feedback/comments/suggestions!

Motivation

char and unsigned char are considered to be universal holders in C. They
can access raw memory and are used to implement memcpy. i8 is the LLVM’s
counterpart but it does not have such semantics, which is also not
desirable as it would disable many optimizations.

I don’t believe this is correct. LLVM does not have an innate
concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using tbaa
metadata, which are unrelated to the IR type of the access.

It’s debatable whether LLVM considers memory to be typed or not. If we don’t consider memory to be typed, then all integer load operations have to be considered as potentially escaping pointers. Example:
store i32* %p, i32** %q
%q2 = bitcast i32** %q to i64*
%v = load i64* %q2

This program stores a pointer and then loads it back as an integer. So there’s an implicit pointer-to-integer cast, which escapes the pointer. If we allow this situation to happen, then the alias analysis code is broken, as well as several optimizations. LLVM doesn’t consider loads as potential pointer escape sites. It would probably be a disaster (performance wise) if it did!

Huh? It’s not the load that’s the problem in your example, it’s the store. If we have a pointer that’s currently known to not alias anything, and that pointer gets stored somewhere, and we can’t analyze all the uses of where it’s stored, then yes, anything that could potentially load from that memory might load the pointer, and so the pointer can’t just be assumed to not alias anything anymore. We don’t need subtle reasoning about bitcasts for that, that’s just basic conservative code analysis.

If you want to be doing optimizations where we opaquely assume that certain loads don’t carry pointers, then you need frontend involvement for that; you can’t just assume that some random i64 doesn’t carry a pointer. And I’ve gotta tell you, some random i64 is allowed to carry a pointer in C, so you’re not likely to get much mileage out of that. If you make this change, all that’s going to happen is that C-ish frontends will have to make all their integer types b8, b32, b64, etc. to indicate that they can carry pointer data, and then you’ll be right back where you were in terms of optimizability. The frontends that can make stronger statements about integers can probably all make much stronger promises about aliasing than you’d get from this analysis anyway.

The introduction of the byte type allow us to make all pointer <-> integer casts explicit, so that we don’t have to make all integer loads as escaping. It also allow us to say that LLVM optimizations are correct, and we “just” need to create a few new optimization to get rid of the extra bytecast instructions when they are provably not needed.

It sounds an awful lot like you’re trying to optimize a different language than the one you’ve got. I can certainly see how this constraint on inttoptr might be useful, but it’s not a constraint that LLVM has historically documented or enforced, and you haven’t made a very compelling case for why it’s worth such a major and incompatible change to the language.

John.

I've never been thoroughly involved in any of the actual optimizations here, but it seems to me that there is a soundness hole in the LLVM semantics that we gloss over when we say that LLVM doesn't have typed memory.

Working backwards from what a putative operational semantics of LLVM might look like (and I'm going to ignore poison/undef because it's not relevant), I think there is agreement that integer types in LLVM are purely bitvectors. Any value of i64 5 can be replaced with any other value of i64 5 no matter where it came from. At the same time, when we have pointers involved, this is not true. Two pointers may have the same numerical value (e.g., when cast to integers), but one might not be replaceable with the other because there's other data that might not be the same. So in operational terms, pointers have both a numerical value and a bag of provenance data (probably other stuff, but let's be simple and call it provenance).

Now we have to ask what the semantics of converting between integers and pointers are. Integers, as we've defined, don't have provenance data. So an inttoptr instruction has to synthesize that provenance somehow. Ideally, we'd want to grab that data from the ptrtoint instruction that generated the integer, but the semantics of integers means we can only launder that data globally, so that an inttoptr has the union of all of the provenance data that was ever fed into an inttoptr (I suspect the actual semantics we use is somewhat more precise than this in that it only considers those pointers that point to still-live data, which doesn't invalidate anything I'm about to talk about).

Okay, what about memory? I believe what most people intend to mean when they say that LLVM's memory is untyped is that a load or store of any type is equivalent to first converting it to an integer and then storing the integer into memory. E.g. these two functions are semantically equivalent:

define void @foo(ptr %mem, i8* %foo) {
   store i8* %foo, ptr %mem
}
define void @bar(ptr %mem, i8* %foo) {
   %asint = ptrtoint i8* %foo to i64 ; Or whatever pointer size you have
   store i64 %asint, ptr %mem
}

In other words, we are to accept that every load and store instruction of a pointer has an implicit inttoptr or ptrtoint attached to it. But as I mentioned earlier, pointers have this extra metadata attached to it that is lost when converting to an integer. Under this strict interpretation of memory, we *lose* that metadata every time a pointer is stored in memory, as if we did an inttoptr(ptrtoint x). Thus, the following two functions are *not* semantically equivalent in that model:

define i8* @basic(i8* %in) {
   ret i8* %in
}
define i8* @via_alloc(i8* %in) {
   %mem = alloca i8*
   store i8* %in, i8** %mem
   %out = load i8*, i8** %mem
   ret i8* %out
}

In order to allow these two functions to be equivalent, we have to let the load of a pointer recover the provenance data stored by the store of the pointer, and nothing more general. If either one of those were instead an integer load or store, then no provenance data can be communicated, so the integer and the pointer loads *must* be nonequivalent (although loading an integer instead of a pointer would presumably be a pessimistic transformation).

In short, pointers have pointery bits that aren't reflected in a bitvector representation an integer has. LLVM has some optimizations that assume that loads and stores only have bitvector manipulation semantics, while other optimizations (and most of the frontends) expect that loads and stores will preserve the pointery bits. And when these interact with each other, it's undoubtedly possible that the pointery bits get lost along the way.

I completely agree with John. “i8” in LLVM doesn’t carry any implications about aliasing (in fact, LLVM pointers are going towards being typeless). Any such thing occurs at the accesses, and are part of TBAA.

I’m opposed to adding a byte type to LLVM, as such semantic carrying types are entirely unprecedented, and would add tremendous complexity to the entire system.

-Chris

Hi,

I would also oppose adding a byte type, but mainly because the bug
report mentioned (https://bugs.llvm.org/show_bug.cgi?id=37469) is not
a bug at all.
The example in the bug report is just badly written C code.
Specifically:

int main() {
  int A[4], B[4];
  printf("%p %p\n", A, &B[4]);
  if ((uintptr_t)A == (uintptr_t)&B[4]) {
    store_10_to_p(A, &B[4]);
    printf("%d\n", A[0]);
  }
  return 0;
}

"int B[4];" allows values between 0 and 3 only, and referring to 4 in
&B[4] is undef, so in my view, it is correctly optimised out which is
why it disappears in -O3.

Kind Regards

James

Also, the comment below is wrong. At this point, arr3 is equivalent to
arr2, which is q.

// Now arr3 is equivalent to arr1, which is p.
  int *r;
  memcpy(&r, (unsigned char *)arr3, sizeof(r));
  // Now r is p.
  *p = 1;
  *r = 10;

HI George,

I don’t think this is scalable model to add a new type just to benefit

an analysis and draw specific conclusions from it. I can argue that
if b8 is proposed then why not b16 for half? Why not b32 for some
other reason? This won’t stop just there and one can go beyond
and introduce types to benefit domain specific languages.

Given the problem, I’d say we should think about a way to annotate
types with attribute or metadata or flags which optimizations can use
to do a better job. The attribute/metadata could carry the semantic

meaning for the type. Frontends can generate this “type attribute/metadata”

and optimizations can choose to use this extra information to do the
better job. It would be a hint though and not a mandate for optimizations.
This approach is very similar to attributes in LLVM IR and just like an IR
function can have attributes, a type can also posses attributes/metadata.

(Whether it should be an attribute or metadata is a choice but

that would not deviate from the purpose).

This approach is far more adoptable and convincing than introducing
a whole new type which would be massive complexity for the type system.

On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-dev@lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov wrote:

    Hi all,

    Together with Nuno Lopes and Juneyoung Lee we propose to add a
    new byte
    type to LLVM to fix miscompilations due to load type punning.
    Please see
    the proposal below. It would be great to hear the
    feedback/comments/suggestions!

    Motivation
    ==========

    char and unsigned char are considered to be universal holders in
    C. They
    can access raw memory and are used to implement memcpy. i8 is the
    LLVM’s
    counterpart but it does not have such semantics, which is also not
    desirable as it would disable many optimizations.

I don’t believe this is correct. LLVM does not have an innate
concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using |tbaa|
metadata, which are unrelated to the IR type of the access.

I completely agree with John. “i8” in LLVM doesn’t carry any implications about aliasing (in fact, LLVM pointers are going towards being typeless). Any such thing occurs at the accesses, and are part of TBAA.

I’m opposed to adding a byte type to LLVM, as such semantic carrying types are entirely unprecedented, and would add tremendous complexity to the entire system.

-Chris

I'll take this opportunity to point out that, at least historically, the reason why a desire to optimize around ptrtoint keeps resurfacing is because:

1. Common optimizations introduce them into code that did not otherwise have them (SROA, for example, see convertValue in SROA.cpp).

2. They're generated by some of the ABI code for argument passing (see clang/lib/CodeGen/TargetInfo.cpp).

3. They're present in certain performance-sensitive code idioms (see, for example, ADT/PointerIntPair.h).

It seems to me that, if there's design work to do in this area, one should consider addressing these now-long-standing issues where we introduce ptrtoint by replacing this mechanism with some other one.

-Hal

Hi Madhur,

I can argue that if b8 is proposed then why not b16

for half? Why not b32 for some other reason?

We do propose to have b for different Ns, as per proposal:

we denote the byte type as b, where N is the number of bits.

Maybe that was not explicit enough. But note that the only byte bit width produced
by the frontend is b8 (from char, unsigned char/std::byte). The other bit widths
come from LLVM optimizations, such as already mentioned memcpy:

%src8 = bitcast i8** %src to i8*

%dst8 = bitcast i8** %dst to i8*

call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false)

is transformed (by instcombine) into

%src64 = bitcast i8** %src to i64*

%dst64 = bitcast i8** %dst to i64*

%val = load i64, i64* %src64

store i64 %val, i64* %dst64

What we propose is to have roughly

%src64 = bitcast i8** %src to b64*

%dst64 = bitcast i8** %dst to b64*

%val = load b64, b64* %src64

store b64 %val, b64* %dst64

Having this just copies memory as-is, and cannot introduce implicit

ptr2int/int2ptr casts.

Given the problem, I’d say we should think about a way to annotate
types with attribute or metadata or flags which optimizations can use
to do a better job. The attribute/metadata could carry the semantic
meaning for the type. Frontends can generate this “type attribute/metadata”
and optimizations can choose to use this extra information to do the
better job. It would be a hint though and not a mandate for optimizations.
This approach is very similar to attributes in LLVM IR and just like an IR
function can have attributes, a type can also posses attributes/metadata.
(Whether it should be an attribute or metadata is a choice but
that would not deviate from the purpose).

I see your point and it is definitely easier to fix everything with metadata/attributes.
I am concerned with this approach because it just postpones a bigger problem.
There are already lots of metadata, attributes and IR/optimizations have become
complicated enough. Instead of fixing a problem while we can, we just add attributes,
then more attributes… I believe that at some point this will just become legacy and
impossible to fix.

Thanks,
George

Taking the address of the “one-past-the end” element of an array is perfectly legal in both C and C++.
*Dereferencing* that pointer is UB.

— Marshall

Not just dereferencing, but also comparing it to a random other point?

Joerg

According to C17 it is legal:

6.5.9.p6

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

According to C17 it is legal:

6.5.9.p6

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

This is interesting, I’m curious (we’re a bit off-topic though) how is the part about “that happens to immediately follow the first array object in the address space” defined?

Hi Mehdi,

I couldn’t find a paragraph explicitly specifying how objects are located from the text.

It seems it is deliberately underspecified to allow compilations to exotic target machines.

However, I could find a few paragraphs that look interesting:

6.2.4p2: An object exists, has a constant address,33) and retains its last-stored value throughout its lifetime.

  1. The term “constant address” means that two pointers to the object constructed at possibly different times will compare equal. The address may be different during two different executions of the same program.

6.3.2.3.p5 68)
The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

7.22.3p1

… Each such allocation shall yield a pointer to an object disjoint from any other object. The pointer returned points to the start (lowest byte address) of the allocated space. …

https://web.archive.org/web/20181230041359if_/http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf

Going back to the bug report, I don’t think these paragraphs affect the validity of the example.

Thanks,
Juneyoung