[RFC] Introducing a byte type to LLVM

It's true: an alternative to introducing a new byte type is to make alias analysis more conservative. But I think it's way worse than you think.
Making AA conservative means that every single integer load must be considered to be escaping a pointer. Unless you know the last stored value was an integer, you would have to conservatively assume that the load may be escaping some pointer (even if only partially). This sounds to me like terrible. Type punning through memory is probably very uncommon, and it seems we would be slowing down all programs because of such uncommon case.

Therefore I feel it's best if we introduce a way to explicitly mark potential type punning situations, such that AA (and optimizers) only need to be conservative when needed. It's not the case that right now we can detect the few potential type punning cases and be conservative just there. We would need to be conservative in e.g. every function that loads an argument pointer.
I agree it's scary to introduce a new type in the IR; we don't do that very often. But we need to fix the bugs in LLVM.

Nuno

I think the lack of a end-to-end example is making this harder (for me) to understand.

It's true: an alternative to introducing a new byte type is to make alias analysis more conservative. But I think it's way worse than you think.
Making AA conservative means that every single integer load must be considered to be escaping a pointer.

What pointer is escaping if I have an integer load. I'm sorry for my questions but I assume I'm not the only one that might
want to understand this properly. So far, I've assumed if I store a pointer away w/o knowing what happens to the memory the
pointer escapes. This is our current model and it's not that bad.

  Unless you know the last stored value was an integer, you would have to conservatively assume that the load may be escaping some pointer (even if only partially). This sounds to me like terrible. Type punning through memory is probably very uncommon, and it seems we would be slowing down all programs because of such uncommon case.

Therefore I feel it's best if we introduce a way to explicitly mark potential type punning situations, such that AA (and optimizers) only need to be conservative when needed. It's not the case that right now we can detect the few potential type punning cases and be conservative just there. We would need to be conservative in e.g. every function that loads an argument pointer.
I agree it's scary to introduce a new type in the IR; we don't do that very often. But we need to fix the bugs in LLVM.

What doesn't add up for me is that one the one side the argument "everything needs to be conservative" w/o byte types, but at the same time we can somehow determine when to use byte types locally? If the latter would not be the case, how would we know when we need a byte type?

Why would we not be conservative/explicit about type punning whenever we would introduce byte types with this proposal? I unfortunately don't understand what the byte types give us. My best guess is that they explicitly mark byte operations on pointers, which we could do otherwise I assume (explicit p2i, i2p). If there is more to it, I think we need an end to end example that shows how we cannot make type punning explicit with current means and a new type is conceptually better.

~ Johannes

I think Joshua gave a very nice motivation already. I'll just complement with an alias analysis example. (plus see my reply to Chris)

define @f() {
  %p = alloca i8
  %q = alloca i8**
  store i8 42, %p
  store i8* %p, i8** %q

  %pp = some ptr arithmetic; full may-alias
  %v = load i64, %pp
  call @g(%v)

  %w = load i8, %p ; call we store-forward & RAUW 42?
}

So, we pass an integer to some function. If we don't distinguish between pointers and integers (current memory model), we need to assume that %v may carry the address of %p, as the load may have done an implicit pointer-to-integer cast. Therefore, pointer %p is escaped.
Thus, in this model we cannot do store forwarding of 42 to the load of %w. Except that LLVM doesn't consider integer loads as escaping pointers. This is wrong, and that's one of the reasons why we miscompile integer/pointer casts. The other reason is what I explained in the reply to John & Chris, which is that some integer optimizations aren't correct for pointers, so we cannot represent pointers as integers (without appropriate p2i/i2p casts).

Another model is to say that integers can only carry integers. Above there's no pointer-to-integer cast of %p, hence %v cannot have the address of %p, and thus store-forwarding can happen.
This model requires a "universal holder" different than (ab)using integers.

Introducing a byte type means we only need to careful with char loads, but not with short, int, long, etc loads. Clang would lower char loads to byte loads, and the remaining integer types straight to i<n> loads. So we can go full speed on integers, and throttle down just on chars, which is all we can do because the semantics of chars in C.

Nuno

I think Joshua gave a very nice motivation already.

I don't dispute that but I am still not understanding the need for bytes. None of the examples I have seen so far
clearly made the point that it is the byte types that provide a substantial benefit. The AA example below does neither.

I'll just complement with an alias analysis example. (plus see my reply to Chris)

define @f() {
   %p = alloca i8
   %q = alloca i8**
   store i8 42, %p
   store i8* %p, i8** %q

   %pp = some ptr arithmetic; full may-alias
   %v = load i64, %pp
   call @g(%v)

   %w = load i8, %p ; call we store-forward & RAUW 42?
}

So, we pass an integer to some function. If we don't distinguish between pointers and integers (current memory model), we need to assume that %v may carry the address of %p, as the load may have done an implicit pointer-to-integer cast. Therefore, pointer %p is escaped. Thus, in this model we cannot do store forwarding of 42 to the load of %w.

The conclusion that we cannot forward 42 is true if %pp has access to %p or %q, which is not clear in the above code.
If no use of %p or %q somehow makes it's way into %pp, nor a load of %q, %pp can be shown to be independent of %q already, regardless if we already perform this reasoning or not.

  Except that LLVM doesn't consider integer loads as escaping pointers. This is wrong, and that's one of the reasons why we miscompile integer/pointer casts. The other reason is what I explained in the reply to John & Chris, which is that some integer optimizations aren't correct for pointers, so we cannot represent pointers as integers (without appropriate p2i/i2p casts).

At least me (and probably others) are arguing for the appropriate and explicit p2i/i2p casts instead of a entirely new type that does the same job.

Another model is to say that integers can only carry integers. Above there's no pointer-to-integer cast of %p, hence %v cannot have the address of %p, and thus store-forwarding can happen.
This model requires a "universal holder" different than (ab)using integers.

Storing %p into %q causes %p to be escaped by default, IMHO. With the newest definition of escapes, we allow analyses to prove %q is not used in "a weird way" and therefore we can pretend %p is not escaped. This effectively requires us to look at all the loads of %q and assume they are potential copies of %p. If such a load is (potentially) an integer, you'd need to decide if the (potential) integer use can escape %p, same as you would decide any other load of %q would. I fail to see the need for byte types here.

Introducing a byte type means we only need to careful with char loads, but not with short, int, long, etc loads. Clang would lower char loads to byte loads, and the remaining integer types straight to i<n> loads. So we can go full speed on integers, and throttle down just on chars, which is all we can do because the semantics of chars in C.

This will cause a lot of side effects, in addition to the work required to make this happen in the first place.
Even if I assume we really want byte types for AA, which I am not sure of given the lack of a convincing example, we should determine what the impact would be here, e.g., restrict some optimizations in the presence of i8 operation and get some idea of the cost.

~ Johannes

Hi Johannes,

I think Joshua gave a very nice motivation already.

I don't dispute that but I am still not understanding the need for bytes. None of the examples I have seen so far
clearly made the point that it is the byte types that provide a substantial benefit. The AA example below does neither.

I hope <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html&gt; makes a convincing case that under the current semantics, when one does an "i64" load of a value that was stored at pointer type, we have to say that this load returns poison. In particular, saying that this implicitly performs a "ptrtoint" is inconsistent with optimizations that are probably too important to be changed to accommodate this implicit "ptrtoint".

If/once we agree on that, "byte" is a fairly natural proposal IMO: we need some type at which to perform a load that does *not* turn some kinds of data into poison. "byte" is that type.

Kind regards,
Ralf

I think it is fact rather obvious that, if this optimization as currently written is indeed in conflict with the current semantics, it is the optimization that will have to give. If the optimization is too important for performance to give up entirely, we will simply have to find some more restricted pattern that wee can still soundly optimize.

Perhaps the clearest reason is that, if we did declare that integer types cannot carry pointers and so introduced byte types that could, C frontends would have to switch to byte types for their integer types, and so we would immediately lose this supposedly important optimization for C-like languages, and so, since optimizing C is very important, we would immediately need to find some restricted pattern under which we could soundly apply this optimization to byte types. That’s assuming that this optimization is actually significant, of course.

John.

Hi Johannes,

I think Joshua gave a very nice motivation already.

I don’t dispute that but I am still not understanding the need for
bytes. None of the examples I have seen so far
clearly made the point that it is the byte types that provide a
substantial benefit. The AA example below does neither.

I hope
<https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html>
makes a convincing case that under the current semantics, when one
does an “i64” load of a value that was stored at pointer type, we have
to say that this load returns poison. In particular, saying that this
implicitly performs a “ptrtoint” is inconsistent with optimizations
that are probably too important to be changed to accommodate this
implicit “ptrtoint”.

I think it is fact rather obvious that, if this optimization as
currently written is indeed in conflict with the current semantics, it
is the optimization that will have to give. If the optimization is too
important for performance to give up entirely, we will simply have to
find some more restricted pattern that wee can still soundly optimize.

I tend to agree. I don’t think Ralf’s example alone is convincing evidence that pointer-load of integer-store must be poison, i.e. memory must be typed.

FWIW, the least important optimization in that example’s chain, and the one that is most obviously incorrect in an untyped memory world, is eliminating a store of a previously loaded value. How much would we actually lose if we disable this particular optimization? Note that this is only a small special case of dead store elimination. The more common case where there are two stores to memory in a row, and the first one is eliminated, is still correct.

Cheers,
Nicolai

Hi,

I don't dispute that but I am still not understanding the need for bytes. None of the examples I have seen so far
clearly made the point that it is the byte types that provide a substantial benefit. The AA example below does neither.

I hope <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html&gt; makes a convincing case that under the current semantics, when one does an "i64" load of a value that was stored at pointer type, we have to say that this load returns poison. In particular, saying that this implicitly performs a "ptrtoint" is inconsistent with optimizations that are probably too important to be changed to accommodate this implicit "ptrtoint".

I think it is fact rather obvious that, if this optimization as currently written is indeed in conflict with the current semantics, it is the optimization that will have to give. If the optimization is too important for performance to give up entirely, we will simply have to find some more restricted pattern that wee can still soundly optimize.

That is certainly a reasonable approach.
However, judging from how reluctant LLVM is to remove optimizations that are much more convincingly wrong [1], my impression was that it is easier to complicate the semantics than to remove an optimization that LLVM already performs.

[1]: 34548 – InstCombine cannot blindly assume that inttoptr(ptrtoint x) -> x,
      https://bugs.llvm.org/show_bug.cgi?id=35229;
      see Pointers Are Complicated II, or: We need better language specs for a
      more detailed explanation

Perhaps the clearest reason is that, if we did declare that integer types cannot carry pointers and so introduced byte types that could, C frontends would have to switch to byte types for their integer types, and so we would immediately lose this supposedly important optimization for C-like languages, and so, since optimizing C is very important, we would immediately need to find some restricted pattern under which we could soundly apply this optimization to byte types. That’s assuming that this optimization is actually significant, of course.

At least C with strict aliasing enabled (i.e., standard C) only needs to use the byte type for "(un)signed char". The other integer types remain unaffected. There is no arithmetic on these types ("char + char" is subject to integer promotion), so the IR overhead would consist in a few "bytecast" instructions next to / replacing the existing sign extensions that convert "char" to "int" before performing the arithmetic.

Kind regards,
Ralf

The semantics you seem to want are that LLVM’s integer types cannot carry information from pointers. But I can cast a pointer to an integer in C and vice-versa, and compilers have de facto defined the behavior of subsequent operations like breaking the integer up (and then putting it back together), adding numbers to it, and so on. So no, as a C compiler writer, I do not have a choice; I will have to use a type that can validly carry pointer information for integers in C.

Since you seem to find this sort of thing compelling, please note that even a simple assignment like char c2 = c1 technically promotes through int in C, and so int must be able to carry pointer information if char can.

John.

In case anyone reading this thread is unaware, there is currently a proposal before the C standards committee that goes into more details on a future provenance model for C: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf. It actually details a few different provenance models, especially several variations on integers-do-not-have-provenance, mainly around what provenance is reconstructed on an inttoptr cast. While I’m not on the standards committee, I believe that something along the lines of this proposal will become the official C provenance rules.

While LLVM isn’t required to adopt C rules wholesale, the basic form of the proposal already matches applied LLVM semantics better than our own language reference (we do not consider integers to carry provenance). There are several points in the proposal where extra annotations are suggested to effect various provenance properties (e.g., the ability to write word-level memcpy), and offhand, it looks like the byte proposal would be a viable vehicle for those extra annotations, although I’m not entirely persuaded that it’s the best vehicle.

In case anyone reading this thread is unaware, there is currently a proposal before the C standards committee that goes into more details on a future provenance model for C: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf. It actually details a few different provenance models, especially several variations on integers-do-not-have-provenance, mainly around what provenance is reconstructed on an inttoptr cast. While I’m not on the standards committee, I believe that something along the lines of this proposal will become the official C provenance rules.

While LLVM isn’t required to adopt C rules wholesale, the basic form of the proposal already matches applied LLVM semantics better than our own language reference (we do not consider integers to carry provenance). There are several points in the proposal where extra annotations are suggested to effect various provenance properties (e.g., the ability to write word-level memcpy), and offhand, it looks like the byte proposal would be a viable vehicle for those extra annotations, although I’m not entirely persuaded that it’s the best vehicle.

To be clear, while that TR represents enormous progress (and I agree that we should be closely following that discussion), it openly acknowledges that it doesn’t know how representation copies are supposed to work. It’s clear that library calls like memcpy are supposed to preserve provenance exactly (rather than doing the more conservative thing that round-tripping through an integer requiring), but code that simply does the same thing inline needs to have the same effect, and nobody knows how that’s supposed to happen if integers don’t carry provenance.

John.

The semantics you seem to want are that LLVM’s integer types cannot carry information from pointers. But I can cast a pointer to an integer in C and vice-versa, and compilers have de facto defined the behavior of subsequent operations like breaking the integer up (and then putting it back together), adding numbers to it, and so on. So no, as a C compiler writer, I do not have a choice; I will have to use a type that can validly carry pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer types not carry pointer information does not necessarily mean that dereferencing a pointer casted from integer is UB.

For example, the definition of cast_ival_to_ptrval at the n2676 proposal shows that a pointer’s provenance is reconstructed from an integer.

(Whether n2676’s cast_ival_to_ptrval can be also used for LLVM’s inttoptr semantics is a different question, though)

Since you seem to find this sort of thing compelling, please note that even a simple assignment like char c2 = c1 technically promotes through int in C, and so int must be able to carry pointer information if char can.

IIUC integer promotion is done when it is used as an operand of arithmetic ops or switch’s condition, so I think assignment operation is okay.

Juneyoung

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back
together), adding numbers to it, and so on. So no, as a C compiler writer,
I do not have a choice; I will have to use a type that can validly carry
pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer
types not carry pointer information does not necessarily mean that
dereferencing a pointer casted from integer is UB.

What exactly is the claimed formal property of byte types, then,
that integer types will lack? Because it seems to me that converting
from an integer gives us valid provenance in strictly more situations
than converting from bytes, since it reconstructs provenance if there’s
any object at that address (under still-debated restrictions),
while converting from bytes always preserves the original provenance
(if any). I don’t understand how that can possibly give us more
flexibility to optimize integers.

Since you seem to find this sort of thing compelling, please note that
even a simple assignment like char c2 = c1 technically promotes through
int in C, and so int must be able to carry pointer information if char
can.

IIUC integer promotion is done when it is used as an operand of arithmetic
ops or switch’s condition, so I think assignment operation is okay.

Hmm, I was misremembering the rule, you’re right.

John.

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back
together), adding numbers to it, and so on. So no, as a C compiler writer,
I do not have a choice; I will have to use a type that can validly carry
pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer
types not carry pointer information does not necessarily mean that
dereferencing a pointer casted from integer is UB.

What exactly is the claimed formal property of byte types, then,
that integer types will lack? Because it seems to me that converting
from an integer gives us valid provenance in strictly more situations
than converting from bytes, since it reconstructs provenance if there’s
any object at that address (under still-debated restrictions),
while converting from bytes always preserves the original provenance
(if any). I don’t understand how that can possibly give us more
flexibility to optimize integers.

When two objects are adjacent, and an integer is exactly pointing to the location between them, its provenance cannot be properly recovered.

int x[1], y[1];
llvm.assume((intptr_t)&x[0] == 0x100 && (intptr_t)&y[0] == 0x104);

int p = (int)(intptr_t)&x[1];
// Q: Is p’s provenance x or y?

If it is expected that ‘*(p-1)’ is equivalent to x, p’s provenance should be x.
However, based on llvm.assume, optimizations on integers can replace (intptr_t)&x[1] with (intptr_t)&y[0] (which is what happened in the bug report).
Then, '
(p-1)’ suddenly becomes out-of-bounds access, which is UB.
So, p’s provenance isn’t simply x or y; it should be something that can access both x and y.

This implies that, unless there is a guarantee that all allocated objects are one or more bytes apart, there is no type that can perfectly store a pointer byte.
memcpy(x, y, 8) isn’t equivalent to ‘v=load i64 y;store i64 v, x’ because v already lost the pointer information.

The pointer information is perfectly stored in a byte type. But, arithmetic property-based optimizations such as the above one are not correct anymore.
Here is an example with a byte-type version:

int x[1], y[1];
// byte_8 is a 64-bits byte type

llvm.assume((byte_8)&x[0] == 0x100 && (byte_8)&y[0] == 0x104);

int p = (int)(byte_8)&x[1];

// p’s provenance is alway x.

For a byte type, equality comparison is true does not mean that the two values are precisely equal.
Since (byte_8)&x[1] and (byte_8)&y[0] have different provenances, replacing one with another must be avoided.
Instead, we can guarantee that p is precisely equivalent to &x[1].
Another benefit is that optimizations on integers do not need to suffer from these pointer thingy anymore;

e.g., the optimization on llvm.assume above can survive and it does not need to check whether an integer variable is derived from a pointer value.

Hi,

The semantics you seem to want are that LLVM’s integer types cannot carry information from pointers. But I can cast a pointer to an integer in C and vice-versa, and compilers have de facto defined the behavior of subsequent operations like breaking the integer up (and then putting it back together), adding numbers to it, and so on. So no, as a C compiler writer, I do not have a choice; I will have to use a type that can validly carry pointer information for integers in C.

Integers demonstrably do not carry provenance; see <https://www.ralfj.de/blog/2020/12/14/provenance.html&gt; for a detailed explanation of why.
As a consequence of this, ptr-int-ptr roundtrips are lossy: some of the original provenance information is lost. This means that optimizing away such roundtrips is incorrect, and indeed doing so leads to miscompilations (34548 – InstCombine cannot blindly assume that inttoptr(ptrtoint x) -> x).

The key difference between int and byte is that ptr-byte-ptr roundtrips are *lossless*, all the provenance is preserved. This means some extra optimizations (such as removing these roundtrips -- which implicitly happens when a redundant-store-after-load is removed), but also some lost optimizations (most notably, "x == y" does not mean x and y are equal in all respects; their provenance might still differ, so it is incorrect for GVN to replace one my the other).

It's a classic tradeoff: we can *either* have lossless roundtrips *or* "x == y" implies full equality of the abstract values. Having both together leads to contradictions, which manifest as miscompilations. "byte" and "int" represent the two possible choices here; therefore, by adding "byte", LLVM would close a gap in the expressive power of its IR.

Kind regards,
Ralf

That article is nice in the way it is trying to describe what the problem is.
Extract from the article:
"How can we fix this?
To fix the problem, we will have to declare one of the three
optimizations incorrect and stop performing it. Speaking in terms of
the LLVM IR semantics, this corresponds to deciding whether pointers
and/or integers have provenance:
1) We could say both pointers and integers have provenance, which
invalidates the first optimization.
2) We could say pointers have provenance but integers do not, which
invalidates the second optimization.
3) We could say nothing has provenance, which invalidates the third
optimization."

This is really the part that I disagree with. There are more possible
alternatives than just 1, 2 and 3.

Kind Regards

James

Hi James,

Integers demonstrably do not carry provenance; see
<https://www.ralfj.de/blog/2020/12/14/provenance.html&gt; for a detailed
explanation of why.

That article is nice in the way it is trying to describe what the problem is.
Extract from the article:
"How can we fix this?
To fix the problem, we will have to declare one of the three
optimizations incorrect and stop performing it. Speaking in terms of
the LLVM IR semantics, this corresponds to deciding whether pointers
and/or integers have provenance:
1) We could say both pointers and integers have provenance, which
invalidates the first optimization.
2) We could say pointers have provenance but integers do not, which
invalidates the second optimization.
3) We could say nothing has provenance, which invalidates the third
optimization."

This is really the part that I disagree with. There are more possible
alternatives than just 1, 2 and 3.

I am interested to hear those. :slight_smile:
Certainly, we have to invalidate at least one of the optimizations -- if all 3 optimizations are valid, we have a contradiction.

Kind regards,
Ralf

Hi,

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back together),
adding numbers to it, and so on. So no, as a C compiler writer, I do not have a
choice; I will have to use a type that can validly carry pointer information for
integers in C.

Integers demonstrably do not carry provenance; see
<https://www.ralfj.de/blog/2020/12/14/provenance.html> for a detailed
explanation of why.
As a consequence of this, ptr-int-ptr roundtrips are lossy: some of the original
provenance information is lost. This means that optimizing away such roundtrips
is incorrect, and indeed doing so leads to miscompilations
(https://bugs.llvm.org/show_bug.cgi?id=34548).

The key difference between int and byte is that ptr-byte-ptr roundtrips are
lossless, all the provenance is preserved. This means some extra optimizations
(such as removing these roundtrips – which implicitly happens when a
redundant-store-after-load is removed), but also some lost optimizations (most
notably, “x == y” does not mean x and y are equal in all respects; their
provenance might still differ, so it is incorrect for GVN to replace one my the
other).

It’s a classic tradeoff: we can either have lossless roundtrips

I think an important part of explaining the motivation for “byte” would be an explanation/demonstration of what the cost of losing “lossless roundtrips” would be.

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back
together), adding numbers to it, and so on. So no, as a C compiler writer,
I do not have a choice; I will have to use a type that can validly carry
pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer
types not carry pointer information does not necessarily mean that
dereferencing a pointer casted from integer is UB.

What exactly is the claimed formal property of byte types, then,
that integer types will lack? Because it seems to me that converting
from an integer gives us valid provenance in strictly more situations
than converting from bytes, since it reconstructs provenance if there’s
any object at that address (under still-debated restrictions),
while converting from bytes always preserves the original provenance
(if any). I don’t understand how that can possibly give us more
flexibility to optimize integers.

When two objects are adjacent, and an integer is exactly pointing to the
location between them, its provenance cannot be properly recovered.

int x[1], y[1];
llvm.assume((intptr_t)&x[0] == 0x100 && (intptr_t)&y[0] == 0x104);
int p = (int)(intptr_t)&x[1];
// Q: Is p’s provenance x or y?

If it is expected that ‘*(p-1)’ is equivalent to x, p’s provenance should
be x.
However, based on llvm.assume, optimizations on integers can
replace (intptr_t)&x[1] with (intptr_t)&y[0] (which is what happened in the
bug report).
Then, '
(p-1)’ suddenly becomes out-of-bounds access, which is UB.
So, p’s provenance isn’t simply x or y; it should be something that can
access both x and y.

This is something that the TR does not really have a firm grasp on yet.

This implies that, unless there is a guarantee that all allocated objects
are one or more bytes apart, there is no type that can perfectly store a
pointer byte.
memcpy(x, y, 8) isn’t equivalent to ‘v=load i64 y;store i64 v, x’ because v
already lost the pointer information .

Okay, so let me try to restate and summarize all this. I’ve CC’ed
a bunch of people back into this part of the thread.

C is moving towards a provenance model; you can find the details in
this committee TR that Joshua Cranmer linked:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf

This TR is very clearly a work in progress and contains many
digressions and several possible sets of rules with different
implications. I will try to briefly summarize.

Formally, all storage has a unique provenance: a notional unique
identifier for the specific allocation event of the storage, like a
local variable coming into scope, or a specific call to malloc.
Pointers formed to the storage formally carry that provenance (or
invalid provenance, or in some corner cases ambiguous provenance).
It is undefined behavior to do certain things with mismatched
provenances. For example, it is undefined behavior to access
an object through a pointer whose provenance doesn’t match the
current provenance of the storage containing that object. This is
a new way of formalizing the well-known rule that you can’t access
a local variable after it’s gone out of scope.

In the provenance model that looks most likely to be accepted, an
int-to-pointer cast blesses the resulting pointer with the current
provenance of the storage containing that address. There does not
have to be any sort of data dependence between taking the address
of the storage and the integer that’s used in the cast; it can
simply be a lucky guess. An integer could reasonably hold the
representation of any address in the program, and so an
int-to-pointer cast could bless its result with valid provenance
for any storage in memory. But if the pointed-to memory is
subsequently deallocated, the provenance of the pointer will
become out of date.

Now, that rule as I’ve stated it would be really bad. Allowing a
lucky guess to resolve to absolutely anything would almost
completely block the optimizer from optimizing memory. For example,
if a local variable came into scope, and then we called a function
that returned a different pointer, we’d have to conservatively
assume that that pointer might alias the local, even if the address
of the local was never even taken, much less escaped:

  int x = 0;
  int *p = guess_address_of_x();
  *p = 15;
  printf(“%d\n”, x); // provably 0?

So the currently favored proposal adds a really important caveat:
this blessing of provenance only works when a pointer with the
correct provenance has been “exposed”. There are several ways to
expose a pointer, including I/O, but the most important is casting
it to an integer.

Again, there’s no requirement of a data dependence between the
exposure and the int-to-pointer cast. If the program casts a
pointer to an integer, and an independently-produced integer
that happens to be the same value is later cast to a pointer,
and the storage hasn’t been reallocated in the meantime, the
resulting pointer will have the right provenance for the memory
and will be valid to use. This implies that pointer-to-int casts
(and other exposures) are semantically significant events in the
program. They don’t have side effects in the normal sense, but
they must be treated by the compiler just like things that do have
side effects: e.g. unless I’m missing something in the TR,
eliminating a completely unused pointer-to-int cast may make
later code UB.

And in fact, it turns out that this is crucially important for
optimization. If the optimizer wants to allow arbitrary
replacement of integers based on conditional equality, like
in GVN, then replacement totally breaks direct data dependence,
and you can easily be left with no remaining uses of a pointer-to-int
cast when the original code would have had a data dependence. So
you cannot reason about provenance through int-to-pointer casts:
the pointer can alias any storage whose provenance has potentially
been exposed, and the optimizer must be conservative about optimizing
memory that has potentially been exposed.

Of course, a lot of this isn’t new. If code passes a pointer to an
opaque function, then yeah, the optimizer has to assume that the
function might expose it by performing a pointer-to-int cast on it.
But that’s okay, because the optimizer already has to assume that
the function can escape the pointer in a more standard way. Exposure
is just another form of escape.

Everything I’ve been talking about so far is a C-level concept:
an int-to-pointer cast is e.g. (float*) myInt, not inttoptr
in LLVM IR. But I think people have an expectation of what these
things mean in LLVM IR, and I haven’t seen it written out explicitly,
so let’s do that now.

The first assumption here is that int-to-pointer and pointer-to-int
casts in C will translate to inttoptr and ptrtoint operations
in IR. Now, this is already problematic, because those operations
do not currently have the semantics they need to have to make the
proposed optimization model sound. In particular:

  • ptrtoint does not have side-effects and can be dead-stripped
    when unused, which as discussed above is not okay.

  • ptrtoint on a constant is folded to a constant expression,
    not an instruction, which is therefore no longer anchored in the
    code and does not reliably record that the global may have escaped.
    (Unused constant expressions do not really exist, and we absolutely
    cannot allow them to affect the semantics of the IR.)

    Of course, this is only significant for globals that don’t already
    have to be treated conservatively because they might have other
    uses. That is, it only really matters for globals with, say,
    internal or private linkage.

  • inttoptr can be reordered with other instructions, which is
    not allowed because different points in a function may have
    different sets of storage with escaped provenance.

  • inttoptr(ptrtoint) can be peepholed; ignoring the dead-stripping
    aspects of removing the inttoptr, this also potentially
    introduces UB because the original inttoptr “launders” the
    provenance of the pointer to the current provenance of the
    storage, whereas the original pointer may have stale provenance.

There are probably other problems. There are two ways that I see
of addressing all of this.

The first way is to take them on directly and change the core
semantic rules of LLVM IR. I think this is a huge change, but
maybe it’s necessary in order to get the semantics we want. We
will need to introduce some new IR features to make this possible,
both for performance and for expressivity. For example, the way
we currently express relative pointers in constant initializers
involves ptrtoint and sub constant expressions, so if we
remove ptrtoint constants, we’ll need something else. And
frontends will need to be much more cautious about not introducing
unnecessary int<->pointer casts because they’ll be heavily
pessimizing.

The second way is to say that inttoptr and ptrtoint are just
superficial type conversions and don’t affect provenance, then
add builtins that do affect provenance. But this leaves us still
tracking provenance through integers, so any optimization that’s
not valid on pointers will also not be valid on integers.

All of this is not even considering the need for byte types yet.
Here’s the thinking behind byte types as I understand it.

The claim is that, to make all of the above work, we need
int<->pointer conversions to be explicit in IR. If there’s some
other way of doing an int<->pointer conversion, we’re in trouble,
because maybe we won’t realize that the conversion has happened,
and so we won’t be appropriately conservative. And unfortunately
C provides some other ways to do these conversions, which happen
to all go through memory: copying representations around, doing
aliased loads and stores, whatever. So we need to have some way
to represent conversions that happen through these mechanisms.
And C says that these mechanisms don’t generally affect provenance,
so inserting inttoptr/ptrtoint casts is at the very least
imprecise because it launders provenance. So what we want is a
type that maintains the original provenance, and therefore is
subject to the same optimization restrictions as pointers.

I don’t find either side of this argument very convincing.

First, the compiler already has to be very conservative about
memory. If a pointer is stored into memory, we generally have
to treat it as having fully escaped: unless we can prove that the
memory isn’t loaded by other code, we have to assume that it is,
and that the loading code will do arbitrary other stuff. That
could include doing a pointer-to-int cast and sharing the pointer
back to us as an integer. Therefore, in the general case, our
ability to optimize when a pointer has escaped into memory is at
least as bad as if it had escaped via an int-to-pointer cast. If
we can nonetheless optimize, because we can reason about some of
the uses together or prove that there aren’t any other uses,
then okay, maybe we see that there’s an int<->pointer conversion.
But translating this to ptrtoint/inttoptr should be, at
worst, overly conservative; it’s not unsound, for reasons I’m
about to get into.

Second, adding casts through an integer type is always valid.
Doing so might block the optimizer, but it doesn’t break semantics.
If I have a program that does e.g *px = 15, and I change it to
do *(int*)(intptr_t)px = 15, my program has become well-defined
in strictly more situations: in any case, there must be valid
storage at px for this not to be UB, but previously px might
have had the wrong provenance, and now it never does as long as
the provenance for that storage has previously escaped.

If we find that we’re generating too many unnecessary casts
through integer types and it’s really blocking the optimizer too
much, then we should consider the best solutions to those problems
as they arise. It may be the case that we need a better solution
for type conversions introduced through manual memcpy-like code
so that we maintain the original provenance instead of adding
explicit int<->pointer conversions that launder provenance.
I don’t know that byte types are the best solution to that, but
we can consider them.

But this whole conversation about byte types seems to be coming
at it the wrong way around. We need to be centering the first
set of problems around int<->pointer casts.

John.

We can’t retroactively weaken inttoptr semantics relative to what frontends rely on–it should continue to have the semantics of a pointer cast in C, which need to be handled more conservatively in LLVM.

We can easily express the weaker memcpy semantics as “!raw” metadata on the load and inttoptr instructions. (I apologize if that was already mentioned in this thread.)

Introducing a new type to communicate pointer semantics seems inconsistent. LLVM types primarily communicate ABI requirements from the frontend to the target. The semantics of pointer loads, and integer conversions can be specified on those operations and generated when the compiler emits byte-wise copies as opposed to semantic casts. We can expose a clang attribute or builtin for manual pointer copying.

-Andy