[RFC] Introducing a byte type to LLVM

Hi Ralf,

[..]

To add to what Juneyoung said:
I don't think that experiment has been made. From what I can see, the
alternative you propose leads to an internally consistent model -- one "just"
has to account for the fact that a "load i64" might do some transformation on
the data to actually obtain an integer result (namely, it might to ptrtoint).

However, I am a bit worried about what happens when we eventually add proper
support for 'restrict'/'noalias': the only models I know for that one actually
make 'ptrtoint' have side-effects on the memory state (similar to setting the
'exposed' flag in the C provenance TS). I can't (currently) demonstrate that

For the 'c standard', it is undefined behavior to convert a restrict pointer to
an integer and back to a pointer type.

(At least, that is my interpretation of n2573 6.7.3.1 para 3:
   Note that "based" is defined only for expressions with pointer types.
)

For the full restrict patches, we do not track restrict provenance across a
ptr2int, except for the 'int2ptr(ptr2int %P)' (which we do, as llvm sometimes
introduced these pairs; not sure if this is still valid).

Greetings,

Jeroen Dobbelaere

Hi John,

    So, at least in the paper Juneyoung and me and a few other people wrote (the
    one referenced a couple times in this thread already:
    https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf
    <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf&gt;\), it is a wildcard
    provenance.
    "exposed" is not a thing in the operational semantics, it is a theorem we
    can prove.

Okay, fine, I won’t say “exposure”. :slight_smile: You intend to still be able to
prove that the reconstructed provenance cannot be that of certain
memory locations based on a comprehensive use analysis of those
locations.

Not really. You prove that the *address* can never be guessed by anything else. For pointers with the wrong address, their provenance simply does not matter.

    I don't think that's right, since "a" is still used in the "if", so a bit of
    information about the address is leaked and you can't assume the address was
    not guessed.

Okay, so you have a baseline expectation that pointer comparisons will be treated as escapes.

Yes. This expectation is based on "otherwise the analysis is wrong and I think I can prove that". :wink: You can't just ignore control dependencies and expect things to work out.
Does LLVM currently not treat pointer comparisons as escapes? I naively assumed it would because how could one even justify not doing so?^^

The reason I added a second comparison in my example is just because otherwise in the |a != b| case the access is out of bounds. So I guess you’re suggesting that it’s impossible to construct an example that avoids that problem that wouldn’t ultimately do something uneliminatable that would count as an escape?

Yes, I think the theorems we proved for that paper establish this.
Of course, there is a chance that *some other* optimization LLVM does (not covered by our theorems) is in conflict with that story.

    That said, things get more tricky once you want to account for C 'restrict'
    / LLVM 'noalias', and then it might be necessary to have explicit
    'exposed'-like operations. I haven't seen a properly worked-out model for
    this, but the candidate models I saw would need something like this. So
    maybe it'd not be a bad idea to have such an operation anyway... I just
    don't think it should have any effect for the kind of programs we have been
    discussing here so far.

Honestly, part of why I like this approach is that I think the basic idea — recognizing when we’re doing a dependency-breaking value replacement and using a different replacement sequence — also moves us towards a solution for |consume| dependencies.

What I don't like about it is that it is not operational (or I did not understand it properly; I am not quite sure what the exact semantics of that marker you want to insert is). I think we to get to a state where we have a precise operational semantics / Abstract Machine for LLVM IR (along the lines of what we have in that paper -- not the same semantics, of course, but the same basic style of defining things). Ideally this would even be implemented in an interpreter. Then we can attack the question of "is this analysis/optimization correct" in a principled way.
(We have such an interpreter for Rust: GitHub - rust-lang/miri: An interpreter for Rust's mid-level intermediate representation . It's been tremendously useful, not just for finding bugs.)

The operational version of this that I know is basically the 'exposed' flag of the C provenance TS. Then 'ptrtoint' has a side-effect of setting that flag, and can (in general) not be removed, even if its result is unused. For 'restrict'/'noalias' we need something more flexible, but I think it can be done. (I've done something similar for a Rust memory model in Stacked Borrows: An Aliasing Model for Rust (POPL 2020) .)
If you are okay with 'ptrtoint' having a side-effect, many things become a lot simpler.
But you seem to imagine something more targeted, something that only affects "dependency-breaking value replacements"... I don't have a good enough intuition for what exactly that is, to really evaluate this option. Could you give an example?

Kind regards,
Ralf

Hi Jeroen,

To add to what Juneyoung said:
I don't think that experiment has been made. From what I can see, the
alternative you propose leads to an internally consistent model -- one "just"
has to account for the fact that a "load i64" might do some transformation on
the data to actually obtain an integer result (namely, it might to ptrtoint).

However, I am a bit worried about what happens when we eventually add proper
support for 'restrict'/'noalias': the only models I know for that one actually
make 'ptrtoint' have side-effects on the memory state (similar to setting the
'exposed' flag in the C provenance TS). I can't (currently) demonstrate that

For the 'c standard', it is undefined behavior to convert a restrict pointer to
an integer and back to a pointer type.

(At least, that is my interpretation of n2573 6.7.3.1 para 3:
    Note that "based" is defined only for expressions with pointer types.
)

For the full restrict patches, we do not track restrict provenance across a
ptr2int, except for the 'int2ptr(ptr2int %P)' (which we do, as llvm sometimes
introduced these pairs; not sure if this is still valid).

Interesting. I assumed that doing ptr2int, then doing whatever you want with that value (say, AES encrypt and then decrypt it), and then turning the same value back into a pointer, must always produce a pointer that is "at least as usable" as the one that we started with. I would interpret the parts of the standard that talk about integer-pointer casts that way.
(That's the problem with axiomatic standards: it is very easy to have mutually contradicting axioms...)

FWIW, Rust's use of LLVM 'noalias' pretty much relies on this. It would be rather disastrous for Rust if 'noalias' pointers cannot be cast to integers, cast back (potentially in a different function), and used.

The C standard definition of 'restrict' is based on hypothetical alternative executions of the program with different inputs. I can't even imagine any reasonable way to interpret that unambiguously, so honestly I don't see how that is even a starting point for a precise formal definition that one could prove theorems about.^^
The ideas colleagues and me discussed for this more evolved around the idea of having more than one "provenance" for an allocation (so when a pointer is passed to a function as 'restrict' argument, it gets a fresh "ID" into its provenance), and then ensuring that the different provenances on one allocation are used consistently. But then when you cast a ptr to an int you basically have to mark that particular provenance as 'exposed' (losing all 'restrict' advantages) to have any chance of handling the case of casting the int back to a ptr. That seems fair to me honestly, if you cast a ptr to an int you cannot reasonably expect alias analysis to make heads or tails of what you are doing. But then 'ptrtoint' has a side-effect and cannot be removed even if the result is unused.

Kind regards,
Ralf

Hi again Jeroen,

However, I am a bit worried about what happens when we eventually add proper
support for 'restrict'/'noalias': the only models I know for that one actually
make 'ptrtoint' have side-effects on the memory state (similar to setting the
'exposed' flag in the C provenance TS). I can't (currently) demonstrate that

For the 'c standard', it is undefined behavior to convert a restrict pointer to
an integer and back to a pointer type.

(At least, that is my interpretation of n2573 6.7.3.1 para 3:
    Note that "based" is defined only for expressions with pointer types.
)

After sleeping over it, I think I want to push back against this interpretation a bit more strongly. Consider a program snippet like

int *out = (int*) decrypt(encrypt( (uintptr_t)in ));

It doesn't matter what "encrypt" and "decrypt" do, as long as they are inverses of each other.
"out" is definitely of pointer type. And by the dependency-based definition of the standard, it is the case that modifying "in" to point elsewhere would also make "out" point elsewhere. Thus "out" is 'based on' "in". And hence it is okay to use "out" to access the object "in" points to, even in the presence of 'restrict'.

Kind regards,
Ralf

Hi Ralf,

My interpretation (well not just mine, we did have discussions about this in our group)
wrt to restrict handling, is that the use of decrypt/encrypt
triggers undefined behavior. Aka, we are not forced to try to track the
restrict dependency for this case. That is important if you want to optimize
restrict annotated accesses vs not-annotated accesses.

At the time that I came up with the implementation, that was also a convenient fallback
to avoid some of the pitfalls. It made thinking about the solution 'easier'.
For our customers, getting the pointer based use cases working also had the highest priority.

Now that we are going over the different pieces of the implementation and see how we can use
them in a broader context, the situation is different: instead of just tracking
the 'restrict/noalias' provenance, we now want to use that part of the infrastructure to
track provenance in general. Because of that, it also makes sense to reconsider what 'policy'
we want to use. In that context, mapping a 'int2ptr' to a 'add_provenance(int2ptr(%Decrypt), null)'
indicating that it can point to anything makes sense, but is still orthogonal to the infrastructure.

For this particular example, it would also be nice if we could somehow indicate that the
'decrypt(encrypt(%P))' can only depend on %P. But that is another discussion.

Greetings,

Jeroen

Hi Jeroen,

My interpretation (well not just mine, we did have discussions about this in our group)
wrt to restrict handling, is that the use of decrypt/encrypt
triggers undefined behavior.

Yes, that is exactly what I am pushing back against. :slight_smile: I cannot see a reading of the standard where this is UB. I also don't think it is the intention of the standard to make this UB. Note that the line I showed could be very far away from the 'restrict' annotation. Basically if this is UB then a 'restrict' pointer cannot be passed to other functions unless we know exactly that they do not do ptr-to-int casts.

Now that we are going over the different pieces of the implementation and see how we can use
them in a broader context, the situation is different: instead of just tracking
the 'restrict/noalias' provenance, we now want to use that part of the infrastructure to
track provenance in general. Because of that, it also makes sense to reconsider what 'policy'
we want to use. In that context, mapping a 'int2ptr' to a 'add_provenance(int2ptr(%Decrypt), null)'
indicating that it can point to anything makes sense, but is still orthogonal to the infrastructure.

That is not sufficient though. You also need to know that the provenance of the 'restrict'ed pointer can now be acquired by other pointers created literally anywhere via int2ptr. *That* is what makes this so tricky, I think.

int foo(int *restrict x) {
   *x = 0;
   unk1();
   assert(*x == 0); // can be optimized to 'true'
   unk2((uintptr_t)x);
   assert(*x == 0); // can *not* be optimized to 'true'
}

For this particular example, it would also be nice if we could somehow indicate that the
'decrypt(encrypt(%P))' can only depend on %P. But that is another discussion.

It would be nice if one could express this in the surface language (C/Rust), but I don't think we should allow LLVM to infer this -- that would basically require tracking provenance through integers, which is not a good idea.
Put differently: as the various examples in this thread show, integers can easily acquire "provenance" of other values simply by comparing them for equality -- so in a sense, after "x == y" evaluates to true, now 'x' also has the "provenance" of 'y'. I don't think we want obscure effects like this in the semantics of the Abstract Machine. (I am not even convinced this can be done consistently.)
So then what we are left with are those transformations that are correct without extra support from the abstract machine. And since these dependencies can entirely disappear from the source code through optimizations like GVN replacing 'x' by 'y', there are strong limits to what can be done here.

Kind regards,
Ralf

Hi Ralf,

> My interpretation (well not just mine, we did have discussions about this in
our group)
> wrt to restrict handling, is that the use of decrypt/encrypt
> triggers undefined behavior.

Yes, that is exactly what I am pushing back against. :slight_smile: I cannot see a
reading
of the standard where this is UB. I also don't think it is the intention of
the
standard to make this UB. Note that the line I showed could be very far away
from the 'restrict' annotation. Basically if this is UB then a 'restrict'
pointer cannot be passed to other functions unless we know exactly that they
do
not do ptr-to-int casts.

Sure, this might be a liberal reading of that sentence wrt to restrict.
And that is how it is done today in the full restrict patches, but of course,
that does not mean that this is where we need to settle on when including the
functionality. It is good to have the reviews that steer us to a solution that
is more broadly applicable.

> Now that we are going over the different pieces of the implementation and
see how we can use
> them in a broader context, the situation is different: instead of just
tracking
> the 'restrict/noalias' provenance, we now want to use that part of the
infrastructure to
> track provenance in general. Because of that, it also makes sense to
reconsider what 'policy'
> we want to use. In that context, mapping a 'int2ptr' to a
'add_provenance(int2ptr(%Decrypt), null)'
> indicating that it can point to anything makes sense, but is still
orthogonal to the infrastructure.

That is not sufficient though. You also need to know that the provenance of
the
'restrict'ed pointer can now be acquired by other pointers created literally
anywhere via int2ptr. *That* is what makes this so tricky, I think.

int foo(int *restrict x) {
   *x = 0;
   unk1();
   assert(*x == 0); // can be optimized to 'true'
   unk2((uintptr_t)x);
   assert(*x == 0); // can *not* be optimized to 'true'
}

Also for restrict, escape analysis must be done. So also this case can be handled.

Greetings,

Jeroen Dobbelaere

Hi Jeroen,

My interpretation (well not just mine, we did have discussions about this in

our group)

wrt to restrict handling, is that the use of decrypt/encrypt
triggers undefined behavior.

Yes, that is exactly what I am pushing back against. :slight_smile: I cannot see a
reading
of the standard where this is UB. I also don't think it is the intention of
the
standard to make this UB. Note that the line I showed could be very far away
from the 'restrict' annotation. Basically if this is UB then a 'restrict'
pointer cannot be passed to other functions unless we know exactly that they
do
not do ptr-to-int casts.

Sure, this might be a liberal reading of that sentence wrt to restrict.
And that is how it is done today in the full restrict patches, but of course,
that does not mean that this is where we need to settle on when including the
functionality. It is good to have the reviews that steer us to a solution that
is more broadly applicable.

Fair enough. The standard is certainly not as unambiguous as one would hope.

Having suffered from an endless stream of 'noalias' bugs on the Rust side, I am very excited that this part of LLVM is being overhauled. :slight_smile:
I was hoping at some point to delve into those restrict patches and try to understand them from a PL/semantics perspective, but so far I haven't had the time -- and it's also a large patchset, much of which naturally is about the implementation (which I can't really follow) and not about the high-level description of the LLVM IR spec that makes the new analyses correct.
When/if I find some time -- what would be a good starting point to try to understand the concepts of those patches without having to understand the C++ details?

Now that we are going over the different pieces of the implementation and

see how we can use

them in a broader context, the situation is different: instead of just

tracking

the 'restrict/noalias' provenance, we now want to use that part of the

infrastructure to

track provenance in general. Because of that, it also makes sense to

reconsider what 'policy'

we want to use. In that context, mapping a 'int2ptr' to a

'add_provenance(int2ptr(%Decrypt), null)'

indicating that it can point to anything makes sense, but is still

orthogonal to the infrastructure.

That is not sufficient though. You also need to know that the provenance of
the
'restrict'ed pointer can now be acquired by other pointers created literally
anywhere via int2ptr. *That* is what makes this so tricky, I think.

int foo(int *restrict x) {
    *x = 0;
    unk1();
    assert(*x == 0); // can be optimized to 'true'
    unk2((uintptr_t)x);
    assert(*x == 0); // can *not* be optimized to 'true'
}

Also for restrict, escape analysis must be done. So also this case can be handled.

Sure, smarter analyses can handle the easy cases, but I was asking about what part of the spec of these operations forces the analysis to work like that. Defeating the analysis is not that hard, so here's another example:

static int foo(int *restrict x, uintptr_t y) {
    *x = 0;
    unk1();
    assert(*x == 0); // can be optimized to 'true'
    uintptr_t addr = (uintptr_t)x;
    if (addr == y)
      unk2(addr);
    assert(*x == 0); // can *not* be optimized to 'true'
}

Now we do GVN integer replacement:

static int foo(int *restrict x, uintptr_t y) {
    *x = 0;
    unk1();
    assert(*x == 0); // can be optimized to 'true'
    uintptr_t addr = (uintptr_t)x;
    if (addr == y)
      unk2(y);
    assert(*x == 0); // can *not* be optimized to 'true'
}

Now let us assume there is exactly one call site of this function (and foo is static so we know it can't be called from elsewhere, or maybe we are doing LTO), which looks like

foo(ptr, (uintptr_t)ptr);

This means we know that the "if" in "foo" will always evaluate to true, so we have

static int foo(int *restrict x, uintptr_t y) {
    *x = 0;
    unk1();
    assert(*x == 0); // can be optimized to 'true'
    uintptr_t addr = (uintptr_t)x;
    unk2(y);
    assert(*x == 0); // can *not* be optimized to 'true'
}

Now we can (seemingly) optimize away the "addr" variable entirely -- but at that point, there is no clue left for escape analysis to know that "unk2" might legally mutate "x".

That's why I am saying that with 'restrict', we have to treat ptr-to-int casts as side-effecting, and cannot optimize them away even if their result is unused.
They *always* have an "escape" effect, no matter what happens with their result.

Kind regards,
Ralf

Though, of course, dead loads could be replaced with some hypothetical
new instruction that has *only* the ptrtoint side-effect (and doesn't
produce any assembly). And such an instruction could be subject to
further transformations.

I just wanted to say that removing CSE’ing pointers based on an icmp would simplify the -fstrict-vtable-pointers model.
We try to CSE loads and stores from the same pointer (modulo zero GEPs and bitcasts) that have the !invariant.group metadata. But we may have two different pointers with different provenances that compare equal (e.g. the pointer before and after a C++ placement new). In those cases we don’t want to CSE the two different pointers, so we add the llvm.strip.invariant.group intrinsic before an icmp to prevent GVN from CSE’ing the pointers given a true icmp. If we prevented CSE’ing based on icmp, we wouldn’t need these strip intrinsics.

Perhaps to recover some lost optimizations (e.g. I’ve seen (a == b ? a : b => b) missed due to adding intrinsics due to -fstrict-vtable-pointers) we could do a late pass (right before the codegen pipeline?) where we allow CSE’ing pointers based on icmp and ptrtoint/inttoptr simplification but disallow any pointer provenance optimizations, and make sure no later optimizations take advantage of those. The codegen pipeline probably doesn’t have any optimizations based on pointer provenance?