lifetime.start/end

Hello all,

We’re discussing the well-formedness of lifetime.start/end intrinsic here (https://reviews.llvm.org/D93376), deciding what is a (syntactically & semantically) valid use of these intrinsics.

I’d like to gather more context about the intrinsics.

  1. Is there a frontend or framework that introduces lifetime call with non-stack allocated objects?

If exists, which behavior do they expect from it?

  1. Can a block’s lifetime be started bytewise or elementwise?
    I imagine an optimization that allow using stack very compactly, but wonder there is a real-world use case.

Thanks,
Juneyoung

Hi Juneyoung,

Happy new year :slight_smile:

After we had a lengthy discussion on phab last year, I've tried to summarize my thoughts,
especially given that I had some time to think about things over the break.
Still, no promises on the quality :wink:

I start with general questions I asked myself and then go on rambling about a potential design.

Q1: Is lifetime a given property or a derived one, thus is it fixed or modifiable?

This is a question I asked myself a lot recently. I think it is derived and modifiable,
at least I hope it is. Only that would allow transformations I would want
us to do. Here are some examples:
Compiler Explorer
https://godbolt.org/z/obaTc

Q2: Is a pointer comparison, or similar use, extending the lifetime?

Asked differently, can we hoist a pointer comparison into a region where the pointer is dead?
This is an important question which we haven't discussed much as we assumed LICM has to work.

The current behavior is that non-dereferencing uses are not extending the lifetime and are
allowed outside of "lifetime regions" (as indicated by markers). They will always produce valid
results. Though, we might want to think about a lifetime marker that spits out a new pointer
value instead of reusing the old one.

Q3: Can we use lifetime to argue about addresses?

The question here is, can we fold address comparisons based on lifetimes, or not.

So far, we fold comparisons based on "address information". For example, we "know" globals,
allocas, and mallocs cannot be equal to one another. Also, two distinct allocations, for globals
and allocas, are considered unequal. Now, the crux is that we have to be consistent if we do two
comparisons, and, as of now, we are not (bug number missing). Since the backend (or any other place
for that matter) might coalesce allocas, their addresses might not be different after all. If we
already folded a comparison to "unequal" we are doomed if we later have a comparison that results
in "equal". (Note, this is different from aliasing rules as they can be stricter.)

Design:

I would hope we can come up with a model that treats memory "the same", regardless if it is global,
stack, or heap. I want to avoid special casing one of them wrt. lifetime as I believe most optimizations
would apply to any of them, potentially for different reasons and with different gains, but nevertheless.

Proposal (largely based on the conversation in phab):

A1: Lifetime is a concept that talks about memory content *only*. Basically, the memory content is set to
undefined by lifetime markers. It is derived/modifiable as it is just an "as-is" property of the memory
content. The lifetimes of an object, as described by markers, might change during the compilation. They
might become smaller if we deduce the object is not accessed and the memory content is not used, they
might become larger if objects with non-overlapping lifetimes are coalesced. (One could see the latter as
introducing a new object though.)

A2: If we define lifetime as above, it has nothing to do with the address of an object. Consequently, pointer
comparisons and similar operations are valid outside the lifetime. Loads and stores are as well, they can
even not be removed "easily". A store followed by a lifetime marker or a load following a lifetime marker
is dead or results in undef respectively.

A3: We could not use lifetime to argue about addresses. This means we could/should also not argue that overlapping
lifetimes result in different addresses. Thus, a comparison between the address of two allocas could not
immediately be folded. That said, there would be special cases though. Basically, if one of the allocas does
not escape, other than the comparisons to be folded, we can fold them. Afterwards, coalescing or splitting
would still be consistent because it is unobservable. The problem we have in-tree is that we fold even though
the address is still observed (after the fold). It is still unclear to me what the impact of this would be
on real code. I suggested before that we run some experiments first before we make any decision whatsoever.

This is pretty much saying that lifetime markers are `memset(undef)`, as you suggested before (I think).
There are some implementation level differences but at the end of the day they are basically the same.

Happy to hear some thoughts on this, especially if it fixes the problems that lead to D93376 in the first place.

~ Johannes

Hi Johannes,

I read your proposal and thought about the model.

As you concerned in A3, certain programs may be valid only when memory blocks with overlapping lifetimes have disjoint addresses.
Look at this example (I’m using malloc, but alloca also works):

p1 = malloc(4)

p2 = malloc(4) // for brevity, assume that there as enough space & p1 and p2 != null
set<char*> s;
s.insert(p1); s.insert(p2); // If the second insert did nothing, it would be surprise to programmers
work(s);
free(data1)

free(data2)

Clearly, IR semantics should guarantee that escaped blocks are disjoint. It would be great for verification tools on LLVM IR to be able to answer that the second insert will succeed.

The problem is that definition of escapedness is not clear at the semantic level. Describing the IR semantics w.r.t. LLVM’s escaped analysis function isn’t something we want.

Semantic definition of escapedness of a pointer seems hard, I mean in a stepwise manner.
It isn’t a single instruction such as ‘escape i8* ptr’, and we need to look over all instructions in the function. For example, ‘(int)(p+1) - (int)p’ isn’t semantically escaping the pointer p because the result is 1 regardless of the value of p.
Stepwisely defining the semantics of instructions is a desirable direction IMO.

In practice, syntactically checking escapedness + nice engineering might not break optimizations in most cases (as undef/poison did); but it would be great to move to another level, since LLVM IR is used in so many places :slight_smile:

The pointer comparison is another beast. It indeed has a few issues, and solving it might require nontrivial solution.

Thanks,
Juneyoung

Hi Johannes,

I read your proposal and thought about the model.

Cool, thanks!

As you concerned in A3, certain programs may be valid only when memory
blocks with overlapping lifetimes have disjoint addresses.
Look at this example (I'm using malloc, but alloca also works):

p1 = malloc(4)
p2 = malloc(4) // for brevity, assume that there as enough space & p1 and
p2 != null
set<char*> s;
s.insert(p1); s.insert(p2); // If the second insert did nothing, it would
be surprise to programmers
work(s);
free(data1)
free(data2)

Clearly, IR semantics should guarantee that escaped blocks are disjoint. It
would be great for verification tools on LLVM IR to be able to answer that
the second insert will succeed.

I agree, the second insert should succeed, assuming `p1 && p2`.
I don't think my proposal would in any way impact the program above,
if anything the A3 reasoning makes sure such a program with allocas
is not miscompiled.

I'm also not sure I understand what you try to argue for. Maybe
elaborate a bit what it is you think is bad or needs to be changed.

The problem is that definition of escapedness is not clear at the semantic
level. Describing the IR semantics w.r.t. LLVM's escaped analysis function
isn't something we want.

Semantic definition of escapedness of a pointer seems hard, I mean in a
stepwise manner.
It isn't a single instruction such as 'escape i8* ptr', and we need to look
over all instructions in the function. For example, '(int)(p+1) - (int)p'
isn't semantically escaping the pointer p because the result is 1
regardless of the value of p.
Stepwisely defining the semantics of instructions is a desirable direction
IMO.

I'm confused. What in the proposal would prevent us from defining
the semantics of instructions, or force us to do it in an "undesirable way"?

In practice, syntactically checking escapedness + nice engineering might
not break optimizations in most cases (as undef/poison did); but it would
be great to move to another level, since LLVM IR is used in so many places
:slight_smile:

The property under which you can coalesce objects is simple:
It is not observable.

Now, if the address of one of the two objects you coalesce is not
observed, coalescing is not observable. That is a sufficient condition
not a necessary one. Pointer "escaping" is one step further. If the
address doesn't escape it is not observed. This does not mean the
"semantic conditions for coalescing", e.g., for the purpose of translation
validation, is supposed to be build on top of our "definition of escaping
pointers". That said, we use "does not escape" as a precondition for
various transformation and I'm unsure what is any different now. The
entire escape argument is only used in the validity of the pointer folding.
Similarly, we can fold a comparison of a noalias pointer with another one
if the former does not escape (and both are dereferenced and one is
written for sure).

The pointer comparison is another beast. It indeed has a few issues, and
solving it might require nontrivial solution.

I think the main problem of the inconsistencies and such we've seen is
rooted in the erroneous folding of pointer comparisons. Cleaning up the
lifetime marker semantics is actually unrelated and simply not folding
as described in A3 should solve the issue that has been reported. Nevertheless,
we should take a crack at lifetime markers while we are here.

~ Johannes

Stepwisely defining the semantics of instructions is a desirable direction
IMO.

I’m confused. What in the proposal would prevent us from defining
the semantics of instructions, or force us to do it in an “undesirable way”?

I meant it would be great if the output state after executing an instruction can be described using its input state.
(that was the meaning of ‘stepwise semantics’, I should have been more clear about this)

For example, the semantics of ‘z = add x y’ can be defined as follows:
Given an input state s, next state s’ = s[z → s(x) + s(y)]
where s(x) is the value of x in the previous state, and s[z → v] is a state with z updated to v.

Another example that involves memory: the semantics of ‘i = ptrtoint p’ can be defined as follows:
Given an input state s, next state s’ = s[i → s(p).obj.address + s(p).offset]
where obj.address is the begin address of a memory object obj pointed by p & offset is p’s byte offset. (Imagine a pointer to the offset of some char array).
Note that ptrtoint & add can be nicely defined w.r.t the input state.

Now, the instruction that we’re facing is ‘p = alloca’.
To describe the output state after executing ‘p = alloca’, the address of new alloca should be determined.
If observedness is involved, we need to know the future state again. :confused: We don’t know whether the alloca is going to be observed or not without seeing the future.
This is the problem of the current lifetime intrinsics as well.

One possible approach to resolve this is adding an ‘unobserved’ flag to alloca instruction (similar to what was suggested by Nicolai).
And, we can say that if alloca with ‘unobserved’ is used by ptrtoint/icmp, it is UB.
But this makes ptrtoint/icmp make UB-raising instructions, which contradicts with what LLVM does.
Also, existing optimizations like loop versioning can introduce ptrtoint/pointer comparisons too.

I see that there are other questions that I didn’t answer yet, but let me answer this first to limit the length of the text :slight_smile:

Thanks,
Juneyoung

Stepwisely defining the semantics of instructions is a desirable

direction

IMO.

I'm confused. What in the proposal would prevent us from defining
the semantics of instructions, or force us to do it in an "undesirable

way"?

I meant it would be great if the output state after executing an
instruction can be described using its input state.
(that was the meaning of 'stepwise semantics', I should have been more
clear about this)

For example, the semantics of 'z = add x y' can be defined as follows:
Given an input state s, next state s' = s[z -> s(x) + s(y)]
where s(x) is the value of x in the previous state, and s[z -> v] is a
state with z updated to v.

Another example that involves memory: the semantics of 'i = ptrtoint p' can
be defined as follows:
Given an input state s, next state s' = s[i -> s(p).obj.address +
s(p).offset]
where obj.address is the begin address of a memory object obj pointed by p
& offset is p's byte offset. (Imagine a pointer to the offset of some char
array).
Note that ptrtoint & add can be nicely defined w.r.t the input state.

Now, the instruction that we're facing is 'p = alloca'.
To describe the output state after executing 'p = alloca', the address of
new alloca should be determined.
If observedness is involved, we need to know the future state again. :confused: We
don't know whether the alloca is going to be observed or not without seeing
the future.
This is the problem of the current lifetime intrinsics as well.

No, you mix things up here.

Nobody proposed to modify the semantics of `alloca`.
`alloca` provides you with a fresh, unobserved block of
dereferenceable memory that is implicitly freed as the stack
unwinds. That is it. No context necessary.

If you want to modify the IR, you need to argue the observable
semantics which is nothing new. That this might require more than
a peephole view of the program is also not new.

One possible approach to resolve this is adding an 'unobserved' flag to
alloca instruction (similar to what was suggested by Nicolai).
And, we can say that if alloca with 'unobserved' is used by ptrtoint/icmp,
it is UB.

The flag can be added, like we add other attributes. It should not
be required for any optimization we talked about though. It basically
is a way to manifest derived or given information into the IR.

Attribute deduction, as well as frontends with domain knowledge,
can add such information. The flag we discussed in phab was not even
sufficient for all the transformation examples I presented in my mail,
that is why I extended my argument. We could still have a "noescape"
flag for allocas, but I'm not sure how useful that really is. We can
certainly deduce it and manifest it, unsure if we have domain knowledge
we can use for non-trivial cases though.

But this makes ptrtoint/icmp make UB-raising instructions, which
contradicts with what LLVM does.

As with other violation of attributes I would, on first though, suggest
to produce poison, not UB.

Also, existing optimizations like loop versioning can introduce
ptrtoint/pointer comparisons too.

Sure. I am not certain why that is a problem. I get the feeling things
are still mixed up here.

What I proposed is twofold:

1) We stop folding comparisons between different allocas if changing the address
of both might be observable. Thus, if both might have their address "taken"/escaped,
other than the comparisons we want to fold, we cannot proceed.

2) We define lifetime markers to mean `memset(undef)`.

The first should be sufficient for the problem you were trying to solve in the
first place. The second makes lifetime markers less weird. Note that 1) is not changing
the semantics of the IR. We basically just argue there is a bug in our instcombine right
now as we do not check all necessary preconditions.

I see that there are other questions that I didn't answer yet, but let me
answer this first to limit the length of the text :slight_smile:

Sure, we can split the discussion :slight_smile:

~ Johannes

No, you mix things up here.

Nobody proposed to modify the semantics of alloca.
alloca provides you with a fresh, unobserved block of
dereferenceable memory that is implicitly freed as the stack
unwinds. That is it. No context necessary.

It is, alloca must have its integral address decided at its allocation time as well, like 0xFF00… .

For example, if 0xFFFF0000 ~ 0xFF000000 is already allocated, the new alloca cannot be placed there unless it is not observed in the future, according to your proposal.

But how do we know from the current state whether the stack variable is going to be observed or not?

It is like writing an interpreter for IR; we cannot write an interpreter that relies on the future memory state to make the current step.

Making icmp/ptrtoint yield poison will still make loop versioning or pointer rewriting transformations unsound because these operations now can create poison (even if pointers are noundef).

What I proposed is twofold:

  1. We stop folding comparisons between different allocas if changing the
    address
    of both might be observable. Thus, if both might have their address
    “taken”/escaped,
    other than the comparisons we want to fold, we cannot proceed.
  1. We define lifetime markers to mean memset(undef).
  1. is fine, I think the suggestion semantically makes sense perfectly. 1) is something I’m concerned about now.

There are more than pointer foldings, such as rewriting such expression, code motion ptr cmp, introduce ptr cmp, etc. There is also analysis relying on ptr cmp.
Defining the correctness of each of them is something we want to avoid, and maybe that’s why we want to define precise semantics for things.

If the sudden textual change at lifetime intrinsics is a concern, maybe we can have a migration time and make the Verifier raise a warning if lifetime is used with an unknown pointer.

I think this will be less aggressive and may give nice feedback to potential projects that are using lifetime with non-alloca.
At the end, we can implement IR writer that lowers lifetime with non-alloca into memset(undef). WDYT?

Thanks,
Juneyoung

p.s. The reply was late, sorry. I think I can spend more time on this today.

No, you mix things up here.

Nobody proposed to modify the semantics of `alloca`.
`alloca` provides you with a fresh, unobserved block of
dereferenceable memory that is implicitly freed as the stack
unwinds. That is it. No context necessary.

It is, alloca must have its integral address decided at its *allocation*
time as well, like 0xFF00... .

For example, if 0xFFFF0000 ~ 0xFF000000 is already allocated, the new
alloca cannot be placed there unless it is not observed in the future,
according to your proposal.

But how do we know from the current state whether the stack variable is
going to be observed or not?

With that argument you could allocate all but 4 bytes, do a malloc(4)
and know the address of the returned pointer (assuming it is not null).

What I try to say is, either your scenario is part of the model and
everything we do is broken as you could "observe" addresses passively,
*or*, the abstract machine we use for semantic reasoning doesn't permit
the above reasoning. I really hope for the latter.

To argue differently: Who is to say there is a stack, or only one,
or that alloca allocates memory "on the one stack"? That is not part of
the IR, IMHO. I can write a machine on which alloca lowers to malloc,
I don't even need the free during stack unwind but that I could do as
well if I wanted to.

It is like writing an interpreter for IR; we cannot write an interpreter
that relies on the future memory state to make the current step.

Still totally agreed that we can't, but I still don't think we need to.

Making icmp/ptrtoint yield poison will still make loop versioning or
pointer rewriting transformations unsound because these operations now can
create poison (even if pointers are noundef).

I did not say they yield poison, at least I did not try to say that.
What are you referring to exactly?

What I proposed is twofold:

1) We stop folding comparisons between different allocas if changing the
address
     of both might be observable. Thus, if both might have their address
"taken"/escaped,
     other than the comparisons we want to fold, we cannot proceed.

2) We define lifetime markers to mean `memset(undef)`.

2) is fine, I think the suggestion semantically makes sense perfectly. 1)
is something I'm concerned about now.

There are more than pointer foldings, such as rewriting such expression,
code motion ptr cmp, introduce ptr cmp, etc. There is also analysis relying
on ptr cmp.
Defining the correctness of each of them is something we want to avoid, and
maybe that's why we want to define precise semantics for things.

I don't get the point. My proposal does not change the semantics of
pointer comparisons, at all. I explicitly mentioned that in the last
email.

If the sudden textual change at lifetime intrinsics is a concern, maybe we
can have a migration time and make the Verifier raise a warning if lifetime
is used with an unknown pointer.

That (alone) is not my concern.

I think this will be less aggressive and may give nice feedback to
potential projects that are using lifetime with non-alloca.

The lifetime marker debate, basically 2) above, is orthogonal to the problem
you try to solve. It got mixed in as lifetime markers were used by StackColoring
to perform coalescing but that is coincidental. You can (arguably) coalesce stack
allocations regardless of lifetime markers and with 1) such a transformation
(w/ and w/o lifetime markers) would actually be sound.

At the end, we can implement IR writer that lowers lifetime with non-alloca
into memset(undef). WDYT?

Yeah, 2) is orthogonal and we can lower it that way. Unsure if it is helpful
but we can certainly define it that way in the LangRef.

~ Johannes

It is, alloca must have its integral address decided at its allocation
time as well, like 0xFF00… .

For example, if 0xFFFF0000 ~ 0xFF000000 is already allocated, the new
alloca cannot be placed there unless it is not observed in the future,
according to your proposal.

But how do we know from the current state whether the stack variable is
going to be observed or not?

With that argument you could allocate all but 4 bytes, do a malloc(4)
and know the address of the returned pointer (assuming it is not null).

What I try to say is, either your scenario is part of the model and
everything we do is broken as you could “observe” addresses passively,
or, the abstract machine we use for semantic reasoning doesn’t permit
the above reasoning. I really hope for the latter.

That’s a very valid point…!

Actually, I have a memory model that addresses this problem.
The gist is that we can interpret each allocation instruction as creating 2 blocks and nondeterministically returning one of them.
The other one is marked as invalid but not freed immediately. Deallocation frees both blocks.
If there is not enough space for having two blocks, it is treated as out-of-memory.

It makes guessing the address of an allocation according to memory layout invalid UB.

p = malloc(4)

// If p != null, it is guaranteed there was at least two 4 byte slots available, say 0x100~0x104 and 0x110~0x114.
// Two blocks are allocated at 0x110 and 0x114, and one of the addresses is nondeterministically returned.
// By the nature of nondeterminism, all executions below should be well-defined regardless of the address of p.
(int)0x100 = 10; // In an execution where p is 0x110, this raises UB.

The paper link is here FYI.

To argue differently: Who is to say there is a stack, or only one,
or that alloca allocates memory “on the one stack”? That is not part of
the IR, IMHO. I can write a machine on which alloca lowers to malloc,
I don’t even need the free during stack unwind but that I could do as
well if I wanted to.

This is right, the example was for illustrative purposes. IR does not enforce alloca to be placed at ‘stack’.

Making icmp/ptrtoint yield poison will still make loop versioning or
pointer rewriting transformations unsound because these operations now can
create poison (even if pointers are noundef).

I did not say they yield poison, at least I did not try to say that.
What are you referring to exactly?

I was referring to this:

But this makes ptrtoint/icmp make UB-raising instructions, which
contradicts with what LLVM does.

As with other violation of attributes I would, on first though, suggest
to produce poison, not UB.

But it is more about the (imaginary) attribute, so maybe I was slightly out of topic.

  1. is fine, I think the suggestion semantically makes sense perfectly. 1)
    is something I’m concerned about now.

There are more than pointer foldings, such as rewriting such expression,
code motion ptr cmp, introduce ptr cmp, etc. There is also analysis relying
on ptr cmp.
Defining the correctness of each of them is something we want to avoid, and
maybe that’s why we want to define precise semantics for things.

I don’t get the point. My proposal does not change the semantics of
pointer comparisons, at all. I explicitly mentioned that in the last
email.

Oh okay, I thought it was a part of the lifetime proposal, but it seems more like a separate thing.
I agree that this requires performance impact.
Also investigation of existing transformations would be needed; Alive2’s pointer comparison is doing approximation yet, but if it becomes fully precise, it will show something from running LLVM unit tests I believe…! :slight_smile:

I think this will be less aggressive and may give nice feedback to
potential projects that are using lifetime with non-alloca.

The lifetime marker debate, basically 2) above, is orthogonal to the problem
you try to solve. It got mixed in as lifetime markers were used by
StackColoring
to perform coalescing but that is coincidental. You can (arguably)
coalesce stack
allocations regardless of lifetime markers and with 1) such a
transformation
(w/ and w/o lifetime markers) would actually be sound.

At the end, we can implement IR writer that lowers lifetime with non-alloca
into memset(undef). WDYT?

Yeah, 2) is orthogonal and we can lower it that way. Unsure if it is helpful
but we can certainly define it that way in the LangRef.

Okay, thanks!

I think a sensible next step would be to determine the impact of A3 (below)
on real code. Basically, "fixing" test3 here: Compiler Explorer

I doubt I will get to that any time soon. Juneyoung, Nikita, would you be able to
run some experiments?

A3: We could not use lifetime to argue about addresses. This means we could/should also not argue that overlapping
lifetimes result in different addresses. Thus, a comparison between the address of two allocas could not
immediately be folded. That said, there would be special cases though. Basically, if one of the allocas does
not escape, other than the comparisons to be folded, we can fold them. Afterwards, coalescing or splitting
would still be consistent because it is unobservable. The problem we have in-tree is that we fold even though
the address is still observed (after the fold). It is still unclear to me what the impact of this would be
on real code. I suggested before that we run some experiments first before we make any decision whatsoever.

It is, alloca must have its integral address decided at its *allocation*
time as well, like 0xFF00... .

For example, if 0xFFFF0000 ~ 0xFF000000 is already allocated, the new
alloca cannot be placed there unless it is not observed in the future,
according to your proposal.

But how do we know from the current state whether the stack variable is
going to be observed or not?

With that argument you could allocate all but 4 bytes, do a malloc(4)
and know the address of the returned pointer (assuming it is not null).

What I try to say is, either your scenario is part of the model and
everything we do is broken as you could "observe" addresses passively,
*or*, the abstract machine we use for semantic reasoning doesn't permit
the above reasoning. I really hope for the latter.

That's a very valid point..!

Actually, I have a memory model that addresses this problem.
The gist is that we can interpret each allocation instruction as *creating
2 blocks* and nondeterministically returning one of them.
The other one is marked as invalid but not freed immediately. Deallocation
frees both blocks.
If there is not enough space for having two blocks, it is treated as
out-of-memory.

It makes guessing the address of an allocation according to memory layout
invalid UB.

p = malloc(4)
// If p != null, it is guaranteed there was at least two 4 byte slots
available, say 0x100~0x104 and 0x110~0x114.
// Two blocks are allocated at 0x110 and 0x114, and one of the addresses is
nondeterministically returned.
// By the nature of nondeterminism, all executions below should be
well-defined regardless of the address of p.
*(int*)0x100 = 10; // In an execution where p is 0x110, this raises UB.

The paper link is here <https://dl.acm.org/doi/10.1145/3276495&gt; FYI.

Nice! Thanks for the link :slight_smile:

Me neither ATM, but it should be me if one has to do the experiment because I was pushing the patch.

Thanks,
Juneyoung

Hi all,

Thanks for including me in this thread!

I agree that *in principle*, alloca is entirely equivalent to malloc and "automatic free when the function returns". The stack is just a very optimized allocator. However, if the compiler takes advantage of this restricted structure of stack allocations in ways that would not be legal for malloc, then we have to account for this by also making alloca special in the semantics.

A3: We could not use lifetime to argue about addresses. This means we could/should also not argue that overlapping
lifetimes result in different addresses. Thus, a comparison between the address of two allocas could not
immediately be folded. That said, there would be special cases though. Basically, if one of the allocas does
not escape, other than the comparisons to be folded, we can fold them. Afterwards, coalescing or splitting
would still be consistent because it is unobservable. The problem we have in-tree is that we fold even though
the address is still observed (after the fold). It is still unclear to me what the impact of this would be
on real code. I suggested before that we run some experiments first before we make any decision whatsoever.

I hope this question has not been answered yet, but I don't see how that fold could be legal. I asked the same question on Phabricator but it seems you prefer to have the discussion here. Taking your example from there and adjusting it:

p = malloc(1)
q = malloc(1)
%c = icmp eq %p, %q
free(q)
free(p)

I think there is a guarantee that c will always be "false". Every operational model of allocation that I have ever seen will guarantee this, and the same for every program logic that I can imagine. So if the compiler may fold this to "false", then as far as I can see, pointer comparison becomes entirely unpredictable. The only consistent model I can think of is "icmp on pointers may spuriously return 'true' at any time", which I doubt anyone wants. :wink:

Am I misunderstanding what you are proposing here?

Kind regards,
Ralf

Hi all,

Thanks for including me in this thread!

I agree that *in principle*, alloca is entirely equivalent to malloc and "automatic free when the function returns". The stack is just a very optimized allocator. However, if the compiler takes advantage of this restricted structure of stack allocations in ways that would not be legal for malloc, then we have to account for this by also making alloca special in the semantics.

A3: We could not use lifetime to argue about addresses. This means we could/should also not argue that overlapping
lifetimes result in different addresses. Thus, a comparison between the address of two allocas could not
immediately be folded. That said, there would be special cases though. Basically, if one of the allocas does
not escape, other than the comparisons to be folded, we can fold them. Afterwards, coalescing or splitting
would still be consistent because it is unobservable. The problem we have in-tree is that we fold even though
the address is still observed (after the fold). It is still unclear to me what the impact of this would be
on real code. I suggested before that we run some experiments first before we make any decision whatsoever.

I hope this question has not been answered yet, but I don't see how that fold could be legal. I asked the same question on Phabricator but it seems you prefer to have the discussion here. Taking your example from there and adjusting it:

p = malloc(1)
q = malloc(1)
%c = icmp eq %p, %q
free(q)
free(p)

I think there is a guarantee that c will always be "false". Every operational model of allocation that I have ever seen will guarantee this, and the same for every program logic that I can imagine. So if the compiler may fold this to "false", then as far as I can see, pointer comparison becomes entirely unpredictable. The only consistent model I can think of is "icmp on pointers may spuriously return 'true' at any time", which I doubt anyone wants. :wink:

Am I misunderstanding what you are proposing here?

I'm confused. In your text I read: It should be "false" but *not* folded to "false" or else ...

FWIW, I would expect `%c = false` here. The tricky part is, do we see `%c = false` at compile time
or at runtime. If it's the former, we might run into the same coalescing issue we have for allocas:

p = malloc(1)
free(q)
q = malloc(1)
free(p)
%c = icmp eq %p, %q

What is `%c` now? It could go either way, depending on your allocator, agreed? If so, we need to ensure
there is not a second `%c2 = icmp eq %p, %q` somewhere if we want to fold `%c` as it needs to be consistent.

Does that make some sense?

~ Johannes

Hi Johannes,

A3: We could not use lifetime to argue about addresses. This means we could/should also not argue that overlapping
lifetimes result in different addresses. Thus, a comparison between the address of two allocas could not
immediately be folded. That said, there would be special cases though. Basically, if one of the allocas does
not escape, other than the comparisons to be folded, we can fold them. Afterwards, coalescing or splitting
would still be consistent because it is unobservable. The problem we have in-tree is that we fold even though
the address is still observed (after the fold). It is still unclear to me what the impact of this would be
on real code. I suggested before that we run some experiments first before we make any decision whatsoever.

I hope this question has not been answered yet, but I don't see how that fold could be legal. I asked the same question on Phabricator but it seems you prefer to have the discussion here. Taking your example from there and adjusting it:

p = malloc(1)
q = malloc(1)
%c = icmp eq %p, %q
free(q)
free(p)

I think there is a guarantee that c will always be "false". Every operational model of allocation that I have ever seen will guarantee this, and the same for every program logic that I can imagine. So if the compiler may fold this to "false", then as far as I can see, pointer comparison becomes entirely unpredictable. The only consistent model I can think of is "icmp on pointers may spuriously return 'true' at any time", which I doubt anyone wants. :wink:

Am I misunderstanding what you are proposing here?

I'm confused. In your text I read: It should be "false" but *not* folded to "false" or else ...

*oops* sorry for that, I meant "if the compiler may fold this to 'true'".

FWIW, I would expect `%c = false` here.

Good, we agree then.

The tricky part is, do we see `%c = false` at compile time
or at runtime. If it's the former, we might run into the same coalescing issue we have for allocas:

p = malloc(1)
free(q)
q = malloc(1)
free(p)
%c = icmp eq %p, %q

What is `%c` now? It could go either way, depending on your allocator, agreed?

Agreed.

If so, we need to ensure
there is not a second `%c2 = icmp eq %p, %q` somewhere if we want to fold `%c` as it needs to be consistent.

Indeed -- if we constrain the non-deterministic choice that the allocator is making, we need to do so consistently.

We could instead coalesce allocas in the middle end (more aggressively and not only based on lifetime markers) in order to fold comparisons of the "then equal" allocas to "equal".

So, to me this sounded like you wanted to fold `icmp` to `true` in situations like in my example. It looks like I misunderstood.
So is your proposal instead that we can coalesce allocations *after* we have ensured that there definitely is no `icmp` on their pointer any more? Like, for local variables, we first fold away all `icmp` and then we can freely coalesce since now it's not observable that we are reusing memory?

However, if we translate alloca to malloc+free, all (almost all?) of these allocations *will* overlap in their lifetime, so 'false' (inequal) is the only thing we could ever fold anything to (and we don't have to worry about consistency here since this is the only possible choice). I guess after we did that, if there's nothing else observing these addresses, we could indeed actually reuse the underlying storage...

Am I on the right track?

Kind regards,
Ralf

Hi Johannes,

A3: We could not use lifetime to argue about addresses. This means we could/should also not argue that overlapping
lifetimes result in different addresses. Thus, a comparison between the address of two allocas could not
immediately be folded. That said, there would be special cases though. Basically, if one of the allocas does
not escape, other than the comparisons to be folded, we can fold them. Afterwards, coalescing or splitting
would still be consistent because it is unobservable. The problem we have in-tree is that we fold even though
the address is still observed (after the fold). It is still unclear to me what the impact of this would be
on real code. I suggested before that we run some experiments first before we make any decision whatsoever.

I hope this question has not been answered yet, but I don't see how that fold could be legal. I asked the same question on Phabricator but it seems you prefer to have the discussion here. Taking your example from there and adjusting it:

p = malloc(1)
q = malloc(1)
%c = icmp eq %p, %q
free(q)
free(p)

I think there is a guarantee that c will always be "false". Every operational model of allocation that I have ever seen will guarantee this, and the same for every program logic that I can imagine. So if the compiler may fold this to "false", then as far as I can see, pointer comparison becomes entirely unpredictable. The only consistent model I can think of is "icmp on pointers may spuriously return 'true' at any time", which I doubt anyone wants. :wink:

Am I misunderstanding what you are proposing here?

I'm confused. In your text I read: It should be "false" but *not* folded to "false" or else ...

*oops* sorry for that, I meant "if the compiler may fold this to 'true'".

FWIW, I would expect `%c = false` here.

Good, we agree then.

The tricky part is, do we see `%c = false` at compile time
or at runtime. If it's the former, we might run into the same coalescing issue we have for allocas:

p = malloc(1)
free(q)
q = malloc(1)
free(p)
%c = icmp eq %p, %q

What is `%c` now? It could go either way, depending on your allocator, agreed?

Agreed.

If so, we need to ensure
there is not a second `%c2 = icmp eq %p, %q` somewhere if we want to fold `%c` as it needs to be consistent.

Indeed -- if we constrain the non-deterministic choice that the allocator is making, we need to do so consistently.

We could instead coalesce allocas in the middle end (more aggressively and not only based on lifetime markers) in order to fold comparisons of the "then equal" allocas to "equal".

So, to me this sounded like you wanted to fold `icmp` to `true` in situations like in my example. It looks like I misunderstood.
So is your proposal instead that we can coalesce allocations *after* we have ensured that there definitely is no `icmp` on their pointer any more? Like, for local variables, we first fold away all `icmp` and then we can freely coalesce since now it's not observable that we are reusing memory?

Let me start with:

My phab response is "stale" by now. My first email to this thread was written after I had more time to think about all this:
https://lists.llvm.org/pipermail/llvm-dev/2021-January/147594.html

Separation of concerns, yes. If you can't observe we coalesced, we can coalesce, or choose not to. My main argument is that the folding we do
right now is broken, or the coalescing is. We cannot fold, coalesce, and then fold again (which could be runtime eval), as that leads to
inconsistencies.

However, if we translate alloca to malloc+free, all (almost all?) of these allocations *will* overlap in their lifetime, so 'false' (inequal) is the only thing we could ever fold anything to (and we don't have to worry about consistency here since this is the only possible choice). I guess after we did that, if there's nothing else observing these addresses, we could indeed actually reuse the underlying storage...

Am I on the right track?

I guess initially we would only fold to `inequal`, which is fine, until we start to coalesce locations with non-overlapping lifetime early.
Though, now that I said this, I'm not even sure we could ever fold after coalescing, thus we could ever coalesce if the addresses (of both)
are "still observed". Maybe folding is not the problem but coalescing is, though I think both are in the most general sense. I think I can
fabricate a reasonable flow of transformation in which early *and* partial folding is bad:

a = alloc
b = alloc
... // no use of a or b
foo(a)
... // no use of a or b
bar(b)
... // no use of a or b
c = (a == b)

now we assume that the two calls are outlined after we fold `(a == b)` to `false`:

void foocall() {
a = alloc
foo(a)
}
void barcall() {
a = alloc
bar(a)
}

foocall()
barcall()
c = true

now we assume foo stores away `a` and bar compares it to `b`. That would be a problem.

WDYT?

In my understanding of
[expr.rel] or
[expr.rel] the result of this is unspecified.
While this does not necessarily extend to LLVM-IR, LLVM-IR usually
assumes C/C++ semantics unless defined otherwise.

Optimizations such as https://reviews.llvm.org/D53362 and
https://reviews.llvm.org/D65408 assume the equivalence of alloca and
malloc+free. I assume that such optimizations are the reason for this
unspecifiedness, i.e. I think someone might want this. Johannes's
proposal A1 however, seems to forbid StackColoring to exploit lifetime
markers, which it currently does and I am not sure we can afford to do
that.

Michael

Hi Johannes,

The tricky part is, do we see `%c = false` at compile time
or at runtime. If it's the former, we might run into the same coalescing issue we have for allocas:

p = malloc(1)
free(q)
q = malloc(1)
free(p)
%c = icmp eq %p, %q

What is `%c` now? It could go either way, depending on your allocator, agreed?

Agreed.

If so, we need to ensure
there is not a second `%c2 = icmp eq %p, %q` somewhere if we want to fold `%c` as it needs to be consistent.

Indeed -- if we constrain the non-deterministic choice that the allocator is making, we need to do so consistently.

We could instead coalesce allocas in the middle end (more aggressively and not only based on lifetime markers) in order to fold comparisons of the "then equal" allocas to "equal".

So, to me this sounded like you wanted to fold `icmp` to `true` in situations like in my example. It looks like I misunderstood.
So is your proposal instead that we can coalesce allocations *after* we have ensured that there definitely is no `icmp` on their pointer any more? Like, for local variables, we first fold away all `icmp` and then we can freely coalesce since now it's not observable that we are reusing memory?

Let me start with:

My phab response is "stale" by now. My first email to this thread was written after I had more time to think about all this:
https://lists.llvm.org/pipermail/llvm-dev/2021-January/147594.html

Separation of concerns, yes. If you can't observe we coalesced, we can coalesce, or choose not to. My main argument is that the folding we do
right now is broken, or the coalescing is. We cannot fold, coalesce, and then fold again (which could be runtime eval), as that leads to
inconsistencies.

I think this is equivalent to what I said, namely that we can only coalesce when there are no comparison checks left -- it's not just that we cannot fold icmp any more after colaescing, we cannot have any icmp, folded or not, since runtime-checking of ptr equality will yield the wrong result after coalescing.
(I have a hard time conceptualizing runtime eval as a kind of "fold" the way you do.^^ So I prefer to think of this in terms of the conditions that have to hold when coalescing happens: no icmp left on the involved pointers.)

However, if we translate alloca to malloc+free, all (almost all?) of these allocations *will* overlap in their lifetime, so 'false' (inequal) is the only thing we could ever fold anything to (and we don't have to worry about consistency here since this is the only possible choice). I guess after we did that, if there's nothing else observing these addresses, we could indeed actually reuse the underlying storage...

Am I on the right track?

I guess initially we would only fold to `inequal`, which is fine, until we start to coalesce locations with non-overlapping lifetime early.
Though, now that I said this, I'm not even sure we could ever fold after coalescing, thus we could ever coalesce if the addresses (of both)
are "still observed". Maybe folding is not the problem but coalescing is, though I think both are in the most general sense. I think I can
fabricate a reasonable flow of transformation in which early *and* partial folding is bad:

a = alloc
b = alloc
... // no use of a or b
foo(a)
... // no use of a or b
bar(b)
... // no use of a or b
c = (a == b)

now we assume that the two calls are outlined after we fold `(a == b)` to `false`:

void foocall() {
a = alloc
foo(a)
}
void barcall() {
a = alloc
bar(a)
}

foocall()
barcall()
c = true

now we assume foo stores away `a` and bar compares it to `b`. That would be a problem.

WDYT?

Hm, yeah, this shows that even outlining is a problem since it can make previously overlapping allocations non-overlapping. In a sense, there is a step happening before outlining here, which is to move around the alloca and explicit free to make the code more like

... // no use of a or b
a = alloc
foo(a)
free a
... // no use of a or b
b = alloc
bar(b)
free b
... // no use of a or b
c = (a == b)

It is this reordering of (de)allocation that is incorrect; the outlining itself is not the problem.

This also reflects my general concern with your approach: there's a lot of information loss. When I write C (or Rust) code, there is no guarantee that different local variables have different addresses. And yet when I translate that code to LLVM by doing

a = alloca
b = alloca
[...]

now I have generated LLVM IR where a and b will definitely get different addresses (if alloca works anything like a regular allocation operation). This is a correct translation from C to LLVM IR, but it heavily restricts the possible allocation choices compared to the original C program. So at this point frontends would be well-advised to not generate code like this, but instead carefully place the alloca as late as possible. Frontends will also want to have an explicit "free" for stack-slots to indicate when they are not needed any more.
It is my understanding that this is exactly where the lifetime.start and lifetime.end markers come in: they are *prescriptive* (in the terminology of your email: they are a given property, and thus fixed -- modulo the usual "as if" rule, of course); they are how the frontend informs the later stages of the constraints on which alloca may overlap and which may not. If the lifetime markers cannot play this role, then something else should. There needs to be *some* way for the frontend to encode this information.

Kind regards,
Ralf

Hi Michael,

I hope this question has not been answered yet, but I don't see how that fold
could be legal. I asked the same question on Phabricator but it seems you prefer
to have the discussion here. Taking your example from there and adjusting it:

p = malloc(1)
q = malloc(1)
%c = icmp eq %p, %q
free(q)
free(p)

I think there is a guarantee that c will always be "false". Every operational
model of allocation that I have ever seen will guarantee this, and the same for
every program logic that I can imagine. So if the compiler may fold this to
"false", then as far as I can see, pointer comparison becomes entirely
unpredictable. The only consistent model I can think of is "icmp on pointers may
spuriously return 'true' at any time", which I doubt anyone wants. :wink:

In my understanding of
[expr.rel] or
[expr.rel] the result of this is unspecified.
While this does not necessarily extend to LLVM-IR, LLVM-IR usually
assumes C/C++ semantics unless defined otherwise.

The part of the standard you quoted is about "Relational operators", i.a. < <= > >=. The code above is using ==, so I do not think these parts of the standard apply.
[expr.eq] seems to govern == and !=, and it says

Two pointers of the same type compare equal if and only if they are both null, both point to the same function, or both represent the same address ([basic.compound]).

So now the question is if "p" and "q" 'represent the same address', but given that they point to the beginning of different allocated objects, I would say they definitely do not represent the same address.

Kind regards,
Ralf

Hi Johannes,

The tricky part is, do we see `%c = false` at compile time
or at runtime. If it's the former, we might run into the same coalescing issue we have for allocas:

p = malloc(1)
free(q)
q = malloc(1)
free(p)
%c = icmp eq %p, %q

What is `%c` now? It could go either way, depending on your allocator, agreed?

Agreed.

If so, we need to ensure
there is not a second `%c2 = icmp eq %p, %q` somewhere if we want to fold `%c` as it needs to be consistent.

Indeed -- if we constrain the non-deterministic choice that the allocator is making, we need to do so consistently.

We could instead coalesce allocas in the middle end (more aggressively and not only based on lifetime markers) in order to fold comparisons of the "then equal" allocas to "equal".

So, to me this sounded like you wanted to fold `icmp` to `true` in situations like in my example. It looks like I misunderstood.
So is your proposal instead that we can coalesce allocations *after* we have ensured that there definitely is no `icmp` on their pointer any more? Like, for local variables, we first fold away all `icmp` and then we can freely coalesce since now it's not observable that we are reusing memory?

Let me start with:

My phab response is "stale" by now. My first email to this thread was written after I had more time to think about all this:
https://lists.llvm.org/pipermail/llvm-dev/2021-January/147594.html

Separation of concerns, yes. If you can't observe we coalesced, we can coalesce, or choose not to. My main argument is that the folding we do
right now is broken, or the coalescing is. We cannot fold, coalesce, and then fold again (which could be runtime eval), as that leads to
inconsistencies.

I think this is equivalent to what I said, namely that we can only coalesce when there are no comparison checks left -- it's not just that we cannot fold icmp any more after colaescing, we cannot have any icmp, folded or not, since runtime-checking of ptr equality will yield the wrong result after coalescing.
(I have a hard time conceptualizing runtime eval as a kind of "fold" the way you do.^^ So I prefer to think of this in terms of the conditions that have to hold when coalescing happens: no icmp left on the involved pointers.)

No icmp left is not the right condition but I guess it's a shorthand for "address is not observed", then we are in agreement. The runtime eval
is little different from static partial evaluation, except that we have all the values, but maybe that is just me.

However, if we translate alloca to malloc+free, all (almost all?) of these allocations *will* overlap in their lifetime, so 'false' (inequal) is the only thing we could ever fold anything to (and we don't have to worry about consistency here since this is the only possible choice). I guess after we did that, if there's nothing else observing these addresses, we could indeed actually reuse the underlying storage...

Am I on the right track?

I guess initially we would only fold to `inequal`, which is fine, until we start to coalesce locations with non-overlapping lifetime early.
Though, now that I said this, I'm not even sure we could ever fold after coalescing, thus we could ever coalesce if the addresses (of both)
are "still observed". Maybe folding is not the problem but coalescing is, though I think both are in the most general sense. I think I can
fabricate a reasonable flow of transformation in which early *and* partial folding is bad:

a = alloc
b = alloc
... // no use of a or b
foo(a)
... // no use of a or b
bar(b)
... // no use of a or b
c = (a == b)

now we assume that the two calls are outlined after we fold `(a == b)` to `false`:

void foocall() {
a = alloc
foo(a)
}
void barcall() {
a = alloc
bar(a)
}

foocall()
barcall()
c = true

now we assume foo stores away `a` and bar compares it to `b`. That would be a problem.

WDYT?

Hm, yeah, this shows that even outlining is a problem since it can make previously overlapping allocations non-overlapping. In a sense, there is a step happening before outlining here, which is to move around the alloca and explicit free to make the code more like

... // no use of a or b
a = alloc
foo(a)
free a
... // no use of a or b
b = alloc
bar(b)
free b
... // no use of a or b
c = (a == b)

It is this reordering of (de)allocation that is incorrect; the outlining itself is not the problem.

This also reflects my general concern with your approach: there's a lot of information loss. When I write C (or Rust) code, there is no guarantee that different local variables have different addresses. And yet when I translate that code to LLVM by doing

a = alloca
b = alloca
[...]

now I have generated LLVM IR where a and b will definitely get different addresses (if alloca works anything like a regular allocation operation). This is a correct translation from C to LLVM IR, but it heavily restricts the possible allocation choices compared to the original C program. So at this point frontends would be well-advised to not generate code like this, but instead carefully place the alloca as late as possible. Frontends will also want to have an explicit "free" for stack-slots to indicate when they are not needed any more.
It is my understanding that this is exactly where the lifetime.start and lifetime.end markers come in: they are *prescriptive* (in the terminology of your email: they are a given property, and thus fixed -- modulo the usual "as if" rule, of course); they are how the frontend informs the later stages of the constraints on which alloca may overlap and which may not. If the lifetime markers cannot play this role, then something else should. There needs to be *some* way for the frontend to encode this information.

At some point I lost track of what you even want to achieve. Your initial problem goes away if we only coalesce and fold comparisons between allocations for which the addresses are not observed.
Let's do that and thereby benefit from all the good things we can get from lifetime markers and coalescing, for alloca/malloc/...
Let's not go down the brittle road of syntactic limitations. It doesn't allow us to do a lot of things and will come with a price as we continue.

FWIW, I think some part of the misconception is that two allocas "will definitely get different addresses" which you argue "heavily restricts the possible allocation choices compared to the original C program".
The addresses are different but that is only interesting if you observe them. Take the C code: `{int a,b, c = (&a == &b); use(a,c);}`. I can happily coalesce `a` and `b` after the comparison was folded, or `b` and `c` even if the comparison is still in place. For neither transformation I need
lifetime markers to do it in IR.

A lot of my arguments around alloca/malloc/... is that we can augment the information from the frontend by determining if an address is observed or not. If it is not, we can limit its allocation, coalesce it, etc. Lifetime markers can help us do this but they are not the only way.
So, basically, we do not need lifetime markers to coalesce alloca/malloc/... but that they can help.

Let's try to get out of this rabbit hole. If you have a problem with the proposal as outlined in https://lists.llvm.org/pipermail/llvm-dev/2021-January/147594.html , please let me know. I assume dealing with observed addresses different than with non-observede ones (implied by noescape) is the way forward now.

~ Johannes

I hope this question has not been answered yet, but I don't see how that fold
could be legal. I asked the same question on Phabricator but it seems you prefer
to have the discussion here. Taking your example from there and adjusting it:

p = malloc(1)
q = malloc(1)
%c = icmp eq %p, %q
free(q)
free(p)

I think there is a guarantee that c will always be "false". Every operational
model of allocation that I have ever seen will guarantee this, and the same for
every program logic that I can imagine. So if the compiler may fold this to
"false", then as far as I can see, pointer comparison becomes entirely
unpredictable. The only consistent model I can think of is "icmp on pointers may
spuriously return 'true' at any time", which I doubt anyone wants. :wink:

In my understanding of
[expr.rel] or
[expr.rel] the result of this is unspecified.
While this does not necessarily extend to LLVM-IR, LLVM-IR usually
assumes C/C++ semantics unless defined otherwise.

I also thought the above is fine.

Optimizations such as https://reviews.llvm.org/D53362 and
https://reviews.llvm.org/D65408 assume the equivalence of alloca and
malloc+free. I assume that such optimizations are the reason for this
unspecifiedness, i.e. I think someone might want this. Johannes's
proposal A1 however, seems to forbid StackColoring to exploit lifetime
markers, which it currently does and I am not sure we can afford to do
that.

It's not that A1 "forbid[s] StackColoring to exploit lifetime markers", but
A1 says that you cannot use lifetime markers to argue about the address. So
you can still use them for reasoning but only as far as they tell you the
content is undef. If you want to do coalescing, you need to verify more things,
especially that the addresses are not (both) observed. If they are, which they
can be with lifetime markers in place as well, you end up with inconsistent views.

That said, if we want to preserve the property that you cannot access outside of
lifetime ranges, you could "fix" StackColoring by simply verifying one of the two
allocas is not escaping. You'd still need to fix InstCombine but the fix is the same.
I'm not sure we want to declare accesses outside of lifetime ranges UB or not. I imagine
in practice this makes little difference anyway, given that escaping uses are a problem
on their own.

~ Johannes