[RFC] Introduce non-capturing stores [second try]

NOTE: This was originally send in January 2021 [0]. The rational is still the same,
there are two different proposed solutions based on the conversations back then.

TL;DR: A pointer stored in memory is not necessarily captured, let's add a way to express this in IR.

--- Rational (copied mostly from [0]) ---

This would solve PR48475.

Runtime functions, as well as regular functions, might require a pointer
to be passed in memory even though the memory is simply a means to pass
(multiple) arguments. That is, the indirection through memory is only
used on the call edge and not otherwise relevant. However, such pointers
are currently assumed to escape as soon as they are stored in memory
even if the callee only reloads them and use them in a "non-escaping" way.
Generally, storing a pointer might not cause it to escape if all "uses of
the memory" it is stored to all have the "nocapture" property. While the
Attributor is aware of this and tries to determine all "copies" that the
store created, other passes are not and frontends cannot provide this
information for known APIs.

To allow optimizations in the presence of pointers stored to memory we
introduce *two* IR extensions:
  Option A) `!nocapture_store` metadata and `"nocapture_use"` operand
              bundle tags.
  Option B) `!nocapture_storage` metadata and `"nocapture_use"` operand
              bundle tags.
Option A) is what was proposed in [0]. Option B) is slightly different and
based on the discussions from [0] as well as a prototype patch [2].

Semantics Option A)
If a store of a pointer is tagged with `!nocapture_store` it guarantees that
the pointer is not captured by this store. To ensure we still "account" for the
uses of the pointer once it is reloaded we add the `"nocapture_use"` operand
bundle tag with the pointer as argument to callees that will interact with the
pointer loaded from memory.

Semantics Option B)
If a memory allocation is tagged with `!nocapture_storage` it guarantees that
stores of a pointer to that memory are not capturing the pointer. To ensure
we still "account" for the uses of the pointer once it is reloaded we add the
`"nocapture_use"` operand bundle tag with the pointer as argument to callees
that will interact with the pointer loaded from memory.
The difference to Option B) is that we do not tag stores but allocations.

--- Previous Discussion ---

The discussion as part of [0] did evolve around a way to handle yet another use case,
basically what happens if the reloads of the stored away pointer are (partially)
exposed rather than hidden behind a runtime function interface. The short answer is:
That is not supported by this RFC alone. The longer answer contains different possible
extensions to this RFC that would allow us to support such use cases. That said, the
runtime use case seems relevant enough to be handled first, especially since there is
no frontend/pass right now (in LLVM) that would rely on any of the extended use cases.

--- Proposal ---

Resurrect [1], or make [2] into a proper patch.
I still think [1] is the way to go as it is more generic and has less lookup cost.

This seems like a useful feature. I read the comments on the review, and I think the reviewers have a lot more context than I do, but with the limited information I have, tagging stores as in proposal A seems more general.

I have previously discussed the problem of how to make SROA work for C++ objects, in particular, std::vector, and I think that’s a use case worth considering in your design.

The problem with C++ objects is that there is often some out-of-line method (think grow) that blocks SROA by taking the address of the object. Other optimizations such as GVN may do some work to reduce loads and stores to the vector object, but we could probably do more optimizations if a vector looked like three pointers (begin, end, capacity) to the optimizer, rather than a struct in memory. Trying to solve this problem would tend to lead in the direction of option B, where we have a “blessed” alloca that can be used to spill SSA values back to memory in the ABI-expected struct layout when there is register pressure.

This is definitely beyond the scope of what you have described here, since grow updates vector fields, and they then have to be reloaded as new SSA values. The vector storage can even be captured by special members of the stored object, so “nocapture” is not a good name for this extended feature. The idea is more along the lines of, “here is a pointer SSA value, I have to store it in this ABI-prescribed struct layout, but I’m passing it on the side in a bundle using appropriate aliasing annotations to inform optimizations”.

I think there is something to this idea, but please don’t let me derail your proposal by increasing the scope too much.

Johannes,

JFYI, I'm still not convinced of this approach, but also don't have the cycles to engage with this and likely wont for a while. Mentioning this just so you know not to expect feedback from me, and to remove any implicit blockage.

Philip

This seems like a useful feature. I read the comments on the review, and I
think the reviewers have a lot more context than I do, but with the limited
information I have, tagging stores as in proposal A seems more general.

I have previously discussed the problem of how to make SROA work for C++
objects, in particular, std::vector, and I think that's a use case worth
considering in your design.

The problem with C++ objects is that there is often some out-of-line method
(think grow) that blocks SROA by taking the address of the object. Other
optimizations such as GVN may do some work to reduce loads and stores to
the vector object, but we could probably do more optimizations if a vector
looked like three pointers (begin, end, capacity) to the optimizer, rather
than a struct in memory. Trying to solve this problem would tend to lead in
the direction of option B, where we have a "blessed" alloca that can be
used to spill SSA values back to memory in the ABI-expected struct layout
when there is register pressure.

This is definitely beyond the scope of what you have described here, since
`grow` updates vector fields, and they then have to be reloaded as new SSA
values. The vector storage can even be captured by special members of the
stored object, so "nocapture" is not a good name for this extended feature.
The idea is more along the lines of, "here is a pointer SSA value, I have
to store it in this ABI-prescribed struct layout, but I'm passing it on the
side in a bundle using appropriate aliasing annotations to inform
optimizations".

I think there is something to this idea, but please don't let me derail
your proposal by increasing the scope too much.

Can you provide an example for this, some small C++ fragment is fine.
I want to play it through and also compare to a new "similar" idea.

~ Johannes

I’m not convinced that this is the way to go either.

The notion of capture is still mostly an implementation detail of LLVM’s alias analysis. It boils down to the fact whether the current implementation can keep track of all aliases of the pointer. We do have a section about pointer capture in the LangRef (https://llvm.org/docs/LangRef.html#pointer-capture), but it is specific to nocapture attribute on calls. It is defined by explicitly prohibiting specific operations on the pointer within the function.

I find proposed definition of nocapture_store metadata rather vague. From https://reviews.llvm.org/D93189:
“the pointer stored is not captured in the sense that all uses of the pointer are explicitly marked otherwise and the storing can be ignored during capture analysis”.

This definition makes lax use of the term “capture”. This term is only defined in the LangRef with respect to pointers and calls.
It refers to “capture analysis” which is an implementation detail of LLVM’s alias analysis and not defined anywhere.
It also says “all uses of the pointer are explicitly marked otherwise”. The proposed way of marking the uses is the nocapture_use operand bundle. So, essentially, the only way a “nocapture stored” pointer can be used today is in a call (this is because today we can only attach operand bundles to calls and invokes). This is very limiting.

This limitation can be lifted if we introduce operand bundles on arbitrary instructions. Assuming we do that, the definition above suggests that we need to mark all uses of the “nocapture stored” pointers. It means every single instruction, including geps, bitcasts, etc. need to bear this operand bundle. This looks verbose and fragile. Every transform now needs to preserve this marking, otherwise risking a miscompile.

Speaking of possible alternatives, we can probably define “nocapture storage” as memory that doesn’t outlive the scope of the current function (it should probably have some other name, because it doesn’t rely on term capture). Having this property we can extend capture tracking with flow-insensitive analysis to keep track of pointers stored into “nocapture storage”.

The other alternative that was suggested previously was to explicitly mark aliasing properties in the IR similarly to (or using) noalias and alias.scope metadata.

Artur

Let me copy the motivation example given in the previous RFC email:

struct Payload {
   int *a;
   double *b;
};

void fn(void *v) {
   Payload *p = (Payload*)(v);
   // Load the pointers from the payload and then dereference them,
   // this will not capture the pointers.
   int *a = p->a;
   double *b = p->b;
   *a = use(*b);
}

void foo(int *a, double *b) {
   Payload p = {a, b};
   pthread_create(..., &fn, &p);
}

Makes sense that a & b don't escape.
Right now they do, but not during the store to p; they only escape at the pthread_create call site. Storing a pointer to an unescaped alloca buffer doesn't escape anything as that memory is not externally observable.

I think this hints at establishing the property you want at call sites, not at allocas or stores. The nocapture attribute is only for the given pointer, while it seems you need a 2-level nocapture (or just recursive, to simplify?).

Unless you have other examples where annotating calls doesn't work, I would go that route instead.

Nuno

Let me copy the motivation example given in the previous RFC email:

struct Payload {
    int *a;
    double *b;
};

void fn(void *v) {
    Payload *p = (Payload*)(v);
    // Load the pointers from the payload and then dereference them,
    // this will not capture the pointers.
    int *a = p->a;
    double *b = p->b;
    *a = use(*b);
}

void foo(int *a, double *b) {
    Payload p = {a, b};
    pthread_create(..., &fn, &p);
}

Makes sense that a & b don't escape.
Right now they do, but not during the store to p; they only escape at the pthread_create call site. Storing a pointer to an unescaped alloca buffer doesn't escape anything as that memory is not externally observable.

I think this hints at establishing the property you want at call sites, not at allocas or stores. The nocapture attribute is only for the given pointer, while it seems you need a 2-level nocapture (or just recursive, to simplify?).

Unless you have other examples where annotating calls doesn't work, I would go that route instead.

I have examples, basically we can encode what the Attributor can determine when it comes to memory.
Examples that involve heap memory, stack memory, and global memory can be found here
⚙ D109170 [Attributor] Look through allocated heap memory
While those mostly show how we can look through memory, when we fail to propagate the value we could mark the stores as non-capture (assuming they are).

There are two main benefit of marking stores rather than call sites:
1) If we have call sites we need to first go to the storage from the store, which might not be in the same function and therefore really hard for non-IPO analysis.
Once at the storage location we need to traverse all uses again until we verify the storage is used only in a certain way. And, which plays into this,
2) We want a system that composes, e.g., store into a local allocation which is stored into a heap allocation which is stored into a global, but all in all - nothing escapes.

The Attributor can deal with both points above just fine, annotating the stores would make the information trivially available to later passes and for frontends it would
not make an obvious difference if they annotate stores or call sites (in the domain knowledge case).

~ Johannes

I'm not convinced that this is the way to go either.

The notion of capture is still mostly an implementation detail of LLVM's alias analysis. It boils down to the fact whether the current implementation can keep track of all aliases of the pointer. We do have a section about pointer capture in the LangRef (LLVM Language Reference Manual — LLVM 16.0.0git documentation), but it is specific to nocapture attribute on calls. It is defined by explicitly prohibiting specific operations on the pointer within the function.

We don't prohibit operations per se. We prohibit behavior. You can copy a pointer if the copy is never read.
Implementation detail is that we can often not proof a copy is never read and therefore not a capture.

In general all/most attributes limit the possible semantics of a program, I would not call these details of the AA, e.g., inaccessible_mem is not an AA detail either even if we use it (mostly) there.

I find proposed definition of nocapture_store metadata rather vague. From Login
"the pointer stored is not captured in the sense that all uses of the pointer are explicitly marked otherwise and the storing can be ignored during capture analysis".

This definition makes lax use of the term "capture". This term is only defined in the LangRef with respect to pointers and calls.

That is valid criticism but totally addressable. Up until Philip added the lang ref part we never defined what capture is but simply went along with some notion of it.
Defining it further can be part of this for sure.

Arguably, the "or stored in the memory before the call" part of the LangRef right now is simply misleading as it is.
This implies some connection of capturing with calls, which is not a thing, and further sounds like there is a connection wrt. the order of instructions, which is not a thing either.
I can will update the patch with a new lang ref if we agree on the general outline and we can discuss details there.

Summary:

An operation can capture a pointer in a given scope if the execution of the operation makes any bit of that pointer available to an observer that is not explicitly marked as user of the pointer.

So, when u mark an operation as "nocapture", e.g. a store, you need to ensure all copies that are made/used through that operation are explicitly marked, e.g., for the runtime call site use case with operand bundles which we can later also use for loads.

It refers to "capture analysis" which is an implementation detail of LLVM's alias analysis and not defined anywhere.

Being captured is a property with semantic implication. Capture analysis determines that property. While we can certainly rephrase this, it's not a conceptual problem in my book.

It also says "all uses of the pointer are explicitly marked otherwise". The proposed way of marking the uses is the nocapture_use operand bundle. So, essentially, the only way a "nocapture stored" pointer can be used today is in a call (this is because today we can only attach operand bundles to calls and invokes). This is very limiting.

All true but on purpose. You need to make sure to mark copies explicitly if you disable the implicit way (via an attribute).
Let's talk about alternatives and limits:
- Marking call sites with attributes avoids this but introduces extra lookup costs. It is hard to make it compose (pointer, in memory. in memory,...). It works only with calls.
- Marking allocations requires lookup costs (not even less than above as you need to get to the allocation first and then to all uses). It is composable. It works with call and other uses (e.g., loads) either via operand bundles or def-use chain traversal.
- Marking stores (or in general operations) does not require lookup costs. It is composable. It works with call and other uses (again via operand bundles *or* via def-use chain traversal).

So from all options we discussed, marking the operation is the most direct way of encoding with the least drawbacks and all possibilities to extend it as we go.
We can add `nocapture` annotations to other operations (e.g., compares), we can add operand bundles to other uses (e.g., loads), or we can have a `nocapture store` version which requires the user to traverse the def-use chain of the allocation to find the potential users.
The proposal at hand is not restricting us but a setup for most performance in the use case we first tackle.

This limitation can be lifted if we introduce operand bundles on arbitrary instructions. Assuming we do that, the definition above suggests that we need to mark *all* uses of the "nocapture stored" pointers. It means every single instruction, including geps, bitcasts, etc. need to bear this operand bundle. This looks verbose and fragile. Every transform now needs to preserve this marking, otherwise risking a miscompile.

We can extend operand bundles to instructions and there have been use cases before. Regardless of the way we encode it, e.g., as extra/optional pointer operand or operand bundle or intrinsic that ties the load and the stored value together, you never need to annotate everything.
I'm not sure where that idea comes from but the use case in which we do not have a runtime call (behind which all the uses of the stored value are hidden) requires you only to annotate the potential copies of the "nocapture stored" value. As mentioned above, we could even have a
version that requires you to traverse the def-use chains to find all uses but that is only slightly more helpful than no annotation at all. Instead you would do this:

store %p %mem, !nocapture
...
bc = bitcast %mem to ...
copy = load %bc, !nocapture_use(%p)
// or alternatively
copy_val = load %bc
copy = llvm.nocapture_use(%copy2_val, %p)
...
use(%copy)

Now, %copy is marked as potential copy of %p explicitly. If you see the store of %p you can ignore it effectively as you'll see the uses later.
No bitcast, gep, or anything else needs to be marked or is impacted, it's all about the value that goes into the operation (here into the store) and the potential copies that come from the operation performed (here the store).

Speaking of possible alternatives, we can probably define "nocapture storage" as memory that doesn't outlive the scope of the current function (it should probably have some other name, because it doesn't rely on term capture). Having this property we can extend capture tracking with flow-insensitive analysis to keep track of pointers stored into "nocapture storage".

Nocapture storage should not be limited by lifetime. We want this property for globals too.

The other alternative that was suggested previously was to explicitly mark aliasing properties in the IR similarly to (or using) noalias and alias.scope metadata.

This encodes uses of potential copies of values. That scales nicely and is generic, e.g., we can also go away from the "nocapture" part of all this and use a very similar mechanism to track potential copies of any value through memory (and calls).

~ Johannes

I'm not convinced that this is the way to go either.

The notion of capture is still mostly an implementation detail of LLVM's alias analysis. It boils down to the fact whether the current implementation can keep track of all aliases of the pointer. We do have a section about pointer capture in the LangRef (LLVM Language Reference Manual — LLVM 16.0.0git documentation), but it is specific to nocapture attribute on calls. It is defined by explicitly prohibiting specific operations on the pointer within the function.

We don't prohibit operations per se. We prohibit behavior. You can copy a pointer if the copy is never read.
Implementation detail is that we can often not proof a copy is never read and therefore not a capture.

In general all/most attributes limit the possible semantics of a program, I would not call these details of the AA, e.g., inaccessible_mem is not an AA detail either even if we use it (mostly) there.

We don’t yet have a good definition what “capture” means. We have an implementation which we are implicitly referring to. That’s why *at this point* I call it an implementation detail.

Below you make an attempt to formalize it. I think this is a step in the right direction.

I find proposed definition of nocapture_store metadata rather vague. From Login
"the pointer stored is not captured in the sense that all uses of the pointer are explicitly marked otherwise and the storing can be ignored during capture analysis".

This definition makes lax use of the term "capture". This term is only defined in the LangRef with respect to pointers and calls.

That is valid criticism but totally addressable. Up until Philip added the lang ref part we never defined what capture is but simply went along with some notion of it.
Defining it further can be part of this for sure.

Arguably, the "or stored in the memory before the call" part of the LangRef right now is simply misleading as it is.
This implies some connection of capturing with calls, which is not a thing, and further sounds like there is a connection wrt. the order of instructions, which is not a thing either.
I can will update the patch with a new lang ref if we agree on the general outline and we can discuss details there.

Summary:

  An operation can capture a pointer in a given scope if the execution of the operation makes any bit of that pointer available to an observer that is not explicitly marked as user of the pointer.

Before we introduce the nocapture enhancements, let’s test the definition we come up with against the existing CaptureTracking implementation.

Currently, one of the categories of uses handled by capture tracking is uses which don’t capture, but produce a new alias for the original pointer. These are geps/bitcasts/phis/selects/some intrinsics. How do they fit into this definition? They make the pointer ‘available’ for the users of these instructions, but we don’t have to mark these users in any special way. CaptureTracking simply knows how to ’track’ through such aliases. I think formalizing this part will be tricky, because it is just an implementation detail of CaptureTracking today.

With nocapture enhancements you add a requirement to ‘mark’ nocapture users, but you don’t give any guidance which users should be marked and which should not. This is why I ask about marking geps/bitcasts/etc. They are also users of the pointer.

So, when u mark an operation as "nocapture", e.g. a store, you need to ensure all copies that are made/used through that operation are explicitly marked, e.g., for the runtime call site use case with operand bundles which we can later also use for loads.

It refers to "capture analysis" which is an implementation detail of LLVM's alias analysis and not defined anywhere.

Being captured is a property with semantic implication. Capture analysis determines that property. While we can certainly rephrase this, it's not a conceptual problem in my book.

It also says "all uses of the pointer are explicitly marked otherwise". The proposed way of marking the uses is the nocapture_use operand bundle. So, essentially, the only way a "nocapture stored" pointer can be used today is in a call (this is because today we can only attach operand bundles to calls and invokes). This is very limiting.

All true but on purpose. You need to make sure to mark copies explicitly if you disable the implicit way (via an attribute).
Let's talk about alternatives and limits:
- Marking call sites with attributes avoids this but introduces extra lookup costs. It is hard to make it compose (pointer, in memory. in memory,...). It works only with calls.
- Marking allocations requires lookup costs (not even less than above as you need to get to the allocation first and then to all uses). It is composable. It works with call and other uses (e.g., loads) either via operand bundles or def-use chain traversal.
- Marking stores (or in general operations) does not require lookup costs. It is composable. It works with call and other uses (again via operand bundles *or* via def-use chain traversal).

So from all options we discussed, marking the operation is the most direct way of encoding with the least drawbacks and all possibilities to extend it as we go.
We can add `nocapture` annotations to other operations (e.g., compares), we can add operand bundles to other uses (e.g., loads), or we can have a `nocapture store` version which requires the user to traverse the def-use chain of the allocation to find the potential users.
The proposal at hand is not restricting us but a setup for most performance in the use case we first tackle.

The part that you need to mark nocapture users is the most sketchy IMO. As I mentioned above, how we know what should be explicitly marked and what not?

The way I read the original proposal is *every* user of the nocapture stored pointer need to be marked.

This limitation can be lifted if we introduce operand bundles on arbitrary instructions. Assuming we do that, the definition above suggests that we need to mark *all* uses of the "nocapture stored" pointers. It means every single instruction, including geps, bitcasts, etc. need to bear this operand bundle. This looks verbose and fragile. Every transform now needs to preserve this marking, otherwise risking a miscompile.

We can extend operand bundles to instructions and there have been use cases before. Regardless of the way we encode it, e.g., as extra/optional pointer operand or operand bundle or intrinsic that ties the load and the stored value together, you never need to annotate everything.
I'm not sure where that idea comes from but the use case in which we do not have a runtime call (behind which all the uses of the stored value are hidden) requires you only to annotate the potential copies of the "nocapture stored" value. As mentioned above, we could even have a
version that requires you to traverse the def-use chains to find all uses but that is only slightly more helpful than no annotation at all. Instead you would do this:

store %p %mem, !nocapture
...
bc = bitcast %mem to ...
copy = load %bc, !nocapture_use(%p)
// or alternatively
copy_val = load %bc
copy = llvm.nocapture_use(%copy2_val, %p)
...
use(%copy)

Now, %copy is marked as potential copy of %p explicitly. If you see the store of %p you can ignore it effectively as you'll see the uses later.
No bitcast, gep, or anything else needs to be marked or is impacted, it's all about the value that goes into the operation (here into the store) and the potential copies that come from the operation performed (here the store).

Speaking of possible alternatives, we can probably define "nocapture storage" as memory that doesn't outlive the scope of the current function (it should probably have some other name, because it doesn't rely on term capture). Having this property we can extend capture tracking with flow-insensitive analysis to keep track of pointers stored into "nocapture storage".

Nocapture storage should not be limited by lifetime. We want this property for globals too.

The other alternative that was suggested previously was to explicitly mark aliasing properties in the IR similarly to (or using) noalias and alias.scope metadata.

This encodes uses of potential copies of values. That scales nicely and is generic, e.g., we can also go away from the "nocapture" part of all this and use a very similar mechanism to track potential copies of any value through memory (and calls).

Right, lifting the problem to explicit mayalias/noalias properties eliminates the nocapture part of the discussion. Aliasing properties are better defined in the lang ref.

I guess we will only need to annotate memory operations in this approach (while nocapture property will require annotating things like comparisons).

Artur