Motivation
In this RFC, I’d like to discuss three improvements we can make to increase the precision of capture tracking. My primary motivation is the first one, but I’d like to bring up the rest as well, so we can amortize the cost of making changes to the nocapture representation in IR.
Distinguishing address capture and provenance escape
Currently, LLVM’s nocapture
attribute and the CaptureTracking
analysis conflate two concepts:
- Address capture: This means that information about the identity or bitwise representation of the address may be leaked. For example, because the function converts the pointer to integer, or performs a pointer comparison.
- Provenance escape: This means that memory accesses may be performed through the escaped pointer after the function returns.
These two concepts are relevant in different scenarios. For example, alias analysis cares exclusively about provenance escapes. Conversely, optimizations replacing one allocation with another may only care about address capture.
In practice, with LLVM IR as it currently is, we usually cannot distinguish between these two concepts. If you have a store ptr %p, ptr @global
, then this is both an address capture and a provenance escape, as we can’t track how the pointer will be used after this.
One notable exception are pointer icmps
, which constitute address captures, but not provenance escapes. For example, if you have icmp eq ptr %p, %q
, then you may not be allowed to access the memory behind %p
through %q
, despite them having the same address.
Furthermore, I expect that future IR improvements will make the difference more pronounced. In particular, Rust has a ptr.addr()
operation, which returns the integral address of the pointer, without exposing its provenance. Ideally, we would represent this as a ptrtoint noescape
operation in LLVM IR to enable additional optimization, but this capability currently doesn’t exist, and wouldn’t be particularly useful until we separate the capture/escape concepts.
Distinguishing read-only and read-write provenance escape
The provenance escape case can be further split into two cases: In one, the escaped pointer can only be used in a read-only fashion. In the other, it can be used for reads and writes. This distinction can be quite useful to improve alias analysis results, as many cases only care about potential writes, not reads.
This is a property that we’re unlikely to derive from IR analysis, but that a frontend can provide. For example, if you pass a (non-mut, Freeze) reference to a function in Rust, then the pointer will not only be read-only inside the function, but any escape based on it will also be read-only.
Distinguishing capture via return and other pathways
Finally, it can be useful to distinguish whether a capture can occur only because the pointer is returned, or also for other reasons (like a store into memory). If the capture is only via return, then CaptureTracking can continue recursive analysis on the return value.
The Attributor framework currently represents this using a custom "no-capture-maybe-returned"
attribute.
Proposal
I’m somewhat unsure what the best way to solve all of the above problems would be in terms of IR representation, so I’ll build this up in multiple steps. Feedback on what we should actually do here is greatly appreciated!
Distinguishing address capture and provenance escape
This part can be easily addressed by splitting nocapture
into two attributes:
nocapture
: Information about the identity or integral representation of the address may be captured. Makes no statement about the provenance of the pointer.noescape
: Provenance of the pointer may escape, and memory accesses may be performed through it after the function returns. Makes no statement about the address identity or representation.
An interesting question here is whether having only nocapture
would make sense under the new semantics. This means that there is a potential provenance escape, but no address capture. I don’t think this is something we can infer (any operation that leaks provenance also leaks the address), but it’s plausible that a frontend could provide the information (e.g. if it just never allows inspecting the identity/address of a pointer). As such, I think it’s best to keep both attributes completely orthogonal and allow all combinations of them.
Distinguishing read-only and read-write provenance escape
To support this, I think it would be best to represent captures using a single attribute that specifies everything that may be captured instead:
- No attribute: Everything may be captured.
captures(none)
(=nocapture noescape
)captures(address)
(=noescape
)captures(provenance)
(=nocapture
)captures(address, read-provenance)
(A Rust&
reference)
An alternative would be to solve this completely independently of the “capturing” property, by changing the semantics of the readonly
attribute instead.
It is my understanding that readonly
currently only constrains accesses during the function call. That is, if a readonly pointer escapes, then it is legal to perform write accesses through it after the function returns.
We could instead define readonly
as a provenance restriction, in which case any access based on the readonly
pointer would have to be read-only, even after the function returns.
I think that overall, this may be the cleaner approach, but it does cause an asymmetry between readonly and writeonly. That is, readonly would be defined in terms of provenance, while writeonly would be defined in terms of effects.
Distinguishing capture via return and other pathways
This is the part I’m least certain about. One way to approach this would be to mirror the memory
attribute and specify which locations capture which information. For example captures(none, return: address, provenance)
would express that both address and provenance are captured, but only via the return value.
To be honest, I think this is somewhat overkill, at least if “return” is the only location we’re interested in separating. This also allows things like captures(address, return: provenance)
, which is more fine-grained than we can really use.
Possibly this just needs a one-bit modifier along the lines of captures(address, provenance in return)
.
Questions
From my side, the main two questions I’d like to have some feedback on are:
- Should the “read-only provenance escape” case be handled as part of the capture representation, or by changing
readonly
semantics? - Do we want to distinguish return-only captures, and if so, what should the syntax be?