Treating value categories correctly in the dataflow framework
In the bargain, we eliminate ReferenceValue, SkipPast, and the duplication between StructValue and AggregateStorageLocation
Context
Objective
I propose a change to the dataflow framework to treat values and references in accordance with C++ value categories[1] (which the framework does not do rigorously today). This change serves two purposes:
- Improve code quality and maintainability by eliminating redundant constructs that are a source of bugs, specifically:
- Eliminate
ReferenceValue
andSkipPast
(see section “Background” below for an explanation of why these are redundant / unnecessary) - Eliminate the duplication of functionality between
StructValue
andAggregateStorageLocation
- Eliminate
- Improve precision of the analysis through closer adherence to C++ semantics, particularly in the treatment of result objects, returning class objects by value, and supporting NRVO.
Background
This change will establish a clear dichotomy between glvalues and prvalues. As a reminder, any expression is either a glvalue or a prvalue:
- A glvalue is an expression that establishes the identity of an object or function; it can be thought of as “something with an address”.
- A prvalue is a pure value. It does not have an address, and it is therefore not legal to apply the address-of operator to it.
As a basis for further discussion, consider the following code sample (godbolt link):
void foo() {
int a = 0;
int &b = a;
a = 1;
b = 2;
}
The expressions a
and b
in the example above both have the same type: They are glvalues (or, more specifically, lvalues) of type int
(see also the Clang AST in godbolt). In C++, expressions never have reference type; more precisely, to quote the standard: “If an expression initially has the type “reference to T
”, the type is adjusted to T
prior to any further analysis.”
The expressions 1
and 2
in the example are prvalues of type int
.
The dataflow framework currently treats these expressions as follows:
- The glvalue expressions
a
andb
are both associated with aStorageLocation
that maps to aReferenceValue
; theReferentLoc
of thisReferenceValue
maps to anIntegerValue
. (Note that this is true forDeclRefExpr
s, but for many other kinds of glvalue expression, the framework produces aStorageLocation
that maps directly to the value, without the intermediateReferenceValue
. This inconsistency presents an additional difficulty that needs to be worked around using theSkipPast
functionality – see below.) - The prvalue expressions
1
and2
are both associated with aStorageLocation
that maps to anIntegerValue
. Again, this is one indirection too many.
Both of these representations really contain one indirection too many:
- The
StorageLocation
associated with a glvalueint
could map (consistently) directly to anIntegerValue
. (Again, consider that expressions in C++ cannot have reference type.) - A prvalue
int
could be associated directly withIntegerValue
, as a prvalue does not have an address, i.e. aStorageLocation
.
The essence of this proposal is to change the representation of glvalues and prvalues as described above.
This change would also eliminate the need for the SkipPast::Reference
functionality. This is provided today to deal with the fact that a glvalue may (but need not) involve a double indirection through a ReferenceValue
. Calling Environment::getStorageLocation(E, SkipPast::Reference)
for an expression E
checks whether the StorageLocation
associated with E
maps to a ReferenceValue
; if so, it returns the ReferentLoc
of that ReferenceValue
, otherwise it returns the StorageLocation
associated with E
.
Finally, note that the dataflow framework also currently treats the declarations for a
and b
differently:
- The declaration for
a
is associated with aStorageLocation
that maps to anIntegerValue
. - The declaration for
b
is associated with aStorageLocation
that maps to aReferenceValue
, which in turn maps to anIntegerValue
.
There isn’t any real need to make this distinction. A reference is not itself an object; it is merely an alias to an existing object. The declaration b
could therefore be associated with a StorageLocation
that maps directly to an IntegerValue
.
Design
Mapping expressions to StorageLocations and Values
I propose changing the treatment of glvalues and prvalues as follows:
- A glvalue
Expr
is associated with aStorageLocation
through theEnvironment::ExprToLoc
map, just like today.
However, theValue
associated with thisStorageLocation
(tracked inEnvironment::LocToVal
) is never aReferenceValue
; it is always aValue
that corresponds directly to the expression’s type (e.g. a glvalue of typeint
is always associated with anIntegerValue
). - A prvalue
Expr
must no longer be inserted intoExprToLoc
, as prvalues don’t have a storage location. Instead, a prvalueExpr
is associated directly with aValue
through a newly introduced mapEnvironment::ExprToVal
. It is not legal to insert glvalueExpr
s into this map.
To enforce the invariants associated with ExprToLoc
and ExprToVal
, the accessor functions are changed as follows:
- The newly added function
setValue(const Expr &, Value &)
adds entries toExprToVal
. -
setStorageLocation(const Expr &E, StorageLocation &)
andgetStorageLocation(const Expr &E)
may only be called ifE.isGLValue()
is true. -
setValue(const Expr &E, Value &)
andgetValue(const Expr &E)
may only be called ifE.isPRValue()
is true. (Note that theSkipPast
parameter has been eliminated fromgetValue()
.)
Environment::join()
, Environment::widen()
and Environment::equivalentTo()
are modified to treat ExprToVal
in the same way as LocToVal
.
There is an additional benefit to separating glvalues and prvalues cleanly. Because most AST nodes operate on prvalues, the transfer functions for these nodes no longer need to unpack the StorageLocation
s for their operands and create StorageLocation
s for their results.
Mapping declarations to StorageLocations
I propose associating global and local variables of reference type directly with the StorageLocation
of the referenced object instead of the StorageLocation
of a ReferenceValue
.
The StorageLocation
s for global and local variables are tracked in Environment::DeclToLoc
. So far, this has been a strictly append-only map: New entries can be added, but the StorageLocation
s for existing declarations cannot be changed.
Changing the treatment of references as described above will require weakening this property, as the storage location associated with a reference can change from one loop iteration to the next. For example:
int a = 1;
int b = 2;
for (int i = 0; i < 42; ++i) {
int &r = i % 2 == 0? a : b;
}
However, the storage location associated with the reference can obviously never change during the lifetime of the reference. This means that we will never need to perform any non-trivial join operations on DeclToLoc
. Let’s expand the example to see where joins can happen:
for (int i = 0; i < 42; ++i) { // join 1
int &r = i % 2 == 0? a : b; // join 2
if (some_condition()) {
...
} else {
...
}
// join 3
// ...
}
In this example, we have three joins:
- Join 1 happens at the top of the loop, before the lifetime of the reference begins. Here, only one of the
Environment
s to be joined contains an entry forr
inDeclToLoc
, namely the one that came from the previous loop iteration. - Join 2 happens within the conditional operator but, again, before the lifetime of the reference begins. Neither of the
Environment
s to be joined contains an entry forr
inDeclToLoc
because the result of join 1 doesn’t contain such an entry (see below). - Join 3 happens after the if statement. Here, both
Environment
s to be joined contain an entry forr
inDeclToLoc
, but both entries map to the sameStorageLocation
.
The joined DeclToLoc
should therefore only contain entries for declarations that exist in both input DeclToLoc
s, and we should verify in this case that both declarations map to the same StorageLocation
. Environment::join()
already does exactly this today, except that it does not yet contain an assertion that both declarations map to the same StorageLocation
.
Dealing with AggregateStorageLocation and StructValue
We’ve discussed above how we will map global and local variables of reference type directly to the StorageLocation
of the referenced object. We also need to do this for fields (non-static member variables) of reference type, and to do so we need to change how we deal with values of class type[2].
One difficulty in dealing with values of class type in the dataflow framework is that there are two data structures that are used to represent class values in different situations: AggregateStorageLocation
and StructValue
. Broadly speaking, an AggregateStorageLocation
represents a glvalue of class type, and a StructValue
represents a prvalue of class type (though the framework does not make a clear distinction between these value categories today). The framework contains code for converting between the two representations and keeping them in sync along with tests for this behavior.
AggregateStorageLocation
is a map from a field to its corresponding StorageLocation
. Removing ReferenceValue
from this representation can be done in the same way as for global and local variables by associating a field of reference type directly with the StorageLocation
of the referenced object instead of the StorageLocation
of a ReferenceValue
.
Removing ReferenceValue
from StructValue
is more difficult. StructValue
is a map from a field to its corresponding Value
. There is therefore no way to associate a field of reference type directly with the StorageLocation
of the referenced object; the only option under the current representation of StructValue
is to associate the field with a ReferenceValue
.
I therefore propose to change the implementation of StructValue
so that it is merely a wrapper for an AggregateStorageLocation
:
class StructValue final : public Value {
public:
explicit StructValue(AggregateStorageLocation &Loc)
: Value(Kind::Struct), Loc(Loc) {}
// other public members omitted
private:
AggregateStorageLocation &Loc;
};
This change is motivated by the following C++ language rules:
-
Every prvalue of class type ends up initializing some result object[3], i.e. it eventually assumes identity, a
StorageLocation
. It is therefore justified to associate a storage location with every prvalue of class type. -
The dot member access operator can only be applied to glvalues of class type, not to prvalues. Therefore, when code accesses a class member, it always does so on an object that has identity, i.e. a
StorageLocation
.
These two rules taken together mean that prvalues of class type serve only a very limited role: They convey a value of class type from the place where it was created to the result object it initializes; no operations on the class members can happen along this route[4].
Note that the value of class type isn’t always created in the same function as the result object it initializes. Here is an example (godbolt):
struct A {
int i;
};
A createA(bool b) { return b ? A{0} : A{1}; }
void foo(bool b) { A a = createA(b); }
The result object for the values of type A
created in createA
is the variable a
in foo()
, and the language guarantees that the objects will be directly constructed into a
, with no copying. Clang needs to go to some effort to guarantee this: Assuming no inlining, it needs to pass a hidden argument for the address of the result object to createA()
, and when doing codegen for createA()
it needs to propagate that address downwards through the AST subtree that represents the return statement. (It’s instructive to look at the AST for this example on godbolt.)
However, the dataflow framework can take a simpler approach. We know that a prvalue of class type will ultimately initialize some result object, so the framework can eagerly create an AggregateStorageLocation
at the point where the prvalue is created. It wraps this AggregateStorageLocation
in a StructValue
so that it can pass it through prvalue AST nodes (such as the conditional operator, return statement, and call expression in the example above) and uphold the invariant that prvalues are always associated with a Value
, not a StorageLocation
. Finally, when the analysis reaches the AST node that identifies the result object (in the example, the declaration of a
), it unwraps the StructValue
and associates the result object with the AggregateStorageLocation
. It also adds an entry to LocToVal
to associate the AggregateStorageLocation
with the StructValue
, as custom analyses may still want to attach properties to the StructValue
.
The language rules discussed so far apply to C++, but C has slightly different rules. In C, the dot member access operator can be applied to an rvalue[5] of struct type (example on godbolt, observe the MemberExpr
for createA(b).i
).
The dataflow framework currently supports only C++, and there are no plans to work on C support for the foreseeable future. On the other hand, there are some high-value C code bases, such as the Linux kernel, on which it would be desirable to run dataflow analysis. It therefore seems preferable at least not to make any design decisions that would make it difficult to support C in the future.
Fortunately, I don’t believe that the proposed change would be an obstacle to supporting C’s operations on struct rvalues. The fields of the proposed StructValue
can be accessed through the contained AggregateStorageLocation
. This would establish the fiction that a StructValue
always has a storage location, which isn’t true in C, but this fiction seems harmless as it can be contained within the StructValue
without “leaking” out into any other code.
Possible future extension: Alias sets
A reviewer of an early draft of this proposal remarked that removing ReferenceValue
might remove a potentially useful extension point that could be used to support alias sets.
This document proposes treating references the same as non-reference variables and associating them directly with a StorageLocation
, but references are different in one important way: They may refer to different variables at runtime, and we may not be able to statically determine which they will refer to. For example:
int a = 0;
int b = 1;
int &r = some_condition() ? a : b;
We would want to model this by associating r
with an alias set that has two entries, one for the storage location of a
and one for the storage location of b
.
ReferenceValue
would provide a possible extension point for doing so: Instead of containing just a single StorageLocation
, as it does today, ReferenceValue
could contain a set of StorageLocation
s.
However, I believe there is a simpler and more consistent solution. We introduce a class AliasSet
, which contains a set of StorageLocation
s. Declarations and glvalue expressions map to an AliasSet
(instead of a StorageLocation
as they do today), and likewise PointerValue
contains an AliasSet
. This provides a consistent treatment of alias sets across the framework and avoids the problems incurred by treating references as if they were values.
Implementation experience
There is existing implementation experience for many of the ideas proposed in this document.
Crubit’s lifetime inference tool, which is built on top of the dataflow framework, implements its own object representation and points-to graph as it requires alias sets, which the dataflow framework does not support today. Its transfer function treats value categories as proposed in this document, though it only handles prvalues if they are of pointer type as no other prvalues are relevant to the analysis. It also implements alias sets as described in the section “Possible future extension: Alias sets” above.
In addition, I have begun prototyping the ideas described in this document and have encountered no obvious roadblocks; the prototype code is not however at a point yet where it makes sense to share it publicly.
Multi-step migration to prevent breaking existing clients of the framework
This proposal makes significant changes to the API of the dataflow framework, both syntactically and semantically. The dataflow framework doesn’t provide any real API stability guarantees, but we should still make a best effort to make the changes easy to absorb for existing users. This will require the changes to be rolled out in multiple stages. Here is a high-level description of how I think this can be done.
New value category semantics in Environment
In the a first step, I propose providing additional Environment
member functions that implement the new value category semantics but leaving the existing member functions and the internal implementation of Environment
unchanged. This will allow existing clients to be migrated gradually to the new semantics.
Specifically, we will provide the following new accessors:
-
getStorageLocationStrict(const Expr &E)
- Asserts that
E
is a glvalue - Retrieves the
StorageLocation
Loc
associated withE
- Asserts that
Loc
refers to aReferenceValue
RefVal
- Returns
RefVal.getReferentLoc()
- Asserts that
-
setStorageLocationStrict(const Expr &E, StorageLocation &Loc)
- Asserts that
E
is a glvalue. - Creates a new
ReferenceValue
RefVal
that wrapsLoc
, creates a newStorageLocation
RefValLoc
that refers to theRefVal
, then callssetStorageLocation(E, RefValLoc)
. - To ensure convergence, we always create the same
RefVal
andRefValLoc
for a givenLoc
.
- Asserts that
-
getValueStrict(const Expr &E)
- Asserts that
E
is a prvalue. - Retrieves the
Value
associated withE
, asserts that it is not aReferenceValue
, and returns it.
- Asserts that
-
setValueStrict(const Expr &E, Value &Val)
- Asserts that
E
is a prvalue - Asserts that
Val
is not aReferenceValue
- Creates a new
StorageLocation
Loc
, then callssetValue(Loc, Val)
andsetStorageLocation(E, Loc)
. - To ensure convergence, we always create the same
Loc
for a givenVal
.
- Asserts that
The migration then proceeds as follows:
- Gradually migrate framework code and clients to the
Strict
versions of the functions. - Remove the non-
Strict
versions once all callers have been migrated. - Change the internal implementation of
Environment
to implement the new semantics inExprToLoc
and a newly introducedExprToVal
. - Add synonyms for the
Strict
versions that omit theStrict
suffix. - Remove the
Strict
suffix from all callsites. - Delete the
Strict
versions of the functions.
If issues are encountered in any step of this migration, it should be straightforward to roll back the change that caused the issues.
New semantics for variables of reference type
As discussed above, I propose that variables of reference type will map directly to the StorageLocation
of the referenced object. To enable a gradual migration, I again propose to start by adding new member functions that implement these semantics:
-
Environment::getStorageLocationStrict(const ValueDecl &D)
- Calls
getStorageLocation(D)
to retrieve theStorageLocation
Loc
associated withD
. - If
D
has non-reference type, returnsLoc
. - If
D
has reference type:- Asserts that
Loc
is associated with aReferenceValue
RefVal
. - Returns
RefVal.getReferentLoc()
.
- Asserts that
- Calls
-
Environment::setStorageLocationStrict(const ValueDecl &D, StorageLocation &Loc)
- If
D
has non-reference type, callssetStorageLocation(D, Loc)
. - If
D
has reference type:- Creates a new
ReferenceValue
RefVal
that wrapsLoc
, creates a newStorageLocation
RefValLoc
that refers to theRefVal
, then callssetStorageLocation(D, RefValLoc)
. - To ensure convergence, we may need to guarantee that we always create the same
RefVal
andRefValLoc
for a givenLoc
.
- Creates a new
- If
The rest of the migration then proceeds as described above for expressions.
Changed StructValue API
The changes to StructValue
make it harder to support a gradual migration. Once StructValue
becomes a wrapper for AggregateStorageLocation
, we can no longer support the Value *getChild(const ValueDecl &)
and void setChild(const ValueDecl &, Value &)
APIs; doing so would require StructValue
to possess a reference to the DataflowEnvironment
, and this doesn’t seem desirable.
However, it appears that the typical purpose that dataflow framework clients use StructValue
for (as opposed to operating on an AggregateStorageLocation
) is to attach custom properties to objects of class type. None of the dataflow framework clients that I have examined appear to use the getChild()
or setChild()
APIs. Nevertheless, it may make sense to initially make these methods private, with friend declarations that allow the current callers within the dataflow framework to continue using them. This change could be easily rolled back if it turns out that there are dataflow framework clients that use these APIs. This client code could then be modified so that, instead of operating on a StructValue
, it operates on the corresponding AggregateStorageLocation
, which should always exist in the contexts that client code cares about.
Longer-term, we will probably want to eliminate all uses of StructValue
in client code and replace them with AbstractStorageLocation
. This would require us to add a way to attach properties to StorageLocation
s. At this point, the only remaining uses of StructValue
would be internal to the framework to convey prvalue Expr
s of class type to their result objects.
-
For some C++ language references, I will use cppreference.com instead of the C++ standard itself as the descriptions at cppreference.com are often more concise and easier to understand while still being accurate enough for the purposes of this document. ↩︎
-
Note that in C++, unions are considered to be classes. ↩︎
-
Except if it is the operand of a decltype-specifier; this case isn’t relevant here because the operand of a decltype-specifier is not evaluated. ↩︎
-
But don’t compilers routinely store and manipulate trivially-copyable class types in registers? Yes, they do, but this is just an optimization that they can perform if it’s guaranteed that the code won’t be able to observe the object’s identity. ↩︎
-
C has only two value categories, lvalue and rvalue. A C rvalue corresponds approximately to a C++ prvalue. ↩︎