[RFC][dataflow] Treating value categories correctly in the dataflow framework

Treating value categories correctly in the dataflow framework

In the bargain, we eliminate ReferenceValue, SkipPast, and the duplication between StructValue and AggregateStorageLocation

Context

Objective

I propose a change to the dataflow framework to treat values and references in accordance with C++ value categories[1] (which the framework does not do rigorously today). This change serves two purposes:

  • Improve code quality and maintainability by eliminating redundant constructs that are a source of bugs, specifically:
    • Eliminate ReferenceValue and SkipPast (see section “Background” below for an explanation of why these are redundant / unnecessary)
    • Eliminate the duplication of functionality between StructValue and AggregateStorageLocation
  • Improve precision of the analysis through closer adherence to C++ semantics, particularly in the treatment of result objects, returning class objects by value, and supporting NRVO.

Background

This change will establish a clear dichotomy between glvalues and prvalues. As a reminder, any expression is either a glvalue or a prvalue:

  • A glvalue is an expression that establishes the identity of an object or function; it can be thought of as “something with an address”.
  • A prvalue is a pure value. It does not have an address, and it is therefore not legal to apply the address-of operator to it.

As a basis for further discussion, consider the following code sample (godbolt link):

void foo() {
  int a = 0;
  int &b = a;
  a = 1;
  b = 2;
}

The expressions a and b in the example above both have the same type: They are glvalues (or, more specifically, lvalues) of type int (see also the Clang AST in godbolt). In C++, expressions never have reference type; more precisely, to quote the standard: “If an expression initially has the type “reference to T”, the type is adjusted to T prior to any further analysis.”

The expressions 1 and 2 in the example are prvalues of type int.

The dataflow framework currently treats these expressions as follows:

  • The glvalue expressions a and b are both associated with a StorageLocation that maps to a ReferenceValue; the ReferentLoc of this ReferenceValue maps to an IntegerValue. (Note that this is true for DeclRefExprs, but for many other kinds of glvalue expression, the framework produces a StorageLocation that maps directly to the value, without the intermediate ReferenceValue. This inconsistency presents an additional difficulty that needs to be worked around using the SkipPast functionality – see below.)
  • The prvalue expressions 1 and 2 are both associated with a StorageLocation that maps to an IntegerValue. Again, this is one indirection too many.

Both of these representations really contain one indirection too many:

  • The StorageLocation associated with a glvalue int could map (consistently) directly to an IntegerValue. (Again, consider that expressions in C++ cannot have reference type.)
  • A prvalue int could be associated directly with IntegerValue, as a prvalue does not have an address, i.e. a StorageLocation.

The essence of this proposal is to change the representation of glvalues and prvalues as described above.

This change would also eliminate the need for the SkipPast::Reference functionality. This is provided today to deal with the fact that a glvalue may (but need not) involve a double indirection through a ReferenceValue. Calling Environment::getStorageLocation(E, SkipPast::Reference) for an expression E checks whether the StorageLocation associated with E maps to a ReferenceValue; if so, it returns the ReferentLoc of that ReferenceValue, otherwise it returns the StorageLocation associated with E.

Finally, note that the dataflow framework also currently treats the declarations for a and b differently:

  • The declaration for a is associated with a StorageLocation that maps to an IntegerValue.
  • The declaration for b is associated with a StorageLocation that maps to a ReferenceValue, which in turn maps to an IntegerValue.

There isn’t any real need to make this distinction. A reference is not itself an object; it is merely an alias to an existing object. The declaration b could therefore be associated with a StorageLocation that maps directly to an IntegerValue.

Design

Mapping expressions to StorageLocations and Values

I propose changing the treatment of glvalues and prvalues as follows:

  • A glvalue Expr is associated with a StorageLocation through the Environment::ExprToLoc map, just like today.
    However, the Value associated with this StorageLocation (tracked in Environment::LocToVal) is never a ReferenceValue; it is always a Value that corresponds directly to the expression’s type (e.g. a glvalue of type int is always associated with an IntegerValue).
  • A prvalue Expr must no longer be inserted into ExprToLoc, as prvalues don’t have a storage location. Instead, a prvalue Expr is associated directly with a Value through a newly introduced map Environment::ExprToVal. It is not legal to insert glvalue Exprs into this map.

To enforce the invariants associated with ExprToLoc and ExprToVal, the accessor functions are changed as follows:

  • The newly added function setValue(const Expr &, Value &) adds entries to ExprToVal.
  • setStorageLocation(const Expr &E, StorageLocation &) and getStorageLocation(const Expr &E) may only be called if E.isGLValue() is true.
  • setValue(const Expr &E, Value &) and getValue(const Expr &E) may only be called if E.isPRValue() is true. (Note that the SkipPast parameter has been eliminated from getValue().)

Environment::join(), Environment::widen() and Environment::equivalentTo() are modified to treat ExprToVal in the same way as LocToVal.

There is an additional benefit to separating glvalues and prvalues cleanly. Because most AST nodes operate on prvalues, the transfer functions for these nodes no longer need to unpack the StorageLocations for their operands and create StorageLocations for their results.

Mapping declarations to StorageLocations

I propose associating global and local variables of reference type directly with the StorageLocation of the referenced object instead of the StorageLocation of a ReferenceValue.

The StorageLocations for global and local variables are tracked in Environment::DeclToLoc. So far, this has been a strictly append-only map: New entries can be added, but the StorageLocations for existing declarations cannot be changed.

Changing the treatment of references as described above will require weakening this property, as the storage location associated with a reference can change from one loop iteration to the next. For example:

int a = 1;
int b = 2;
for (int i = 0; i < 42; ++i) {
  int &r = i % 2 == 0? a : b;
}

However, the storage location associated with the reference can obviously never change during the lifetime of the reference. This means that we will never need to perform any non-trivial join operations on DeclToLoc. Let’s expand the example to see where joins can happen:

for (int i = 0; i < 42; ++i) {  // join 1
  int &r = i % 2 == 0? a : b;  // join 2
  if (some_condition()) {
    ...
  } else {
    ...
  }
  // join 3
  // ...
}

In this example, we have three joins:

  • Join 1 happens at the top of the loop, before the lifetime of the reference begins. Here, only one of the Environments to be joined contains an entry for r in DeclToLoc, namely the one that came from the previous loop iteration.
  • Join 2 happens within the conditional operator but, again, before the lifetime of the reference begins. Neither of the Environments to be joined contains an entry for r in DeclToLoc because the result of join 1 doesn’t contain such an entry (see below).
  • Join 3 happens after the if statement. Here, both Environments to be joined contain an entry for r in DeclToLoc, but both entries map to the same StorageLocation.

The joined DeclToLoc should therefore only contain entries for declarations that exist in both input DeclToLocs, and we should verify in this case that both declarations map to the same StorageLocation. Environment::join() already does exactly this today, except that it does not yet contain an assertion that both declarations map to the same StorageLocation.

Dealing with AggregateStorageLocation and StructValue

We’ve discussed above how we will map global and local variables of reference type directly to the StorageLocation of the referenced object. We also need to do this for fields (non-static member variables) of reference type, and to do so we need to change how we deal with values of class type[2].

One difficulty in dealing with values of class type in the dataflow framework is that there are two data structures that are used to represent class values in different situations: AggregateStorageLocation and StructValue. Broadly speaking, an AggregateStorageLocation represents a glvalue of class type, and a StructValue represents a prvalue of class type (though the framework does not make a clear distinction between these value categories today). The framework contains code for converting between the two representations and keeping them in sync along with tests for this behavior.

AggregateStorageLocation is a map from a field to its corresponding StorageLocation. Removing ReferenceValue from this representation can be done in the same way as for global and local variables by associating a field of reference type directly with the StorageLocation of the referenced object instead of the StorageLocation of a ReferenceValue.

Removing ReferenceValue from StructValue is more difficult. StructValue is a map from a field to its corresponding Value. There is therefore no way to associate a field of reference type directly with the StorageLocation of the referenced object; the only option under the current representation of StructValue is to associate the field with a ReferenceValue.

I therefore propose to change the implementation of StructValue so that it is merely a wrapper for an AggregateStorageLocation:

class StructValue final : public Value {
public:
  explicit StructValue(AggregateStorageLocation &Loc)
      : Value(Kind::Struct), Loc(Loc) {}

  // other public members omitted

private:
  AggregateStorageLocation &Loc;
};

This change is motivated by the following C++ language rules:

  • Every prvalue of class type ends up initializing some result object[3], i.e. it eventually assumes identity, a StorageLocation. It is therefore justified to associate a storage location with every prvalue of class type.

  • The dot member access operator can only be applied to glvalues of class type, not to prvalues. Therefore, when code accesses a class member, it always does so on an object that has identity, i.e. a StorageLocation.

These two rules taken together mean that prvalues of class type serve only a very limited role: They convey a value of class type from the place where it was created to the result object it initializes; no operations on the class members can happen along this route[4].

Note that the value of class type isn’t always created in the same function as the result object it initializes. Here is an example (godbolt):

struct A {
    int i;
};

A createA(bool b) { return b ? A{0} : A{1}; }

void foo(bool b) { A a = createA(b); }

The result object for the values of type A created in createA is the variable a in foo(), and the language guarantees that the objects will be directly constructed into a, with no copying. Clang needs to go to some effort to guarantee this: Assuming no inlining, it needs to pass a hidden argument for the address of the result object to createA(), and when doing codegen for createA() it needs to propagate that address downwards through the AST subtree that represents the return statement. (It’s instructive to look at the AST for this example on godbolt.)

However, the dataflow framework can take a simpler approach. We know that a prvalue of class type will ultimately initialize some result object, so the framework can eagerly create an AggregateStorageLocation at the point where the prvalue is created. It wraps this AggregateStorageLocation in a StructValue so that it can pass it through prvalue AST nodes (such as the conditional operator, return statement, and call expression in the example above) and uphold the invariant that prvalues are always associated with a Value, not a StorageLocation. Finally, when the analysis reaches the AST node that identifies the result object (in the example, the declaration of a), it unwraps the StructValue and associates the result object with the AggregateStorageLocation. It also adds an entry to LocToVal to associate the AggregateStorageLocation with the StructValue, as custom analyses may still want to attach properties to the StructValue.

The language rules discussed so far apply to C++, but C has slightly different rules. In C, the dot member access operator can be applied to an rvalue[5] of struct type (example on godbolt, observe the MemberExpr for createA(b).i).

The dataflow framework currently supports only C++, and there are no plans to work on C support for the foreseeable future. On the other hand, there are some high-value C code bases, such as the Linux kernel, on which it would be desirable to run dataflow analysis. It therefore seems preferable at least not to make any design decisions that would make it difficult to support C in the future.

Fortunately, I don’t believe that the proposed change would be an obstacle to supporting C’s operations on struct rvalues. The fields of the proposed StructValue can be accessed through the contained AggregateStorageLocation. This would establish the fiction that a StructValue always has a storage location, which isn’t true in C, but this fiction seems harmless as it can be contained within the StructValue without “leaking” out into any other code.

Possible future extension: Alias sets

A reviewer of an early draft of this proposal remarked that removing ReferenceValue might remove a potentially useful extension point that could be used to support alias sets.

This document proposes treating references the same as non-reference variables and associating them directly with a StorageLocation, but references are different in one important way: They may refer to different variables at runtime, and we may not be able to statically determine which they will refer to. For example:

int a = 0;
int b = 1;
int &r = some_condition() ? a : b;

We would want to model this by associating r with an alias set that has two entries, one for the storage location of a and one for the storage location of b.

ReferenceValue would provide a possible extension point for doing so: Instead of containing just a single StorageLocation, as it does today, ReferenceValue could contain a set of StorageLocations.

However, I believe there is a simpler and more consistent solution. We introduce a class AliasSet, which contains a set of StorageLocations. Declarations and glvalue expressions map to an AliasSet (instead of a StorageLocation as they do today), and likewise PointerValue contains an AliasSet. This provides a consistent treatment of alias sets across the framework and avoids the problems incurred by treating references as if they were values.

Implementation experience

There is existing implementation experience for many of the ideas proposed in this document.

Crubit’s lifetime inference tool, which is built on top of the dataflow framework, implements its own object representation and points-to graph as it requires alias sets, which the dataflow framework does not support today. Its transfer function treats value categories as proposed in this document, though it only handles prvalues if they are of pointer type as no other prvalues are relevant to the analysis. It also implements alias sets as described in the section “Possible future extension: Alias sets” above.

In addition, I have begun prototyping the ideas described in this document and have encountered no obvious roadblocks; the prototype code is not however at a point yet where it makes sense to share it publicly.

Multi-step migration to prevent breaking existing clients of the framework

This proposal makes significant changes to the API of the dataflow framework, both syntactically and semantically. The dataflow framework doesn’t provide any real API stability guarantees, but we should still make a best effort to make the changes easy to absorb for existing users. This will require the changes to be rolled out in multiple stages. Here is a high-level description of how I think this can be done.

New value category semantics in Environment

In the a first step, I propose providing additional Environment member functions that implement the new value category semantics but leaving the existing member functions and the internal implementation of Environment unchanged. This will allow existing clients to be migrated gradually to the new semantics.

Specifically, we will provide the following new accessors:

  • getStorageLocationStrict(const Expr &E)
    • Asserts that E is a glvalue
    • Retrieves the StorageLocation Loc associated with E
    • Asserts that Loc refers to a ReferenceValue RefVal
    • Returns RefVal.getReferentLoc()
  • setStorageLocationStrict(const Expr &E, StorageLocation &Loc)
    • Asserts that E is a glvalue.
    • Creates a new ReferenceValue RefVal that wraps Loc, creates a new StorageLocation RefValLoc that refers to the RefVal, then calls setStorageLocation(E, RefValLoc).
    • To ensure convergence, we always create the same RefVal and RefValLoc for a given Loc.
  • getValueStrict(const Expr &E)
    • Asserts that E is a prvalue.
    • Retrieves the Value associated with E, asserts that it is not a ReferenceValue, and returns it.
  • setValueStrict(const Expr &E, Value &Val)
    • Asserts that E is a prvalue
    • Asserts that Val is not a ReferenceValue
    • Creates a new StorageLocation Loc, then calls setValue(Loc, Val) and setStorageLocation(E, Loc).
    • To ensure convergence, we always create the same Loc for a given Val.

The migration then proceeds as follows:

  • Gradually migrate framework code and clients to the Strict versions of the functions.
  • Remove the non-Strict versions once all callers have been migrated.
  • Change the internal implementation of Environment to implement the new semantics in ExprToLoc and a newly introduced ExprToVal.
  • Add synonyms for the Strict versions that omit the Strict suffix.
  • Remove the Strict suffix from all callsites.
  • Delete the Strict versions of the functions.

If issues are encountered in any step of this migration, it should be straightforward to roll back the change that caused the issues.

New semantics for variables of reference type

As discussed above, I propose that variables of reference type will map directly to the StorageLocation of the referenced object. To enable a gradual migration, I again propose to start by adding new member functions that implement these semantics:

  • Environment::getStorageLocationStrict(const ValueDecl &D)
    • Calls getStorageLocation(D) to retrieve the StorageLocation Loc associated with D.
    • If D has non-reference type, returns Loc.
    • If D has reference type:
      • Asserts that Loc is associated with a ReferenceValue RefVal.
      • Returns RefVal.getReferentLoc().
  • Environment::setStorageLocationStrict(const ValueDecl &D, StorageLocation &Loc)
    • If D has non-reference type, calls setStorageLocation(D, Loc).
    • If D has reference type:
      • Creates a new ReferenceValue RefVal that wraps Loc, creates a new StorageLocation RefValLoc that refers to the RefVal, then calls setStorageLocation(D, RefValLoc).
      • To ensure convergence, we may need to guarantee that we always create the same RefVal and RefValLoc for a given Loc.

The rest of the migration then proceeds as described above for expressions.

Changed StructValue API

The changes to StructValue make it harder to support a gradual migration. Once StructValue becomes a wrapper for AggregateStorageLocation, we can no longer support the Value *getChild(const ValueDecl &) and void setChild(const ValueDecl &, Value &) APIs; doing so would require StructValue to possess a reference to the DataflowEnvironment, and this doesn’t seem desirable.

However, it appears that the typical purpose that dataflow framework clients use StructValue for (as opposed to operating on an AggregateStorageLocation) is to attach custom properties to objects of class type. None of the dataflow framework clients that I have examined appear to use the getChild() or setChild() APIs. Nevertheless, it may make sense to initially make these methods private, with friend declarations that allow the current callers within the dataflow framework to continue using them. This change could be easily rolled back if it turns out that there are dataflow framework clients that use these APIs. This client code could then be modified so that, instead of operating on a StructValue, it operates on the corresponding AggregateStorageLocation, which should always exist in the contexts that client code cares about.

Longer-term, we will probably want to eliminate all uses of StructValue in client code and replace them with AbstractStorageLocation. This would require us to add a way to attach properties to StorageLocations. At this point, the only remaining uses of StructValue would be internal to the framework to convey prvalue Exprs of class type to their result objects.


  1. For some C++ language references, I will use cppreference.com instead of the C++ standard itself as the descriptions at cppreference.com are often more concise and easier to understand while still being accurate enough for the purposes of this document. ↩︎

  2. Note that in C++, unions are considered to be classes. ↩︎

  3. Except if it is the operand of a decltype-specifier; this case isn’t relevant here because the operand of a decltype-specifier is not evaluated. ↩︎

  4. But don’t compilers routinely store and manipulate trivially-copyable class types in registers? Yes, they do, but this is just an optimization that they can perform if it’s guaranteed that the code won’t be able to observe the object’s identity. ↩︎

  5. C has only two value categories, lvalue and rvalue. A C rvalue corresponds approximately to a C++ prvalue. ↩︎

Adding @gribozavr @Xazax-hun @ymand @sam-mccall, who provided feedback on an early version of this proposal.

Thank you for the thorough proposal, it LGTM!

Looks great, can’t wait to review the patches!

While continuing to work on this, I’ve realized that there are more relevant join scenarios that affect DeclToLoc than the ones I had considered in the section “Mapping declarations to StorageLocations”. There can be scenarios in Environment::join() where the same declaration maps to two different storage locations in the two environments to be joined. For example:

    bool &foo();
    bool some_condition();

    void target() {
       do {
        bool& Bar = foo();
        if (Bar) break;
      } while (some_condition());
    }

First, recall that we create a new storage location for Bar in every iteration of the loop.

Now consider the join at the bottom of the loop. This can be reached in two ways: From the break statement or by dropping out of the bottom of the loop (i.e. when some_condition() returns false). In the general case, the two environments that are joined here need not be from the same loop iteration – so Bar may map to two different storage locations in this case.

To deal with this, we will need to remove declarations from DeclToLoc at the end of their scope. This isn’t hard to do – it merely requires a small handler for CFGScopeEnd. Keeping DeclToLoc “tidy” in this way also has benefits for debugging, as it will now only contain entries for variables that are in fact alive.