Reference-counted statements

Daniel, Ted, John and I had a design discussion yesterday about the ownership model for expressions and statements in Clang. The catalyst for this discussion is the generic tree transformation that went into Clang earlier this week, and on the problem of transforming expression trees. In particular, given the expression

  x + y

let's say that the tree transformer transforms x into x' but leaves y alone. It will then attempt to construct a new addition operation

  x' +' y

With today's strict ownership model for expressions, we have a problem: both the + and the transformed +' believe that they own the subexpression y, which will lead to double-frees, use-after-free errors, etc.

One immediate possibility would be to clone the subexpression y (into a y' that is exactly similar to y); then the +' expression will own x' and y' and the strict ownership model is retained. This solution is relatively easy to implement, and is being used already (in a limited way) during template instantiation. However, it has performance costs in both time and memory. Moreover, it might not be the right answer for other clients of Clang's ASTs, which may need to refer to an expression within the the AST that might be prematurely destroyed.

We think that the addition of reference-counting to statements and expressions is the right answer for Clang, and I've attached a patch that adds such reference counting. Essentially, each statement (and, therefore, expression) gets a an embedded reference count; the Destroy method of a statement decrements the reference count, then deallocates the expression and destroys its children if the reference count hits zero.

Most AST clients will still use simple Stmt or Expr pointers and will not need to increase or decrease the reference counts. In those few places where we are explicitly sharing statements or expressions (e.g., as part of a tree transform), we will call the statement's Retain operation to increase the reference count. In the attached patch, I've introduced Retain calls in a few places to illustrate the idea:

  1) When performing template instantiation for non-dependent leaf nodes (e.g., literals), we retain the literal as written in the source code bump its reference count by one. Previously, we were cloning.

  2) When adding a SwitchCase (for a "case" or a "default" statement) to a switch, we retain that SwitchCase. The switch statement itself then Destroys its SwitchCases as part of its own destruction. This fixes a bunch of crash-on-invalid problems we've had with switch statements (see PR3444).

Two performance-related notes: first, the size of statements and expressions did not get any bigger. We reduced the size of the statement class field in Stmt from 32 bits to 8 bits [*] and used the remaining 24 bits for the reference count. Second, the reference counts are modified very, very rarely, and only in response to explicit sharing (as in the two cases I've shown), so we are not concerned about additional overhead introduced by the counting.

Comments, questions, etc. all welcome!

  - Doug

[*] 256 statement/expression types should be enough for anyone! :wink:

ref-counted-stmts.patch (8.01 KB)

Ref counting is fine. One issue I see is when traversing the tree (for whatever reason)
with shared nodes. Need a special flag in the node (or some other mechanism) to not revisit the shared sub-tree.

- Fariborz

Is there a particular use case you have in mind?

  - Doug

Ref counting is fine. One issue I see is when traversing the tree (for whatever reason)
with shared nodes. Need a special flag in the node (or some other mechanism) to not revisit the shared sub-tree.

Is there a particular use case you have in mind?

It depends on who the consumer of the shared trees are. Are we going to see AST trees with
shared nodes because of this? Then maybe the Indexer when caching the AST nodes. If this
is all internal to Sema and Sema has no need to traverse the tree then it is a non-issue.

- Fariborz

Hi Doug,

I'd like to clarify one point:

Most AST clients will still use simple Stmt or Expr pointers and will not
need to increase or decrease the reference counts. In those few places where
we are explicitly sharing statements or expressions (e.g., as part of a tree
transform), we will call the statement's Retain operation to increase the
reference count. In the attached patch, I've introduced Retain calls in a
few places to illustrate the idea:

I think this amounts to an extension of our "strong" ownership model,
right? The precise rule for when clients need to handle the reference
count themselves is when passing an owned expression into another
context (e.g., a constructor) expecting to resolve a strong reference
itself. I haven't thought it through, but this may even be something
we could eventually enforce via the static type system.

- Daniel

AST nodes still have ownership of the statements and expressions they refer to, and there's no way to actually determine whether a node you own is shared by someone else. This is generally okay, since statements and expressions aren't meant to be mutated once they've been constructed [*]. I haven't seen a use case that would need to avoid revising nodes in a shared sub-tree; when such a use case comes up, we'll figure something out.

  - Doug

[*] Sema does some limited mutation of AST nodes, as does the PCH reader, but most clients won't ever see these AST nodes until all mutation has already been performed.

Hi Doug,

I'd like to clarify one point:

Most AST clients will still use simple Stmt or Expr pointers and will not
need to increase or decrease the reference counts. In those few places where
we are explicitly sharing statements or expressions (e.g., as part of a tree
transform), we will call the statement's Retain operation to increase the
reference count. In the attached patch, I've introduced Retain calls in a
few places to illustrate the idea:

I think this amounts to an extension of our "strong" ownership model,
right?

Yes, that is the intent.

The precise rule for when clients need to handle the reference
count themselves is when passing an owned expression into another
context (e.g., a constructor) expecting to resolve a strong reference
itself. I haven't thought it through, but this may even be something
we could eventually enforce via the static type system.

Perhaps we could describe it via the static type system. The pattern within Sema is that we allocate a new statement and place it into a Sema::OwnedStmtResult, using smart pointers to hold ownership of the statement and eventually transferring ownership from the smart pointer into an AST (via take()/release()). The pattern for using Retain() is that you've gotten a statement from the AST and you also want to own it (to eventually put into a new AST, for example), so you call Retain() and build a new AST or put it in a Sema::OwnedStmtResult.

  - Doug

Daniel, Ted, John and I had a design discussion yesterday about the ownership model for expressions and statements in Clang. The catalyst for this discussion is the generic tree transformation that went into Clang earlier this week, and on the problem of transforming expression trees. In particular, given the expression

  x + y

let's say that the tree transformer transforms x into x' but leaves y alone. It will then attempt to construct a new addition operation

  x' +' y

It is unfortunate, but I agree that this is probably the best way to handle this. Go for it! Now we *really* don't want parent pointers in the AST :slight_smile:

-Chris