RFC: Easier AST Matching by Default

I would like to hear the opinions of others on these questions. I think
we've both described our perspectives and made our cases.

Just expanding on my position from earlier in this thread slightly: to
directly address the "where to from here?" question, I would like us to get
back to the state where, *in our default mode, the matcher for AST node X
is always able to match all instances of AST node X*. I think there are
various options for how we get there that seem acceptable, and that don't
sacrifice your noble goals of making AST matching easier and more
intuitive. In particular, I think it's fine (and probably good) if by
default we look through non-matching implicit nodes while looking for a
matching node (so long as we don't look past a matching node just because
it's implicit). And I think it's fine (and probably good) to expose an easy
way to explicitly control whether we automatically look through implicit
nodes or not. But I think if no-one is prepared to do the work to make our
default mode satisfy the above property while still looking through
implicit nodes if (and only if) they don't match, then I think we should
change the default back to the state we had before. (I don't have much of
an opinion on whether to keep or remove 'traverse' and its various modes,
other than that we've already caused a substantial amount of churn with the
changes thus far, and removing them again would cause further churn.)

1. It would be helpful to rehash the examples one final time — give the code & the dump, a proposed matcher, then what option A would match, B, etc. Others should try to consider at each as would a beginner.

2. Perhaps this an opportunity to create a uniform interface to deal with implicit nodes in the AST. I believe any implicit node only ever has at most one child node. If so, consider this:
  (a) Always have "Implicit*" versions of any implicit-able nodes (e.g. ImplicitCXXConstructExpr, in addition to CXXConstructExpr);
  (c) Implement getAs<T>() for Decls and Stmts, the same we do for Types: e.g. E->getAs<CXXConstructExpr>() would skip over the ExprWCleanups, ImplicitCXXConstructExpr, etc. But you could still call E->getAs<ImplicitCXXConstructExpr>() too if desired.

Then, in the case of ASTMatchers, the matchers would always match the implicit nodes as before, but Results.Nodes.getNodeAs<…> would call getAs<…> on each actual result. I believe this aligns with the behavior you suggested Richard. But I have not fully thought this through, so I defer to others.

My perspective on ASTMatchers, FWIW: I never got into them, because I had begun learning the AST first and it seemed redundant since you could do the same stuff with RecursiveASTVisitor, and there seemed to be a lot of extra infrastructure you had to deal with. I now see that the matchers can be considerably tighter and clearer -- but that is the only advantage over doing it the full way with RecursiveASTVisitor, so whatever the solution is, it must keep or increase that clarity advantage while retaining as much of the flexibility of RecursiveASTVisitor as possible. E.g. it sounded like Billy, the guy who had issues with Stmt traversal order (owing to the queueing of child statements, instead of stack processing them as most would expect), decided to just use RecursiveASTVisitor instead of deal with the hassle — that is the outcome to avoid.

I would like to hear the opinions of others on these questions. I think
we’ve both described our perspectives and made our cases.

Just expanding on my position from earlier in this thread slightly: to
directly address the “where to from here?” question, I would like us to get
back to the state where, in our default mode, the matcher for AST node X
is always able to match all instances of AST node X
. I think there are
various options for how we get there that seem acceptable, and that don’t
sacrifice your noble goals of making AST matching easier and more
intuitive. In particular, I think it’s fine (and probably good) if by
default we look through non-matching implicit nodes while looking for a
matching node (so long as we don’t look past a matching node just because
it’s implicit). And I think it’s fine (and probably good) to expose an easy
way to explicitly control whether we automatically look through implicit
nodes or not. But I think if no-one is prepared to do the work to make our
default mode satisfy the above property while still looking through
implicit nodes if (and only if) they don’t match, then I think we should
change the default back to the state we had before. (I don’t have much of
an opinion on whether to keep or remove ‘traverse’ and its various modes,
other than that we’ve already caused a substantial amount of churn with the
changes thus far, and removing them again would cause further churn.)

  1. It would be helpful to rehash the examples one final time — give the code & the dump, a proposed matcher, then what option A would match, B, etc. Others should try to consider at each as would a beginner.

  2. Perhaps this an opportunity to create a uniform interface to deal with implicit nodes in the AST. I believe any implicit node only ever has at most one child node. If so, consider this:
    (a) Always have “Implicit*” versions of any implicit-able nodes (e.g. ImplicitCXXConstructExpr, in addition to CXXConstructExpr);
    (c) Implement getAs() for Decls and Stmts, the same we do for Types: e.g. E->getAs() would skip over the ExprWCleanups, ImplicitCXXConstructExpr, etc. But you could still call E->getAs() too if desired.

Then, in the case of ASTMatchers, the matchers would always match the implicit nodes as before, but Results.Nodes.getNodeAs<…> would call getAs<…> on each actual result. I believe this aligns with the behavior you suggested Richard. But I have not fully thought this through, so I defer to others.

I think that’s a pretty interesting approach, that definitely has some appeal. I think we should also be thinking about how to serve the needs of clients that want a syntactic view or a semantic one – some clients want to skip over implicit AST nodes (eg, nodes that describe semantics but no syntax), and others want to skip over what we call “parentheses” (eg, nodes that describe syntax but whose sole subexpression has the same semantic effect). A single getAs<…> might have a hard time serving both classes of user.

True, ParenExpr is the one oddball case: all syntax, no semantics, the exact opposite of most of the nodes that get in people’s way.

But I think Type::getAs() is again instructive: one can call getAs() on a ParenType to get its underlying type, and on an ElaboratedType etc. — nodes that are purely syntactic. So just allow the single getAs to also get a ParenExpr as its underlying expression. I believe it is the only such oddball case.

What we would probably also want is an analog or two of Type’s getCanonicalType, except for getting the underlying syntax:

Stmt {

T *getAs() {
if (isa(this))
return cast(this);
if (this->isImplicit() || isa(this)) {
assert(children.size() == 1 && “Expected implicit nodes to have exactly one child”);
return (*children.begin())->getAs();
}
return nullptr;
}

Expr *getAsWritten() {
if (this->isImplicit())
return (*children.begin())->getAsWritten());
return this;
}
Expr *getAsWrittenIgnoreParens() {
if (isa(this))
return this;
return getAsWritten();
}

};

Decls too could at least in some cases benefit from the same interface:

struct A {
union { int i; float f; }
};

I believe the union becomes an implicit FieldDecl, whose type is the anonymous union’s CXXRecordDecl; a user may well want to call

if (auto AnonUnion = someFieldDecl->getAs())
//…

to handle this case.

A getAs in every AST node would provide a uniform way of handling implicit nodes across the AST, a definite benefit to beginners without any apparent cost to long time users.

Dave

True, ParenExpr is the one oddball case: all syntax, no semantics, the exact opposite of most of the nodes that get in people’s way.

But I think Type::getAs() is again instructive: one can call getAs() on a ParenType to get its underlying type, and on an ElaboratedType etc. — nodes that are purely syntactic. So just allow the single getAs to also get a ParenExpr as its underlying expression. I believe it is the only such oddball case.

It’s not, there’s a few types of node that we consider to be “parens”, some of them only conditionally. These include things like _Generic and __builtin_choose_expr.

What we would probably also want is an analog or two of Type’s getCanonicalType, except for getting the underlying syntax:

We already have quite a few different flavours of that – look at Stmt::IgnoreImplicit and friends. There doesn’t seem to be a good one-size-fits-all solution here, given the number of different variants of this we’ve invented. But maybe there’s something good enough for most matchers at least.

Stmt {

T *getAs() {
if (isa(this))
return cast(this);
if (this->isImplicit() || isa(this)) {
assert(children.size() == 1 && “Expected implicit nodes to have exactly one child”);
return (*children.begin())->getAs();
}
return nullptr;
}

Note that the analogy to Type::getAs has already broken down here; getAs always returns a type with the same semantics but a possibly different meaning, whereas this can return an expression with different semantics by skipping implicit nodes.

Expr *getAsWritten() {
if (this->isImplicit())
return (*children.begin())->getAsWritten());
return this;
}
Expr *getAsWrittenIgnoreParens() {
if (isa(this))
return this;
return getAsWritten();
}

};

Decls too could at least in some cases benefit from the same interface:

struct A {
union { int i; float f; }
};

I believe the union becomes an implicit FieldDecl, whose type is the anonymous union’s CXXRecordDecl; a user may well want to call

if (auto AnonUnion = someFieldDecl->getAs())
//…

to handle this case.

Taking it this far seems questionable to me. The field and the union are very different entities here, so I don’t think the analogy to type sugar and parentheses really holds.

Stmt {

T *getAs() {
if (isa(this))
return cast(this);
if (this->isImplicit() || isa(this)) {
assert(children.size() == 1 && “Expected implicit nodes to have exactly one child”);
return (*children.begin())->getAs();
}
return nullptr;
}

Note that the analogy to Type::getAs has already broken down here; getAs always returns a type with the same semantics but a possibly different meaning, whereas this can return an expression with different semantics by skipping implicit nodes.

This, I believe, gets to the heart of the issue Stephen has raised. In

struct B { B(int); };
B f() { return 42; }

is it ever valid for a user to treat the 42 as an IntegerLiteral, instead of an ExprWCleanups?

What is interesting about the analogy between the proposed Stmt::getAs() and the existing Type::getAs() is that, considering only their effect on syntax and semantics, they are precisely the opposite (disregarding the extra handling of ParenExpr, which probably should be removed): whereas each desugaring performed in Type::getAs peels back a layer of meaningless syntax to get ever closer to the underlying semantics, Stmt::getAs would peel back layers of invisible semantics to get ever closer to the underlying syntax.

So it might be confusing.

But in another sense, the analogy holds: in both cases, getAs retrieves only the essential information about the node: the information without which the semantics could not be reconstructed. In the case of types, this is the fully de-aliased, de-sugared type. In the case of Stmts (or really, just Exprs), this is the originally written statement.

So, there are two possible ways the user might interpret a universal getAs, such that it would either confuse them or help them. I am satisfied the case has been made, will defer to you and others with greater experience and wisdom to reflect on it and decide it.

Thanks for engaging this issue, and Stephen thanks again for raising this important issue,

Dave