new design of RecursiveASTVisitor

Hi,

While working on improving the RecursiveASTVisitor class, some of us saw an opportunity for a different design that’s cleaner and easier to understand and use. Here’s the proposal. Comments and suggestions welcome!

Clang AST Visitor Design

This document discusses the design of a customizable visitor class for the AST generated by the Clang C++ parser.

Background

Clang comes with two AST visitor classes: ASTVisitor and RecursiveASTVisitor. The ideas behind the two are similar. Both are implemented using the CRTP pattern. The former is considered an internal class and not for use in tools developed on top of Clang. The latter is supposed to be a public API.

At this time, neither class is complete (i.e. they’ll miss some AST nodes) and there are known bugs (e.g. some AST nodes may be visited more than once). We are working on improving RecursiveASTVisitor. Our experience shows that its design has much space for improvement. Therefore it would be a worthwhile exercise to see what we can come up with if we do it from scratch, without being constrained or influenced by the current RecursiveASTVisitor design.

Concepts and Terminology

Muddy concepts lead to poor, confusing designs. Before we explore the design space, it’s important to clarify some key concepts.

  • An AST is a tree-shaped data structure. A node in an AST can have 0 or many other nodes as its children.
  • There a three categories of AST nodes in Clang: statements, declarations/definitions, and types. They are derived from class Stmt, Decl, and TypeLoc, respectively. In general, a node may have children of any or all of the three categories. The class hierarchies for Stmt, Decl, and TypeLoc can be found here.
  • The AST visitor has three distinct tasks:
    1. traverse the entire AST (i.e. go to every node),
    2. at each node, walk up the class hierarchy, starting with the dynamic type of the node,
    3. for a given (node, type) combination (where ‘type’ is some base class of the dynamic type of ‘node’), call a user-overridable function to actually “visit” the node.

Design Goals

  • easy-to-understand API
  • performs a depth-first, preorder traversal on the AST (no child of a node can be visited until all work on the node has been done)
  • complete coverage: each node in the AST must be visited
  • no duplication: a node cannot be visited more than once
  • common use cases should be easy
  • fast

Our Design

The distinction between the AST visitor’s three tasks (see the “Concepts and Terminology” section above) is important for understanding how the AST visitor works and using it effectively. Therefore we use naming conventions to highlight the role of a method of the AST visitor:

  • TraverseFoo(Foo* x) does task #1. It traveses (i.e. recursively visits) the sub-tree rooted at x, where Foo is *x's dynamic type (it’s the caller’s responsibility to ensure the correct type).
  • Traverse(Decl* x) simply dispatches (i.e. forwards) to TraverseFoo(Foo* x) where Foo is the dynamic type of *x. Traverse(Stmt* x) and Traverse(TypeLoc* x) work similarly.
  • WalkUpFromFoo(Foo* x) (better name welcome) does task #2. It does not try to visit any child of x. Instead, it first calls WalkUpFromBar(x) where Bar is the direct parent class of Foo (unless Foo has no parent), and then calls VisitFoo(x) (see the next bullet).
  • VisitFoo(Foo* x) does task #3.

Here’s some pesudo code to show what these methods’ implementation looks like:

// A function returns false if it wants the traversal to abort (note that
// the current RecursiveASTVisitor class returns true for abortion - we
// feel that 'false' is much more intuitive).
// Therefore, each caller needs to check the return code and propagate
// it up appropriately.  Such plumbing code is omitted here for clarify.

// The entry point for visiting an AST rooted at x.
bool Traverse(Decl* x) {

  switch(x->getKind()) {

  case kFoo:
    TraverseFoo((Foo*)x);

    break;
  case kBar:
    TraverseBar((Bar*)x);

    break;
  ...
  }
}

bool TraverseFoo(Foo* x) {

  WalkUpFromFoo(x);
  for each child node n of x {

    Traverse(n)
  }
}

bool WalkUpFromFoo(Foo* x) {

  WalkUpFromBar(x);  // Bar is Foo's parent.

  VisitFoo(x);
}

bool WalkUpFromDecl(Decl* x) {

  // Decl has no parent, so no more walking up.
  VisitDecl(x);
}

// All Visit*() methods do nothing by default.
bool VisitFoo(Foo* x) { return true; }

Usually, a user shouldn’t override a Traverse*() or WalkUpFrom*() function - they are the “skeleton” of the AST visitor and do the mundane plumbing that a normal user is not interested in. We recommend that only people truly understand how the traversal works can override them.

A user is expected to override the Visit*() methods as needed. The default implementation of them does nothing.

Returning a bool to indicate whether the traversal needs to be aborted requires much tedious, error-prone plumbing in the implementation. Exceptions are the most natural solution here, but unfortunately they are banned from Clang code.

One advantage of this design over the existing RecursiveASTVisitor is that when a user overrides VisitFoo(), he only needs to do what he’s interested in, without worrying about calling VisitFoo() or any other methods in the base AST visitor class (so less chance to shoot his own feet).

With this design, all Visit*() methods on an AST ndoe are called before any Visit*() method is called on a child node of it. So the trace of the Visit*() calls is easy to understand and reason about.

One thing people might not like about the design is that it uses more methods. I actually think it’s a good thing that we have separate families of methods for separate roles. RecursiveASTVisitor's approach of lumping all logic (traversal, walking up the type hierarchy, and the custom visiting logic) in the same Visit*() method makes the implementation more tricky and error-prone - that’s probably why there was so much confusion when we were trying to improve RecursiveASTVisitor.

To ensure performance, we would use CRTP as the existing AST visitors do. That means two things:

  1. No need for virtual functions.
  2. The entire code is templated s.t. the compiler can easily inline trivial functions and do cross-function optimization.

Use Cases

Let’s see how some typical use cases can be implemented using our AST visitor.

Dump the AST

Only 3 methods need to be overridden:

VisitDecl(Decl* x) {

  print information about x;
}

VisitStmt(Stmt* x) {

  print information about x;
}

VisitTypeLoc(TypeLoc* x) {

  print information about x;
}

Search for an AST Node of Type Foo

The tool would override VisitFoo(Foo* x) to return false if x is the node it’s looking for.

Find the Parents of an AST Node

The tool would maintain a stack of the nodes that lead to the current node. It can build the stack by overriding exactly 3 methods (plumbing of the return code omitted):

Traverse(Decl* d) {

  Push(d);
  Base::Traverse(d);

  Pop();
}

Traverse(Stmt* s) {

  Push(s);
  Base::Traverse(s);

  Pop();
}

Traverse(TypeLoc* t) {

  Push(t);
  Base::Traverse(t);

  Pop();
}

For proper implementation of tree finite automata I'd suggest you to
consider the (needed) distinction from node entering and node exiting.

  call Enter node handler;
  visit node childs;
  call Exit Node handler;

Another very useful thing you should consider is to add a child
identifier to handler arguments that inform handler about *which* parent
child is currently visited: this information is not otherwise deducible.

To better understand this try to imagine an application that want to
check lhs of assign binary operator.

Thanks for the comments.

For proper implementation of tree finite automata I'd suggest you to
consider the (needed) distinction from node entering and node exiting.

call Enter node handler;
visit node childs;
call Exit Node handler;

This is beyond the scope of what I'd like to do now. The generality
you proposed can be useful (e.g. for implementing postorder
traversal), but all the use cases I have in mind only need preorder
traversal (as stated in my design goal). Note that the "exit node
handler" can always be added later without breaking existing clients,
so we can incorporate your proposal later, when there's actual need
for it.

Another very useful thing you should consider is to add a child
identifier to handler arguments that inform handler about *which* parent
child is currently visited: this information is not otherwise deducible.

Sorry, I don't understand what you mean by "which parent child is
currently visited". Care to clarify?

Another very useful thing you should consider is to add a child
identifier to handler arguments that inform handler about *which* parent
child is currently visited: this information is not otherwise deducible.

Sorry, I don't understand what you mean by "which parent child is
currently visited". Care to clarify?

To better understand this try to imagine an application that want to
check lhs of assign binary operator.

As you've noted in your documentation it might be useful to know the
ancestor chain of current node and for this aim it is possibile to save
a stack of node entered, but what about to know *which* child of its
parent current node is?

This info is not available or deducible *unless* the parent visit (in
its general purpose code) pass down this information.

This make possibile to node visit to answer to the following questions:

"I'm the lhs or the rhs expression?", "I'm the return type of function
decl or the full function type?", "I'm the return type of function type
or an argument type?", etc.

Another very useful thing you should consider is to add a child
identifier to handler arguments that inform handler about *which* parent
child is currently visited: this information is not otherwise deducible.

Sorry, I don't understand what you mean by "which parent child is
currently visited". Care to clarify?

To better understand this try to imagine an application that want to
check lhs of assign binary operator.

As you've noted in your documentation it might be useful to know the
ancestor chain of current node and for this aim it is possibile to save
a stack of node entered, but what about to know *which* child of its
parent current node is?

This info is not available or deducible *unless* the parent visit (in

It is easily deducible. Just search for the current node (a pointer)
in the parent's children.

its general purpose code) pass down this information.

This make possibile to node visit to answer to the following questions:

"I'm the lhs or the rhs expression?", "I'm the return type of function
decl or the full function type?", "I'm the return type of function type
or an argument type?", etc.

Just compare the pointer to the current node with the parent node's
lhs expression child, etc.

I've build my AST traversal on top of llvm's df_iterator.

To get an idea, check

https://gforge.zih.tu-dresden.de/scm/viewvc.php/trunk/Scout/clangAddons/include/clang/ASTProcessing/StmtTraversal.h?root=hicfd&view=markup

stmt_iterator enables the traversal over a particular type, which at least for me is sufficient in most cases. For more complex tasks there is visit_df taking a StmtVisitor.

Best regards
Olaf Krzikalla

Hi, all,

Craig and I implemented most of the new design. I uploaded the patch
to http://codereview.appspot.com/1607042 for easy reviewing - you can
download it from
http://codereview.appspot.com/download/issue1607042_1.diff

In the implementation we made a slight change to the proposed design:
instead of overloading Traverse() for Stmt, Type, and Decl, we prefer
to have separate TraverseStmt, TraverseType, and TraverseDecl
functions. We think that makes the code clearer.

Chandler plan to review it after 3pm today. Doug, would you have some
time to look at this today? Thanks!

Hi, all,

Craig and I implemented most of the new design. I uploaded the patch
to http://codereview.appspot.com/1607042 for easy reviewing - you can
download it from
http://codereview.appspot.com/download/issue1607042_1.diff

In the implementation we made a slight change to the proposed design:
instead of overloading Traverse() for Stmt, Type, and Decl, we prefer
to have separate TraverseStmt, TraverseType, and TraverseDecl
functions. We think that makes the code clearer.

Chandler plan to review it after 3pm today. Doug, would you have some
time to look at this today? Thanks!

I've added my comments up on codereview.appspot.com. Looks good!

  - Doug

Thanks for the excellent comments, Doug and Chandler! I've uploaded a
new patch to the codereview app.