[RFC] Dynamic AST Matcher library

Hello all,

We have been working on a layer on top of the AST Matcher library to support creating arbitrary matchers at runtime.
I have a proof of concept working on the tooling branch and wanted to send a proposal to make it a real thing.
The proposal doc is on google docs at:
https://docs.google.com/a/google.com/document/d/1sUSlxwJsTdU39n1uapTGz7Xl3vejC1aZtri2IXHkNws/edit

Inlined copy below for those that prefer it this way. Feel free to comment on either.
_Sam
# Design: Dynamic AST MatchersThis document contains a design proposal for a dynamic AST matcher layer that supports creating arbitrary matchers at runtime, which will allow for generic tools to be developed.
## ContextThe existing AST matcher framework provides a very powerful and compact way to traverse the AST and find specific nodes by providing complex predicates.
However, this framework requires that all predicates to be used at runtime are fully defined at compile time. The creation of these predicates can be a big challenge in some cases, and this limitation forces a ‘recompile’ step on this creative loop.

Also, without being able to specify matchers at runtime we can’t support more generic search use cases. For example, an editor-based search tool that supports an AST matcher predicates.
## Goals- Runtime parsing of matcher expressions, including literals and predicates. Feedback for syntax/semantic errors in the expressions

  • Create matchers at runtime by name, with dynamically created arguments if required by the matcher
  • Query command line tool that runs the matcher on a specific compilation unit and reports all matches

Non-goals- Dynamic refactoring tools. They might be implemented using dynamic matchers, but it is out of the scope for this library

  • Full C++ expression support for predicates. Values are going to be limited to literals (eg. 1, “foo”). Expressions like 1+2 will not be parsed. Some useful constants (like INT_MAX) might be available

Code locationThe main output of the project will be a library to be used by AST matcher related clang-tools, and should live close to the already existing AST matcher library, but as a separate library

## Matcher implementationOne aspect that simplifies the implementation of dynamic matchers is to have a uniform generic signature for creating them. Otherwise, whatever code that creates them will require explicit knowledge of each matcher, or at least of each signature.
The generic signature is of the form VariantType (*)(const std::vector& args). A variant type instance can contain numbers, booleans, strings and Matchers, as well as error related information for return values.
Instead of reimplementing all matchers to have a generic signature, we implement marshaller functions that provide the generic signature on top of the regular static one. The marshallers verify that the argument count and types are correct and calls the underlying constructor function.
## RegistryAt initialization of the framework, all known matchers are added into a registry. This registry provides a way to create matchers at runtime by name.
This registry is the only component that would require maintenance when the ASTMatcher library is updated. Any new matcher, and new overloads for existing matchers, would have to be added to the registry. In most cases, a single line of code per matcher does the job.
## Matcher expression parsingA simple matcher expression parser, combined with the matcher registry, allows clients to go from a user provided string to a Matcher. The output of the parser is a variant type also, which allows it to return Matchers as well as syntax and semantic errors.
The parser interface provides a way for the client to inject and/or modify the matcher tree as it is being created. For example, a client could insert Id() matcher around specific nodes without having to do it in the string form.

FYI it seems like the google doc has not been "shared".

-- Sean Silva

This almost seems like an indictment of clang. Isn't the whole point
of clang that we have a library which is capable of doing this?

It seems like the fundamental issue that the dynamic matchers are
trying to address is the long iteration time required by recompile.
Effectively, the dynamic matcher infrastructure would then simply be
special purpose compiler for a subset of C++ with hardcoded semantics
just enough to support the ASTMatchers. That really seems like a
band-aid.

There has been quite a bit of work around speeding up this iteration
loop using clang. The big one I can think of is cling
<http://root.cern.ch/drupal/content/cling>. There is also an IDE (I
think intended for game development) which uses Clang+JIT to
dynamically run and recompile programs (e.g. you can change a constant
in the source which represents a color, and in real time the running
program reflects this change). There is even the relatively
unmaintained examples/clang-interpreter which, while being quite crude
and not feature-complete, is a remarkably short proof of concept.

Is there currently some unacceptable "minimum time" needed to process
C++ with Clang? If so, that sounds like a bug :slight_smile:

-- Sean Silva

Sorry about that.
Apparently I can't share that specific document.
This copy should work:
https://docs.google.com/document/d/1H3queuLjEYWdgpHzo7Mj3ReBXJdpSI8vcH9Y2p_4Y-8/edit

_Sam

> Full C++ expression support for predicates. Values are going to be
limited
> to literals (eg. 1, “foo”). Expressions like 1+2 will not be parsed. Some
> useful constants (like INT_MAX) might be available

This almost seems like an indictment of clang. Isn't the whole point
of clang that we have a library which is capable of doing this?

True. You could use clang itself to parse it.
For the proof-of-concept having a simple parser seemed easier.

It seems like the fundamental issue that the dynamic matchers are
trying to address is the long iteration time required by recompile.

There are two use cases for the dynamic matchers:
- Faster development time for someone creating matcher expressions for a
specific purpose.
- Generic services that accept arbitrary matchers at runtime. Eg. a
grep-like tool that uses matchers instead of regex.

The latter was my main motivation for this work.

Effectively, the dynamic matcher infrastructure would then simply be
special purpose compiler for a subset of C++ with hardcoded semantics
just enough to support the ASTMatchers. That really seems like a
band-aid.

I think you are referring to the parser here, which is the one that is
limited to declarative matcher expressions.
The rest of the library (variant type, registry, etc) allows runtime
generation of arbitrary matchers. This can be used independently of the
parser if the generation of the expression is done in some other way.

There has been quite a bit of work around speeding up this iteration
loop using clang. The big one I can think of is cling
<http://root.cern.ch/drupal/content/cling>. There is also an IDE (I
think intended for game development) which uses Clang+JIT to
dynamically run and recompile programs (e.g. you can change a constant
in the source which represents a color, and in real time the running
program reflects this change). There is even the relatively
unmaintained examples/clang-interpreter which, while being quite crude
and not feature-complete, is a remarkably short proof of concept.

I'll look into JIT'n the matchers.
It could be a simpler solution for the problem.

_Sam

Ah, ok.

I think that a grep-like tool is a good idea for running on single
translation units (I once mentioned to klimek how I wanted something
like `clang -ast-print -ast-match=<arbitrary ASTMatcher expression>
foo.cc`).

I have some thoughts about the utility of this approach for searching
a full codebase, which I think is a general (albeit long-term-ish)
goal for everyone:

My concern is that a "grep-like" tool does a linear search, which is
going to take so long that I'm not sure that it will really be a huge
benefit to having "dynamic matchers" compared to just writing C++. I
haven't timed "building" with just -fsyntax-only, but I wouldn't
expect it to be faster than 5 minutes on my machine just to parse all
of LLVM/Clang. Larger code bases will be even more problematic.
Obviously if you have 10000 machines to run this on and the
infrastructure to rapidly distribute jobs to them---as you guys do at
Google---then this could be completely parallelized and have the
critical path being the longest translation unit + dispatching
latency; however, I would be surprised if there were more than a
handful of companies (let alone individual developers) with the
infrastructure in place to permit the creation of an
acceptably-performant tool for codebase-wide search with this approach
(AFAIK most distributed processing that companies have in place is
oriented at batch jobs where nontrivial (e.g. 10 sec, 1 min) minimum
dispatching latency is acceptable). Even if you have the
infrastructure available, the operational expenses that this would
incur might be prohibitive: nowadays any developer can easily spin up
their own EC2/whatever cloud, but the cost would be quite large to get
enough parallelism (utilization would also be extremely low). So from
my perspective the "grep-like" approach for full codebase search seems
like a piece of the puzzle.

Have you considered looking into a general mechanism for indexing the
AST, so that searches can be performed quickly? That would allow for
the creation of a general-purpose tool which, say, every C/C++
developer could have available to them and could run locally on their
machine. For example, this could be a service provided by clangd, and
the index could be updated as a by-product of the build process (e.g.
CXXFLAGS+='-update-index=~/foo.idx'). The index need not necessarily
contain everything needed for an exact match, but just enough to
greatly cut down the dataset that needs to be "grepped". This is very
similar to Russ Cox's description of how google code search used to
work <http://swtch.com/~rsc/regexp/regexp4.html>, except that instead
of node patterns in the clang AST, they dealt with regexes matching
flat text (it is a fantastic article btw, you should read it if you
haven't). I think that approach could be made to work for the AST.

These are longer-term thoughts but they might be worth keeping in mind.

-- Sean Silva

> There are two use cases for the dynamic matchers:
> - Faster development time for someone creating matcher expressions for a
> specific purpose.
> - Generic services that accept arbitrary matchers at runtime. Eg. a
> grep-like tool that uses matchers instead of regex.
>
> The latter was my main motivation for this work.

Ah, ok.

I think that a grep-like tool is a good idea for running on single
translation units (I once mentioned to klimek how I wanted something
like `clang -ast-print -ast-match=<arbitrary ASTMatcher expression>
foo.cc`).

I have some thoughts about the utility of this approach for searching
a full codebase, which I think is a general (albeit long-term-ish)
goal for everyone:

My concern is that a "grep-like" tool does a linear search, which is
going to take so long that I'm not sure that it will really be a huge
benefit to having "dynamic matchers" compared to just writing C++.

It simplifies the usage of matchers.
Writing/compiling/running a matcher directly in C++ takes way longer than
just running a prebuilt binary with a string argument.

I haven't timed "building" with just -fsyntax-only, but I wouldn't

expect it to be faster than 5 minutes on my machine just to parse all
of LLVM/Clang. Larger code bases will be even more problematic.

From what I saw, parsing and creating the AST is way more expensive than

running the matchers.
The extra overhead of the dynamic dispatched layer does not seem to have
any effect on the cost of matching.
The AST could be cached in a serialized form to reduce the effect of
parsing the code.

Obviously if you have 10000 machines to run this on and the
infrastructure to rapidly distribute jobs to them---as you guys do at
Google---then this could be completely parallelized and have the
critical path being the longest translation unit + dispatching
latency; however, I would be surprised if there were more than a
handful of companies (let alone individual developers) with the
infrastructure in place to permit the creation of an
acceptably-performant tool for codebase-wide search with this approach
(AFAIK most distributed processing that companies have in place is
oriented at batch jobs where nontrivial (e.g. 10 sec, 1 min) minimum
dispatching latency is acceptable). Even if you have the
infrastructure available, the operational expenses that this would
incur might be prohibitive: nowadays any developer can easily spin up
their own EC2/whatever cloud, but the cost would be quite large to get
enough parallelism (utilization would also be extremely low). So from
my perspective the "grep-like" approach for full codebase search seems
like a piece of the puzzle.

Have you considered looking into a general mechanism for indexing the
AST, so that searches can be performed quickly?

I have given it some thought.
For example, on a matcher like:

memberCallExpr(thisPointerType(recordDecl(hasName(class_name_)))),
               callExpr(callee(methodDecl(hasName(call_name_)))))

using an index for hasName() would be easy.
Just filtering out compilation units that do not know about a class called
class_name_ can help in most cases.
Indexing common sub-expressions like
memberCallExpr(thisPointerType(recordDecl(hasName(class_name_))))) could
also have some benefit.
I don't think you could have an index that benefits all matchers, but the
most common ones could be taken care of.

That would allow for

As additional context, note that:

  • we can run servers that have all of the source code pre-parsed in memory; being able to run multiple matchers in parallel over a pre-built AST is thus possible as long as we don’t need to recompile
  • jit-ing C++ code inside those servers is both not simple (yay C++) and has security implications that not everybody might be happy with
  • the proposed “language” is something-like-s-expressions with some constants on top of it; that’s hardly C++

Cheers,
/Manuel

Pinging this thread. There is still interest on this on our side.

I’d say that setting up a CL would be a next step; I’d propose to put it into ASTMatchers, perhaps starting with a minimal implementation (I know that you already have a more complete set, but it would probably be easier to review if we can split that up somehow).

Cheers,
/Manuel