RFC: AST Matcher Sub-expression references implementation straw man

Hi all,

I've been looking at implementing sub-expression references in AST Matchers. I'd like to propose an implementation and get feedback, especially if I've overlooked something.

+djasper, alexfh

+djasper, alexfh

Hi all,

I've been looking at implementing sub-expression references in AST
Matchers. I'd like to propose an implementation and get feedback,
especially if I've overlooked something.

======
Quick motivation first:
The goal of sub-expression references is to write matchers that can refer
to nodes tagged by the matcher and use these nodes in more tests. Consider
the following simple example that wants to find variable declarations whose
type exactly matches the initializer type (for simplicity, lets ignore all
the implicit cast stuff that will exist around the initializer):

varDecl(hasType(qualType(...).bind("declType")),
               hasInitializer(hasType(sameAs("declType")))

Minor point: I'd guess we'll need to specify the type here explicitly, like
sameAs<QualType>("declType"). Also, I'm not very happy with the color of
the bike shed called "sameAs" yet :slight_smile:

)

There's potential for other sub-expression reference matchers beyond the
"sameAs()" matcher proposed by this example. Name comparisons for example.

Proposed changes to implement sub-expression references:
- In a sub-ex matcher, look-up in the provided BoundNodesTreeBuilder for
the given id.
  - If the id exists perform matcher logic, return truth value.
  - If an id doesn't exist, record an unresolved sub-expression reference
in the Builder and return true.
    - Info recorded: id we're looking for and some sort of matcher
"future": a callback or functor for executing matcher logic at a later time
along with all necessary data matcher needs.
- When control returns to the top level finder logic
(MatchASTVisitor::match()) we tend to unresolved sub-ex refs. For each
recorded unresolved sub-ex ref, do a search through the BoundNodesTree for
a matching id. If one is found, execute matcher future. If the future
returns false, the match fails and the Callback that would normally get
called will not be. If no id is found, match also fails. If the matcher
future returns true, proceed to calling Callback for the match. The
handling of unresolved sub-ex refs could be handled by modifying
MatchVisitor.

I'm currently opposed to unresolved subexpression matcher handling (but
happy to be convinced otherwise).
Currently, we have:
a) a defined order of execution
b) a well-defined cut-off for logical expressions

That is, if you say anyOf(a(), b()), b() will not be matched if a()
matches. Now a() could contain a sub-expression matcher referring to
something that might be bound in b(); how would you propose to solve this?

And on a more principled note:
Which things can we solve with unresolved subexpression matchers that
cannot also be solved by re-arranging the matcher so that the bind() to the
id must always happen before the sameAs match, otherwise sameAs will not
match?

+djasper, alexfh

** **

** **

*From:* Manuel Klimek [mailto:klimek@google.com]
*Sent:* Wednesday, April 17, 2013 3:45 AM
*To:* Vane, Edwin; Daniel Jasper; Alexander Kornienko
*Cc:* Clang Dev List (cfe-dev@cs.uiuc.edu)
*Subject:* Re: RFC: AST Matcher Sub-expression references implementation
straw man****

** **

On Tue, Apr 16, 2013 at 8:47 PM, Manuel Klimek <klimek@google.com> wrote:*
***

+djasper, alexfh****

** **

****

Hi all,

I've been looking at implementing sub-expression references in AST
Matchers. I'd like to propose an implementation and get feedback,
especially if I've overlooked something.

======
Quick motivation first:
The goal of sub-expression references is to write matchers that can refer
to nodes tagged by the matcher and use these nodes in more tests. Consider
the following simple example that wants to find variable declarations whose
type exactly matches the initializer type (for simplicity, lets ignore all
the implicit cast stuff that will exist around the initializer):

varDecl(hasType(qualType(...).bind("declType")),
               hasInitializer(hasType(sameAs("declType")))****

** **

Minor point: I'd guess we'll need to specify the type here explicitly,
like sameAs<QualType>("declType"). Also, I'm not very happy with the color
of the bike shed called "sameAs" yet :)****

** **

*[Edwin] *I had thought the type could be inferred from which matcher the
sub-ex ref is passed to à la PolymorphicMatchWithParam*. Ambiguities would
be solved the same way we currently have to resolve ambiguities in the
polymorphic matchers; by explicitly inserting a dyncast matcher:****

hasType(qualType(sameAs(“blah”)))

You're right...

I have no attachments to the name. I’m open to suggestions.

Daniel usually has good naming ideas :slight_smile:

If we don’t have to worry about unresolved sub-expression references then the implementation of a sub-ex ref matcher is pretty straight-forward. However, the ‘look up node in the BoundNodesTree’ statement is where all the trickiness is. Using existing machinery to do this could be computationally expensive. I’m going to do a bit of profiling to check this out. I’ve seen some mention of profiling stubs in the code already. Are there any tips or best known methods I should be aware of?

If we don’t have to worry about unresolved sub-expression references
then the implementation of a sub-ex ref matcher is pretty straight-forward.
However, the ‘look up node in the BoundNodesTree’ statement is where all
the trickiness is. Using existing machinery to do this could be
computationally expensive. I’m going to do a bit of profiling to check this
out. I’ve seen some mention of profiling stubs in the code already. Are
there any tips or best known methods I should be aware of?

Not really - once you have something, I'll patch it in and test it over our
100MLOC code base and see how bad it is :wink:
Apart from that, just use a sufficiently large TU that takes multiple
seconds to parse and run over that... Declarations are where most of the
parsing time goes anyway :P...

Do you already have a place to insert this sameAs() matcher then? Any changes I make would probably only affect sub-ex ref matchers.

Why are declarations more expensive?

Do you already have a place to insert this sameAs() matcher then? Any
changes I make would probably only affect sub-ex ref matchers.

I'm not sure I understand what you mean. (apart from: insert next to
equalsNode in ASTMatchers.h :P)

Regarding the name: we can either name overload equalsNode with a string,
or call it equalsBoundNode.

Why are declarations more expensive?

They just are generally hard to parse in C++ (so much can go wrong), and
most of the transitive closure of TUs are declarations...

Cheers,
/Manuel