[RFC] Improving Support for Debugging Matchers

Objective

The goal for this proposal is to improve the experience of debugging Clang’s AST matchers and of understanding what they are doing.

Background

AST Matchers are a powerful tool that allows users to pick out nodes in the Clang AST, provided they match the specified constraints and criteria. Yet, creating a well-formed matcher can be difficult. Matchers can become complex when multiple layers of matchers are nested to create a larger compound matcher, which can be necessary to select parts of the AST that fulfill strict criteria. Since matchers only provide a boolean response as to whether they obtained any matches or did not obtain any matches, the user receives no explanation for when a matcher fails to produce any bindings. Even when a matcher produces bindings, performing introspection to debug false positive bindings can be tedious. When the result of a matcher is not expected, determining how the result was obtained is made difficult by this limited response. Without any other options to interact with matchers, other than trial-and-error based on binary feedback, this lack of transparency makes it difficult to introspect matcher behavior and identify problems (such as which part of a compound matcher is inconsistent with expectations). This often causes the process of debugging a faulty matcher to be slow, confusing, and frustrating.

Design ideas## Proposed Solution

The proposed solution is composed of two parts.

The first part (https://reviews.llvm.org/D113917) is to expose a getName method on the Matcher class. Currently, there is no easy way to access any form of identification information for any matcher. This method will act as a hook that allows for accessing the name of a matcher programmatically. While this method will be helpful for the implementation for the withTag matcher, it could also prove to be useful for other forms of matcher introspection/debugging.

The second part (https://reviews.llvm.org/D113943) is to create an introspection node matcher (named the withTag matcher). This matcher will act as a way to enable verbose logging for matchers, which introduces the possibility for a workflow similar to printf debugging. This matcher is templated - it should accept an inner matcher of any type, and itself be a matcher of that type. For example, if the withTag matcher has an inner matcher of type Matcher, it will return a Matcher. In terms of functionality, the withTag matcher will not affect the outcome of the inner matcher and should directly return whatever the inner matcher returns.

Before:



varDecl(hasName(“x”), hasTypeLoc(typeLoc().bind(“TL”)))

|

  • |

After:



withTag(varDecl(hasName(“x”), hasTypeLoc(typeLoc().bind(“TL”))))

|

  • |

This matcher will be able to provide information such as:

  • Matcher Type

  • Matcher Name

  • Matcher Success

  • Node Kind

  • Node Source

The matcher will provide output by writing to stderr. Though the exact format of the output is subject to change, it could look like this, for the matcher withTag(varDecl(withTag(hasName(“x”)), hasTypeLoc(pointerTypeLoc()))):



:star: Attempting new match



Matcher Name: VarDecl Node



Node Kind: TypedefDecl



Node Value:



<br><br><br><br>typedef char *__builtin_va_list<br><br><br><br>



Node AST:



TypedefDecl 0x479af10 <> implicit __builtin_va_list ‘char *’



-PointerType 0x47e6360 'char *'<br><br><br><br>-BuiltinType 0x47e54d0 ‘char’



Result: Unsuccessful



:heavy_check_mark: Concluding attempt




:star: Attempting new match



Matcher Name: VarDecl Node



Node Kind: VarDecl



Node Value:



<br><br><br><br>void *x<br><br><br><br>



Node AST:



VarDecl 0x479af80 <input.cc:1:1, col:7> col:7 x ‘void *’



:star: Attempting new match



Matcher Name: HasName



Node Kind: VarDecl



Node Value:



<br><br><br><br>void *x<br><br><br><br>



Node AST:



VarDecl 0x479af80 <input.cc:1:1, col:7> col:7 x ‘void *’



Result: Successful



:heavy_check_mark: Concluding attempt




Result: Successful



:heavy_check_mark: Concluding attempt


|

  • |

From this output, a user is able to see things such as information about each node the matcher attempts to match, information about which part of the compound matcher is being matched, and the success of each matcher.

Pros

  • The design revolves around a matcher, which is a familiar pattern/construct

  • The design has low overhead for using and removing - it requires wrapping one or more matchers with the withTag matcher.

  • The design allows for a great degree of control - one can pick out certain matchers within a compound matcher to analyze.

Cons- One withTag matcher will only be able to report information about the inner matcher, and not necessarily matchers within the inner matcher

  • However, it may be possible to provide a hook to access the inner matcher’s inner matcher

  • Printf debugging is just one method of diagnosing issues, and may not be everyone’s preferred approach to debugging. Alternative debugging approaches could include something like steveire’s debugger (https://steveire.wordpress.com/2019/04/16/debugging-clang-ast-matchers/).

Questions- Currently, getName returns a string. Should we consider returning an enum, instead?

  • One reason we have opted for a string, as of now, is because users may define their own private matchers which may not have a value in the enum.

  • If getName were to return a string, what formats/conventions should we adhere to?

  • Currently, the returned strings are not named in a very formal approach. For example, most traversal and narrowing matchers have a getName that returns the name of the matcher exactly, whereas most node matchers have a getName that returns a string containing the name of the node (ie. VarDecl instead of varDecl)

+Stephen Kelly

Thank you for proposing this! I think making AST matchers easier to
use and debug is a fantastic goal. Some thoughts below.

Objective

The goal for this proposal is to improve the experience of debugging Clang’s AST matchers and of understanding what they are doing.

Background

AST Matchers are a powerful tool that allows users to pick out nodes in the Clang AST, provided they match the specified constraints and criteria. Yet, creating a well-formed matcher can be difficult. Matchers can become complex when multiple layers of matchers are nested to create a larger compound matcher, which can be necessary to select parts of the AST that fulfill strict criteria. Since matchers only provide a boolean response as to whether they obtained any matches or did not obtain any matches, the user receives no explanation for when a matcher fails to produce any bindings. Even when a matcher produces bindings, performing introspection to debug false positive bindings can be tedious. When the result of a matcher is not expected, determining how the result was obtained is made difficult by this limited response. Without any other options to interact with matchers, other than trial-and-error based on binary feedback, this lack of transparency makes it difficult to introspect matcher behavior and identify problems (such as which part of a compound matcher is inconsistent with expectations). This often causes the process of debugging a faulty matcher to be slow, confusing, and frustrating.

Design ideas

Proposed Solution

The proposed solution is composed of two parts.

The first part (⚙ D113917 Add infrastructure to support matcher names.) is to expose a getName method on the Matcher class. Currently, there is no easy way to access any form of identification information for any matcher. This method will act as a hook that allows for accessing the name of a matcher programmatically. While this method will be helpful for the implementation for the withTag matcher, it could also prove to be useful for other forms of matcher introspection/debugging.

The second part (⚙ D113943 Add `withIntrospection` matcher.) is to create an introspection node matcher (named the withTag matcher). This matcher will act as a way to enable verbose logging for matchers, which introduces the possibility for a workflow similar to printf debugging. This matcher is templated - it should accept an inner matcher of any type, and itself be a matcher of that type. For example, if the withTag matcher has an inner matcher of type Matcher<TypeLocMatcher>, it will return a Matcher<TypeLocMatcher>. In terms of functionality, the withTag matcher will not affect the outcome of the inner matcher and should directly return whatever the inner matcher returns.

Before:

varDecl(hasName("x"), hasTypeLoc(typeLoc().bind("TL")))

After:

withTag(varDecl(hasName("x"), hasTypeLoc(typeLoc().bind("TL"))))

This matcher will be able to provide information such as:

Matcher Type

Matcher Name

Matcher Success

Node Kind

Node Source

The matcher will provide output by writing to stderr.

I'm not opposed to it, but making this user configurable seems like a
better approach (not everyone wants to write to stderr or go through
redirections to get the data into a file). Also, we may want to
consider what formats we want to provide this information. You give an
example below, but this sounds a lot like a case where we might want
to provide structured output (e.g., JSON) so that other tools can
consume it to provide a better user interface. (That seems like
something that can be done in follow-ups rather than something
required for the initial feature.)

Though the exact format of the output is subject to change, it could look like this, for the matcher withTag(varDecl(withTag(hasName("x")), hasTypeLoc(pointerTypeLoc()))):

:star: Attempting new match

Matcher Name: `VarDecl` Node

Node Kind: TypedefDecl

Node Value:


typedef char *__builtin_va_list

Node AST:

TypedefDecl 0x479af10 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'char *'

`-PointerType 0x47e6360 'char *'

  `-BuiltinType 0x47e54d0 'char'

Result: Unsuccessful

:heavy_check_mark: Concluding attempt

:star: Attempting new match

Matcher Name: `VarDecl` Node

Node Kind: VarDecl

Node Value:


void *x

Node AST:

VarDecl 0x479af80 <input.cc:1:1, col:7> col:7 x 'void *'

:star: Attempting new match

Matcher Name: HasName

Node Kind: VarDecl

Node Value:


void *x

Node AST:

VarDecl 0x479af80 <input.cc:1:1, col:7> col:7 x 'void *'

Result: Successful

:heavy_check_mark: Concluding attempt

Result: Successful

:heavy_check_mark: Concluding attempt

From this output, a user is able to see things such as information about each node the matcher attempts to match, information about which part of the compound matcher is being matched, and the success of each matcher.

Pros

The design revolves around a matcher, which is a familiar pattern/construct

The design has low overhead for using and removing - it requires wrapping one or more matchers with the withTag matcher.

The design allows for a great degree of control - one can pick out certain matchers within a compound matcher to analyze.

Cons

One withTag matcher will only be able to report information about the inner matcher, and not necessarily matchers within the inner matcher

However, it may be possible to provide a hook to access the inner matcher’s inner matcher

Another con that I see is: this is likely to get very chatty when
working on complex matchers. We might need some notion of a verbosity
level for the information.

Printf debugging is just one method of diagnosing issues, and may not be everyone’s preferred approach to debugging. Alternative debugging approaches could include something like steveire’s debugger (Debugging Clang AST Matchers | Steveire's Blog).

FWIW, I loved a lot of the concepts Stephen was developing, so
enabling this kind of technology would be fantastic. One thing I think
is worth considering is that clang-query is currently the go-to tool
for developing complex AST matchers because of the REPL nature of the
tool. While printf-style debugging gets more information, I do wonder
if we should be considering how to integrate this into clang-query
more directly.

Questions

Currently, getName returns a string. Should we consider returning an enum, instead?

We sometimes rename matchers, so I think using an enumeration may have
some nice (or possibly annoying, depending on perspective) properties.
If we rename something and getName() returns a string, it's more of a
silent change; using an enumeration is more likely to produce a
compile error if someone is relying on the name for some reason. I
don't currently have a strong preference because I can see arguments
either way.

One reason we have opted for a string, as of now, is because users may define their own private matchers which may not have a value in the enum.

That's a pretty compelling reason to go with the string approach.

If getName were to return a string, what formats/conventions should we adhere to?

Currently, the returned strings are not named in a very formal approach. For example, most traversal and narrowing matchers have a getName that returns the name of the matcher exactly, whereas most node matchers have a getName that returns a string containing the name of the node (ie. VarDecl instead of varDecl)

Naively, I think users are going to expect getName() to return the
name of the matcher as it is written (that also makes it easier to
visually distinguish between the matcher and the matched node).

~Aaron

Hello Aaron,

Thanks for the feedback! It seems that the goal and comments here can be broken down into two categories:

  • Establishing some infrastructure to provide information about matchers

  • Creating tooling to access this information

Focusing on the first point, the decision regarding what to return from getName has many different feasible options. Potentially, we may want the function to return additional information about the matcher. In that sense, I agree that there may be a better name than getName. Additionally, instead of returning a string or an enum directly, perhaps it would be viable to return a “MatcherInfo” object (exposed through getMatcherInfo). That object could initially carry the name (exposed through getName), but we could later add support for an enum and other useful information (ex. interesting properties, child matchers, etc.).

As for the second point, I agree that there are a lot of options to consider for presenting this information (integrating with clang-query, structuring output, etc). All the ideas you proposed seem to be viable - one of the benefits of breaking this problem down into the two above points is that once the infrastructure is established, all of these solutions can be explored. This should allow us to achieve the goal of supporting multiple debugging workflows. The withTag matcher, specifically, can be useful for cases such as analyzing unit tests which fail due to faulty matchers - wrapping the matcher in a withTag matcher would be a quick way to gain some insight into its behavior. Ultimately, the effort of providing better matcher introspection can be continued by expanding the functionality of the withTag matcher, or by working on different approaches that utilize getMatcherInfo.

Thanks,

James

Hi,

Gentle ping on this proposal. James has finished his internship, but I’m eager to see the work through. :slight_smile:

Thanks,
Yitzhak

Hi,

Very nice proposal! I am very new to AST matchers and clang in general, and having gone through quite a few misunderstandings and debugging lately, I would very much appreciate an improved debugging experience! (At the same time, take my opinions with a grain of salt - I don’t have a good overview about the AST matcher design, yet)

  • I consider the printf-debugging provided by withTag/withIntrospection as useful. Even if there were other, more powerful debugging utilities, I would still consider it useful.

  • I would probably use withTag primarily from clang-query. Is https://reviews.llvm.org/D113943 sufficient for withTag to work also in clang-query? Or is there more plumbing required?

  • Output format:

  • To me, Matcher Name: VarDecl Node feels like unnecessary clutter. I am usually debugging a matcher which I just wrote a couple of minutes ago. I still remember which matcher I used

  • Instead, I would prefer a possibility to provide a custom name. Something like withTag("myName", varDecl(hasName("x"), hasTypeLoc(typeLoc().bind("TL")))). This would allow me to distinguish the output for, e.g., multiple matchers of the same type. (Effectively, this approach might make getName unnecessary?)

  • How would output for nested withTag look like? E.g. for withTag(cxxMethodDecl(ofClass(hasName("C")), forEachOverridden(withTag(cxxMethodDecl()))))? Would the match attempts be nested similar to ``? Or will it all be flattened?
    Cheers,
    Adrian

Adrian,

I’m glad to hear this sounds useful!

Regarding the name: we think it’s important for the case of nested annotations. I agree that if there’s just one, you don’t need the name, but it’s helpful once you have multiple logging matchers. That said, the source of names can be flexible. We actually started w/ custom names (because that’s easier, avoiding any need for infrastructure changes), but moved to auto-naming because it reduced the clutter in the matcher (and user burden). We’re also considering a modal version of withTag just like traverse, where surrounding a given matcher turns on logging for it and its descendants. In that case, auto-naming becomes even more important.

It seems reasonable to allow the user to choose between all the naming options we’ve discussed: auto-name, custom name and no-name. The question will only be what the default choice is.

Can you clarify your question about nesting? I’m not sure what you mean by “similar to ``”. Are you referring to visual indentation?

Can you clarify your question about nesting? I’m not sure what you mean by “similar to ``”. Are you referring to visual indentation?

I just realized that your initial example already answered my question.
Yes, I was referring to visual indentation.
E.g., I would

withTag(cxxMethodDecl(ofClass(hasName(“C”)), forEachOverridden(withTag(cxxMethodDecl()))))
prefer to produce the output
:star: Attempting new match (outer cxxMethodDecl)
:star::star: Attempting new match (inner cxxMethodDecl)
:star::heavy_check_mark: Concluding match (inner cxxMethodDecl)
:star::star: Attempting new match (inner cxxMethodDecl)
:star::heavy_check_mark: Concluding match (inner cxxMethodDecl)

:heavy_check_mark: Concluding attempt (outer cxxMethodDecl)
instead of
:star: Attempting new match (outer cxxMethodDecl)
:star: Attempting new match (inner cxxMethodDecl)
:heavy_check_mark: Concluding match (inner cxxMethodDecl)
:star: Attempting new match (inner cxxMethodDecl)
:heavy_check_mark: Concluding match (inner cxxMethodDecl)

:heavy_check_mark: Concluding attempt (outer cxxMethodDecl
as it would allow me to see the nesting depth more easily.
Your initial example shows that the proposed patch would print the 2nd variant, i.e. without indentation

Agreed – visual nesting would be nice. However, that seems to require support within the context (like traverse) which we haven’t planned for yet.

Hi everyone,

I’ve recently begun working on this, taking over for James. I have built upon James’ initial prototype implementation, and have some changes to the proposal, largely based on some of the earlier review comments and discussion. Just wanted to put feelers out there on ideas/comments on them:

Change getName() to getSpec()

Rather than matchers returning their name (or, in some cases, the type of node matched), a call to getSpec() will return a MatcherSpec, which contains info on the matcher tree.

MatcherSpec contains the following:

  • Name – essentially what was originally returned by getName(), either a specific matcher name, or the type of node that is matched against
  • Children – a list of all the internal matchers’ MatcherSpec
  • Metadata – other potentially useful data about the matcher, such as Type (the type of node matched against, for node-agnostic matchers like has), or values for non-matcher arguments passed in, like Name

For example, a call to functionDecl(has(varDecl(hasName("name")))).getSpec() could return the following MatcherSpec (printed as):

{ "Name" : "`FunctionDecl` Node",
   "Metadata": {
   },
   "Children" : [
     { "Name" : "Has",
       "Metadata": {
         "Type" : "`FunctionDecl` Node", 
       }
       "Children" : [
         { "Name" : "`VarDecl` Node",
           "Metadata": {
           },
           "Children" : [
             { "Name" : "HasName",
               "Metadata": {
                 "Names" : [ "name", ], 
                 "UseUnqualifiedMatch" : "true", 
               },
               "Children" : [
               ],
             },
           ],
         },
       ],
     },
   ]

Cheers,
Aaron