RFC: Easier AST Matching by Default

Stephen_Kelly · December 20, 2019, 9:01pm

Hi,

(Apologies if you receive this twice. GMail classified the first one as spam)

Aaron Ballman and I met by chance in Belfast and we discussed a way forward towards making AST Matchers easier to use, particularly for C++ developers who are not familiar with the details of the Clang AST.

For those unaware, I expanded on this in the EuroLLVM conference this year, and then expanded on it at ACCU:

The Future of AST-Matching refactoring tools (EuroLLVM and ACCU) | Steveire's Blog

One step in the process of getting there is changing the default behavior of AST Matchers to ignore invisible nodes while matching using the C++ API, and while matching and dumping AST nodes in clang-query.

I think this is the most important change in the entire proposal as it sets out the intention of making the AST Matchers easier to use for C++ developers who are not already familiar with Clang APIs.

To that end, I've written an AST to motivate the change:

We're looking for feedback before pressing forward with the change. I already have some patches written to port clang-tidy and unit tests to account for the change of default.

Thanks,

Stephen.

Aaron_Ballman · December 20, 2019, 9:06pm

Hi,

(Apologies if you receive this twice. GMail classified the first one as
spam)

Aaron Ballman and I met by chance in Belfast and we discussed a way
forward towards making AST Matchers easier to use, particularly for C++
developers who are not familiar with the details of the Clang AST.

For those unaware, I expanded on this in the EuroLLVM conference this
year, and then expanded on it at ACCU:

The Future of AST-Matching refactoring tools (EuroLLVM and ACCU) | Steveire's Blog

One step in the process of getting there is changing the default
behavior of AST Matchers to ignore invisible nodes while matching using
the C++ API, and while matching and dumping AST nodes in clang-query.

I think this is the most important change in the entire proposal as it
sets out the intention of making the AST Matchers easier to use for C++
developers who are not already familiar with Clang APIs.

To that end, I've written an AST to motivate the change:

Easier AST Matching by Default - Google Docs

We're looking for feedback before pressing forward with the change. I
already have some patches written to port clang-tidy and unit tests to
account for the change of default.

I'm generally in favor of this path forward. I think this is the
correct default and allows a more gentle introduction to AST matchers
for people new to the project (which helps with introducing new
clang-tidy checks) while still allowing people who need to get into
the nitty gritty details of the AST to do so as needed. Thank you for
the efforts!

CCing Manual, Alex, and Sam because they do a lot of work on the
matcher interfaces as well.

~Aaron

Gabor_Horvath · December 20, 2019, 10:02pm

Hi!

I like the idea of making writing tools/prototypes easier. But if the goal is to get rid of invisible nodes when matching I wonder if this would somewhat overlap with syntax trees 1. Adding Dmitri and Ilya in case they have an opinion on that.

Cheers,
Gabor

NoQ · December 20, 2019, 10:48pm

A stable parse/syntax tree (even if not a particularly great one) and a stable matcher API could allow the major breakthrough of allowing ourselves to encourage ordinary developers (not compiler engineers) to write their own tools (simple static analyses or transformations). I'm very excited to see a movement in this direction.

Stephen_Kelly · December 20, 2019, 11:03pm

Matchers are orthogonal to Syntax Trees as laid out in that document:


## Will we end up needing to reinvent matchers for this representation?

**We want to avoid this.**

Matchers are a complicated beast and having a separate set of those for syntax trees would be unfortunate. Moreover, syntax trees lack the semantic information that the users of AST matchers often rely on. If one wants to take advantage of the concept of matchers and syntax trees, they have options to do so based on existing implementation of ASTMatchers, e.g. they could:

1. use ASTMatchers to find interesting Clang AST nodes,

1. find their syntax tree counterparts,

1. apply transformations to syntax trees.

Thanks,

Stephen.

Aaron_Ballman · January 10, 2020, 3:34pm

I've not seen much to suggest the community is not in favor of this
direction, so I think you're all set to post patches for review. If
there are community concerns, we can address them during the review
process. Thank you for working on this!

~Aaron

ilya · January 13, 2020, 4:27pm

Since this popped up, I wanted to add a bit to the point made in the syntax trees doc.

Matchers for syntax trees have the following downsides:

would incur a maintenance burden,
will lack semantic information (at least some of it, e.g. implicit casts),
may not be as helpful in some use-cases (syntax trees already aim to provide an alternative interface to the AST that is simpler to use for some use-cases, something that AST matchers also try to do),
they incur both runtime and compile-time cost.

However, matchers are also a DSL for pattern-matching the parse tree in C++. In this sense they could also be useful for syntax trees.
Syntax trees are also much more regular than the AST and I believe matchers for syntax trees could even be auto-generated, so they can potentially be added there in the future.

gribozavr · January 13, 2020, 4:33pm

Since this popped up, I wanted to add a bit to the point made in the syntax trees doc.

Matchers for syntax trees have the following downsides:

would incur a maintenance burden,

will lack semantic information (at least some of it, e.g. implicit casts),

may not be as helpful in some use-cases (syntax trees already aim to provide an alternative interface to the AST that is simpler to use for some use-cases, something that AST matchers also try to do),

they incur both runtime and compile-time cost.

However, matchers are also a DSL for pattern-matching the parse tree in C++. In this sense they could also be useful for syntax trees.
Syntax trees are also much more regular than the AST and I believe matchers for syntax trees could even be auto-generated, so they can potentially be added there in the future.

+1, pattern-matching on syntax trees certainly makes sense, and AST matchers are a nice DSL for that. Therefore, AST Matchers for syntax trees directly correspond to the goal in the doc by Stephen Kelly:

Goal: Users should dump ASTs and write matchers which resemble source code syntax, not semantics

Dmitri

gribozavr · January 13, 2020, 4:39pm

Sorry, I just came back from vacation. I’m not sure if it is a good idea, in particular, for the following reasons:

The document specifies the goals in terms of a solution: “Goal: Users should not have to account for implicit AST nodes when writing AST Matchers”.
The complexity for implementers (although it also applies to https://reviews.llvm.org/D61837, which landed a while ago, and I didn’t see it). Allowing to run the same matcher in multiple modes increases our support and testing matrix.
The complexity cliff for users. Many non-trivial matchers need to explicitly manage implicit AST nodes.
The cost of ClangTidy. ClangTidy runs all matchers simultaneously in one AST traversal. If we implement implicit/no-implicit modes as separate AST traversals, ClangTidy would need to run two traversals, one for “easier” matchers (skipping implicit code), and one for “expert” matchers (traversing implicit code). We would also need to build two parent maps.

My best suggestion is to investigate implementing AST Matchers for syntax trees, and allow jumping between syntax and semantic nodes in one matcher – to match syntax first, and then validate the necessary semantic constraints.

Dmitri

Gabor_Horvath · January 17, 2020, 12:21am

Probably I am missing something but wouldn’t it be possible to do the two traversals at once? There is an implementation cost to it for sure, but I do not see anything that would prevent that.

Stephen_Kelly · January 17, 2020, 11:41am

    I've not seen much to suggest the community is not in favor of this
    direction, so I think you're all set to post patches for review. If
    there are community concerns, we can address them during the review
    process. Thank you for working on this!

Sorry, I just came back from vacation. I'm not sure if it is a good idea, in particular, for the following reasons:

- The document specifies the goals in terms of a solution: "Goal: Users should not have to account for implicit AST nodes when writing AST Matchers".

The goal is to make AST Matchers more accessible/easy to use for newcomers/non-experts in the clang AST by no-longer confronting them with 100% of the complexity on first use.

I think it's important to bridge that gap. See my EuroLLVM talk for more.

- The complexity for implementers (although it also applies to ⚙ D61837 Make it possible control matcher traversal kind with ASTContext, which landed a while ago, and I didn't see it). Allowing to run the same matcher in multiple modes increases our support and testing matrix.

I don't think it increases the test matrix more than any other feature. Can you expand on this?

- The complexity cliff for users. Many non-trivial matchers need to explicitly manage implicit AST nodes.

You just described the complexity cliff which is (after my change) not the first thing a newcomer hits.

It's good to move that cliff back so that the newcomer is not confronted with it immediately.

- The cost of ClangTidy. ClangTidy runs all matchers simultaneously in one AST traversal. If we implement implicit/no-implicit modes as separate AST traversals, ClangTidy would need to run two traversals, one for "easier" matchers (skipping implicit code), and one for "expert" matchers (traversing implicit code). We would also need to build two parent maps.

Will Syntax Trees ever be used in clang-tidy? Or will their use be forbidden in clang-tidy for the same reason of adding to a runtime cost when used in a mixture?

If your concern is valid it would seem to apply to Syntax Trees to a greater extent than this patch does.

That said, I think there is some scope for optimization in my approach. The optimizations I have in mind (being more efficient with parent maps) wouldn't apply to Syntax Trees though as that is a completely separate system.

My best suggestion is to investigate implementing AST Matchers for syntax trees, and allow jumping between syntax and semantic nodes in one matcher -- to match syntax first, and then validate the necessary semantic constraints.

My previous email quoted the Syntax Trees RFC saying that the first step is to "use [the current ASTMatchers API] to find interesting Clang AST nodes". You seem to be suggesting that the opposite would be done. Maybe things have evolved since that RFC, or maybe both approaches should work equally well.

As long as the current AST Matchers APIs exist, they should be improved. Perhaps they will be less important in the future when we have Syntax Trees for these use-cases instead. However, for now, I don't think the ease of use improvement in my patches should be prevented based existence of a future alternative system.

Thanks,

Stephen.

gribozavr · January 17, 2020, 12:54pm

I’ve not seen much to suggest the community is not in favor of this
direction, so I think you’re all set to post patches for review. If
there are community concerns, we can address them during the review
process. Thank you for working on this!

Sorry, I just came back from vacation. I’m not sure if it is a good idea, in particular, for the following reasons:

The document specifies the goals in terms of a solution: “Goal: Users should not have to account for implicit AST nodes when writing AST Matchers”.

The goal is to make AST Matchers more accessible/easy to use for newcomers/non-experts in the clang AST by no-longer confronting them with 100% of the complexity on first use.

I think it’s important to bridge that gap. See my EuroLLVM talk for more.

I totally agree with the goal of making AST Matchers, refactoring, and tooling more accessible (which is why I’m helping with libraries like Stencil and Syntax trees). However the goal that I quoted from the proposal on this thread (“Goal: Users should not have to account for implicit AST nodes when writing AST Matchers”) constrains the solution to exactly what is being proposed.

The complexity for implementers (although it also applies to https://reviews.llvm.org/D61837, which landed a while ago, and I didn’t see it). Allowing to run the same matcher in multiple modes increases our support and testing matrix.

I don’t think it increases the test matrix more than any other feature. Can you expand on this?

https://reviews.llvm.org/D61837 makes it possible to run every matcher with different traversal semantics. Our current tests only test one mode for each matcher, the default mode. There are two ways to define custom matchers: the advanced way (through the AST_MATCHER macro) that is needed when you want to traverse the AST yourself, and a simpler way – by combining existing matchers.

Consider the makeIteratorLoopMatcher defined in clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp. The caller can run the newly made matcher in any traversal mode, even though the matcher was probably written with the default mode in mind.

Clang codebase does not have many matchers that are combinations of existing matchers, but code that uses Clang as a library (and often written by engineers who are not compiler experts and just need to get some refactoring done) does that very often in my experience working at Google.

The complexity cliff for users. Many non-trivial matchers need to explicitly manage implicit AST nodes.

You just described the complexity cliff which is (after my change) not the first thing a newcomer hits.

It’s good to move that cliff back so that the newcomer is not confronted with it immediately.

As soon as you pretty-print the AST (which I think most people should do before writing a matcher), you see implicit nodes.

The cost of ClangTidy. ClangTidy runs all matchers simultaneously in one AST traversal. If we implement implicit/no-implicit modes as separate AST traversals, ClangTidy would need to run two traversals, one for “easier” matchers (skipping implicit code), and one for “expert” matchers (traversing implicit code). We would also need to build two parent maps.

Will Syntax Trees ever be used in clang-tidy? Or will their use be forbidden in clang-tidy for the same reason of adding to a runtime cost when used in a mixture?

If your concern is valid it would seem to apply to Syntax Trees to a greater extent than this patch does.

Syntax trees will be used in ClangTidy once they are more fleshed out. You are absolutely right about syntax trees having their cost. However, the runtime cost of syntax trees comes with a benefit of actually providing an API that was designed to model syntax trees. The proposal to visit only explicitly written AST nodes will only make matching look like it matches syntax, but it won’t help with any further inspection of the AST nodes, mutation, or even viewing the AST before you write a matcher.

That said, I think there is some scope for optimization in my approach. The optimizations I have in mind (being more efficient with parent maps) wouldn’t apply to Syntax Trees though as that is a completely separate system.

My best suggestion is to investigate implementing AST Matchers for syntax trees, and allow jumping between syntax and semantic nodes in one matcher – to match syntax first, and then validate the necessary semantic constraints.

My previous email quoted the Syntax Trees RFC saying that the first step is to “use [the current ASTMatchers API] to find interesting Clang AST nodes”. You seem to be suggesting that the opposite would be done. Maybe things have evolved since that RFC, or maybe both approaches should work equally well.

I discussed it with Ilya in person and we concluded that the way it came through as written in the RFC did not reflect our understanding of Syntax Trees. See Ilya’s response on this thread.

As long as the current AST Matchers APIs exist, they should be improved. Perhaps they will be less important in the future when we have Syntax Trees for these use-cases instead. However, for now, I don’t think the ease of use improvement in my patches should be prevented based existence of a future alternative system.

I agree that we should not block improvements on existence of a future system. However, each specific change should be evaluated based on its merits and costs. I think there are non-trivial downsides to offering a different mode in existing AST matchers, which create maintenance and testing costs that we will be paying perpetually. It instantly creates a huge testing gap for existing AST matchers (which we currently test only in one mode). It also creates traps for existing code that will be able to start accidentally calling existing matchers in a traversal mode that they were not designed for.

On the other hand, my objection primarily applies to https://reviews.llvm.org/D61837, which already landed, not to changing the default. Changing the default just exacerbates the problem significantly by affecting the semantics of existing code silently. However, Clang does not aim to provide a stable API even in the tooling space. So maybe it is too late for me to object, and unless we also remove the traverse() matchers, my objection can’t be fully addressed.

If you want to still proceed with changing the default, my request would be to provide detailed guidance (probably as a section in release notes) about updating existing code that uses matchers, including ClangTidy checkers.

Dmitri

Aaron_Ballman · January 18, 2020, 3:09pm

I've not seen much to suggest the community is not in favor of this
direction, so I think you're all set to post patches for review. If
there are community concerns, we can address them during the review
process. Thank you for working on this!

Sorry, I just came back from vacation. I'm not sure if it is a good idea, in particular, for the following reasons:

- The document specifies the goals in terms of a solution: "Goal: Users should not have to account for implicit AST nodes when writing AST Matchers".

The goal is to make AST Matchers more accessible/easy to use for newcomers/non-experts in the clang AST by no-longer confronting them with 100% of the complexity on first use.

I think it's important to bridge that gap. See my EuroLLVM talk for more.

I totally agree with the goal of making AST Matchers, refactoring, and tooling more accessible (which is why I'm helping with libraries like Stencil and Syntax trees). However the goal that I quoted from the proposal on this thread ("Goal: Users should not have to account for implicit AST nodes when writing AST Matchers") constrains the solution to exactly what is being proposed.

FWIW, I think the proposed solution is a valuable goal. Implicit nodes
are *hard* for even experienced users to deal with and one of the more
common pieces of review feedback I find myself giving users boils down
to things like "do you need to ignore paren expressions here" and the
likes, and that feedback is inconsistent because I'm a human and
despite the massive amount of clang-tidy code I review, *I* still
forget about implicit nodes myself. I think that having a traversal
mode which ignores implicit nodes is a valuable feature and that
needing to match implicit nodes in AST matchers is the 5% case rather
than the 95% case.

- The complexity for implementers (although it also applies to ⚙ D61837 Make it possible control matcher traversal kind with ASTContext, which landed a while ago, and I didn't see it). Allowing to run the same matcher in multiple modes increases our support and testing matrix.

I don't think it increases the test matrix more than any other feature. Can you expand on this?

⚙ D61837 Make it possible control matcher traversal kind with ASTContext makes it possible to run every matcher with different traversal semantics. Our current tests only test one mode for each matcher, the default mode. There are two ways to define custom matchers: the advanced way (through the AST_MATCHER macro) that is needed when you want to traverse the AST yourself, and a simpler way -- by combining existing matchers.

Consider the `makeIteratorLoopMatcher` defined in clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp. The caller can run the newly made matcher in any traversal mode, even though the matcher was probably written with the default mode in mind.

Clang codebase does not have many matchers that are combinations of existing matchers, but code that uses Clang as a library (and often written by engineers who are not compiler experts and just need to get some refactoring done) does that very often in my experience working at Google.

clang-tidy uses a fair amount of matchers that are combinations of
existing matchers, but they're localized to the individual checks. The
AST matcher implementation itself composes matchers as well. Composed
matchers is something that the proposed functionality will have to
take into account and I fully agree that this feature will require
additional testing, but I don't see anything here that suggests this
is an undue increase in testing or maintenance burden based on what
you described. Maybe I'm looking at it from the wrong angle still,
though.

- The complexity cliff for users. Many non-trivial matchers need to explicitly manage implicit AST nodes.

You just described the complexity cliff which is (after my change) not the first thing a newcomer hits.

It's good to move that cliff back so that the newcomer is not confronted with it immediately.

As soon as you pretty-print the AST (which I think most people should do before writing a matcher), you see implicit nodes.

I think you mean "when you dump the AST" as opposed to "pretty-print
the AST". When you pretty print, you do not typically see implicit
nodes because we're reconstructing the source. When you AST dump, you
currently see implicit AST nodes but 1) they're typically explicitly
marked as being implicit nodes, and 2) there's no reason we can't give
an option to control AST output generation. In support of point #2,
there's already a need for this with the JSON AST dumper to be able to
dump in compact vs non-compact mode, but we haven't devised a general
way to control AST dump output yet. Selecting whether to dump implicit
or explicit nodes is just another kind of option along those lines.

- The cost of ClangTidy. ClangTidy runs all matchers simultaneously in one AST traversal. If we implement implicit/no-implicit modes as separate AST traversals, ClangTidy would need to run two traversals, one for "easier" matchers (skipping implicit code), and one for "expert" matchers (traversing implicit code). We would also need to build two parent maps.

Will Syntax Trees ever be used in clang-tidy? Or will their use be forbidden in clang-tidy for the same reason of adding to a runtime cost when used in a mixture?

If your concern is valid it would seem to apply to Syntax Trees to a greater extent than this patch does.

Syntax trees will be used in ClangTidy once they are more fleshed out.

Is this a hope, or was this a decision made already?

You are absolutely right about syntax trees having their cost. However, the runtime cost of syntax trees comes with a benefit of actually providing an API that was designed to model syntax trees. The proposal to visit only explicitly written AST nodes will only make matching look like it matches syntax, but it won't help with any further inspection of the AST nodes, mutation, or even viewing the AST before you write a matcher.

That said, I think there is some scope for optimization in my approach. The optimizations I have in mind (being more efficient with parent maps) wouldn't apply to Syntax Trees though as that is a completely separate system.

My best suggestion is to investigate implementing AST Matchers for syntax trees, and allow jumping between syntax and semantic nodes in one matcher -- to match syntax first, and then validate the necessary semantic constraints.

My previous email quoted the Syntax Trees RFC saying that the first step is to "use [the current ASTMatchers API] to find interesting Clang AST nodes". You seem to be suggesting that the opposite would be done. Maybe things have evolved since that RFC, or maybe both approaches should work equally well.

I discussed it with Ilya in person and we concluded that the way it came through as written in the RFC did not reflect our understanding of Syntax Trees. See Ilya's response on this thread.

As long as the current AST Matchers APIs exist, they should be improved. Perhaps they will be less important in the future when we have Syntax Trees for these use-cases instead. However, for now, I don't think the ease of use improvement in my patches should be prevented based existence of a future alternative system.

I agree that we should not block improvements on existence of a future system. However, each specific change should be evaluated based on its merits and costs. I think there are non-trivial downsides to offering a different mode in existing AST matchers, which create maintenance and testing costs that we will be paying perpetually.

I still don't see these costs as being undue, however, I definitely
see the pain point this RFC is intended to lessen. It's something that
comes up with a high degree of frequency and I'm worried about
ignoring that pain because we hope a totally new approach will pan out
in the future despite any practical experience with that new approach.
That said, we've lived with requiring used to match on implicit nodes
for this long, it would not kill us to wait longer if we needed to.

It instantly creates a huge testing gap for existing AST matchers (which we currently test only in one mode). It also creates traps for existing code that will be able to start accidentally calling existing matchers in a traversal mode that they were not designed for.

I think the testing gap can be solved with software solutions
(judicious use of macros and/or templates). The traps for existing
code are a bit of a concern, but truthfully, no more of a concern than
use of StringRef is. If the composed matcher is publicly exposed, it's
something the author definitely has to account for. If the composed
matcher is private to a particular check or other usage (which is the
majority case as best I can tell), then the author really only has to
care about the mode being used for that usage.

On the other hand, my objection primarily applies to ⚙ D61837 Make it possible control matcher traversal kind with ASTContext, which already landed, not to changing the default. Changing the default just exacerbates the problem significantly by affecting the semantics of existing code silently. However, Clang does not aim to provide a stable API even in the tooling space. So maybe it is too late for me to object, and unless we also remove the traverse() matchers, my objection can't be fully addressed.

FWIW, this was something I was concerned by as well, but eventually I
was okay with it specifically because we make no stability guarantees
in this space as it stands. I'd love it if the breakage could be
non-silent though.

If you want to still proceed with changing the default, my request would be to provide detailed guidance (probably as a section in release notes) about updating existing code that uses matchers, including ClangTidy checkers.

Absolutely agreed that this would be necessary.

~Aaron

gribozavr · January 22, 2020, 11:29am

FWIW, I think the proposed solution is a valuable goal.

Goals should not be specified in terms of solutions, because given such framing, alternatives can’t be considered. XY problem etc.

Clang codebase does not have many matchers that are combinations of existing matchers, but code that uses Clang as a library (and often written by engineers who are not compiler experts and just need to get some refactoring done) does that very often in my experience working at Google.

clang-tidy uses a fair amount of matchers that are combinations of
existing matchers, but they’re localized to the individual checks.

Like I said, that is the case in the Clang codebase. However, my experience working at Google is different – most often people compose existing matchers instead of defining new ones using AST_MATCHER.

Syntax trees will be used in ClangTidy once they are more fleshed out.

Is this a hope, or was this a decision made already?

It is a direction towards which we are building libraries like Stencil, Transformer, Syntax trees. We can’t say that we made a decision to use them in ClangTidy, because obviously not all of the infrastructure is there, so we can’t evaluate it and make a justified decision. However, we are building these libraries hoping that they will make common refactoring and linting tooling (like ClangTidy) more accessible for novices, and easier to implement even for experts. If we didn’t think that could be the case, we would not be building these libraries.

It instantly creates a huge testing gap for existing AST matchers (which we currently test only in one mode). It also creates traps for existing code that will be able to start accidentally calling existing matchers in a traversal mode that they were not designed for.

I think the testing gap can be solved with software solutions
(judicious use of macros and/or templates). The traps for existing
code are a bit of a concern, but truthfully, no more of a concern than
use of StringRef is.

It is very different: to catch misuses of StringRef, we already have ASan.

If the composed matcher is publicly exposed, it’s
something the author definitely has to account for. If the composed
matcher is private to a particular check or other usage (which is the
majority case as best I can tell),

Correction: the majority case in llvm-project.git.

Dmitri

Stephen_Kelly · January 26, 2020, 7:54pm

        I've not seen much to suggest the community is not in favor
        of this
        direction, so I think you're all set to post patches for
        review. If
        there are community concerns, we can address them during the
        review
        process. Thank you for working on this!

    Sorry, I just came back from vacation. I'm not sure if it is a
    good idea, in particular, for the following reasons:

    - The document specifies the goals in terms of a solution: "Goal:
    Users should not have to account for implicit AST nodes when
    writing AST Matchers".

    The goal is to make AST Matchers more accessible/easy to use for
    newcomers/non-experts in the clang AST by no-longer confronting
    them with 100% of the complexity on first use.

    I think it's important to bridge that gap. See my EuroLLVM talk
    for more.

I totally agree with the goal of making AST Matchers, refactoring, and tooling more accessible (which is why I'm helping with libraries like Stencil and Syntax trees).

This change affects AST Matchers, but as far as I can see, it does not affect any other effort.

I don't see why you mention those initiatives in this and your other email. We're only talking about AST Matchers. Unless AST Matchers are to be removed, we should improve them, regardless of Syntax Tree, Stencil and Transformer.

However the goal that I quoted from the proposal on this thread ("Goal: Users should not have to account for implicit AST nodes when writing AST Matchers") constrains the solution to exactly what is being proposed.

Perhaps there's a misunderstanding. That's the goal of changing the default.

    - The complexity for implementers (although it also applies to
    ⚙ D61837 Make it possible control matcher traversal kind with ASTContext, which landed a while ago, and I
    didn't see it). Allowing to run the same matcher in multiple
    modes increases our support and testing matrix.

    I don't think it increases the test matrix more than any other
    feature. Can you expand on this?

⚙ D61837 Make it possible control matcher traversal kind with ASTContext makes it possible to run every matcher with different traversal semantics. Our current tests only test one mode for each matcher, the default mode. There are two ways to define custom matchers: the advanced way (through the AST_MATCHER macro) that is needed when you want to traverse the AST yourself, and a simpler way -- by combining existing matchers.

AFAICT, defining a matcher using the macro, or defining it as a composition function doesn't change much?

Consider the `makeIteratorLoopMatcher` defined in clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp. The caller can run the newly made matcher in any traversal mode, even though the matcher was probably written with the default mode in mind.

Clang codebase does not have many matchers that are combinations of existing matchers

It does actually - callee() for example. The fact that is uses a macro doesn't change anything AFAICT. Also alignOfExpr() which does not use a macro, but I don't see how that changes anything.

Am I missing something?

, but code that uses Clang as a library (and often written by engineers who are not compiler experts and just need to get some refactoring done) does that very often in my experience working at Google.

I don't think the traverse() matcher changes so much in this area. "Just need to get some refactoring done" sounds like writing a composition matcher within a check, not writing it for some library code.

    - The complexity cliff for users. Many non-trivial matchers need
    to explicitly manage implicit AST nodes.

    You just described the complexity cliff which is (after my change)
    not the first thing a newcomer hits.

    It's good to move that cliff back so that the newcomer is not
    confronted with it immediately.

As soon as you pretty-print the AST (which I think most people should do before writing a matcher), you see implicit nodes.

If using clang-query, you can change the mode of how clang-query matches and dumps ASTs. But, in the future you won't need to dump the AST anyway, because clang-query will just tell you the available matchers and what they match.

See my EuroLLVM talk for more, in particular the workflow diagram which does not include inspecting the AST: https://steveire.files.wordpress.com/2018/11/all-no-ast.png?w=900&h=590

    - The cost of ClangTidy. ClangTidy runs all matchers
    simultaneously in one AST traversal. If we implement
    implicit/no-implicit modes as separate AST traversals, ClangTidy
    would need to run two traversals, one for "easier" matchers
    (skipping implicit code), and one for "expert" matchers
    (traversing implicit code). We would also need to build two
    parent maps.

As of ⚙ D73388 NFC: Implement AST node skipping in ParentMapContext we no longer create a parent map for each traversal kind used at runtime.

This means I can no longer measure a runtime cost of introducing use of the traverse() matcher in clang-tidy.

    That said, I think there is some scope for optimization in my
    approach. The optimizations I have in mind (being more efficient
    with parent maps) wouldn't apply to Syntax Trees though as that is
    a completely separate system.

Done, per above.

    As long as the current AST Matchers APIs exist, they should be
    improved. Perhaps they will be less important in the future when
    we have Syntax Trees for these use-cases instead. However, for
    now, I don't think the ease of use improvement in my patches
    should be prevented based existence of a future alternative system.

I agree that we should not block improvements on existence of a future system. However, each specific change should be evaluated based on its merits and costs. I think there are non-trivial downsides to offering a different mode in existing AST matchers, which create maintenance and testing costs that we will be paying perpetually.

We may not achieve your agreement on this change, but I think we have consensus already that this change should proceed.

It instantly creates a huge testing gap for existing AST matchers (which we currently test only in one mode).

The traverse() matcher does not introduce a need to test each AST matcher in each mode.

It also creates traps for existing code that will be able to start accidentally calling existing matchers in a traversal mode that they were not designed for.

Yes, true. It is possible to write a matcher which works as someone expects in one mode but not the other, such as a matcher which specifically expects to match a implicitCastExpr().

This can matter for matchers which are in "libraries", but not for matchers which are implementation details inside checks. Unless a particular check exposes the traversal kind as an option, those matchers can not be affected. Any check which does expose that as an option has to do so consciously and explicitly and has to test it. Nothing new there.

On the other hand, my objection primarily applies to ⚙ D61837 Make it possible control matcher traversal kind with ASTContext, which already landed, not to changing the default.

Understood. Hopefully that comes through in my email.

If you want to still proceed with changing the default, my request would be to provide detailed guidance (probably as a section in release notes) about updating existing code that uses matchers, including ClangTidy checkers.

Certainly.

Thanks,

Stephen.

gribozavr · January 26, 2020, 9:36pm

I’ve not seen much to suggest the community is not in favor of this
direction, so I think you’re all set to post patches for review. If
there are community concerns, we can address them during the review
process. Thank you for working on this!

Sorry, I just came back from vacation. I’m not sure if it is a good idea, in particular, for the following reasons:

The document specifies the goals in terms of a solution: “Goal: Users should not have to account for implicit AST nodes when writing AST Matchers”.

The goal is to make AST Matchers more accessible/easy to use for newcomers/non-experts in the clang AST by no-longer confronting them with 100% of the complexity on first use.

I think it’s important to bridge that gap. See my EuroLLVM talk for more.

I totally agree with the goal of making AST Matchers, refactoring, and tooling more accessible (which is why I’m helping with libraries like Stencil and Syntax trees).

This change affects AST Matchers, but as far as I can see, it does not affect any other effort.

I don’t see why you mention those initiatives in this and your other email. We’re only talking about AST Matchers. Unless AST Matchers are to be removed, we should improve them, regardless of Syntax Tree, Stencil and Transformer.

I mention all of these libraries because people don’t use AST matchers just for the sake of it. People use AST matchers to achieve a high-level goal. Therefore, we should be thinking of the whole user journey, not just the individual piece. My point is that, we are working on better solutions for the whole user journey. Those better solutions will allow to achieve the same high-level goals more easily than today’s tools, and (I think) even more easily than any modification restricted to just AST matchers would have allowed us to do.

However the goal that I quoted from the proposal on this thread (“Goal: Users should not have to account for implicit AST nodes when writing AST Matchers”) constrains the solution to exactly what is being proposed.

Perhaps there’s a misunderstanding. That’s the goal of changing the default.

The goal should be something high-level that the user cares about, for example, “more easily write refactoring tools”. Individual technical details are not meaningful as goals, because they don’t have intrinsic value from the perspective of end users, maintainers of Clang, maintainers of libraries etc.

How do we evaluate whether the proposal to change the default traversal is good or bad, if changing the default is the goal? It becomes a circular argument: we want to change it because we want it so. The goal should be stated in terms of a problem that the user is trying to solve.

The complexity for implementers (although it also applies to https://reviews.llvm.org/D61837, which landed a while ago, and I didn’t see it). Allowing to run the same matcher in multiple modes increases our support and testing matrix.

I don’t think it increases the test matrix more than any other feature. Can you expand on this?

https://reviews.llvm.org/D61837 makes it possible to run every matcher with different traversal semantics. Our current tests only test one mode for each matcher, the default mode. There are two ways to define custom matchers: the advanced way (through the AST_MATCHER macro) that is needed when you want to traverse the AST yourself, and a simpler way – by combining existing matchers.

AFAICT, defining a matcher using the macro, or defining it as a composition function doesn’t change much?

When you define a matcher using a macro, you usually traverse the AST yourself, by directly using Clang AST APIs. However, when you define a matcher as a composition of other matchers, the user of your matcher can affect the traversal mode within your matcher by calling your matcher within traverse(). In other words, there is no abstraction barrier between the matcher and the user.

Consider the makeIteratorLoopMatcher defined in clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp. The caller can run the newly made matcher in any traversal mode, even though the matcher was probably written with the default mode in mind.

Clang codebase does not have many matchers that are combinations of existing matchers

It does actually - callee() for example. The fact that is uses a macro doesn’t change anything AFAICT. Also alignOfExpr() which does not use a macro, but I don’t see how that changes anything.

Am I missing something?

I didn’t say such matchers don’t exist in Clang, I said there are not many of them. And yes, these are the types of matchers I am concerned about being affected by traverse().

callee and alignOfExpr() are not affected by traverse() because they only match one level of AST nodes – the call expression, or the alignof expression, and pass down the inner matcher. Had they been traversing multiple levels of AST nodes, they would have been affected by traverse(), if the user called these matchers within traverse().

, but code that uses Clang as a library (and often written by engineers who are not compiler experts and just need to get some refactoring done) does that very often in my experience working at Google.

I don’t think the traverse() matcher changes so much in this area. “Just need to get some refactoring done” sounds like writing a composition matcher within a check, not writing it for some library code.

We have lots of engineers who are not Clang experts and write reusable refactoring code.

The complexity cliff for users. Many non-trivial matchers need to explicitly manage implicit AST nodes.

You just described the complexity cliff which is (after my change) not the first thing a newcomer hits.

It’s good to move that cliff back so that the newcomer is not confronted with it immediately.

As soon as you pretty-print the AST (which I think most people should do before writing a matcher), you see implicit nodes.

If using clang-query, you can change the mode of how clang-query matches and dumps ASTs. But, in the future you won’t need to dump the AST anyway, because clang-query will just tell you the available matchers and what they match.

See my EuroLLVM talk for more, in particular the workflow diagram which does not include inspecting the AST: https://steveire.files.wordpress.com/2018/11/all-no-ast.png?w=900&h=590

The cost of ClangTidy. ClangTidy runs all matchers simultaneously in one AST traversal. If we implement implicit/no-implicit modes as separate AST traversals, ClangTidy would need to run two traversals, one for “easier” matchers (skipping implicit code), and one for “expert” matchers (traversing implicit code). We would also need to build two parent maps.

As of https://reviews.llvm.org/D73388 we no longer create a parent map for each traversal kind used at runtime.

This means I can no longer measure a runtime cost of introducing use of the traverse() matcher in clang-tidy.

Thank you, that’s a great improvement! However, I’m not sure about the absence of runtime cost – I think it would be more fair to say that there is no cost right now. This change adds extra logic (AscendIgnoreUnlessSpelledInSource) that is only called in the case of TK_IgnoreUnlessSpelledInSource traversal. Since most checkers in ClangTidy don’t use that mode, we see no increase in runtime cost today. However, once the default is flipped, or once we start using TK_IgnoreUnlessSpelledInSource more, we could see an increased cost of AST matchers compared to the TK_AsIs traversal.

As long as the current AST Matchers APIs exist, they should be improved. Perhaps they will be less important in the future when we have Syntax Trees for these use-cases instead. However, for now, I don’t think the ease of use improvement in my patches should be prevented based existence of a future alternative system.

I agree that we should not block improvements on existence of a future system. However, each specific change should be evaluated based on its merits and costs. I think there are non-trivial downsides to offering a different mode in existing AST matchers, which create maintenance and testing costs that we will be paying perpetually.

We may not achieve your agreement on this change, but I think we have consensus already that this change should proceed.

I’m not sure how you measured the degree of consensus – could you clarify?

Here’s what I see: so far, there have been few people participating on this thread, and outside of the authors of the proposal (you and Aaron), only I have provided direct feedback about supporting/opposing the proposal, other people (Artem, Gabor, and Ilya) only asked questions, provided clarifications, and provided feedback on the syntax trees developments. If you count me as objecting to the traverse() API in general, and not objecting to this change, then there has been no feedback from the community on this proposal at all. If you count me as objecting to this proposal, then there’s one person objecting. Based on these observations, I don’t see consensus yet, even if you exclude my feedback.

It instantly creates a huge testing gap for existing AST matchers (which we currently test only in one mode).

The traverse() matcher does not introduce a need to test each AST matcher in each mode.

Why do you say so? Matchers composed out of other matchers are affected by the traversal mode that the user invokes them in.

It also creates traps for existing code that will be able to start accidentally calling existing matchers in a traversal mode that they were not designed for.

Yes, true. It is possible to write a matcher which works as someone expects in one mode but not the other, such as a matcher which specifically expects to match a implicitCastExpr().

Exactly.

This can matter for matchers which are in “libraries”, but not for matchers which are implementation details inside checks. Unless a particular check exposes the traversal kind as an option, those matchers can not be affected. Any check which does expose that as an option has to do so consciously and explicitly and has to test it. Nothing new there.

I’m sorry if I have been unclear about this point, but during this whole conversation I have been talking not about specific ClangTidy checkers, but about libraries that provide matchers. Users of Clang (for example, internal users at Google) have such libraries, and use them broadly outside of ClangTidy. Previously in this thread I pointed at a matcher in ClangTidy as an example because that’s the only piece of open source code I could easily find, that has similar complexity to libraries that we have at Google. Introduction of traverse() has increased the testing matrix for these matchers, so it is a new increase in complexity, and increase in technical debt (missing tests).

Let me reiterate: I understand that Clang or Clang Tooling does not provide a stable API, so making breaking changes here is totally within the policy.

However, the specific aspects of this change (silently breaking existing matchers, creating new traps for people who write reusable AST matchers, increasing the testing matrix) have consequences that are very different from a typical API change in Clang AST, where the user of that API will see a build break, and can easily locally fix code. The cost of this change is much higher for users of the API being changed.

Let me point you at similar change that was done previously: the change in elidable constructor representation in Clang AST between C++14 and C++17. (TL;DR: AST built by Clang in C++11 mode has AST nodes for elidable constructor calls, but in C++17 mode they are always elided.) Lots of ClangTidy checkers were broken by this change, but nobody noticed, because ClangTidy was not being tested in all language modes. I closed this testing gap (https://reviews.llvm.org/D62125), and our team has fixed a bunch of ClangTidy checkers, but other checkers still remain broken until this day (run git grep 'FIXME: Fix the checker to work in C++17 mode.') The worst part of this story is that the breakage was noticed a long, long time after the AST change was implemented, and that the cost of the change (expanding testing, fixing checkers) has fallen onto people other than the ones who changed the AST.

The proposal has been implying (correct me if I’m wrong) that the maintenance cost and complexity that is being put onto experts with a codebase that uses AST matchers that they maintain long-term is worth the added convenience for novice users who write one-off tools that don’t need to be maintained. If it is indeed the case, I would prefer this argument to be made explicitly (as in, the proposal should describe the downsides of making this change; right now it does not describe them).

Dmitri

r4nt · May 8, 2020, 10:55am

Update: Stephen updated the proposal with the results / fallout of making the change on clang-tidy’s matchers. I believe that these results indicate that the risk of that change on downstream users is small (5 of > 300 matchers needed fixes).
I’d be curious whether others disagree with that assessment.

Matthew_Fowles_Kuluk · May 8, 2020, 2:47pm

Manuel~

I favor skipping invisible nodes by default. It produces more intuitive results.

Matt

Aaron_Ballman · May 11, 2020, 12:50pm

Update: Stephen updated the proposal with the results / fallout of making the change on clang-tidy's matchers. I believe that these results indicate that the risk of that change on downstream users is small (5 of > 300 matchers needed fixes).
I'd be curious whether others disagree with that assessment.

No disagreement from me.

~Aaron

r4nt · May 11, 2020, 1:19pm

Update: Stephen updated the proposal with the results / fallout of making the change on clang-tidy’s matchers. I believe that these results indicate that the risk of that change on downstream users is small (5 of > 300 matchers needed fixes).
I’d be curious whether others disagree with that assessment.

No disagreement from me.

Given there seems to be consensus on this being a good thing, and me checking with Dmitri that he doesn’t have serious concerns (deferring to my judgement), this proposal looks good to me.
I believe Aaron volunteered to do detailed reviews on specific patches (thanks, Aaron, otherwise let me know if I misunderstood!!)

Topic		Replies	Views
RFC: Easier AST Matching by Default Clang Frontend	4	144	June 24, 2020
[PATCH] New matcher for MaterializeTemporaryExpression. Clang Frontend	2	142	August 21, 2012
Patch: AST Matcher Framwork and an example tool Clang Frontend	10	154	June 24, 2011
[RFC] Improving Support for Debugging Matchers Clang Frontend	9	257	April 6, 2022
RFC: Easier AST Matching by Default Clang Frontend	0	103	December 20, 2019

RFC: Easier AST Matching by Default

Related topics