Support query-based clang-tidy external check

Background

clang-tidy is becoming more and more popular, but its static configuration makes it difficult to implement customized checks for public third-party libraries.
For example: [clang-tidy] Add modernize-nlohmann-json-explicit-conversions check by mikecrowe · Pull Request #126425 · llvm/llvm-project · GitHub
It is difficult to adapt plugins implemented in the form of dynamic link libraries to different versions of clang-tidy, and because it relies on the llvm ecosystem, compiling so consumes a lot of CPU and time. This is very unfriendly to many open source projects or small and medium-sized teams, who do not have the resources to maintain a corresponding tool chain

If we make clang-tidy more flexible and configurable. It will help more C++ projects improve their quality.

Concept

Query Based Custom Check
clang-query is tool to dynamically execute ast-matcher and bind to results. we can reuse it to implement more flexible plugin system in clang-tidy.

High Level Requirement

  1. I want these checks can be treat as built-in checks. They should be enable / disable by command line and config.
  2. User should provide query string and corresponded diagnostics message.
3 Likes

Current Implement

[clang-tidy] Add ClangQueryChecks config option by DeNiCoN · Pull Request #123734 · llvm/llvm-project · GitHub (deprecated)
[clang-tidy] support query based custom check by HerrCai0907 · Pull Request #131804 · llvm/llvm-project · GitHub

I discussed this idea with Yitzhak Mandelbaum in personal email based on his work with the Transformer library.

In the following replies, I’ll repeat our conversation from 2022:

Me:

Since the last time I hacked on clang-tidy, the Transformer library has
been added. From what I can tell in the commit history and email list
archives, you are the primary author.

Thanks for adding this library!

Having written a bunch of clang-tidy checks[*], I have come to the
conclusion that many refactorings we’d like to write could be handled
by a scriptable tool. The DSLs for expressing edits and rewrite rules
in the Transformer lib puts us almost all the way towards having such
a tool. Imagine if we could write the edits and the rewrite rules as
easily as we write matchers in clang-query.

For many organizations, I think writing C++ that directly manipulates
the AST and the source code transformations is too high a burden for
writing their own transformations. Particularly if those
transformations are of a migratory nature and are “fire once and
forget” for their code base.

If we could get a script-like language mechanism in place for writing
transformations, I think it would make this accessible to many more
people.

It appears that you wrote some parsing support in Transformer already.
I’d like to collaborate with you on adding more parsing support for
the remainder of the DSLs so that we can move closer to a scripted
clang-tidy check.

What do you think?

Yitzhak Mandelbaum:

Thanks for reaching out. I’d be happy to work on this together. I already
have a language and parser that covers many of the available Stencil
combinators, I just haven’t had the chance to make the patches for
committing it. This could be a good forcing function. :) That said, you
might find you prefer a different language. This one is designed in “format
string” style, vs the clang-query language and my own range language which
mimic the C++ libraries they’re based on. I’ve pasted the 1-page guide
below so you can see what I mean.

That said, I should note that we’ve had this language available for general
use for > 2 years, integrated into a web UI, and we haven’t seen much
adoption. In practice, I think a lot of migrations end up needing at least
some custom matchers and/or Stencils (more the former than the latter).
Also, learning the AST and matchers remains a very high barrier for many
newcomers

Clang’s Stencil abstraction provides a code-generating object parameterized
by named references to (bound) AST nodes. Stencils are specified in Zwingli
with a specialized format string language. For example, if we wish to
transform code of the form

if (condition) { body; }

to

if (!condition) { LOG(ERROR) << "condition failed"; } else { body; }

Given the matcher that binds the condition to cond and the body to body,
you can express the output as:

if (!($cond)) { LOG(ERROR) << "condition failed"; } else $body

Here, the $ tells the parser to treat the following alphanumeric characters
as an identifier bound to an AST node by the matcher. The format strings
provide additional built-in operators to help you construct your output.
The full grammar for the format strings can be found below.

StencilDescription
$id Add the code segment represented by the AST node bound to the reference
id
$id.member Constructs code that accesses the named member of the object
bound to id. The access is constructed idiomatically: if id is bound to e,
then constructs e->member, when e is a pointer, and e.member otherwise.
Wraps e in parentheses, if needed. Similarly, it will do some basic
simplifications to avoid creating expressions like (&x)->member. Members
can be identified by raw text or operator calls (e.g. $name()).
$name(id) Given a reference to a named declaration id (that is, a node of
type clang::NamedDecl or one of its derived classes), generates the
name. id must
have an identifier name (that is, constructors are not valid arguments to
the name operation).
$callArgs(id) Given a reference to call expression node, generates the
source text of the arguments (all source between the call’s parentheses).
$initListElements(id) Given a reference to an initializer-list expression
node, generates the source text of the elements (all source between the
braces).
$*(id) Renders a node’s source as a value, even if the node is a pointer.
For the following examples, assume something is bound to the x in the
expression foo(x). The rewrite will depend on the type of the bound
variable.

  • T* → foo(*x)
  • T& → foo(x)
  • T → foo(x)

It will also do some basic simplifications to avoid creating expressions
like *&x.
$&(id) Renders a node’s source as an address, even if the node is an
lvalue. For the following examples, assume something is bound to the x in
the expression foo(x). The rewrite will depend on the type of the bound
variable.

  • T* → foo(x)
  • T& → foo(&x)
  • T → foo(&x)

It will also do some basic simplifications to avoid creating expressions
like &*x.
$(id) Given a reference to a node e, generates (e) if e may parse
differently depending on context. For example, a binary operation is always
wrapped, while a variable reference is never wrapped.
$includeHeader(foo/bar.h) Add an include statement (of the form #include
“foo/bar.h”) at the beginning of the file. Can only appear at the beginning
of the stencil.
Grammar

The grammar for the format strings is as follows

FormatString = IncludeOp* Part*
Part = Text | IdOp | UnaryOp | MemberOp
Text = Literal+
Literal = UNESCAPED|ESCAPED
IdOp = '$' (ID | '(' ID ')')
IncludeOp = '$includeHeader(' PATH ')'
UnaryOp = '$' ('*' | '&' | 'name' | 'callArgs' | 'initListElements') '(' ID ')'
MemberOp = IdOp '.' (Text | Part)

UNESCAPED = [^$]
ESCAPED = '\'.
ID = [a-zA-Z0-9_]+
PATH = [a-zA-Z0-9/.]+
1 Like

Me:

Yitzhak wrote:

Given the matcher that binds the condition to cond and the body to body=
,
you can express the output as:

if (!($cond)) { LOG(ERROR) << “condition failed”; } else $body

This is actually very close for what I had in mind! Naturally you
can’t express deletions with a format-string like approach, but it
works great for the most common case of “replace X with Y where Y has
values substituted in from X and/or the outer environment”.

That said, I should note that we’ve had this language available for gene=
ral
use for > 2 years, integrated into a web UI, and we haven’t seen much
adoption.

I assume you’re referring to adoption internally at Google?

In practice, I think a lot of migrations end up needing at least
some custom matchers and/or Stencils (more the former than the latter)=

Right now the barrier to entry is so high that I would like to get
some more practice with a scripted approach before I conclude that
most things are going to need some custom code. My gut feeling is
that with more experience using such a tool we’ll identify more and
more bits of scripting infrastructure (matchers, stencils, what have
you) that fill in the missing pieces.

My expectation is that there will always be some transformations that
are going to require some custom code, because the logic is too
complex to be expressed in script.

Also, learning the AST and matchers remains a very high barrier for many
newcomers.

Yes, I think this is a barrier. Even I have a hard time figuring out
how to write a matcher just from looking at the docs. I think there
are improved UIs for developing checks that could improve here. For
instance, if I had IDE support for “write a matcher expression that
selects the highlighted code” and a clang-query window in my IDE (I
use VS, but you get the idea) that was integrated with my editor.

I don’t think we’re quite at the point for GUI integration though, I
think the parser/script infrastructure needs to be solid first.

Yitzhak:

This is actually very close for what I had in mind! Naturally you
can’t express deletions with a format-string like approach, but it
works great for the most common case of “replace X with Y where Y has
values substituted in from X and/or the outer environment”.

Right – this only covers the replacement itself. We need to extend with
operators for “insert”,“change”,“remove”. Probably worth just sticking to
the syntax of the library, but open to other ideas.

I assume you’re referring to adoption internally at Google?

Ah. Yes, sorry.

Right now the barrier to entry is so high that I would like to get
some more practice with a scripted approach before I conclude that
most things are going to need some custom code. My gut feeling is
that with more experience using such a tool we’ll identify more and
more bits of scripting infrastructure (matchers, stencils, what have
you) that fill in the missing pieces.

Fair enough. In some sense, we’ve been doing that for the libraries.
Regardless, its worth a try. Might still be easier to have scripting
language w/ facilty for user-defined operations, vs having to write
full-blown C++ programs.

Yes, I think this is a barrier. Even I have a hard time figuring out
how to write a matcher just from looking at the docs. I think there
are improved UIs for developing checks that could improve here. For
instance, if I had IDE support for “write a matcher expression that
selects the highlighted code” and a clang-query window in my IDE (I
use VS, but you get the idea) that was integrated with my editor.

Agreed. This would be very cool. We’ve made some progress in this
direction, too, with libraries that take an AST and output potential
matchers. But, here too, I haven’t time to get those committed upstream.

Me:

This is all great to hear, it sounds like we’re very much thinking
along the same lines!

I had reviewed a google docs with some comments along these lines a
couple years ago; that wouldn’t happened to have been your doc was it?

I tried finding the link to it, but I couldn’t find it easily.

I think I’m going to take the existing checks that I’ve created and
refactor them into Transformer checks as a way to get some experience.

Yitzhak:

Not sure if you saw, but I
recently added a tutorial to the clang docs.
Clang Transformer Tutorial — Clang 21.0.0git documentation

Thanks for creating this discussion @HerrCai0907

Original proposal is here [clang-tidy] RFC: Add generic check that takes match expression like clang-query · Issue #107680 · llvm/llvm-project · GitHub
Maybe someone finds that proposal easier to understand.

But in short: We just want to have some way to quickly add clang-query -based check and use it with existing clang-tidy infrastructure.
I have already tested this solution from [clang-tidy] Add ClangQueryChecks config option by DeNiCoN · Pull Request #123734 · llvm/llvm-project · GitHub and it is good enough for our usage at work, but since @HerrCai0907 volunteered to improve that PR - I am happy to support his intention.

Basically I just want to express my support of this PR and I hope other people also support this. In fact I am surprised nobody else came up with this patch before, it is really useful :slight_smile:

Thanks for bringing up the RFC!

At a high level, I think it is a good idea to try to find a way to make it easier for folks to add small, “one-off” checks to clang-tidy. Things like “don’t call function foo” are a pretty common need for various projects, and writing a whole clang-tidy plugin and integrating that can be a challenge (especially because plugin support for Windows is not really there yet).

One concern I have is that clang-query uses the dynamic AST matchers which are different than the C++ DSL used by clang-tidy. The intent is for them to be the same, but the nature of the beast is that the C++ DSL gets way better diagnostics when you mess up the matchers. For example, sometimes clang-query just quietly does nothing instead of telling you about matcher type mismatches. Some things are possible in one that aren’t possible in the other.

It’s also unclear as to whether a clang-query based interface will allow for adding fixits to make code transformations easier.

Some questions I have:

  • How do you plan to handle invalid query strings? Duplicate check names?
  • Do you have plans to support fix-its in the initial implementation? If not, how sure are you that they’re possible to support in the future?
  • What is the performance difference between a clang-tidy plugin and a query string-based approach for the same check? e.g., is clang-query’s performance noticeably different?

CC @ymand as well

1 Like

I’ve seen this too and the divergence certainly makes for a more difficult development/debug experience for check authors as I usually use clang-query to prototype my matcher.

Give configuration error, since it is the reuse clang-query’s grammar, it will not be too hard to debug in clang-query

In current implement I add custom- module prefix for each check. But for the same name custom check, I am not sure they should be fine-grained override in which way. But at least, we can override the whole check.

Consider the complexity for fix-its in current check, it needs to use API from parser / lexers. I don’t think it is easy to implement a powerful replacement framework. IMO, if someone wants to provide fix-its, maybe dynamical library is more suitable.

Parsing string to dynamically create matcher cost some time.
Creating diagnose from binding also waste some time because we need to search binding name in configuration for each matched results. If there are lots of diagnostics pattern in configuration (I hope no one write check like this), it will cost some time (maybe times log(n) since we always need to search string in map).

1 Like

Thank you for raising this.

I was actually surprised how well at least Debian-based Linux distributions supported compiling clang-tidy plugins (see my nlohmann/json CI commit).

The annoying part for me was testing the plugin outside the LLVM repository. I needed to copy check_clang_tidy.py from the LLVM repository in order to make the checks testable. I suppose this could be improved by including the script in distribution packages too.

1 Like