I discussed this idea with Yitzhak Mandelbaum in personal email based on his work with the Transformer library.
In the following replies, I’ll repeat our conversation from 2022:
Me:
Since the last time I hacked on clang-tidy, the Transformer library has
been added. From what I can tell in the commit history and email list
archives, you are the primary author.
Thanks for adding this library!
Having written a bunch of clang-tidy checks[*], I have come to the
conclusion that many refactorings we’d like to write could be handled
by a scriptable tool. The DSLs for expressing edits and rewrite rules
in the Transformer lib puts us almost all the way towards having such
a tool. Imagine if we could write the edits and the rewrite rules as
easily as we write matchers in clang-query.
For many organizations, I think writing C++ that directly manipulates
the AST and the source code transformations is too high a burden for
writing their own transformations. Particularly if those
transformations are of a migratory nature and are “fire once and
forget” for their code base.
If we could get a script-like language mechanism in place for writing
transformations, I think it would make this accessible to many more
people.
It appears that you wrote some parsing support in Transformer already.
I’d like to collaborate with you on adding more parsing support for
the remainder of the DSLs so that we can move closer to a scripted
clang-tidy check.
What do you think?
Yitzhak Mandelbaum:
Thanks for reaching out. I’d be happy to work on this together. I already
have a language and parser that covers many of the available Stencil
combinators, I just haven’t had the chance to make the patches for
committing it. This could be a good forcing function. :)
That said, you
might find you prefer a different language. This one is designed in “format
string” style, vs the clang-query language and my own range language which
mimic the C++ libraries they’re based on. I’ve pasted the 1-page guide
below so you can see what I mean.
That said, I should note that we’ve had this language available for general
use for > 2 years, integrated into a web UI, and we haven’t seen much
adoption. In practice, I think a lot of migrations end up needing at least
some custom matchers and/or Stencils (more the former than the latter).
Also, learning the AST and matchers remains a very high barrier for many
newcomers
Clang’s Stencil abstraction provides a code-generating object parameterized
by named references to (bound) AST nodes. Stencils are specified in Zwingli
with a specialized format string language. For example, if we wish to
transform code of the form
if (condition) { body; }
to
if (!condition) { LOG(ERROR) << "condition failed"; } else { body; }
Given the matcher that binds the condition to cond and the body to body,
you can express the output as:
if (!($cond)) { LOG(ERROR) << "condition failed"; } else $body
Here, the $
tells the parser to treat the following alphanumeric characters
as an identifier bound to an AST node by the matcher. The format strings
provide additional built-in operators to help you construct your output.
The full grammar for the format strings can be found below.
StencilDescription
$id
Add the code segment represented by the AST node bound to the reference
id
$id.member
Constructs code that accesses the named member of the object
bound to id. The access is constructed idiomatically: if id is bound to e,
then constructs e->member
, when e is a pointer, and e.member
otherwise.
Wraps e in parentheses, if needed. Similarly, it will do some basic
simplifications to avoid creating expressions like (&x)->member
. Members
can be identified by raw text or operator calls (e.g. $name()
).
$name(id)
Given a reference to a named declaration id (that is, a node of
type clang::NamedDecl or one of its derived classes), generates the
name. id must
have an identifier name (that is, constructors are not valid arguments to
the name operation).
$callArgs(id)
Given a reference to call expression node, generates the
source text of the arguments (all source between the call’s parentheses).
$initListElements(id)
Given a reference to an initializer-list expression
node, generates the source text of the elements (all source between the
braces).
$*(id)
Renders a node’s source as a value, even if the node is a pointer.
For the following examples, assume something is bound to the x in the
expression foo(x)
. The rewrite will depend on the type of the bound
variable.
- T* → foo(*x)
- T& → foo(x)
- T → foo(x)
It will also do some basic simplifications to avoid creating expressions
like *&x
.
$&(id)
Renders a node’s source as an address, even if the node is an
lvalue. For the following examples, assume something is bound to the x in
the expression foo(x). The rewrite will depend on the type of the bound
variable.
- T* → foo(x)
- T& → foo(&x)
- T → foo(&x)
It will also do some basic simplifications to avoid creating expressions
like &*x.
$(id) Given a reference to a node e, generates (e) if e may parse
differently depending on context. For example, a binary operation is always
wrapped, while a variable reference is never wrapped.
$includeHeader(foo/bar.h) Add an include statement (of the form #include
“foo/bar.h”) at the beginning of the file. Can only appear at the beginning
of the stencil.
Grammar
The grammar for the format strings is as follows
FormatString = IncludeOp* Part*
Part = Text | IdOp | UnaryOp | MemberOp
Text = Literal+
Literal = UNESCAPED|ESCAPED
IdOp = '$' (ID | '(' ID ')')
IncludeOp = '$includeHeader(' PATH ')'
UnaryOp = '$' ('*' | '&' | 'name' | 'callArgs' | 'initListElements') '(' ID ')'
MemberOp = IdOp '.' (Text | Part)
UNESCAPED = [^$]
ESCAPED = '\'.
ID = [a-zA-Z0-9_]+
PATH = [a-zA-Z0-9/.]+