For loop conversion (was: C++11 migration tools)

Thanks for the input!

Tooling/Refactoring is definitely the right way to go - dumping to stdout was just a holdover from the example on LibTooling. I’ll change it once I figure out how it works - and a clean way to arrange the tests.

As for the use of matchers vs. visitors, I decided to use a visitor because this is a relatively complex transformation. I would happily use matchers if I thought I could - and I think that some other c++11 migrations can easily be written with matchers - but I think the for loop checks need some features that matchers don’t have (correct me if I’m wrong!). For example, the check for statically allocated array-based loops does this:
Given a for loop, determine if:

  • The increment statement increments (via ++) exactly one integral index variable, such that the variable is declared and initialized to zero in the init portion of the loop, and that this variable’s value is a compared (via <, > or !=) to a nonnegative compile-time constant N in the compare portion.
  • The index variable is only ever used in an ArrayIndexExpession indexing a single, statically allocated array A.
  • The array A has exactly N elements.
  • Additionally, if the ArrayIndexExpression A[index] is ever assigned, passed to a function or copied as a non-const reference, or its address taken with & (I still need to add a check for calls to non-const member functions), the loop variable in the converted version needs to be a non-const reference so that the value will be correctly updated (this step adds the most complexity).

The other types of loop (iterator-based, and array-like container) are more complicated to detect, since there are more permitted ways to define and use the index/iterators. What makes this difficult to do entirely with matchers is the number of back- and cross-references, as well as the different local behaviors based on semantic properties. On the other hand, if there were some kind of backreference-enabled matcher that allowed me to locate all matches in a given Stmt, it could be much easier to express some parts of the logic, such as the first step in the above list. I also suspect that a single-Stmt matcher would better way to handle the last step; currently I track whether the visitor is looking at a statement or expression which fits any of the const-disqualifying conditions, and a note is made if I run into A[index].

Does this make the use case clearer? I don’t really see a better way to approach this problem, but you guys know the available toolkit far better than I do.

Thanks for the input!

Tooling/Refactoring is definitely the right way to go - dumping to stdout was just a holdover from the example on LibTooling. I’ll change it once I figure out how it works - and a clean way to arrange the tests.

As for the use of matchers vs. visitors, I decided to use a visitor because this is a relatively complex transformation. I would happily use matchers if I thought I could - and I think that some other c++11 migrations can easily be written with matchers - but I think the for loop

I’m not claiming that the matchers needed to match those constructs are all already written - but if we write the questions you need to ask into matchers, other people who want to match similar things can reuse them, thus amplifying the impact of the code you write :wink:

checks need some features that matchers don’t have (correct me if I’m wrong!). For example, the check for statically allocated array-based loops does this:
Given a for loop, determine if:

  • The increment statement increments (via ++) exactly one integral index variable, such that the variable is declared and initialized to zero in the init portion of the loop, and that this variable’s value is a compared (via <, > or !=) to a nonnegative compile-time constant N in the compare portion.
  • The index variable is only ever used in an ArrayIndexExpession indexing a single, statically allocated array A.
  • The array A has exactly N elements.
  • Additionally, if the ArrayIndexExpression A[index] is ever assigned, passed to a function or copied as a non-const reference, or its address taken with & (I still need to add a check for calls to non-const member functions), the loop variable in the converted version needs to be a non-const reference so that the value will be correctly updated (this step adds the most complexity).

… and the matcher I would want to write for this looks something like that:
ForLoop(
HasInitialization(Declaration(Id(“loopvar”, HasType(IsIntegral())))),
HasCondition(BinaryOperator(
HasAnyOperand(DeclarationReference(Id(“condref”, To(Variable())))),
HasAnyOperand(IntegerLiteral()))),
HasIncrement(UnaryOperator(HasUnaryOperand(DeclarationReference(Id(“incref”, To(Variable()))))), …),
)

In general, the complex stuff can stay complex, but the simple stuff shouldn’t be lots of code.

The other types of loop (iterator-based, and array-like container) are more complicated to detect, since there are more permitted ways to define and use the index/iterators. What makes this difficult to do entirely with matchers is the number of back- and cross-references, as well as the different local behaviors based on semantic properties. On the other hand, if there were some kind of backreference-enabled matcher that

The way to handle the back-refs is to bind the nodes you want, and then pull them out and compare them in the callback.

Thoughts?
/Manuel

Thanks for the input!

Tooling/Refactoring is definitely the right way to go - dumping to stdout
was just a holdover from the example on LibTooling. I'll change it once I
figure out how it works - and a clean way to arrange the tests.

As for the use of matchers vs. visitors, I decided to use a visitor
because this is a relatively complex transformation. I would happily use
matchers if I thought I could - and I think that some other c++11
migrations can easily be written with matchers - but I think the for loop

I'm not claiming that the matchers needed to match those constructs are
all already written - but if we write the questions you need to ask into
matchers, other people who want to match similar things can reuse them,
thus amplifying the impact of the code you write :wink:

checks need some features that matchers don't have (correct me if I'm
wrong!). For example, the check for statically allocated array-based loops
does this:
Given a for loop, determine if:
- The increment statement increments (via ++) exactly one integral index
variable, such that the variable is declared and initialized to zero in the
init portion of the loop, and that this variable's value is a compared (via
<, > or !=) to a nonnegative compile-time constant N in the compare portion.
- The index variable is only ever used in an ArrayIndexExpession
indexing a single, statically allocated array A.
- The array A has exactly N elements.
- Additionally, if the ArrayIndexExpression A[index] is ever assigned,
passed to a function or copied as a non-const reference, or its address
taken with & (I still need to add a check for calls to non-const member
functions), the loop variable in the converted version needs to be a
non-const reference so that the value will be correctly updated (this step
adds the most complexity).

... and the matcher I would want to write for this looks something like
that:
ForLoop(
  HasInitialization(Declaration(Id("loopvar", HasType(IsIntegral())))),
  HasCondition(BinaryOperator(
    HasAnyOperand(DeclarationReference(Id("condref", To(Variable())))),
    HasAnyOperand(IntegerLiteral()))),

HasIncrement(UnaryOperator(HasUnaryOperand(DeclarationReference(Id("incref",
To(Variable()))))), ...),
)

In general, the complex stuff can stay complex, but the simple stuff
shouldn't be lots of code.

Good point - and this is much easier to read than the equivalent code I had
written.

The other types of loop (iterator-based, and array-like container) are
more complicated to detect, since there are more permitted ways to define
and use the index/iterators. What makes this difficult to do entirely with
matchers is the number of back- and cross-references, as well as the
different local behaviors based on semantic properties. On the other hand,
if there were some kind of backreference-enabled matcher that

The way to handle the back-refs is to bind the nodes you want, and then
pull them out and compare them in the callback.

I see - make the matcher slightly more general, then filter the results,
perhaps with a visitor.

Thoughts?
/Manuel

This sounds like it would make at least half the work much easier, so I
think it would definitely be worth it to try switching to a matcher-based
solution. When are matchers supposed to hit mainline (or some extra
cloneable repo) :slight_smile: ?

-Sam

Matchers are currently in ^cfe/branches/tooling/include/clang/ASTMatchers/…

I’m currently working on renaming them to camelCase from CamelCase; there’s a Tool to help with the conversion though, so no problem in starting now …

Cheers,
/Manuel

I’m trying to port my code to take advantage of matchers, now that they are in mainline. Some of the work I want to do involves semantic analysis of the results (i.e. in the callback). What would be the best way to get a Sema or CompilerInstance out of either RefactoringTool or MatchResult? I’m currently playing with changing MatchASTConsumer to inherit from SemaConsumer, so that MatchFinder can track a Sema object the same way it does an ASTContext.

Thanks!

I'm trying to port my code to take advantage of matchers, now that they are
in mainline. Some of the work I want to do involves semantic analysis of the
results (i.e. in the callback). What would be the best way to get a Sema or
CompilerInstance out of either RefactoringTool or MatchResult? I'm currently
playing with changing MatchASTConsumer to inherit from SemaConsumer, so that
MatchFinder can track a Sema object the same way it does an ASTContext.

I'd be interested in what the use cases for the semantic analysis are
first. Often it seems like it would be better to have the information
available in the AST instead of rerunning (potentially expensive)
semantic analysis.

Cheers,
/Manuel

Good point. I currently have a matcher that finds for loops that appear to have the right shape for conversion, and I do some extra checking along with the conversion work in the callback. I can completely handle array-style loops with a Preprocessor, which is used when generating the names of new variables. For iterator-style loops, I use Sema to determine the return type of operator*, which will be the type of the loop variable. Finally (though this isn’t implemented yet), containers that are used as if they’re arrays will require Sema to check if suitable begin() and end() functions exist.

For now, I can force the use of auto rather than an explicit type listing for iterator-based loops and get most of the work done without Sema. The question then becomes the correct way to avoid repeatedly creating a Preprocessor object, though if this is a cheap operation, I can create one in the callback.

Does this make the use clearer?
Thanks!

Good point. I currently have a matcher that finds for loops that appear to
have the right shape for conversion, and I do some extra checking along with
the conversion work in the callback. I can completely handle array-style
loops with a Preprocessor, which is used when generating the names of new

I don't understand yet what the Preprocessor is needed for here...

variables. For iterator-style loops, I use Sema to determine the return type
of operator*, which will be the type of the loop variable. Finally (though

Isn't the return type of the oeprator* already in the AST?

this isn't implemented yet), containers that are used as if they're arrays
will require Sema to check if suitable begin() and end() functions exist.

Ah, I see - basically you want to trigger overload resolution to see
whether the conversion would work. That makes sense. I'm not sure we
want to get all of Sema available though - it's an insanely huge
interface, and it's rather hard to figure out what's safe to do and
what's not.

Perhaps we can create a class that encapsulates Sema for those
"what-if" type questions we'll need to answer for refactoring tools?

For now, I can force the use of auto rather than an explicit type listing
for iterator-based loops and get most of the work done without Sema. The
question then becomes the correct way to avoid repeatedly creating a
Preprocessor object, though if this is a cheap operation, I can create one
in the callback.

Does this make the use clearer?

Yep, thx.

Cheers,
/Manuel

> Good point. I currently have a matcher that finds for loops that appear
to
> have the right shape for conversion, and I do some extra checking along
with
> the conversion work in the callback. I can completely handle array-style
> loops with a Preprocessor, which is used when generating the names of new

I don't understand yet what the Preprocessor is needed for here...

I use Preprocessor in exactly one place right now, namely to create an
IdentifierResolver. When I am generating the name of a new loop variable, I
want to make sure that the identifier is unique, which is an approximation
to checking that the identifier neither shadows an existing definition nor
is shadowed in an inner scope. If there's a better way to do this, I would
be happy to change it.

> variables. For iterator-style loops, I use Sema to determine the return
type
> of operator*, which will be the type of the loop variable. Finally
(though

Isn't the return type of the oeprator* already in the AST?

It's actually possible that operator* is never called. My initial
implementation tried to infer the type from expressions involving the
iterator, but this didn't work correctly for CXXMemberCallExpr. I can
conceivably work around this without checking operator*, but

> this isn't implemented yet), containers that are used as if they're
arrays
> will require Sema to check if suitable begin() and end() functions exist.

Ah, I see - basically you want to trigger overload resolution to see
whether the conversion would work. That makes sense. I'm not sure we
want to get all of Sema available though - it's an insanely huge
interface, and it's rather hard to figure out what's safe to do and
what's not.

Perhaps we can create a class that encapsulates Sema for those
"what-if" type questions we'll need to answer for refactoring tools?

This sounds like a good idea. I can try to identify exactly the features I
need (most likely just overload resolution).

> Good point. I currently have a matcher that finds for loops that appear
> to
> have the right shape for conversion, and I do some extra checking along
> with
> the conversion work in the callback. I can completely handle array-style
> loops with a Preprocessor, which is used when generating the names of
> new

I don't understand yet what the Preprocessor is needed for here...

I use Preprocessor in exactly one place right now, namely to create an
IdentifierResolver. When I am generating the name of a new loop variable, I
want to make sure that the identifier is unique, which is an approximation
to checking that the identifier neither shadows an existing definition nor
is shadowed in an inner scope. If there's a better way to do this, I would
be happy to change it.

The ASTContext has the identifier table. (ASTContext::Idents).

> variables. For iterator-style loops, I use Sema to determine the return
> type
> of operator*, which will be the type of the loop variable. Finally
> (though

Isn't the return type of the oeprator* already in the AST?

It's actually possible that operator* is never called. My initial
implementation tried to infer the type from expressions involving the
iterator, but this didn't work correctly for CXXMemberCallExpr. I can
conceivably work around this without checking operator*, but

did you want to finish that

> this isn't implemented yet), containers that are used as if they're
> arrays
> will require Sema to check if suitable begin() and end() functions
> exist.

Ah, I see - basically you want to trigger overload resolution to see
whether the conversion would work. That makes sense. I'm not sure we
want to get all of Sema available though - it's an insanely huge
interface, and it's rather hard to figure out what's safe to do and
what's not.

Perhaps we can create a class that encapsulates Sema for those
"what-if" type questions we'll need to answer for refactoring tools?

This sounds like a good idea. I can try to identify exactly the features I
need (most likely just overload resolution).

That would sound like most of what we need. It has come up a few
times. Let me know if you identify more things we need.

Cheers,
/Manuel

Here are three kinds of operations that I would want to completely avoid any kind of recursive AST traversal:

  • Optional matchers: In order to ignore certain ignorable parts of the AST, I often want to create

  • Right-recursive matchers:

I have a function that hunts for the expression which is actually passed to a single-parameter constructor, which sheds ImplicitCastExprs, single-parameter CXXConstructExprs, and MaterializeTemporaryExprs until it finds something that is not a single-parameter constructor. Essentially, the matcher would look like
StmtMatcher = expression(rightRecursion(ignoreImplicitCasts(constructorCall(argumentCountIs(1), hasArgument(0, optional(materializeTemporaryExpr(RecursionPoint))), withBaseCase(id(expresion))))

  • The ability to run the equivalent of a MatchFinder on an arbitrary AST (e.g. in the callback of another matcher)

Here are three kinds of operations that I would want to completely avoid any kind of recursive AST traversal:

  • Optional matchers: In order to ignore certain ignorable parts of the AST, I often want to create

Is there something missing?

  • Right-recursive matchers:

I have a function that hunts for the expression which is actually passed to a single-parameter constructor, which sheds ImplicitCastExprs, single-parameter CXXConstructExprs, and MaterializeTemporaryExprs until it finds something that is not a single-parameter constructor. Essentially, the matcher would look like
StmtMatcher = expression(rightRecursion(ignoreImplicitCasts(constructorCall(argumentCountIs(1), hasArgument(0, optional(materializeTemporaryExpr(RecursionPoint))), withBaseCase(id(expresion))))

hasDescendant / forEachDescendant doesn’t work?

  • The ability to run the equivalent of a MatchFinder on an arbitrary AST (e.g. in the callback of another matcher)

That shouldn’t be too hard to implement.

I accidentally sent the email before I was done (the Gmail tab/space combo
strikes again), and you replied before I finished fixing the original :slight_smile:

Here are three kinds of operations that I would want to completely avoid
any kind of recursive AST traversal:
- Optional matchers: In order to ignore certain ignorable parts of the
AST, I often want to create

Is there something missing?

The missing part of optional matchers: I'm just wondering if a shorthand
would be reasonable to implement. instead of something like
  anyOf(unaryOperator(hasOperand(SomeOtherMatcher)), SomeOtherMatcher)
One could use
  optional(unaryOperator(hasOperand), SomeOtherMatcher)

- Right-recursive matchers:

  I have a function that hunts for the expression which is actually
passed to a single-parameter constructor, which sheds ImplicitCastExprs,
single-parameter CXXConstructExprs, and MaterializeTemporaryExprs until it
finds something that is not a single-parameter constructor. Essentially,
the matcher would look like
  StmtMatcher =
expression(rightRecursion(ignoreImplicitCasts(constructorCall(argumentCountIs(1),
hasArgument(0, optional(materializeTemporaryExpr(RecursionPoint))),
withBaseCase(id(expresion))))

hasDescendant / forEachDescendant doesn't work?

I am trying to enforce a particular structure on the way to those
descendants - I only want to match the innermost expression if the AST was
a linear chain of these particular kinds of node, and
{has,forEach}Descendant throws this information away. If I'm wrong, that
would be great!

- The ability to run the equivalent of a MatchFinder on an arbitrary AST
(e.g. in the callback of another matcher)

That shouldn't be too hard to implement.

Excellent!

Thanks,
-Sam

I accidentally sent the email before I was done (the Gmail tab/space combo
strikes again), and you replied before I finished fixing the original :slight_smile:

Here are three kinds of operations that I would want to completely avoid
any kind of recursive AST traversal:
- Optional matchers: In order to ignore certain ignorable parts of the
AST, I often want to create

Is there something missing?

The missing part of optional matchers: I'm just wondering if a shorthand
would be reasonable to implement. instead of something like
  anyOf(unaryOperator(hasOperand(SomeOtherMatcher)), SomeOtherMatcher)
One could use
  optional(unaryOperator(hasOperand), SomeOtherMatcher)

And we arrive at where the problem is with something that looks like a
functional language but doesn't have a really nice way to express
higher order functions without lots of syntactical crap around it.

But it also doesn't strike me as so bad to repeat SomeOtherMatcher in there...

- Right-recursive matchers:

  I have a function that hunts for the expression which is actually
passed to a single-parameter constructor, which sheds ImplicitCastExprs,
single-parameter CXXConstructExprs, and MaterializeTemporaryExprs until it
finds something that is not a single-parameter constructor. Essentially, the
matcher would look like
  StmtMatcher =
expression(rightRecursion(ignoreImplicitCasts(constructorCall(argumentCountIs(1),
hasArgument(0, optional(materializeTemporaryExpr(RecursionPoint))),
withBaseCase(id(expresion))))

hasDescendant / forEachDescendant doesn't work?

I am trying to enforce a particular structure on the way to those
descendants - I only want to match the innermost expression if the AST was a
linear chain of these particular kinds of node, and {has,forEach}Descendant
throws this information away. If I'm wrong, that would be great!

hasDescendant will match at the first occurrence in tree-search order
(RecursiveASTVisitor order)

Cheers,
/Manuel

> I accidentally sent the email before I was done (the Gmail tab/space
combo
> strikes again), and you replied before I finished fixing the original :slight_smile:
>
>>
>>>
>>> Here are three kinds of operations that I would want to completely
avoid
>>> any kind of recursive AST traversal:
>>> - Optional matchers: In order to ignore certain ignorable parts of the
>>> AST, I often want to create
>>
>>
>> Is there something missing?
>
>
> The missing part of optional matchers: I'm just wondering if a shorthand
> would be reasonable to implement. instead of something like
> anyOf(unaryOperator(hasOperand(SomeOtherMatcher)), SomeOtherMatcher)
> One could use
> optional(unaryOperator(hasOperand), SomeOtherMatcher)

And we arrive at where the problem is with something that looks like a
functional language but doesn't have a really nice way to express
higher order functions without lots of syntactical crap around it.

But it also doesn't strike me as so bad to repeat SomeOtherMatcher in
there...

I was just curious if this was possible, though I didn't expect it to be.
It would only have been really useful when combined with the recursion
request.

>>> - Right-recursive matchers:
>>>
>>> I have a function that hunts for the expression which is actually
>>> passed to a single-parameter constructor, which sheds
ImplicitCastExprs,
>>> single-parameter CXXConstructExprs, and MaterializeTemporaryExprs
until it
>>> finds something that is not a single-parameter constructor.
Essentially, the
>>> matcher would look like
>>> StmtMatcher =
>>>
expression(rightRecursion(ignoreImplicitCasts(constructorCall(argumentCountIs(1),
>>> hasArgument(0, optional(materializeTemporaryExpr(RecursionPoint))),
>>> withBaseCase(id(expresion))))
>>
>>
>> hasDescendant / forEachDescendant doesn't work?
>
>
> I am trying to enforce a particular structure on the way to those
> descendants - I only want to match the innermost expression if the AST
was a
> linear chain of these particular kinds of node, and
{has,forEach}Descendant
> throws this information away. If I'm wrong, that would be great!

hasDescendant will match at the first occurrence in tree-search order
(RecursiveASTVisitor order)

I think the difficulty here is matching against a recurring pattern. In
some sense, what I'm really looking for is the first descendant that
doesn't match. To make this as explicit as possible, my current code looks
like this:

const Expr *digThroughConstructors(const Expr *E) {
  if (!E)
    return NULL;
  E = E->IgnoreParenCasts();
  if (const CXXConstructExpr *ConstructExpr =
dyn_cast<CXXConstructExpr>(E)) {
    if (ConstructExpr->getNumArgs() != 1)
      return NULL;
    E = ConstructExpr->getArg(0);
    if (const MaterializeTemporaryExpr *MTE
= dyn_cast<MaterializeTemporaryExpr>(E))
      E = MTE->GetTemporaryExpr();
    return digThroughConstructors(E);
  }
  return E;
}

It's short enough that I don't mind leaving it as its own separate check.
Thanks again for taking the time to help me with this!

-Sam

> I accidentally sent the email before I was done (the Gmail tab/space
> combo
> strikes again), and you replied before I finished fixing the original :slight_smile:
>
>>
>>>
>>> Here are three kinds of operations that I would want to completely
>>> avoid
>>> any kind of recursive AST traversal:
>>> - Optional matchers: In order to ignore certain ignorable parts of
>>> the
>>> AST, I often want to create
>>
>>
>> Is there something missing?
>
>
> The missing part of optional matchers: I'm just wondering if a shorthand
> would be reasonable to implement. instead of something like
> anyOf(unaryOperator(hasOperand(SomeOtherMatcher)), SomeOtherMatcher)
> One could use
> optional(unaryOperator(hasOperand), SomeOtherMatcher)

And we arrive at where the problem is with something that looks like a
functional language but doesn't have a really nice way to express
higher order functions without lots of syntactical crap around it.

But it also doesn't strike me as so bad to repeat SomeOtherMatcher in
there...

I was just curious if this was possible, though I didn't expect it to be. It
would only have been really useful when combined with the recursion request.

>>> - Right-recursive matchers:
>>>
>>> I have a function that hunts for the expression which is actually
>>> passed to a single-parameter constructor, which sheds
>>> ImplicitCastExprs,
>>> single-parameter CXXConstructExprs, and MaterializeTemporaryExprs
>>> until it
>>> finds something that is not a single-parameter constructor.
>>> Essentially, the
>>> matcher would look like
>>> StmtMatcher =
>>>
>>> expression(rightRecursion(ignoreImplicitCasts(constructorCall(argumentCountIs(1),
>>> hasArgument(0, optional(materializeTemporaryExpr(RecursionPoint))),
>>> withBaseCase(id(expresion))))
>>
>>
>> hasDescendant / forEachDescendant doesn't work?
>
>
> I am trying to enforce a particular structure on the way to those
> descendants - I only want to match the innermost expression if the AST
> was a
> linear chain of these particular kinds of node, and
> {has,forEach}Descendant
> throws this information away. If I'm wrong, that would be great!

hasDescendant will match at the first occurrence in tree-search order
(RecursiveASTVisitor order)

I think the difficulty here is matching against a recurring pattern. In some
sense, what I'm really looking for is the first descendant that doesn't
match. To make this as explicit as possible, my current code looks like
this:

const Expr *digThroughConstructors(const Expr *E) {
  if (!E)
    return NULL;
  E = E->IgnoreParenCasts();
  if (const CXXConstructExpr *ConstructExpr = dyn_cast<CXXConstructExpr>(E))
{
    if (ConstructExpr->getNumArgs() != 1)
      return NULL;
    E = ConstructExpr->getArg(0);
    if (const MaterializeTemporaryExpr *MTE =
dyn_cast<MaterializeTemporaryExpr>(E))
      E = MTE->GetTemporaryExpr();
    return digThroughConstructors(E);
  }
  return E;
}

It's short enough that I don't mind leaving it as its own separate check.

Yea, I'm not sure that's easy to express as a match currently. We
should think about adding a linear search matcher construct that
expresses the relationship you want.

On the other hand, I'm not sure yet that we really need this in your
specific case :slight_smile: But only seeing the code around it will answer that
for me...

Cheers,
/Manuel