How to write a matcher?

Hi,

I want to write a tool to remove c_str() calls in lines like

std::cout << s.c_str();

There isn't really enough information about what matchers mean, and how to
use them.

There is no good information on how to write/debug them, and how they need
to be arranged (nested, in an allof, wrapped with to(), on(), callee() etc).

How does one *actually* go about doing it?

Here's my attempt:

My test code is:

int main(int, char **)
{
   std::string s("something");
   std::cout << s << std::endl;
   std::cout << s.c_str() << std::endl;
   return 0;
}

I used

clang++ -Xclang -ast-dump -fsyntax-only -fno-color-diagnostics ../main.cpp

to generate

`-FunctionDecl 0x275cae0 <../main.cpp:9:1, line:16:1> main 'int (int, char
**)'
   >-ParmVarDecl 0x275c9a0 <line:9:10> 'int'
   >-ParmVarDecl 0x275ca10 <col:15, col:21> 'char **'
   `-CompoundStmt 0x275f150 <line:10:1, line:16:1>
     >-DeclStmt 0x275d160 <line:11:3, col:23>
     > `-VarDecl 0x275cc30 <col:3, col:22> s 'std::string':'class
std::basic_string<char>'
     > `-ExprWithCleanups 0x275d148 <col:15, col:22> 'std::string':'class
std::basic_string<char>'
     > `-CXXConstructExpr 0x275d108 <col:15, col:22>
'std::string':'class std::basic_string<char>' 'void (const char *, const
class std::allocator<char> &)'
     > >-ImplicitCastExpr 0x275cd20 <col:17> 'const char *'
<ArrayToPointerDecay>
     > > `-StringLiteral 0x275cc88 <col:17> 'const char [4]' lvalue
"wtf"
     > `-CXXDefaultArgExpr 0x275d0e0 <<invalid sloc>> 'const class
std::allocator<char>':'const class std::allocator<char>' lvalue
     >-CXXOperatorCallExpr 0x275e160 <line:12:3, col:26>
'__ostream_type':'class std::basic_ostream<char>' lvalue
     > >-ImplicitCastExpr 0x275e148 <col:18> '__ostream_type &(*)
(__ostream_type &(*)(__ostream_type &))' <FunctionToPointerDecay>
     > > `-DeclRefExpr 0x275e0c0 <col:18> '__ostream_type &(__ostream_type
&(*)(__ostream_type &))' lvalue CXXMethod 0x26d0410 'operator<<'
'__ostream_type &(__ostream_type &(*)(__ostream_type &))'
     > >-CXXOperatorCallExpr 0x275d5c0 <col:3, col:16> 'basic_ostream<char,
struct std::char_traits<char> >':'class std::basic_ostream<char>' lvalue
     > > >-ImplicitCastExpr 0x275d5a8 <col:13> 'basic_ostream<char, struct
std::char_traits<char> > &(*)(basic_ostream<char, struct
std::char_traits<char> > &, const basic_string<char, struct
std::char_traits<char>, class std::allocator<char> > &)'
<FunctionToPointerDecay>
     > > > `-DeclRefExpr 0x275d528 <col:13> 'basic_ostream<char, struct
std::char_traits<char> > &(basic_ostream<char, struct std::char_traits<char>

&, const basic_string<char, struct std::char_traits<char>, class

std::allocator<char> > &)' lvalue Function 0x2533250 'operator<<'
'basic_ostream<char, struct std::char_traits<char> > &(basic_ostream<char,
struct std::char_traits<char> > &, const basic_string<char, struct
std::char_traits<char>, class std::allocator<char> > &)'
     > > >-DeclRefExpr 0x275d198 <col:3, col:8> 'ostream':'class
std::basic_ostream<char>' lvalue Var 0x275c3a0 'cout' 'ostream':'class
std::basic_ostream<char>'
     > > `-ImplicitCastExpr 0x275d510 <col:16> 'const basic_string<char,
struct std::char_traits<char>, class std::allocator<char> >':'const class
std::basic_string<char>' lvalue <NoOp>
     > > `-DeclRefExpr 0x275d1d0 <col:16> 'std::string':'class
std::basic_string<char>' lvalue Var 0x275cc30 's' 'std::string':'class
std::basic_string<char>'
     > `-ImplicitCastExpr 0x275e0a8 <col:21, col:26> 'basic_ostream<char,
struct std::char_traits<char> > &(*)(basic_ostream<char, struct
std::char_traits<char> > &)' <FunctionToPointerDecay>
     > `-DeclRefExpr 0x275e068 <col:21, col:26> 'basic_ostream<char,
struct std::char_traits<char> > &(basic_ostream<char, struct
std::char_traits<char> > &)' lvalue Function 0x26d4480 'endl'
'basic_ostream<char, struct std::char_traits<char> > &(basic_ostream<char,
struct std::char_traits<char> > &)' (FunctionTemplate 0x26b9180 'endl')
     >-CXXOperatorCallExpr 0x275f0c8 <line:13:3, col:34>
'__ostream_type':'class std::basic_ostream<char>' lvalue
     > >-ImplicitCastExpr 0x275f0b0 <col:26> '__ostream_type &(*)
(__ostream_type &(*)(__ostream_type &))' <FunctionToPointerDecay>
     > > `-DeclRefExpr 0x275f088 <col:26> '__ostream_type &(__ostream_type
&(*)(__ostream_type &))' lvalue CXXMethod 0x26d0410 'operator<<'
'__ostream_type &(__ostream_type &(*)(__ostream_type &))'
     > >-CXXOperatorCallExpr 0x275e610 <col:3, col:24> 'basic_ostream<char,
struct std::char_traits<char> >':'class std::basic_ostream<char>' lvalue
     > > >-ImplicitCastExpr 0x275e5f8 <col:13> 'basic_ostream<char, struct
std::char_traits<char> > &(*)(basic_ostream<char, struct
std::char_traits<char> > &, const char *)' <FunctionToPointerDecay>
     > > > `-DeclRefExpr 0x275e578 <col:13> 'basic_ostream<char, struct
std::char_traits<char> > &(basic_ostream<char, struct std::char_traits<char>

&, const char *)' lvalue Function 0x26d9830 'operator<<'

'basic_ostream<char, struct std::char_traits<char> > &(basic_ostream<char,
struct std::char_traits<char> > &, const char *)'
     > > >-DeclRefExpr 0x275e1c8 <col:3, col:8> 'ostream':'class
std::basic_ostream<char>' lvalue Var 0x275c3a0 'cout' 'ostream':'class
std::basic_ostream<char>'
     > > `-CXXMemberCallExpr 0x275e258 <col:16, col:24> 'const char *'
     > > `-MemberExpr 0x275e228 <col:16, col:18> '<bound member function

' .c_str 0x252c400

     > > `-ImplicitCastExpr 0x275e280 <col:16> 'const class
std::basic_string<char>' lvalue <NoOp>
     > > `-DeclRefExpr 0x275e200 <col:16> 'std::string':'class
std::basic_string<char>' lvalue Var 0x275cc30 's' 'std::string':'class
std::basic_string<char>'
     > `-ImplicitCastExpr 0x275f070 <col:29, col:34> 'basic_ostream<char,
struct std::char_traits<char> > &(*)(basic_ostream<char, struct
std::char_traits<char> > &)' <FunctionToPointerDecay>
     > `-DeclRefExpr 0x275f030 <col:29, col:34> 'basic_ostream<char,
struct std::char_traits<char> > &(basic_ostream<char, struct
std::char_traits<char> > &)' lvalue Function 0x26d4480 'endl'
'basic_ostream<char, struct std::char_traits<char> > &(basic_ostream<char,
struct std::char_traits<char> > &)' (FunctionTemplate 0x26b9180 'endl')
     `-ReturnStmt 0x275f130 <line:14:3, col:10>
       `-IntegerLiteral 0x275f110 <col:10> 'int' 0

It seems useful to writing a matcher.

I modified the removeCStrCalls tool with a debugging matcher, so that I can
start broad, and incrementally narrow what I want to match, while printing
out what I have actually matched:

class Debugging : public ast_matchers::MatchFinder::MatchCallback {
public:
   Debugging(tooling::Replacements *Replace)
     : Replace(Replace) {}

   virtual void run(const ast_matchers::MatchFinder::MatchResult &Result) {

     const Expr *Arg =
         Result.Nodes.getStmtAs<Expr>("match");
     std::string match = getText(*Result.SourceManager, *Arg);
     std::cout << "MATCH " << match << std::endl;

   }

private:
   tooling::Replacements *Replace;
};

Is this the right/sane/only possible approach to the act of writing a
matcher?

From the -ast-dump, it looks like I need to first match a

CXXOperatorCallExpr, so I write this matcher:

  Finder.addMatcher(
      id("match", operatorCallExpr()),
      &Callback);

which gives me lots of output, mostly from the iostream header. I need to
get narrower.

In the -ast-dump output there is a CXXMemberCallExpr nested in the
CXXOperatorCallExpr. I make a logical jump at matching that:

  Finder.addMatcher(
      id("match", operatorCallExpr(memberCallExpr())),
      &Callback);

However, that produces no output, so my logical jump must be incorrect.

What is the correct logical jump here in order to narrow my match?

How can I determine what logical jumps I can make, by looking at the ast-
dump output?

How does one go about *actually* writing matcher code?

What is the understanding one must have, the logic one must have and the
information one must have to hand (is ast-dump output useful or not)?

To be clear (in case it is not already clear), I'm not looking for a
finished solution for the match/replacement I'm trying to make. I'm asking
what to spend time doing between opening my editor and running a tool that
does what I need. How do I figure out what code I need to write?

Note: I have read the docs. The section at

http://clang.llvm.org/docs/LibASTMatchersTutorial.html#step-2-using-ast-matchers

shows some incremental steps, but does not explain how they are chosen.

How does one know to go from

forStmt()

to

forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))

for example? That is the kind of information I am looking for.

Thanks,

Steve.

I have found the AST Matcher Reference to be useful while developing matchers: http://clang.llvm.org/docs/LibASTMatchersReference.html

Instead of writing your own debug matcher you may want to try clang-query.

I don’t think it’s documented yet but it’s quite straightforward to get started. You will have an interactive shell session to build your matcher and see the results in live, I tried it the other day and I suggest to use it with libedit support so it will complete for you the possible matchers as you type.

https://llvm.org/viewvc/llvm-project/clang-tools-extra/trunk/clang-query/

To generate the AST dump you may want to use clang-check instead of clang++ with all the arguments. clang-check allows you to filter the AST and remove the need of specifying things like -fsyntax-only.

I have found the ast-dump useful since it provides the names of the types that you can look for in the AST Matcher Reference.

papin_g wrote:

I have found the AST Matcher Reference to be useful while developing
matchers: http://clang.llvm.org/docs/LibASTMatchersReference.html

Instead of writing your own debug matcher you may want to try clang-query.

To avoid maintaining a build of clang, I'm using the debian packages. That
doesn't seem to be part of it.

To generate the AST dump you may want to use clang-check instead of
clang++ with all the arguments. clang-check allows you to filter the AST
and remove the need of specifying things like -fsyntax-only.

Thanks for the tip.

However, as I wrote, getting the AST is not the primary problem. Properly
making use of it is.

Anyway, lots of mostly-blind experimentation led me to add this to
RemoveCStrCalls:

  Finder.addMatcher(
      operatorCallExpr(
        hasArgument(0, hasType(recordDecl(hasName("basic_ostream")))),
        hasOverloadedOperatorName("<<"),
        hasArgument(1,
          memberCallExpr(
            hasDescendant(
              memberExpr(
// callee(expr().bind("match")),
//
hasObjectExpression(hasType(recordDecl(hasName("basic_string")))),
                member(hasName("c_str"))
              ).bind("member")
            ),
            on(expr().bind("arg"))
          ).bind("call")
        )
      ),
      &Callback);

I still have no idea how to make it match 'c_str' on strings. However, it's
good enough to port CMake, so I guess I'm done until I next encounter a
need.

Thanks,

Steve.

In article <lfmm58$e8i$1@ger.gmane.org>,
    Stephen Kelly <steveire@gmail.com> writes:

I want to write a tool to remove c_str() calls in lines like

std::cout << s.c_str();

I would start with remove-cstr-calls which already gives you an
example of how to write a refactoring tool that matches c_str calls.

Have you looked at that?

You may also find this work in progress I wrote helpful:
<https://github.com/LegalizeAdulthood/remove-void-args>

There isn't really enough information about what matchers mean, and how to
use them.

Yes, they could use some improved documentation. All we really have
right now is the AST matchers reference page.

How does one *actually* go about doing it?

That is the focus of the presentation I'm preparing for C++ Now! 2014.
I wrote the remove-void-args tool based on the remove-cstr-calls. In
my experiments I never needed to actually run through the parsing and
AST matching code in the debugger, but I did make extensive use of
clang to dump the AST.

I didn't know about clang-query at the time, so I haven't updated the
tutorial to reflect this.

I'm also working on a patch to update the packaging to include
additional files in the package to make things available to
refactoring tool authors like clang-query and FileCheck.

My apologies to Stephen for not noticing that he was already doing
what I was suggesting (starting with remove-cstr-calls and modifying).

In article <lfmm58$e8i$1@ger.gmane.org>,
    Stephen Kelly <steveire@gmail.com> writes:

From the -ast-dump, it looks like I need to first match a
CXXOperatorCallExpr, so I write this matcher:

  Finder.addMatcher(
      id("match", operatorCallExpr()),
      &Callback);

which gives me lots of output, mostly from the iostream header. I need to
get narrower.

In my "remove-void-arg" example, I did an extra check to see if
something was declared extern "C", because we don't want to change

    extern "C" {
        int foo(void);
    }

into

    extern "C" {
        int foo();
    }

I'm pretty sure I'd run into the same problem as Stephen once my test
inputs were broadened from simple cases to real-world code.

I think this is a general problem -- namely that when you include header
files, either from the standard library or third party libraries, your
refactoring tool can match all kinds of code that is in their headers.

The Visual Assist X add-on for Visual Studio has a way of specifying a
directory hierarchy of "stable" includes that it ignores for various
operations, including refactoring operations. By default, it puts all
the Visual Studio headers in this stable list.

I was thinking that a simple filter would be to look at the file
associated with the source location of the match and if that file
isn't one of the translation units mentioned in the compilation
database, then we would discount the match. This works well for .cpp
files, but obviously misses header files.

It sounds like it would be useful to have an AST matcher that provided
this sort of filtering so you could just include it as an additional
predicate in your matcher and not have to repeat the work of filtering
out "uninteresting" source locations.

I looked on the matchers page and didn't seem to see any matchers that
filter based on the file origin of a node in the AST.

In the -ast-dump output there is a CXXMemberCallExpr nested in the
CXXOperatorCallExpr. I make a logical jump at matching that:

  Finder.addMatcher(
      id("match", operatorCallExpr(memberCallExpr())),
      &Callback);

However, that produces no output, so my logical jump must be incorrect.

I think the reason you didn't get a match here is that the structure is:

CXXOperatorCallExpr

-ImplicitCastExpr
-DeclRefExpr
-CXXOperatorCallExpr
>-CXXMemberCallExpr

<http://clang.llvm.org/docs/LibASTMatchersReference.html> says that
operatorCallExpr expects a Matcher<CXXOperatorCallExpr> argument, but
you supplied memberCallExpr() which returns type Matcher<Stmt>.

operatorCallExpr() could take hasOverloadedOperatorName() in order to
narrow it to the specific operator.

I think then you need to narrow it further based on the arguments to
the operator.

I would have to verify this hypothesis by reproducing your experiment,
but perhaps another list reader with more experience can confirm my
hypothesis.

How does one go about *actually* writing matcher code?

I did something similar to what you did, but I think this process
still has some bumps in the road and could be made easier still.

The first bump I'm trying to help iron out is that the pre-built
packages don't have everything you need to build refactoring tools.

(aside: this made me laugh:
StringLiteral 0x275cc88 <col:17> 'const char [4]' lvalue "wtf")

In article <E1WOEAb-00071u-8w@shell.xmission.com>,
    Richard <legalize@xmission.com> writes:

I think the reason you didn't get a match here is that the structure is:

CXXOperatorCallExpr
>-ImplicitCastExpr
>-DeclRefExpr
>-CXXOperatorCallExpr
> >-CXXMemberCallExpr

<http://clang.llvm.org/docs/LibASTMatchersReference.html> says that
operatorCallExpr expects a Matcher<CXXOperatorCallExpr> argument, but
you supplied memberCallExpr() which returns type Matcher<Stmt>.

operatorCallExpr() could take hasOverloadedOperatorName() in order to
narrow it to the specific operator.

I think then you need to narrow it further based on the arguments to
the operator.

OK, finally figured out how to get clang-query built (you need libedit
or it won't build) and tried it out on your example file:

clang-query> match operatorCallExpr(hasOverloadedOperatorName("<<"),
hasArgument(1, hasType(asString("const char *"))))

Match #1:

/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../include/c++/4.7/ostream:542:15:
note: "root" binds here
    { return (__out << reinterpret_cast<const char*>(__s)); }
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Match #2:

/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../include/c++/4.7/ostream:547:15:
note: "root" binds here
    { return (__out << reinterpret_cast<const char*>(__s)); }
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Match #3:

/tmp/foo.cpp:7:3: note: "root" binds here
  std::cout << s.c_str() << std::endl;
  ^~~~~~~~~~~~~~~~~~~~~~
3 matches.

If you add in the suggestion about narrowing based on source file
location, then we're getting exactly the one you want from your source
file.

However, more narrowing is needed because you may have other functions
besides std::string::c_str() that return a const char * expression.

My apologies to Stephen for not noticing that he was already doing
what I was suggesting (starting with remove-cstr-calls and modifying).

In article <lfmm58$e8i$1@ger.gmane.org>,
    Stephen Kelly <steveire@gmail.com> writes:

> From the -ast-dump, it looks like I need to first match a
> CXXOperatorCallExpr, so I write this matcher:
>
> Finder.addMatcher(
> id("match", operatorCallExpr()),
> &Callback);
>
> which gives me lots of output, mostly from the iostream header. I need to
> get narrower.

In my "remove-void-arg" example, I did an extra check to see if
something was declared extern "C", because we don't want to change

    extern "C" {
        int foo(void);
    }

into

    extern "C" {
        int foo();
    }

I'm pretty sure I'd run into the same problem as Stephen once my test
inputs were broadened from simple cases to real-world code.

I think this is a general problem -- namely that when you include header
files, either from the standard library or third party libraries, your
refactoring tool can match all kinds of code that is in their headers.

The Visual Assist X add-on for Visual Studio has a way of specifying a
directory hierarchy of "stable" includes that it ignores for various
operations, including refactoring operations. By default, it puts all
the Visual Studio headers in this stable list.

I was thinking that a simple filter would be to look at the file
associated with the source location of the match and if that file
isn't one of the translation units mentioned in the compilation
database, then we would discount the match. This works well for .cpp
files, but obviously misses header files.

It sounds like it would be useful to have an AST matcher that provided
this sort of filtering so you could just include it as an additional
predicate in your matcher and not have to repeat the work of filtering
out "uninteresting" source locations.

I looked on the matchers page and didn't seem to see any matchers that
filter based on the file origin of a node in the AST.

We usually create all changes and filter either in the match callback
(return early if the source location is in an uninteresting file), or
after-the-fact.