About AST rewriting / manipulation

Hello everyone,
I am trying to use Clang as a source-to-source compiler. Through the API I've found the way to rewrite back the syntax tree into source code, and that's not difficult. However, before writing back the syntax tree I would like to manipulate the syntax tree in order to apply some code transformations.

For example I would like to rewrite something like f(a,b) into g(b, a, c)

Now I guess I should create the AST nodes I need (building a new CallExpr... object and so on...) and then substitute the old f(...) with the new g(...). The Clang API for creating AST nodes is nevertheless quite complex to use, it's really too demanding.

Now, I am wonder that It would be very nice if I could write the statement I want to substitute (or to add) as a string and then use the the Clang parser to create the syntax tree of the piece of code I have written in a way it can be easily plugged in the old main syntax tree (of course the new instance of the parser should be invoked considering the previous context...). Is it possible to have this kind of behavior? Or my imagination simply went too far? :slight_smile:

Any pointers / references related to this kind of topic are really welcome!

regards, Simone

Hello Simone,

I am trying to use Clang as a source-to-source compiler. Through the API
I've found the way to rewrite back the syntax tree into source code, and
that's not difficult. However, before writing back the syntax tree I
would like to manipulate the syntax tree in order to apply some code
transformations.

For example I would like to rewrite something like f(a,b) into g(b, a, c)

Okay.

Now I guess I should create the AST nodes I need (building a new
CallExpr... object and so on...) and then substitute the old f(...) with
the new g(...). The Clang API for creating AST nodes is nevertheless
quite complex to use, it's really too demanding.

Interesting. I guess the demanding part of the API is that you need to be careful to ensure that you build semantically-correct ASTs.

Now, I am wonder that It would be very nice if I could write the
statement I want to substitute (or to add) as a string and then use the
the Clang parser to create the syntax tree of the piece of code I have
written in a way it can be easily plugged in the old main syntax tree
(of course the new instance of the parser should be invoked considering
the previous context...). Is it possible to have this kind of behavior?

I believe it is possible to extend Clang to do this, but there is no API to do so right now. The parser can be handed a set of tokens and told to "go parse these" by calling into the appropriate parse function; we do this to implement some C++ semantics, such as inline definitions of member functions.

However, the hard part---that nobody has even thought about how to implement---is that you would need to be able to take an AST node and instruct the parser *and semantic analysis* to set its internal state to the point where that AST node was parsed. That means reconstructing the scope stack, the information about which identifiers bind to which declarations, and so on. Not all of this information is present in the AST, so this is a major undertaking.

  - Doug

Thanks for the response,
I've chosen the simpler way, I just create clang nodes from scratch and replace / insert these nodes with the old one in the syntax tree! At the end it's not so complex as I thought... but now I have another problem! I want to write back the modified syntax tree but no matter which kind of changes I made to the syntax tree... when I use the TokenRewriter:

const LangOptions &LangOpts = Ctx.getLangOptions();
TokenRewriter Rewriter(Ctx.getSourceManager().getMainFileID(), Ctx.getSourceManager(), LangOpts);

// Print out the output.
for (TokenRewriter::token_iterator I = Rewriter.token_begin(), E = Rewriter.token_end(); I != E; ++I)
        out << pp.getSpelling(*I);

I still get the original code! How can I rewrite a syntax tree to Source file? Is there something in the API that I am missing? Should I use the Rewriter? If yes, why the rewriter doesn't provide an InsertStmt method? With the current API I can add text... and replace statements... what about inserting statements? and... remove statements?

thanks, S. Pellegrini

Douglas Gregor wrote:

Thanks for the response,
I've chosen the simpler way, I just create clang nodes from scratch and replace / insert these nodes with the old one in the syntax tree! At the end it's not so complex as I thought... but now I have another problem! I want to write back the modified syntax tree but no matter which kind of changes I made to the syntax tree... when I use the TokenRewriter:

const LangOptions &LangOpts = Ctx.getLangOptions();
TokenRewriter Rewriter(Ctx.getSourceManager().getMainFileID(), Ctx.getSourceManager(), LangOpts);

// Print out the output.
for (TokenRewriter::token_iterator I = Rewriter.token_begin(), E = Rewriter.token_end(); I != E; ++I)
      out << pp.getSpelling(*I);

I still get the original code!

As I understand it, the TokenRewriter is only meant to be used to insert tokens into the token stream as it's being rewritten. It's not designed for AST-level rewrites.

How can I rewrite a syntax tree to Source file? Is there something in the API that I am missing? Should I use the Rewriter? If yes, why the rewriter doesn't provide an InsertStmt method? With the current API I can add text... and replace statements... what about inserting statements? and... remove statements?

I believe the Rewriter can do what you want, but it will take some effort. The rewriter allows you to insert/remove/replace text in the source file with any other text, using SourceLocations to specify where in the source file the modifications will be made. That's one approach to AST-based rewriting, since one can turn a statement/expression/declaration into a string and then insert it at the appropriate place in the source.

Clang has some functionality for pretty-printing statements and expressions as source code, but we haven't put a lot of effort into making sure that the result is well-formed C++. Depending on how complicated your statements and expressions are, this might take no work or it might take a lot of work.

  - Doug

Yes, the Rewriter does the job!
But the main problem with the Rewriter is that any change I apply to the syntax tree (let's say... insert text or node replacement) are not kept in the original syntax tree.

To me the Rewriter seems like a buffer that keeps track of the changes I made and when I say to print it back it merges the syntax tree with the changes I have provided. Is that correct? So, I will explain my problem... I have a source file and I want to analyze it and then apply (if it's worth doing it) a list of transformations considering a particular order! So... I want to be able to apply several transformations on the same AST. Using the rewriter it seems that I can apply one transformation... then produce a source file... parse it again... analyze and apply another transformation... and so on!
And that's it's real inefficient! :frowning:

I guess I could mix the usage of the TokenRewriter and the prettyPrint() method to rewrite portions of the AST I have changed... or have a look at the Rewriter implementation because as now it's not clear to me how it works! :slight_smile:

regards, S. Pellegrini

Douglas Gregor wrote:

Hi Simone,

A challenge with the rewriter is rewriting code that has already been rewritten. Rewriting an expression often involves synthesizing new expressions that don't have a SourceLocation (as a result, they can't be rewritten). The ObjC rewriter plays many "games" to work around this. The code for rewriting ObjCPropertyRefExpr's is particularly tricky.

Another limitation is you can't rewrite code that originates from a macro. If this is a problem, the workaround is to preprocess the code prior to rewriting it.

Another approach is to do all the transformations on the AST's and pretty print the result. The downside with this approach is: (a) The pretty printing is still incomplete (particularly for Decls). Expressions are in fairly good shape. (b) You will loose comments, macros.

The ObjC rewriter uses both rewriting and pretty printing (for expressions) to do it's job. It's currently the most complex usage of AST-based rewriting.

snaroff

We do this in the Caja compiler
(Google Code Archive - Long-term storage for Google Code Project Hosting.) using things we call
quasiliterals.

For the above you would match something like "f(@a, @b)" and output
something like "g(@b, @a, c)".

For some examples, see
Google Code Archive - Long-term storage for Google Code Project Hosting..

Unfortunately you can't reuse our code very easily because a) it is in
Java and b) it compiles Javascript, but I'm sure the ideas are very
similar for C/C++.

BTW, I should add a warning here - in Javascript this would break
evaluation order guarantees - I can't remember for sure if "," is a
sequence point (is that the right term?) in C, but if it is you'd have
to rewrite to something like "t1 = @a; t2 = @b; g(t2, t1, c)" to
preserve order.