AST transformations

I am trying to use Clang to analyze and modify source code at the AST
level. The class that seems most relevant for AST analysis is
RecursiveASTVisitor. I have seen some comments on this list indicating
that the AST is immutable once created. My understanding is that a
RecursiveASTVisitor would be called _after_ AST creation, making it
difficult or impossible to modify the AST using this interface.

What classes should I be looking at instead? I know that the
RewriteObjC class uses ASTConsumer; however, looking at the source
it's not immediately clear to me whether it is transforming the AST or
just inserting extra declarations/statements/etc. at the source level.

Any suggestions/comments would be much appreciated.

Here is more context if anyone is interested:

I am attempting to use Clang to analyze array access patterns. The
basic goal is to extract the array offset expression for each write or
read to an array. This by itself seems pretty trivial (find each
ArraySubscriptExpr node in the AST and print it out). However, an
added complication is that the expressions can only contain certain
types of values that can easily be reasoned about. The specific
context I am interested in is OpenCL, so the types of allowable values
are things like built-in OpenCL functions (get_global_id, etc.),
kernel arguments, constants, etc.

Example (I have removed much of the OpenCL-specific syntax for simplicity):

void kernel(int *output, int width) {
  int index = get_global_id(1) * width + get_global_id(0); //
get_global_id is a function that I can reason about
  output[index] = index;
}

For this kernel/function, the desired output would be something
similar to the following:
offset = get_global_id(1) * kernel_argument_1 + get_global_id(0);

The way I would solve this problem conceptually would be:
- Traverse the AST looking for assignments.
- When an assignment (LHS = RHS) is found, propagate the RHS down the
AST (in other words, replace any future uses of LHS with RHS).
- Continue until the children of all ArraySubscriptExpr nodes are from
a restricted class of values.

Obviously there are some other complications, like how to deal with
loops, etc. For now I am just worried about supporting some simple
kernels.

Thanks,
Michael

I am trying to use Clang to analyze and modify source code at the AST
level. The class that seems most relevant for AST analysis is
RecursiveASTVisitor. I have seen some comments on this list indicating
that the AST is immutable once created. My understanding is that a
RecursiveASTVisitor would be called _after_ AST creation, making it
difficult or impossible to modify the AST using this interface.

True. I've been assured that mutating the AST is very difficult to do
correctly, but I don't fully understand why. Perhaps it would be a
good idea to write up an FAQ explaining why clang doesn't support it,
with an example of why a simple local modification can break other
parts of the AST (ie switching the cmp condition).

What classes should I be looking at instead? I know that the
RewriteObjC class uses ASTConsumer; however, looking at the source
it's not immediately clear to me whether it is transforming the AST or
just inserting extra declarations/statements/etc. at the source level.

The recommended approach for transforming the program source is to
analyze the AST and use the source locations in the AST to make
textual, source-level edits.

Reid

Hi,

I am trying to use Clang to analyze and modify source code at the AST
level. The class that seems most relevant for AST analysis is
RecursiveASTVisitor. I have seen some comments on this list indicating
that the AST is immutable once created. My understanding is that a
RecursiveASTVisitor would be called _after_ AST creation, making it
difficult or impossible to modify the AST using this interface.

What classes should I be looking at instead? I know that the
RewriteObjC class uses ASTConsumer; however, looking at the source
it's not immediately clear to me whether it is transforming the AST or
just inserting extra declarations/statements/etc. at the source level.

Any suggestions/comments would be much appreciated.

I ran into the same problem, when I was looking for such tool. I think the RecursiveASTVisitor is meant to be a tool for traversing the tree not for transforming or mutating it. Thus it is very difficult to use it for such kind of stuff. However, you can have a look at the clang::(Stmt

Decl | Type)Visitor classes. They provide pretty powerful interface

and you can make your own transformation utility.

Here is what I did, I derive from clang::StmtVisitor and clang::DeclVisitor (both of them have RetTy), which I set to clang::Stmt* and clang::Decl* respectively. Then I traverse the TranslationUnit top-down and I can return whatever AST node I want. Like that you can replace the every specific type of AST node you want, when going up. Even you can send to the parent more complex structure (than clang::Stmt*, clang::Decl), containing complex context, which gives hints to the parent how to deal with the returned node.

Note that it is extremely dangerous, because you can break the semantics of the AST. Keep in mind, that if you plan to inject new artificially created nodes they wouldn't have SourceLocation, etc, which causes problem with the accuracy of the diagnostics, for example. Even more, if you inject new node, you should make Sema think that it actually comes from the Parser, which is difficult because you don't have source, right?

Cheers,
Vassil

Hi,

I don't know much about the objc rewriter, but the way that fixits
work is they emit a diagnostic that has a symbolic textual operation
on it, something like insert, replace, or delete text at a given
source range. There is no reparsing.

Reid

Hm... looks interesting. I will have a look. Thanks!

Here is what I did, I derive from clang::StmtVisitor and
clang::DeclVisitor (both of them have RetTy), which I set to
clang::Stmt* and clang::Decl* respectively. Then I traverse the
TranslationUnit top-down and I can return whatever AST node I want. Like
that you can replace the every specific type of AST node you want, when
going up. Even you can send to the parent more complex structure (than
clang::Stmt*, clang::Decl), containing complex context, which gives
hints to the parent how to deal with the returned node.

To a certain degree I've just sketched that approach in the "ConditionalOperator::setCond is gone" thread. However the problem with this approach is, that you you can't use all the neat means (e.g. rewriting, CFG creation) coming with clang anymore. How have you solved it?

Note that it is extremely dangerous, because you can break the semantics
of the AST. Keep in mind, that if you plan to inject new artificially
created nodes they wouldn't have SourceLocation, etc, which causes
problem with the accuracy of the diagnostics, for example.

Well, during and after my AST transformation phase there were no further needs for built-in diagnostics.

Even more, if
you inject new node, you should make Sema think that it actually comes
from the Parser, which is difficult because you don't have source, right?

Why should I have to deal with Sema at all? AST transformation and rewriting don't care about Sema.

Best regards
Olaf

In C++, there's very little that doesn't care about Sema. Any rewrite could require new implicit conversion sequences to be calculated, and unless you want to reimplement that code, you need to go through Sema.

There are, without doubt, several things that you can do without bothering Sema, but I believe that they are very limited in scope.

Sebastian

Here is what I did, I derive from clang::StmtVisitor and
clang::DeclVisitor (both of them have RetTy), which I set to
clang::Stmt* and clang::Decl* respectively. Then I traverse the
TranslationUnit top-down and I can return whatever AST node I want. Like
that you can replace the every specific type of AST node you want, when
going up. Even you can send to the parent more complex structure (than
clang::Stmt*, clang::Decl), containing complex context, which gives
hints to the parent how to deal with the returned node.

To a certain degree I've just sketched that approach in the
"ConditionalOperator::setCond is gone" thread. However the problem with
this approach is, that you you can't use all the neat means (e.g.
rewriting, CFG creation) coming with clang anymore. How have you solved it?

Yes you're right. I haven't solved it, but my case is very narrow and I don't need to deal with rewriting (which I believe it is not possible) and CFG creation.

Even more, if
you inject new node, you should make Sema think that it actually comes
from the Parser, which is difficult because you don't have source, right?

Why should I have to deal with Sema at all? AST transformation and
rewriting don't care about Sema.

Because I think it is much safer to let Sema create the nodes and not me :wink:
You are right that AST transformation and rewriting shouldn't care about sema in principle. However, in my opinion the clang AST is much closer to the source code representation than to an AST (defined in the books), because it contains a lot of information, which in any other case should be kept in a symbol table. I have read somewhere in clang docs that all this is intentional, so there isn't much that can be done.

Considering your other post "ConditionalOperator::setCond is gone" I don't see any easy way of implementing that. Unless there is another phase, which simplifies the current AST and provides mapping between the reduced and the full AST. However, I don't think this is a compiler task. Providing such functionality for few use-cases would be big performance trade-off.

It would be nice if the compiler provide hook for a tool, which enables these kind of transformations. Then the tool might have something like you said. For complex transformations like node removal, the tool can analyze and check if it make sense or not. Complex changes or node substitutions may involve cascade of changes, which the tool can perform. This, however, involves dependency analysis and as Aho says "dependency analysis is not trivial" :). I think that having that implemented would enable making clang AST immutable and even more safe than it is now.

Thanks for raising that discussion!
Cheers,
Vassil

Maybe you already can sense the response: do implicit conversion sequences belong to an AST at all? Of course, this doesn't solve the particular issue. However thinking about the concepts might point in the right direction.

Best Olaf

Without doubt, AST is a bit of a misnomer for the HLIR component of Clang. On the other hand, representing pure syntax leaves so much work to be done, and we would need the annotated tree anyway. At least most of the in-tree components work on the annotated tree and actually require that information.

Sebastian

Without doubt, AST is a bit of a misnomer for the HLIR component of
Clang.

HLIR? I guess "high level intermediate represantation"? The internet (in terms of google and wikipedia :slight_smile: is rather quiet.

On the other hand, representing pure syntax leaves so much work
to be done, and we would need the annotated tree anyway. At least most
of the in-tree components work on the annotated tree and actually
require that information.

I don't argue agains the AST in its current form (beside the naming). And I don't even argue against its immutability, since I fully understand the reasons behind that decision.
But would a lightweight component (cheap construction from original AST is required) be accepted in the clang universum, which emphasizes on syntax and which e.g. in the rewriter or maybe in the XML printer would replace the AST? It depends on the number of people using clang for source-to-source transformations and similar tasks. Judging from the questions on this list there are a few.

Best Olaf

I'm fairly surprised that nobody has brought up Sema's TreeTransform, which is a generic transformation on the Clang AST that allows subclasses to decide how to replace certain nodes with other nodes, and then rebuild the tree around that. It is currently used for template instantiation, rebuilding types for out-of-line definitions of class template members once we know what their semantic scope is, and for replacing "auto" with its deduced type.

Has TreeTransform been deemed inadequate for some reason? Or did people just not know about it?

  - Doug

Just a question in that direction: I am thinking about .ast -> DB -> Transformations -> DB -> .ast

Is it possible/reasonable idea ?

Kind regards,
Alek

It would be nice if the compiler provide hook for a tool, which enables
these kind of transformations. Then the tool might have something like

Just a question in that direction: I am thinking about .ast -> DB -> Transformations -> DB -> .ast

Is it possible/reasonable idea ?

Sorry for the stupid question but, what does DB stands for?

That is good point! I had have a look 6-7 months ago, but it didn't seem appropriate for my use-case then. Today when I looked at it again, I realized how much it has evolved. I found out that I did a lot of things in the exact same manner as it is done there. Yeah yeah..., I always learn thing in the most difficult way :wink: What does "replace certain nodes with other nodes" mean? It is not possible to insert/remove/modify an arbitrary node, is it?
I don't think that TreeTransform.h is public interface. I found it in lib/Sema/

Vassil

I'm fairly surprised that nobody has brought up Sema's TreeTransform, which is a generic transformation on the Clang AST that allows subclasses to decide how to replace certain nodes with other nodes, and then rebuild the tree around that. It is currently used for template instantiation, rebuilding types for out-of-line definitions of class template members once we know what their semantic scope is, and for replacing "auto" with its deduced type.

Has TreeTransform been deemed inadequate for some reason? Or did people just not know about it?

That is good point! I had have a look 6-7 months ago, but it didn't seem appropriate for my use-case then. Today when I looked at it again, I realized how much it has evolved. I found out that I did a lot of things in the exact same manner as it is done there. Yeah yeah..., I always learn thing in the most difficult way :wink: What does "replace certain nodes with other nodes" mean? It is not possible to insert/remove/modify an arbitrary node, is it?

Your subclass could identify the node you want to change, and replace it with a different node.

I don't think that TreeTransform.h is public interface. I found it in lib/Sema/

It isn't in the public interface. If it's going to see more general usage, it can be moved into the public interface.

  - Doug

Vassil Vassilev <vasil.georgiev.vasilev@...> writes:

> Just a question in that direction: I am thinking about .ast -> DB ->
> Transformations -> DB -> .ast
>
> Is it possible/reasonable idea ?
Sorry for the stupid question but, what does DB stands for?

Database. That lets you use a database schema for the AST.

I would have guess that but it seemed a bit strange to import the ast in a database...
I still don't understand. It is possible but the question is what would be the advantages of that? I guess you want to use a database schema for cascade deletion...
And how you would do the transformations in the database? Can you give us more details on what you want to do?

Vassil

Hi Vassil,

Vassil Vassilev<vasil.georgiev.vasilev@...> writes:

Just a question in that direction: I am thinking about .ast -> DB ->
Transformations -> DB -> .ast

Is it possible/reasonable idea ?

Sorry for the stupid question but, what does DB stands for?

Database. That lets you use a database schema for the AST.

I would have guess that but it seemed a bit strange to import the ast in
a database...
I still don't understand. It is possible but the question is what would
be the advantages of that? I guess you want to use a database schema for
cascade deletion...

As .ast I mean BC encoded serialization produced from the ASTWriter (clang -emit-ast).

In the past ten years, there are several projects following this (AST in DB) approach - two successful samples:

  * JTransformer [1] - open source, Eclipse plugin, based on SWI Prolog (the DB is a standard prolog fact store + indexes)
  * SemmleCode/.QL [2] - closed source, complies to SQL

For the proof of concept attempt, my proposal for DB/Query Language would be Berkeley DBXML/XQuery because:
  * XQuery, naturally operates on subtrees
  * Further, I think that some more specialized language (like TXL or Stratego) can be compiled to XQuery.
  * In this JunGL paper [3], the author states that his language is near (in terms of necessary characteristics) to XQuery.
  * XQuery is W3C Standard, there are many implementations, I personally think that for very large code bases, the right engine will be something based on Pathfinder [4]

As Siegfried said, the DB schema (XML schema/Relax NG in XMLDB case) can help for validation of DB import and/or the state of the trees after some transformation processing - this comes out-of-the box, but I am afraid that for full/sound validation, we will need to write additional modules in XQuery (because of need of semantic checks for refs between the nodes at least)

And how you would do the transformations in the database? Can you give
us more details on what you want to do?

I see two forms of transformations:
  * In-place, using XQuery Update
  * Projections using (often recursive) "constructor" functions: insert nodes your-module:ProjFunc1($base-node, $args) into $node)

Advantages:
  * Stratego (or even "low level" :slight_smile: XQuery) transformation can be sketched from almost everyone in several hours - equivalent (let's say final) TreeTransform based one will cost at least days for well trained in LLVM/Clang developer.
  * Relatively easy (customized) unparsing and other query based DB results, like stable C++ XML representation [*] for example.

What you think?

Kind regards,
Alek

[*] Douglas Gregor often says that the XML export of CLang AST need to be in standardized (to stable, not so parallel to current Clang ASTs) schema. I think that this is perfect goal, but can be achieved and supported more easy via XML->XML transformation of native Clang X.Y schema (using XSLT or XQuery), compared to C tree -> XML transformation mixed in phase of XML printing (using C++).

[1] http://sewiki.iai.uni-bonn.de/research/jtransformer/api/java/pefs/2.9/java_pef_overview
[2] Semmle - Wikipedia
[3] http://research.microsoft.com/pubs/79030/DPhil%20Thesis%20-%20Mathieu%20Verbaere.pdf
[4] https://db.inf.uni-tuebingen.de/research/past-projects/pathfinder/