AST XML dump

Hi @clang-community,

I've recently started a C(++)-source-to-source translation project. The project is open-source and thus all the components involved needs to be open-source too.
The goal of the project is to semi-automatically rewrite and optimize parts of C(++) sources (C++ instead of C is a 'very-nice-to-have feature').
Currently I'm still in the evaluation phase. Obvisiously the first thing I need is a parser. Beside gcc I found clang, sparse and the Ravi C parser as open source projects. Are there any others out there?
From among these three projects clang seems to be the most promising project for my needs as it is already able to deal with some C++ (and according to cfe-commits the community is very active in this area).
However one thing I badly need is a better AST dump. I'm now going to implement my own XML dump in clang since the current AST dump is hard to parse. Any objections? (of course I can do what I want to do anyway, but in the end I naturally want to contribute my additions to the clang project - so I better ask).

Best regards
Olaf Krzikalla

Hi @clang-community,

I've recently started a C(++)-source-to-source translation project. The
project is open-source and thus all the components involved needs to be
open-source too.
The goal of the project is to semi-automatically rewrite and optimize
parts of C(++) sources (C++ instead of C is a 'very-nice-to-have feature').
Currently I'm still in the evaluation phase. Obvisiously the first thing
I need is a parser. Beside gcc I found clang, sparse and the Ravi C
parser as open source projects. Are there any others out there?
From among these three projects clang seems to be the most promising
project for my needs as it is already able to deal with some C++ (and
according to cfe-commits the community is very active in this area).
However one thing I badly need is a better AST dump. I'm now going to
implement my own XML dump in clang since the current AST dump is hard to
parse. Any objections? (of course I can do what I want to do anyway, but
in the end I naturally want to contribute my additions to the clang
project - so I better ask).

Having an easy to parse/read XML dump would be great!

The current AST dump is primarily used for debugging.

Thanks,

snaroff

Hi @clang-community,

I've recently started a C(++)-source-to-source translation project. The
project is open-source and thus all the components involved needs to be
open-source too.
The goal of the project is to semi-automatically rewrite and optimize
parts of C(++) sources (C++ instead of C is a 'very-nice-to-have feature').
Currently I'm still in the evaluation phase. Obvisiously the first thing
I need is a parser. Beside gcc I found clang, sparse and the Ravi C
parser as open source projects. Are there any others out there?

If C++ is important to you, GCC and Clang are the only two that are likely to handle the whole C++ language in the "near" future.

From among these three projects clang seems to be the most promising
project for my needs as it is already able to deal with some C++ (and
according to cfe-commits the community is very active in this area).

Clang C++ is very active and support for C++ is improving rapidly. That said, we're still not to the point where we can handle even small applications (because we can't parse the C++ standard library headers yet). We're working on it!

However one thing I badly need is a better AST dump. I'm now going to
implement my own XML dump in clang since the current AST dump is hard to
parse. Any objections? (of course I can do what I want to do anyway, but
in the end I naturally want to contribute my additions to the clang
project - so I better ask).

A good XML dump for Clang's representation would be a great addition to Clang. You might consider looking at GCC-XML's output format, since it is a decent representation of the C++ language in XML. GCC-XML is here:

  GCC-XML

Matching the GCC-XML format means that other tools meant to work with GCC-XML (e.g., Pyste) could also work with Clang as a front end.

Since you mentioned source-to-source translation... Clang has a rewriter that allows you to replace certain parts of the source code (based on, e.g., a source range) with other code, without disturbing comments or formatting. If you're doing small, targeted rewrites in your source-to-source translation, it's something else to look at (and you won't need XML output to do it). Check out the Objective-C rewriter in the Clang source tree for an example (tools/clang-cc/RewriterObjC.cpp).

  - Doug

Hi @clang-community,

I’ve recently started a C(++)-source-to-source translation project. The
project is open-source and thus all the components involved needs to be
open-source too.
The goal of the project is to semi-automatically rewrite and optimize
parts of C(++) sources (C++ instead of C is a ‘very-nice-to-have feature’).
Currently I’m still in the evaluation phase. Obvisiously the first thing
I need is a parser. Beside gcc I found clang, sparse and the Ravi C
parser as open source projects. Are there any others out there?

ELSA is yet another c/cpp parser. ELSA implements c++ better than the current clang. however, it is limited in a very small group and its major development has been evidently over since 2006.
fyi, it already has XML serializer and de-serializer for AST.
http://www.cubewano.org/oink/wiki/ElsaFeatures

Hi @clang-community,

so now the first version is ready for commiting. I've attached the appropriate patch.
It would be nice if someone could review it in the very near future and comment on it.
So far I'm going to keep the direction of the current work.

Douglas Gregor schrieb:

GCC-XML

Matching the GCC-XML format means that other tools meant to work with GCC-XML (e.g., Pyste) could also work with Clang as a front end.

I tried to keep it as similiar as possible to GCC-XML. However some things I coudn't keep, other things I won't keep.
For instance I encapsulate the type and context references at the end of the document in separate sections. But IMHO it should be easy to adapt tools
working with GCC-XML to clang-XML since mainly the document structure has changed (it got more structured ;-).
There are still a lot of things to do and the patch isn't meant to be complete even for C. But it should give a good starting point for discussions and further work anyway.

Best
Olaf Krzikalla

firstXML.patch (43.1 KB)

Olaf Krzikalla wrote:

Hi @clang-community,

so now the first version is ready for commiting. I've attached the
appropriate patch.
It would be nice if someone could review it in the very near future
and comment on it.
So far I'm going to keep the direction of the current work.

I think this should be a visitor, not a modification of the AST classes
themselves.

Sebastian

Thanks for working on this. I have a couple comments/questions:

1) Why do we need to build an in-memory representation of the XML document just to print it? The in-memory representation itself isn't likely to be useful for much, since we're not providing any way of directly manipulating the in-memory representation. Would it be possible, instead, to stream the XML representation to disk directly, rather than building it all in memory? Doing so would save a lot of memory and should improve performance, since we won't be doing so much string copying and manipulation. Also in the same vein: there are quite a few std::map's to strings. These should probably be llvm::DenseMaps, and is it possible to map to something more efficient than a std::string?

2) I see that you've added a dumpXML routine into Stmt. Is there some benefit to having this as a method on Stmt, or can we just separate the XML-dumping functionality completely, putting it all into an AST consumer?

3) I feel that it is very important that we have a proper XML schema before we claim to generate good XML. It's also important for testing: we'd like to be able to, e.g., generate XML from some translation unit and then verify that the XML meets the known schema.

4) I can't build with this patch because of some errors (mentioned below). Could you post some sample input/output so we can get a feel for the structure?

More detailed comments follow:

+//---------------------------------------------------------
+AttributeXML::AttributeXML(const char* pName, unsigned value) :
+ Name(pName)
+{
+ char buffer[32];
+ Value = _ultoa(value, buffer, 10);
+}

include/llvm/ADT/StringExtras.h has some integer-to-string facilities. I suggest using those instead of _ultoa et al.

+//---------------------------------------------------------
+PresumedLoc DocumentXML::addLocation(const SourceLocation& Loc)
+{
+ SourceManager& SM = Ctx->getSourceManager();
+ SourceLocation SpellingLoc = SM.getSpellingLoc(Loc);
+ PresumedLoc PLoc;
+ if (!SpellingLoc.isInvalid())
+ {
+ PLoc = SM.getPresumedLoc(SpellingLoc);
+ addSourceFileAttribute(PLoc.getFilename());
+ addAttribute("line", PLoc.getLine());
+ addAttribute("col", PLoc.getColumn());
+ }
+ // else there is no error in some cases (eg. CXXThisExpr)
+ return PLoc;
+}

Do we need to think about source locations that point into macro instantiations here?

+#ifndef mode_t
+typedef unsigned short mode_t;
+#endif

Why do we need this?

+//---------------------------------------------------------
+template<class T, class U, class V>
+bool addToMap(std::map<T, std::string, U>& idMap, const V& value, tIdType idType = ID_NORMAL)
+{
+ std::map<T, std::string, U>::iterator i = idMap.find(value);
+ bool toAdd = i == idMap.end();
+ if (toAdd)
+ {
+ idMap.insert(std::map<T, std::string, U>::value_type(value, getNewId(idType)));
+ }
+ return toAdd;
+}

You need "typename" before std::map<T, std::string, U>::iterator and std::map<T, std::string, U>::value_type.

+ for (std::map<const Type*, std::string>::const_iterator i = BasicTypes.begin(), e = BasicTypes.end(); i != e; ++i)
+ {
+ // don't use the get methods as they strip of typedef infos
+ if (const BuiltinType *BT = dyn_cast<BuiltinType>(i->first)) {
+ addSubNode("FundamentalType");
+ addAttribute("name", BT->getName());
+ }
+ else if (const PointerType *PT = dyn_cast<PointerType>(i->first)) {
+ addSubNode("PointerType");
+ addTypeAttribute(PT->getPointeeType());
+ }

It might be better to use the macro invocations in include/clang/AST/TypeNodes.def to enumerate and switch on the various kinds of type nodes. It's generally cleaner, and protects this code better against changes in the type hierarchy.

  - Doug

Hi,

Thanks for your comments. By now I understand why Sebastian objected at first of all. I initially used the organization that was used for AST dumping.
But I've changed that and thus became really minimal intrusive. Find attached another patch which respected some comments and should now compile out of the box (tested with VC and cygwin) with the current revision. I also added the output for ast-printing.c as found in the test suit. As you can see the _Complex type is one of the features not implemented yet.

Douglas Gregor schrieb:

1) Why do we need to build an in-memory representation of the XML document just to print it? The in-memory representation itself isn't likely to be useful for much, since we're not providing any way of directly manipulating the in-memory representation. Would it be possible, instead, to stream the XML representation to disk directly, rather than building it all in memory? Doing so would save a lot of memory and should improve performance, since we won't be doing so much string copying and manipulation. Also in the same vein: there are quite a few std::map's to strings. These should probably be llvm::DenseMaps, and is it possible to map to something more efficient than a std::string?

2) I see that you've added a dumpXML routine into Stmt. Is there some benefit to having this as a method on Stmt, or can we just separate the XML-dumping functionality completely, putting it all into an AST consumer?

Most of these things are done by now.

3) I feel that it is very important that we have a proper XML schema before we claim to generate good XML. It's also important for testing: we'd like to be able to, e.g., generate XML from some translation unit and then verify that the XML meets the known schema.

Documentation is indeed the most important TODO. However I'm not completely convinced by the current format yet. I had some hope that there is already an existing inter-language de-facto standard for ASTs but beside GENERIC I didn't found anything. And at a first glance GENERIC seems to be too low level at least for my purposes.

It might be better to use the macro invocations in include/clang/AST/TypeNodes.def to enumerate and switch on the various kinds of type nodes. It's generally cleaner, and protects this code better against changes in the type hierarchy.

Thats another TODO. In fact I don't like all the dyn_cast chains either.

Best
Olaf Krzikalla

ast-printing.xml (40 KB)

firstXML.patch (42 KB)

Olaf Krzikalla wrote:

Hi,

Thanks for your comments. By now I understand why Sebastian objected
at first of all. I initially used the organization that was used for
AST dumping.
But I've changed that and thus became really minimal intrusive. Find
attached another patch which respected some comments and should now
compile out of the box (tested with VC and cygwin) with the current
revision. I also added the output for ast-printing.c as found in the
test suit. As you can see the _Complex type is one of the features not
implemented yet.

I like it, but I think you should now move your files to lib/Frontend
instead of lib/AST. Other than that, I think it's ready for commit,
pending approval from Doug.

Sebastian

Hi Olaf,

Thanks for your comments. By now I understand why Sebastian objected at first of all. I initially used the organization that was used for AST dumping.
But I've changed that and thus became really minimal intrusive.

Great!

Find attached another patch which respected some comments and should now compile out of the box (tested with VC and cygwin) with the current revision. I also added the output for ast-printing.c as found in the test suit. As you can see the _Complex type is one of the features not implemented yet.

Okay, looks good. I've committed your patch, after making two small adjustments:

   - Removed order_QualType; we already had a similar QualTypeOrdering in clang/AST/TypeOrdering.h
   - Moved the XML-dumping functionality into Frontend, as Sebastian suggested.

Douglas Gregor schrieb:

1) Why do we need to build an in-memory representation of the XML document just to print it? The in-memory representation itself isn't likely to be useful for much, since we're not providing any way of directly manipulating the in-memory representation. Would it be possible, instead, to stream the XML representation to disk directly, rather than building it all in memory? Doing so would save a lot of memory and should improve performance, since we won't be doing so much string copying and manipulation. Also in the same vein: there are quite a few std::map's to strings. These should probably be llvm::DenseMaps, and is it possible to map to something more efficient than a std::string?

This is still an interesting question for me, but it's not critical.

2) I see that you've added a dumpXML routine into Stmt. Is there some benefit to having this as a method on Stmt, or can we just separate the XML-dumping functionality completely, putting it all into an AST consumer?

Most of these things are done by now.

Looks good.

3) I feel that it is very important that we have a proper XML schema before we claim to generate good XML. It's also important for testing: we'd like to be able to, e.g., generate XML from some translation unit and then verify that the XML meets the known schema.

Documentation is indeed the most important TODO. However I'm not completely convinced by the current format yet. I had some hope that there is already an existing inter-language de-facto standard for ASTs but beside GENERIC I didn't found anything. And at a first glance GENERIC seems to be too low level at least for my purposes.

It doesn't look like there is such a standard, unfortunately.

It might be better to use the macro invocations in include/clang/AST/TypeNodes.def to enumerate and switch on the various kinds of type nodes. It's generally cleaner, and protects this code better against changes in the type hierarchy.

Thats another TODO. In fact I don't like all the dyn_cast chains either.

Okay.

  - Doug

It might be better to use the macro invocations in include/clang/AST/TypeNodes.def to enumerate and switch on the various kinds of type nodes.

I noticed that the naming of the parents isn't consistent among the defines - some use the Type suffix, others don't. This make the parent denotations of TypeNodes.def rather unusable (and currently it is nowhere used). As the base type is named "Type" I guess we need the Type suffix everywhere else.

Best
Olaf Krzikalla

Olaf Krzikalla schrieb:

I noticed that the naming of the parents isn't consistent among the defines - some use the Type suffix, others don't. This make the parent denotations of TypeNodes.def rather unusable (and currently it is nowhere used). As the base type is named "Type" I guess we need the Type suffix everywhere else.
  

Patch attached.

Best
Olaf Krzikalla

TypeNodes.patch (832 Bytes)

Committed, thanks!

  - Doug

Hi @clang,

I've created a new patch pushing the XML development forward. I now use visitors and .def files for the XML structure definition. The .def files also serve as documentation.
Automatic creation of a xsd file by using the .def files should be possible (maybe we need some small adjustments).
The patch seems to have one small problem with a missing new-line at the end of TypeVisitor.h. I checked it here but it was there. Don't know. Otherwise it compiles fine with VC and cygwin.
It would be nice if someone could check it, comment on it and eventually commit it so I can keep in sync with the overall development.

Best
Olaf

secondXML.patch (87.1 KB)

Hello Olaf,

Hi @clang,

I've created a new patch pushing the XML development forward. I now use visitors and .def files for the XML structure definition. The .def files also serve as documentation.

Wonderful! This is very, very clean.

Automatic creation of a xsd file by using the .def files should be possible (maybe we need some small adjustments).

Excellent. I think this will work out really, really well.

The patch seems to have one small problem with a missing new-line at the end of TypeVisitor.h. I checked it here but it was there. Don't know. Otherwise it compiles fine with VC and cygwin.
It would be nice if someone could check it, comment on it and eventually commit it so I can keep in sync with the overall development.

Sorry for the long delay; I've committed your changes now.

  - Doug

Douglas Gregor wrote:

Hello Olaf,

The patch seems to have one small problem with a missing new-line at
the end of TypeVisitor.h. I checked it here but it was there. Don't
know. Otherwise it compiles fine with VC and cygwin.
It would be nice if someone could check it, comment on it and
eventually commit it so I can keep in sync with the overall
development.
    
Sorry for the long delay; I've committed your changes now.
  

There goes my pipeline. But I'm glad Olaf doesn't have to wait another
two weeks.

Sebastian