Clarification for term "AST"

Hello,

I would like to continue the discussion on open issues which I tried to point
out in a bug tracker.
http://llvm.org/bugs/show_bug.cgi?id=15254#c0

How do you think about to distinguish the involved terms from computer science
and their application in the evolving software a bit more so that potential
misunderstandings can be avoided?

Regards,
Markus

http://llvm.org/bugs/show_bug.cgi?id=15254#c0

How do you think about to distinguish terms (and their relationships) like the
following a bit more in your application programming interfaces?

- TS: token stream
- AST: abstract syntax tree
- ASG: abstract semantic graph
- CFG: control flow graph

Regards,
Markus

Hi, Markus. One thing to take from this lack of response is that we're all very busy and just don't always respond to e-mails in a timely way. But another is probably what you first thought: making this change is not interesting to the majority of active Clang developers.

We don't have "token stream" as a first-class concept. C languages can get their tokens from many places—a file, a macro, a pretokenized header (essentially deprecated but still working, I think). Rather than deal with "token streams", we have PreprocessorLexer and its subclasses, which produce tokens from various kinds of input. The lexers do have different input sources, but these are character buffers (or rather, byte buffers) rather than streams of any kind.

Yes, our AST classes do not form a tree, nor are they strictly for syntax. At the same time, they do not exactly form a general graph either, and they are not strictly for semantics. Some examples:
- We preserve parentheses in our "AST" in the form of ParenExprs.
- We include implicit transformations like lvalue-to-rvalue conversions in the form of ImplicitCastExprs.
- A DeclRefExpr may refer to a Decl that encloses the current DeclContext, such as using a C++ class name in an implemention of one of its methods.
- Some nodes can be accessed through multiple "child" paths, such as the condition and the consequent in the GNU binary ?: extension expression. (These are wrapped in OpaqueValueExprs to prevent from double-traversal from naive tools.)
- Nodes do not have a common type—the Big Three are Decl, Type (and its wrapper QualType), and Stmt, but there are several other classes which may or may not be considered "AST nodes" (CXXBaseSpecifier, CXXCtorInitializer, ObjCDictionaryElement...) but are certainly important parts of the "AST".

What we have doesn't match up to a computer-science-theoretical notion of an AST or an ASG, but we don't have a better name. We're not really looking for one, either, since it doesn't seem to be causing trouble for anyone and no one has complained about it until now.

We have a CFG and it is called CFG. :slight_smile: (Actually, we have two—one for Clang, and one for LLVM's SSA-based IR. The former is only used for smart warnings and the static analyzer, though.)

Sorry, but I think this one is "not to be fixed".
Jordan

Sorry, but I think this one is "not to be fixed".

I thank you very much for your detailed response.

I find that you mention a couple of issues where interfaces from different
abstraction levels are affected in unclear ways. I guess that this situation
results into some challenges for a better understanding of involved
implementation details and increases the potential for unpleasant mistakes.

I hope that further constructive discussions will help to clarify and improve
the software design at various places.

Regards,
Markus

Hi Markus,

The problem as I see it is that building a “pure” AST of C++ (and C) is just impossible because to know whether you have a variable declaration or a functional call you require basic semantic analysis already. This is a defect of the specification of the grammar itself, and it will not be solved now (it would require completely reworking the grammar which is obviously impractical given the number of programs written).

Therefore, you’ll always have “kind-of” a semantic tree, without a dedicated Abstract Syntax Tree level.

– Matthieu

The problem as I see it is that building a "pure" AST of C++ (and C) is just impossible

I have got a different opinion. - Corresponding solutions will need more efforts
for further separation of concerns, won't they?

because to know whether you have a variable declaration or a functional call
you require basic semantic analysis already.

I guess that you refer to ambiguities in the programming language(s) like they
were also described by software researchers "Edward D. Willink" and "Adrian D.
Thurston".

This is a defect of the specification of the grammar itself, and it will not be solved now
(it would require completely reworking the grammar which is obviously impractical given
the number of programs written).

I assume that there happened some discussions on this already.

Therefore, you'll always have "kind-of" a semantic tree,

I would like to suggest a different approach.

without a dedicated Abstract Syntax Tree level.

I guess that it is another software design challenge to ensure clear interfaces
between the involved abstraction levels.

Regards,
Markus

The problem as I see it is that building a "pure" AST of C++ (and C) is just impossible

I have got a different opinion. - Corresponding solutions will need more efforts
for further separation of concerns, won't they?

Yes. A complete separation of concerns is impossible though, and
Clang's implementation is basically at the edge of that limit.

because to know whether you have a variable declaration or a functional call
you require basic semantic analysis already.

I guess that you refer to ambiguities in the programming language(s) like they
were also described by software researchers "Edward D. Willink" and "Adrian D.
Thurston".

I haven't looked in detail at their work, sorry. Maybe you could
summarize their findings?

This is a defect of the specification of the grammar itself, and it will not be solved now
(it would require completely reworking the grammar which is obviously impractical given
the number of programs written).

I assume that there happened some discussions on this already.

Yes, but the existing grammar of the language is not open for debate.
There are billions of lines of code written against the existing
grammar. Grammar modifications at this point are only going to be
additive.

Therefore, you'll always have "kind-of" a semantic tree,

I would like to suggest a different approach.

Please do! We are always interested in improving our codebase.
However, I think that as you gain a better understanding of the
complexities of the task that Clang performs you will come to
appreciate the current design as having quite a clear separation of
concerns.

without a dedicated Abstract Syntax Tree level.

I guess that it is another software design challenge to ensure clear interfaces
between the involved abstraction levels.

Indeed. It has required a lot of effort to improve our interfaces to
the clarity that they are at now.

-- Sean Silva

To be blunt, invasive refactors based on a non-contributor's sense of
abstractly good software-engineering practices are not welcome.
Understand the problems that our data structures are designed to solve
before trying to tell us the right way to do things.

John.

A complete separation of concerns is impossible though,

I get the feeling here that your view presents some limitations which I try to reduce.

and Clang's implementation is basically at the edge of that limit.

I am curious how further adjustments will make corresponding software evolution more interesting.

Maybe you could summarize their findings?

Would you like to consider information from documents like the following again?
* Adrian D. Thurston:
   A Computer Language Transformation System Capable
of Generalized Context-Dependent Parsing
   http://hdl.handle.net/1974/1629
   http://www.complang.org/colm/thurston-phdthesis.pdf

* Edward D. Willink:
   Meta-Compilation for C++
   http://www.computing.surrey.ac.uk/research/dsrg/fog/FogThesis.html
   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.157.1250

* Nicholas A. Kraft , Brian A. Malloy, James F. Power:
   A Tool Chain for Reverse Engineering C++ Applications
   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.4682
   http://g4re.sourceforge.net/

Yes, but the existing grammar of the language is not open for debate.

The debate will continue just by the fact that a lot of software designers and developers struggle with the proper handling of context-sensitive/dependent (programming) languages (like with C++).
http://stackoverflow.com/questions/6088064/context-sensitivity-vs-ambiguity

However, I think that as you gain a better understanding of the
complexities of the task that Clang performs you will come to appreciate
the current design as having quite a clear separation of concerns.

I imagine that further fine-tuning will be useful at various places.

Regards,
Markus

To be blunt, invasive refactors based on a non-contributor's sense of abstractly good software-engineering practices are not welcome.

I understand that you categorise me as a stranger in the Clang software development community. But I am still interested in incremental improvements for your software infrastructure.

Understand the problems that our data structures are designed to solve
before trying to tell us the right way to do things.

I do that to some degree already. I pick up some ideas from various information sources. So I dare to point out open issues even as an "outsider" who is looking for better solutions.

Regards,
Markus

The problem as I see it is that building a "pure" AST of C++ (and C) is just impossible

I have got a different opinion. - Corresponding solutions will need more efforts
for further separation of concerns, won't they?

[snip]

I guess that it is another software design challenge to ensure clear interfaces
between the involved abstraction levels.

To be blunt, invasive refactors based on a non-contributor's sense of
abstractly good software-engineering practices are not welcome.
Understand the problems that our data structures are designed to solve
before trying to tell us the right way to do things.

Great advice.

Now, where are those problems described and
where is the rationale for choosing the clang data structures
to solve those problems described? With that information,
maybe Markus would have a better idea about why his
suggestions wouldn't work or maybe steer him to
an alternative solution.

John.

Regards,
Larry

Let me preface by saying that most of my use of Clang is to do things that are strictly orthogonal from actually compiling C++ code.

I can't say that I can think of a single time where I would ever want the raw parsed AST with no semantic analysis done on it. The kinds of queries you want from a C++ programs tend to start with things like "find all variables" or "find all functions", yet faithfully following the grammar (and not combining levels of the tree) would result in a lot of boilerplate code to do unwrapping, especially when you consider the contortions needed to represent something like "int (*foo(int))(double)" as a proper tree. From my standpoint as a user, anytime I have to duplicate Clang's reasoning of logic, it is a bug that Clang does not give me an API to do that.

Theoretical purity isn't a particularly useful design goal, since true purity is often (IMHO) rather pointless. C++11 defines 9 phases of translations, of which the first 6 can be described as lexing, the 7th "parse, semantic analysis, code generation, and optimization," the 8th linking, and the final one the runtime loader. Yet there is no practical reason to, for example, output the source program exactly as it is right after folding lines that end in a backslash. Indeed, preserving the ability to do this action would unnecessarily complicate later pipelines (you would need to attach cumbersome metadata to source files prior to actual tokenization, the first step of which happens in the subsequent phase).

Another example I can give is that the grammar represents precedence by explicitly doing the factoring itself, which means that you would need, in order to achieve theoretical purity, to wrap every multiplicative expression in a node which is an unary additive expression. Ignoring this does fail to achieve theoretical purity, since you are representing the AST in a manner not identical to the grammar; however, by sacrificing purity, you are making everybody's lives very much easier.

I can't say that I can think of a single time where I would ever want the raw parsed AST with no semantic analysis done on it.

Are you interested in detailed source code transformations after a specific analysis on a control flow graph?
Do you see use cases to drill down from higher abstraction levels to the abstract syntax tree to generate some adjustments?

How do you think about consequences and challenges like they are demonstrated by the semantic patch language from the software "Coccinelle"?
http://coccinelle.lip6.fr/sp.php

Regards,
Markus

I can't say that I can think of a single time where I would ever want the raw parsed AST with no semantic analysis done on it.

Are you interested in detailed source code transformations after a specific analysis on a control flow graph?
Do you see use cases to drill down from higher abstraction levels to the abstract syntax tree to generate some adjustments?

The short answer here is "no", because viewing refactorings as transformations on ASTs is wrong. Refactoring is really transforming source text to source text guided by high-level semantic rules; treating it as AST implementations will cause lots of grief when you try to reserialize that AST, thanks in large part to the complications of macros and templates (macros++).

How do you think about consequences and challenges like they are demonstrated by the semantic patch language from the software "Coccinelle"?
http://coccinelle.lip6.fr/sp.php

Semantic patching is very seductive, but it's a poor substitute when it comes to complicated refactorings. The use case I played with a few years ago was rewriting a C API into C++, which requires some semantic rules cumbersome to produce in an spatch-like format. The best example I have is "make the first argument to be 'derived' from MimeObject be the implicit this argument." Another one is "every object which is the implicit this argument cast to some other type is really just the implicit this argument, if the cast is a cast to the current class type or some superclass thereof".

And the clang AST actually works fine for building tools like that on top -
we're doing similar stuff on a daily basis based on the AST matchers and
Tooling/Refactoring libs.

If anybody really wanted to spend some time "fixing" the AST, I'd suggest
to look at how to introduce an ExpressionStatement instead of having all
Expressions be Statements :slight_smile:

Cheers,
/Manuel

And the clang AST actually works fine for building tools like that on top - we're doing similar stuff on a daily basis based on the AST matchers and Tooling/Refactoring libs.

Would you like to add any news to the plans you described in the presentation "Refactoring C++ with Clang" on a conference?
http://llvm.org/devmtg/2012-04-12/

Can I read a bit more about the domain-specific language(s) you developed for the involved use cases?

Regards,
Markus

And the clang AST actually works fine for building tools like that on top

- we're doing similar stuff on a daily basis based on the AST matchers and
Tooling/Refactoring libs.

Would you like to add any news to the plans you described in the
presentation "Refactoring C++ with Clang" on a conference?
http://llvm.org/devmtg/2012-**04-12/ <http://llvm.org/devmtg/2012-04-12/>

Yes, I'm planning to submit a talk proposal...

Can I read a bit more about the domain-specific language(s) you developed
for the involved use cases?

http://clang.llvm.org/docs/LibASTMatchers.html
http://clang.llvm.org/docs/LibASTMatchersReference.html

You can also take a look at the tests in
unittests/ASTMatchers/ASTMatchersTests.cpp in the clang source tree.

http://clang.llvm.org/docs/LibASTMatchers.html

Would you like to distinguish a bit more if the matching process is performed on a syntactic or semantic abstraction level?

Do you combine such "AST" matchers into a computation tree logic variant?

Regards,
Markus

http://clang.llvm.org/docs/**LibASTMatchers.html<http://clang.llvm.org/docs/LibASTMatchers.html>

Would you like to distinguish a bit more if the matching process is
performed on a syntactic or semantic abstraction level?

The matching is ont he clang AST, which is a fully type-resolved AST. Does
that answer it?

Do you combine such "AST" matchers into a computation tree logic variant?

I'm not sure I understand what exactly you ask, but we're combining
matchers into a tree that than is matched against the AST (which is more of
a graph than a tree :wink:

Cheers,
/Manuel

I'm not sure I understand what exactly you ask, but we're combining matchers into a tree that than is matched against the AST (which is more of a graph than a tree :wink:

Can higher level analysis and software refactoring be performed by the means from variants of the knowledge area "computation tree logic"?

Would you like to map "ASG" matchers to a semantic patch language?

Regards,
Markus