Parser Design?

Hi all,

I’m new to the mailing list (been lurking a few weeks). I just compiled the code and started looking at it.

At this point in life, I’m working with software analysis. In the past I’ve written interpreters for Domain Specific Languages, done code analysis, and written a lot of C and C++.

My desire it to help clang be able to provide for software analysis what Parasoft’s C++Test does, but to make a better scripting language than the “symbolic” language the C++Test provides. Most importantly, clang’s licensing will allow anyone to make use of it without costing several body parts. I’d love to be able to write out some language analysis rules that would be the equivalent of:
“flag any places where there is a missing (copy constructor | assignment operator) when there is in any of (base class | contained class | base-of-contained class) a non-POD data type.”

Once a set of predicates were written in such a scripting language, the set could be evaluated against a stored form of the AST, and those parts of software analysis which can be automated could then be put into the test process. Even those parts which are harder and require a CFG could be aided by such an approach, instead of hand-coding each query into a distinct binary.

Question 1) Does this sound interesting? I’d be working on such features only when I had no specific tasking at work to do - such a tool would aid my work once it were mature, so I’m allowed to use my “down time” to work on it. [And my family time is far too valuable. :slight_smile: ]

I have a question about the parser design - it looks like it was written by hand: was it?

For C, that’s not so hard, as the C spec is pretty straightforward. But for C++, a hand-written parser seems to me to be a bit more difficult - especially with C++0x’s changes.

Question 2) Is there any interest in using a lex/yacc type of approach?
I’ve both written by-hand parsers and used parser-generators to create front ends - and the latter makes for considerably less work when the language is well-defined.

Question 3) How does clang know when it’s being targeted at C vs C++? There are some ares in which valid C is invalid C++, so the parser (if exact for both languages) would either have to know the difference, would have to switch, or perhaps is C++ being treated as a superset of C?

thanks,
Brian

Brian Allison wrote:

My desire it to help clang be able to provide for software analysis
what Parasoft's C++Test does, but to make a better scripting language
than the "symbolic" language the C++Test provides.

Question 1) Does this sound interesting?

Yes. You're not the first one to come here with an interest in static
analysis - but you seem to have thought the most about it.

I have a question about the parser design - it looks like it was
written by hand: was it?

Yes.

For C, that's not so hard, as the C spec is pretty straightforward.
But for C++, a hand-written parser seems to me to be a bit more
difficult - especially with C++0x's changes.

Ever tried writing a description of C++'s grammar? It's just about
impossible. A hand-written parser is not only the most performant and
most flexible way to go about it, it's also probably the easiest. Not
that writing a C++ parser is easy under any circumstances. Look at the
test cases in test/Parser/cxx-ambig-paren-expr.cpp for some of the stuff
that we have to cope with.

Question 2) Is there any interest in using a lex/yacc type of approach?

No, I feel pretty confident in saying that there is no interest in
replacing our current lexer and/or parser. (Of course I can't speak for
everyone.) Using a hand-written recursive descent parser was a very
early decision in the project, and since I've been with it, I've seen
nothing to counter the view that it was absolutely the right thing to do.
For reference, the GCC team once had a generated parser for C++, but
they replaced it by a hand-written one - because the old one was not
only too slow, but also too inflexible.

  I've both written by-hand parsers and used parser-generators to
create front ends - and the latter makes for considerably less work
when the language is well-defined.

Neither C nor C++ are well-defined. They are extremely context-sensitive
- *especially* C++ - and that's not a good thing for automatic parser
generation.

Question 3) How does clang know when it's being targeted at C vs C++?
There are some ares in which valid C is invalid C++, so the parser (if
exact for both languages) would either have to know the difference,
would have to switch, or perhaps is C++ being treated as a superset of C?

We have a class LangOptions; an instance of it is filled by the driver.
All the "master" objects (Lexer, Preprocessor, Parser, Sema and
ASTContext) hold a reference to this object. You can query this object
for a lot of flags: C++, Objective-C (if both are enabled, you get
Objective-C++), C++0x, C99, Objective-C2 and several extension flags.

The lexer, parser and sema components often make decisions based on the
state of these flags. If you grep the source for CPlusPlus you'll find
all the places where C++ makes a difference. For example, look around
line 340 of lib/Parse/ParseExpr.cpp for an example where the C and C++
grammars differ very subtly.

Sebastian