clang leveraging Elsa?

Hi,
I just found out about clang from the LLVM 2.1 announcement. It's great
to see someone working on a C++ front-end with an emphasis on
source2source and static analysis. I've been writing tools for that
using the Elsa frontend for the past year.
Elsa is further along in development (fairly complete in that it can
parse most of the C/C++ code that gcc3.4 accepts). Have the clang
developers considered reusing parts of the elsa? I haven't noticed a
mention of elsa in the list archives.
Elsa comes with an extensive testsuite and has some design similarities
to clang as described in the clang Internals manual. It also differs in
that the preprocessor is not integrated. For precise source2source
transforms I worked with the MCPP author to produce a mode that
annotates macro expansions.
Would it make sense to refactor elsa to the clang design to speed up
clang development?

Elsa homepage: Elsa
Oink suite (cental elsa repository is within oink):
http://www.cubewano.org/oink/
MCPP preprocessor: http://mcpp.sourceforge.net/
My development blog & some links to my oink fork:
http://blog.mozilla.com/tglek

I am very interested in clang since it will be a C/C++/ObjC frontend
that's suitable for source analysis that also serves as a frontend to a
production compiler.
My biggest gripe with Elsa is that isn't developed in a transparent
fashion. It also occasionally has bugs that would be caught if it were
used as a frontend for a compiler.

Cheers,
Taras

I just found out about clang from the LLVM 2.1 announcement. It's great
to see someone working on a C++ front-end with an emphasis on
source2source and static analysis. I've been writing tools for that
using the Elsa frontend for the past year.

Nifty!

Elsa is further along in development (fairly complete in that it can
parse most of the C/C++ code that gcc3.4 accepts). Have the clang
developers considered reusing parts of the elsa? I haven't noticed a
mention of elsa in the list archives.

We have. The killer problem is that Elsa's implementation will not allow us to achieve the performance goals of the clang project. In addition, Elsa doesn't solve the hard part of C++ parsing (the semantic analysis and type checking), isn't built as a reusable library (in the way clang aims to be), doesn't get the corner cases of the languages it parses correct, etc.

While we could extend Elsa to complete its support for C++ and polish the corner cases, fixing the performance issues would require a complete redesign. As such, reusing elsa is a non-starter. :frowning:

Elsa comes with an extensive testsuite and has some design similarities
to clang as described in the clang Internals manual.

When we get that far, I expect clang to extensively leverage the available test suites, including the GCC test suite, elsa, as well as just building tons of open source software.

I am very interested in clang since it will be a C/C++/ObjC frontend
that's suitable for source analysis that also serves as a frontend to a
production compiler.
My biggest gripe with Elsa is that isn't developed in a transparent
fashion. It also occasionally has bugs that would be caught if it were
used as a frontend for a compiler.

clang is definitely developed in the open and welcomes contributors. However, our C++ support is basically non-existent (and we don't have anyone really working on it), so Elsa is probably a better solution to C++ parsing issues in the short term. Over the next couple years, I expect the clang C++ support to come up to the point where it is both industrial quality and useful for a broad variety of clients. It also has much better ObjC support than elsa :wink:

-Chris

Chris Lattner wrote:

I just found out about clang from the LLVM 2.1 announcement. It's great
to see someone working on a C++ front-end with an emphasis on
source2source and static analysis. I've been writing tools for that
using the Elsa frontend for the past year.

Nifty!

Elsa is further along in development (fairly complete in that it can
parse most of the C/C++ code that gcc3.4 accepts). Have the clang
developers considered reusing parts of the elsa? I haven't noticed a
mention of elsa in the list archives.

We have. The killer problem is that Elsa's implementation will not allow us to achieve the performance goals of the clang project. In addition, Elsa doesn't solve the hard part of C++ parsing (the semantic analysis and type checking), isn't built as a reusable library (in the way clang aims to be), doesn't get the corner cases of the languages it parses correct, etc.

Actually, it does do semantic analysis and type checking.

Could you elaborate on what you mean by "isn't built as a reusable library"? API-wise I think it's ok in that regard.

I'm also not sure about what you mean by corner cases. Elsa's C support isn't ideal because it pretends that C is a subset of C++. And if by corner cases you mean that it doesn't always store everything it parsed in in the AST, that's also correct. However I've found that filling in missing information is next to trivial and I am not aware of any other shortcomings. Could you elaborate on that point too?

While we could extend Elsa to complete its support for C++ and polish the corner cases, fixing the performance issues would require a complete redesign. As such, reusing elsa is a non-starter. :frowning:

Elsa's support for C++ is fairly complete. In my view, templates are the only crucial part that needs work. It fails on anything beyond simple template instantiation(lucky for me it's barely enough for Mozilla). However, template support is a hard part that will have to dealt with in any C++ frontend.

Elsa was also designed with a performance in mind (but I agree that the authors could've done much better). An annoying part of elsa is hand-rolled data structures(one named string!), so I've been considering doing some sort of refactoring to get rid or change some of the obscure data structures. The type system also needs slight redoing . Perhaps we could redo it in clang's image.

Elsa comes with an extensive testsuite and has some design similarities
to clang as described in the clang Internals manual.

When we get that far, I expect clang to extensively leverage the available test suites, including the GCC test suite, elsa, as well as just building tons of open source software.

I am very interested in clang since it will be a C/C++/ObjC frontend
that's suitable for source analysis that also serves as a frontend to a
production compiler.
My biggest gripe with Elsa is that isn't developed in a transparent
fashion. It also occasionally has bugs that would be caught if it were
used as a frontend for a compiler.

clang is definitely developed in the open and welcomes contributors. However, our C++ support is basically non-existent (and we don't have anyone really working on it), so Elsa is probably a better solution to C++ parsing issues in the short term. Over the next couple years, I expect the clang C++ support to come up to the point where it is both industrial quality and useful for a broad variety of clients. It also has much better ObjC support than elsa :wink:

We have been discussing adding ObjC++ to elsa :slight_smile:

Me and a number of other people would like an actively maintained C++ frontend suitable for analysis and source2source. GCC isn't an option, neither is waiting on someone to write another frontend from scratch since that take another half a decade. It would be nice if useful parts of elsa were absourbed by clang so I wouldn't have to play catch up with real compilers.

What would you need done to elsa to consider using it in clang? Perhaps it could serve as a stopgap measure while the faster clang C++ frontend matures.

Taras

The past few months I have been writing many tools with Roberto Raggi's c++ preprocessor and parser. It is very fast and I have enjoyed messing with it. As you (chris) already know exactly what you guys need/want and what would make a good parser I am very curious what you can say about it (where it is good/bad, what it is missing etc)

The one I have been using can be found in this package:

ftp://ftp.trolltech.com/qtjambi/source/qtjambi-gpl-src-4.3.0_01.tar.gz

Located in: generator/parser/ and the preprocessor is in generator/parser/rpp

-Benjamin Meyer

We have. The killer problem is that Elsa's implementation will not allow us to achieve the performance goals of the clang project. In addition, Elsa doesn't solve the hard part of C++ parsing (the semantic analysis and type checking), isn't built as a reusable library (in the way clang aims to be), doesn't get the corner cases of the languages it parses correct, etc.

Actually, it does do semantic analysis and type checking.

I'm sorry, I should have been more clear. Also, my understanding of Elsa is somewhat dated, about a year old, so my understandings could be obsolete.

What I meant is that Elsa (as I understand it) has enough semantic analysis to parse, but is not enforcing all of the constraints required by the language. For example, it does not (AFAIK) correctly enforce things like integer constant expressions, constraints on VLAs etc. If your goal is to just parse correct code, this is fine. If you want to correctly enforce the requirements of the language, this isn't ok.

Could you elaborate on what you mean by "isn't built as a reusable library"? API-wise I think it's ok in that regard.

Specifically, lexing, parsing, and AST building are not cleanly separable as they are in clang. In clang, it is possible to implement an action module what uses the parser but doesn't necessarily build an AST at all.

I'm also not sure about what you mean by corner cases. Elsa's C support isn't ideal because it pretends that C is a subset of C++. And if by corner cases you mean that it doesn't always store everything it parsed in in the AST, that's also correct. However I've found that filling in missing information is next to trivial and I am not aware of any other shortcomings. Could you elaborate on that point too?

See above: accepting a superset of a language is much harder than accepting the language properly.

An additional issue is one of diagnostics. I haven't successfully built elsa, so I can't play with it, but my guess is that the diagnostics produced by elsa are not very good. I would be interested to know if that guess is correct or not. Having *extremely good* diagnostics is a very important goal for clang.

While we could extend Elsa to complete its support for C++ and polish the corner cases, fixing the performance issues would require a complete redesign. As such, reusing elsa is a non-starter. :frowning:

Elsa's support for C++ is fairly complete. In my view, templates are the only crucial part that needs work. It fails on anything beyond simple template instantiation(lucky for me it's barely enough for Mozilla). However, template support is a hard part that will have to dealt with in any C++ frontend.

Yep, if elsa was otherwise ok, I would much rather have us extend it instead of reinventing yet another new thing.

Elsa was also designed with a performance in mind (but I agree that the authors could've done much better). An annoying part of elsa is hand-rolled data structures(one named string!), so I've been considering doing some sort of refactoring to get rid or change some of the obscure data structures. The type system also needs slight redoing . Perhaps we could redo it in clang's image.

Reliance on GLR parsing makes Elsa fundamentally slower than a parser that does well constrained look ahead and backtracking. I am actually a fan of GLR parsing, and I think that Elsa's implementation is a pretty good one. However, GLR parsing requires *building speculative parse trees* and then eliminating the speculation later. In my experience with clang, I've found that anything which does memory allocation or touches the heap is orders of magnitude slower than something that can avoid it.

I don't see how GLR parsing can be done without a liberal amount of heap traffic, but maybe I'm missing something.

clang is definitely developed in the open and welcomes contributors. However, our C++ support is basically non-existent (and we don't have anyone really working on it), so Elsa is probably a better solution to C++ parsing issues in the short term. Over the next couple years, I expect the clang C++ support to come up to the point where it is both industrial quality and useful for a broad variety of clients. It also has much better ObjC support than elsa :wink:

We have been discussing adding ObjC++ to elsa :slight_smile:

Nice!

Me and a number of other people would like an actively maintained C++ frontend suitable for analysis and source2source. GCC isn't an option, neither is waiting on someone to write another frontend from scratch since that take another half a decade.

Yep, there is a clear need!

It would be nice if useful parts of elsa were absourbed by clang so I wouldn't have to play catch up with real compilers. What would you need done to elsa to consider using it in clang? Perhaps it could serve as a stopgap measure while the faster clang C++ frontend matures.

I agree, but I don't see how the two can be merged. Do you have any specific suggestion?

-Chris

Interesting, I wasn't aware of this work. I'll take a look at it later tonight. It looks like it will take a while to dig in due to lack of comments :(.

-Chris

For what it is worth he also hacks on KDevelop and a version of the preprocessor and parser also exists there (this isn't the old KDE 3 KDevelop stuff, but the stuff in KDE for KDE4's KDevelop and it is a little different as the KDE devs have messed around with it). It uses QLALR a LALR(1) parser generator, which is fast and tiny.

http://labs.trolltech.com/page/Projects/Compilers/QLALR

-Benjamin Meyer

It is somewhat irritating to me that there is almost no comments for this: it seems well thought out and written. Is there any out of line documentation available?

Overall, it is an impressive piece of work. There are some minor strange (to me) design decisions: for example, what is ConditionAST, why does it exist?

The ASTs produced seem to be a bit heavier-weight than the clang ASTs, and relies on the entire lexed token stream being available to interpret the location info. However, in my first few minutes looking at it, I don't think that it shares the "fatal flaws" (from the clang perspective only, obviously) in its design or implementation that elsa has. As a matter of fact, while the details differ significantly, its design is somewhat similar to clangs, validating clang's design ;-). One thing that is impossible for me to do from inspection is to determine how complete the parser is.

Since I don't have it built and you do, here are some questions for you: :slight_smile:

1) looking at the preprocessor, the implementation doesn't look particularly speedy. It is using std::strings to push text around. Have you timed the preprocessor on large inputs to see how fast it really is?
2) the preprocessor seems to get the 90% case right, but doesn't seem to be fully conformant. Do you have any idea whether it has been tested against the hard cases in the standard? For example, the clang/test/Preprocessor directory has some example hard cases.
3) does the code handle nasty features like trigraphs?
4) how good is the C++ support? It seems like there is significant coverage for a big chunk of the language, but it seems like pieces are missing. Without at least partial template instantiation support you can't correctly parse some C++ code for example. Note that this requires full handling of template specialization etc. Are there known holes/deficiencies?
5) it looks like a lot of semantic checks are missing. Is there anything that talks about the current state of the parser? It also reads and ignores lots of stuff, even simple things like break/continue/goto stmts.

-Chris

The past few months I have been writing many tools with Roberto
Raggi's c++ preprocessor and parser. It is very fast and I have
enjoyed messing with it. As you (chris) already know exactly what
you guys need/want and what would make a good parser I am very
curious what you can say about it (where it is good/bad, what it is
missing etc)

The one I have been using can be found in this package:

ftp://ftp.trolltech.com/qtjambi/source/qtjambi-gpl-src-4.3.0_01.tar.gz

Located in: generator/parser/ and the preprocessor is in generator/
parser/rpp

It is somewhat irritating to me that there is almost no comments for this: it seems well thought out and written. Is there any out of line documentation available?

Asking around there doesn't seem to be any that I could find. Sorry :frowning:

Overall, it is an impressive piece of work. There are some minor strange (to me) design decisions: for example, what is ConditionAST, why does it exist?

Sorry I am not that familiar with the design choice, I have forwarded this to Roberto who will hopefully respond shortly with answers to it and the other ones below.

The ASTs produced seem to be a bit heavier-weight than the clang ASTs, and relies on the entire lexed token stream being available to interpret the location info. However, in my first few minutes looking at it, I don't think that it shares the "fatal flaws" (from the clang perspective only, obviously) in its design or implementation that elsa has. As a matter of fact, while the details differ significantly, its design is somewhat similar to clangs, validating clang's design ;-). One thing that is impossible for me to do from inspection is to determine how complete the parser is.

Since I don't have it built and you do, here are some questions for you: :slight_smile:

Having it sitting in the middle of other packages is getting annoying. I have pulled it out and made a quick (and very dirty) example app that can be used to generate preprocessed files (./example -E file) put it in a git depot and put it up online

http://repo.or.cz/w/rpp.git

1) looking at the preprocessor, the implementation doesn't look particularly speedy. It is using std::strings to push text around. Have you timed the preprocessor on large inputs to see how fast it really is?

Did a quick (not scientific) test against gcc on my macbook it is almost twice the speed

g++
real 0m0.437s
user 0m0.243s
sys 0m0.060s

rpp
real 0m0.248s
user 0m0.197s
sys 0m0.070s

5) it looks like a lot of semantic checks are missing. Is there anything that talks about the current state of the parser? It also reads and ignores lots of stuff, even simple things like break/continue/goto stmts.

I know that there are several groups (kdevelop is one group) who wish to use it to parse c++ code beyond Qt so it will be maintained and improved. I don't know their plans though. Sorry I can't be more help.

-Benjamin Meyer

It is somewhat irritating to me that there is almost no comments
for this: it seems well thought out and written. Is there any out
of line documentation available?

Asking around there doesn't seem to be any that I could find. Sorry :frowning:

Ok.

Overall, it is an impressive piece of work. There are some minor
strange (to me) design decisions: for example, what is
ConditionAST, why does it exist?

Sorry I am not that familiar with the design choice, I have forwarded
this to Roberto who will hopefully respond shortly with answers to it
and the other ones below.

Thanks,

The ASTs produced seem to be a bit heavier-weight than the clang
ASTs, and relies on the entire lexed token stream being available
to interpret the location info. However, in my first few minutes
looking at it, I don't think that it shares the "fatal flaws" (from
the clang perspective only, obviously) in its design or
implementation that elsa has. As a matter of fact, while the
details differ significantly, its design is somewhat similar to
clangs, validating clang's design ;-). One thing that is
impossible for me to do from inspection is to determine how
complete the parser is.

Since I don't have it built and you do, here are some questions for
you: :slight_smile:

Having it sitting in the middle of other packages is getting
annoying. I have pulled it out and made a quick (and very dirty)
example app that can be used to generate preprocessed files (./
example -E file) put it in a git depot and put it up online

Public Git Hosting - rpp.git/summary

Nice, it's easy to get a tarball from it, unfortunately I don't have qmake.

1) looking at the preprocessor, the implementation doesn't look
particularly speedy. It is using std::strings to push text
around. Have you timed the preprocessor on large inputs to see how
fast it really is?

Did a quick (not scientific) test against gcc on my macbook it is
almost twice the speed

g++
real 0m0.437s
user 0m0.243s
sys 0m0.060s

rpp
real 0m0.248s
user 0m0.197s
sys 0m0.070s

User+sys it is only 13% faster. Still this is impressive :slight_smile:

5) it looks like a lot of semantic checks are missing. Is there
anything that talks about the current state of the parser? It also
reads and ignores lots of stuff, even simple things like break/
continue/goto stmts.

I know that there are several groups (kdevelop is one group) who wish
to use it to parse c++ code beyond Qt so it will be maintained and
improved. I don't know their plans though. Sorry I can't be more help.

No problem, I'll delve more into the code. It looks like it is designed and optimized for use by kdevelop, so it hasn't focused much on type analysis and other things that a front-end needs for full diagnostics and code generation. However, it has a pretty nice design and is simple, so it's a good source of inspiration.

Now we just need someone to help push C++ support forward! :slight_smile:

-Chris

My biggest question:

are resulting files generated free of license or are they GPLd?

Whenever I see a program related to compiler technology under GPL all kinds of
warning bells start to chime. Especially considering GCC has all manner of
exclusion statements from the GPL in order to allow non-GPL programs
linked/created to be exempt.

The qt-jambi work is a parser only, it does not include a code generator. The only thing it produces is an in-memory AST, not machine code.

-Chris