clang leveraging Elsa?

Hi,

Daniel Wilkerson pointed me at this thread, and thought I should
respond to answer some questions people have about Elsa. But before I
do so, I want to be clear that I'm not in favor of nor against clang
using Elsa. I'm just hoping to help clear any misconceptions so the
best choice can be made.

>What I meant is that Elsa (as I understand it) has enough semantic
>analysis to parse, but is not enforcing all of the constraints
>required by the language.

This is true, and even stated as such on the Elsa page, but is quite
misleading (for which I am partly to blame). "enough semantic
analysis to parse" is almost the whole language, because virtually
every language feature must be handled properly to get an accurate
parse: one must disintguish names of types from names of variables,
and since names can be members of template classes, and the right
specialization must be chosen, and specialization selection engages
almost every other language feature directly or indirectly, there is
no stopping short. Believe me, I tried. :slight_smile:

The only feature that Elsa is completely lacking, because it genuinely
is not required for a correct parse (thank goodness), is access
control (public, private, etc.). The only hard part about that is
figuring out what the heck 'friend' means (but that *is* hard).

>For example, it does not (AFAIK) correctly enforce things like
>integer constant expressions, constraints on VLAs etc.

It is true there are a lot of language restrictions that are not
currently enforced. However, access control aside, I claim that each
such restriction could be enforced by writing some "leaf" code to do
it; all the information is there, the rule simply needs to be coded.

>Specifically, lexing, parsing, and AST building are not cleanly
>separable as they are in clang.

I would argue that Elsa separates these phases as well as can be done
for C++; separation was among the early goals of the project. Unlike
every other C/C++ implementation I am aware of (I have not looked at
clang), there is no "lexer hack" feedback loop from the symbol table
to the lexer. So lexing and parsing are each totally independent
phases. However, the output of the parser is a forest, not a tree;
the price of separation is that much disambiguation has to be
postponed until type checking.

>In clang, it is possible to implement an action module what uses the
>parser but doesn't necessarily build an AST at all.

Unfortunately, that is simply not possible when parsing C++. You
*have* to do template instantiation, and that requires building an
AST. :frowning:

>See above: accepting a superset of a language is much harder than
>accepting the language properly.

I think you meant to say that accepting a superset is much easier?

I probably would have agreed with that statement (the "much" part)
when I first began writing Elsa, but I now do not.

>built elsa, so I can't play with it, but my guess is that the
>diagnostics produced by elsa are not very good.

Yes and no. :slight_smile: When it comes to syntax for which there is no
possible syntactic parse (meaning the parser itself is rejecting the
input), the diagnostics are awful. Elkhound (the parser generator
underneath Elsa) has no error recovery and only very primitive
diagnostics ("unexpected token <blah> at <blah>, bailing"). I have a
longstanding TODO to add good diagnosis and recovery (I was planning
to use the Burke+Fisher algorithm) to Elkhound. I would estimate it
would only take a couple weeks; adapting the algorithm to GLR is
nontrivial but should be possible.

However, once you get into the type checker, the diagnostics are
fairly good IMO. I try to print out both what Elsa thinks is wrong
and the names, types, etc. of relevant entities in coherent error
messages. Good diagnostics were also among the early goals. Now,
that's not to say there is no room for improvement, but I haven't seen
much better (again, IMO).

>> are the only crucial part that needs work. It fails on anything
>> beyond simple template instantiation(lucky for me it's barely
>> enough for Mozilla).

I think this characterization is a bit unfair, though I do understand
the frustration users feel when they hit Elsa template bugs, and
certainly acknowledge there are still plenty in there. I knock them
off from time to time.

If I had my druthers, I'd rewrite the entire template mechanism (for
the fourth time). But what's there gets the job done. Just how much
of the job it does is hard to say, but I'm currently finding that
fixing bugs usually involves small-ish tweaks.

>> Elsa was also designed with a performance in mind (but I agree that
>> the authors could've done much better). An annoying part of elsa is
>> hand-rolled data structures(one named string!), so I've been
>> considering doing some sort of refactoring to get rid or change
>> some of the obscure data structures.

It's true I avoided using STL, for a variety of reasons. However, I
don't think that choice affected performance one way or another.

>Reliance on GLR parsing makes Elsa fundamentally slower than a parser
>that does well constrained look ahead and backtracking. I am
>actually a fan of GLR parsing, and I think that Elsa's implementation
>is a pretty good one. However, GLR parsing requires *building
>speculative parse trees* and then eliminating the speculation later.
>In my experience with clang, I've found that anything which does
>memory allocation or touches the heap is orders of magnitude slower
>than something that can avoid it.

First, I don't see how one could separate lexing+parsing from type
checking of C++ w/o using GLR. So you're kind of stuck if you want
both "as fast as possible" but also "clean separation".

Now, as for whether GLR is slow: with some work, I was able to make
the GLR implementation used in Elsa perform within a few percent of a
bison-generated parser, for a deterministic C grammar. This is due
(in part) to a technique that allows the parser to switch between LR
and GLR dynamically depending on the grammar+input:

   http://www.cs.berkeley.edu/~smcpeak/papers/elkhound_cc04.ps

For the (nondeterministic) C++ grammar, I measured Elsa's performance
as being around 30% slower than gcc-2 and around 2x faster than gcc-3
(that was against gcc-3.2.x or so; gcc-3.4+ has a totally new parser,
and I haven't compared to it). I do not know if you consider gcc
"fast enough". (See the above citation for details.)

>I don't see how GLR parsing can be done without a liberal amount of
>heap traffic, but maybe I'm missing something.

The GLR algorithm itself uses a graph-structured stack, but the
Elkhound implementation uses a node pool that is nearly as fast as an
ordinary LR stack like bison's. Heap traffic can be fast if the
locality is good and the general-purpose allocator is avoided.

As for building AST, it's not merely speculative; much of the extra
AST actually survives parsing in the form of ambiguous alternatives.
But my measurements show that there are only 30-50% more nodes in the
ambiguous AST than in the unambiguous AST. So while that's not a
trivial cost, you're going to have to work pretty hard to get more
than a factor of two by eliminating the extra AST building.

At the end of the day, there are basically three approaches to parsing
C++ that I am aware of:

* Lexer hack plus LR parsing. This is what GCC did until 3.4. It is
very painful to make changes b/c LALR(1) is so unpredictable.

* Lexer hack plus hand-crafted parsing. This is what GCC-3.4+ does,
and what EDG has done all along. It ends up being better (IMO) than
the LR approach b/c the unpredictability goes away, but it's still a
lot of work to maintain since the parser does a lot of explicit token
manipulation. It's also difficult to separate the parser from its
AST, which is unfortunate from the perspective of possibly reusing the
GCC parser.

* GLR and post-parse disambiguation. So far Elsa is the only example.
I think it's a good point in the design space, but it does trade some
speed (I think at most a factor of three) in order to gain
maintainability.

-Scott

Daniel Wilkerson pointed me at this thread, and thought I should
respond to answer some questions people have about Elsa. But before I
do so, I want to be clear that I'm not in favor of nor against clang
using Elsa. I'm just hoping to help clear any misconceptions so the
best choice can be made.

Welcome Scott!

What I meant is that Elsa (as I understand it) has enough semantic
analysis to parse, but is not enforcing all of the constraints
required by the language.

This is true, and even stated as such on the Elsa page, but is quite
misleading (for which I am partly to blame). "enough semantic
analysis to parse" is almost the whole language, because virtually
every language feature must be handled properly to get an accurate
parse: one must disintguish names of types from names of variables,
and since names can be members of template classes, and the right
specialization must be chosen, and specialization selection engages
almost every other language feature directly or indirectly, there is
no stopping short. Believe me, I tried. :slight_smile:

Yep. C++ is extremely annoying like that. Handling knowing whether simple things like a<t>::y are a type or variable does require full template info and details like SFINAE make it hard to get everything right. I was not trying to understate the effort and value of Elsa, I was just trying to explain how differing goals drive the projects priorities in different ways.

The only feature that Elsa is completely lacking, because it genuinely
is not required for a correct parse (thank goodness), is access
control (public, private, etc.). The only hard part about that is
figuring out what the heck 'friend' means (but that *is* hard).

Sure, relative to everything else, it's not a big deal.

For example, it does not (AFAIK) correctly enforce things like
integer constant expressions, constraints on VLAs etc.

It is true there are a lot of language restrictions that are not
currently enforced. However, access control aside, I claim that each
such restriction could be enforced by writing some "leaf" code to do
it; all the information is there, the rule simply needs to be coded.

Yep, anything can be added :slight_smile:

Specifically, lexing, parsing, and AST building are not cleanly
separable as they are in clang.

I would argue that Elsa separates these phases as well as can be done
for C++; separation was among the early goals of the project. Unlike
every other C/C++ implementation I am aware of (I have not looked at
clang), there is no "lexer hack" feedback loop from the symbol table
to the lexer. So lexing and parsing are each totally independent
phases. However, the output of the parser is a forest, not a tree;
the price of separation is that much disambiguation has to be
postponed until type checking.

This is not exactly what I meant, allow me to clarify. :slight_smile: Elsa has a nice separation because it allows deferring certain decisions until the (ambiguous) parse tree is pruned down by later semantic analysis. This means that the parser doesn't have to have any idea whether something is an variable, type, template etc in order to get the correct answer.

In clang, we use a form of the lexer hack. The lexer itself is not actually "hacked", it just returns "identifier", never "type" or "variable". Instead of having the lexer make the determination (which would require it to know about scope and many other things in C++), the *parser* actually makes the determination. This design allows the lexer/preprocessor to be completely independent of the parser data structures, as in elsa.

Unlike Elsa, clang does do the "lexer hack" in the parser. This means that the C parser (for example) needs to determine whether an identifier returned by the lexer is a typename or a variable, before it can correctly parse past the token. In the case of C++, much more would be needed of course.

The major difference between clang and other AST-building "parsers" is that clang's parser does not build an AST at all. The simplest way to think of it is that each production in the grammar invokes a virtual method on an "actions" module. This actions module can then build an AST if it wants, or do other light-weight tracking. In C, for example, the 'minimal actions' just keeps track of scoped typedef/variable names: it doesn't build an AST at all otherwise. It is obviously true that a C++ version of this will have to keep track of much more for templates.

When I said: "lexing, parsing, and AST building are not cleanly separable as they are in clang", I meant this separation between the parsing and AST building primarily. The specific win of this is that they are in different libraries, which cleanly decouples them (AST building only depends on the parser, there is no circular dependence) and makes both of them simpler/more clear.

See above: accepting a superset of a language is much harder than
accepting the language properly.

I think you meant to say that accepting a superset is much easier?

Yep.

I probably would have agreed with that statement (the "much" part)
when I first began writing Elsa, but I now do not.

Fair enough. While I think I understand what needs to be done, you have actually done it :slight_smile:

Reliance on GLR parsing makes Elsa fundamentally slower than a parser
that does well constrained look ahead and backtracking. I am
actually a fan of GLR parsing, and I think that Elsa's implementation
is a pretty good one. However, GLR parsing requires *building
speculative parse trees* and then eliminating the speculation later.
In my experience with clang, I've found that anything which does
memory allocation or touches the heap is orders of magnitude slower
than something that can avoid it.

First, I don't see how one could separate lexing+parsing from type
checking of C++ w/o using GLR. So you're kind of stuck if you want
both "as fast as possible" but also "clean separation".

See above, hopefully that clarifies what I mean by 'clean separation'. The code is separate, in different libraries, but at runtime the execution is intertwined.

Now, as for whether GLR is slow: with some work, I was able to make
the GLR implementation used in Elsa perform within a few percent of a
bison-generated parser, for a deterministic C grammar. This is due
(in part) to a technique that allows the parser to switch between LR
and GLR dynamically depending on the grammar+input:

   http://www.cs.berkeley.edu/~smcpeak/papers/elkhound_cc04.ps

Yep, of course, this is basically saying that "GLR is fast when you don't use its power" :slight_smile:

You could clearly refactor the C++ grammar to be completely deterministic and avoid all ambiguous node allocations, but then there is little reason to use GLR.

For the (nondeterministic) C++ grammar, I measured Elsa's performance
as being around 30% slower than gcc-2 and around 2x faster than gcc-3
(that was against gcc-3.2.x or so; gcc-3.4+ has a totally new parser,
and I haven't compared to it). I do not know if you consider gcc
"fast enough". (See the above citation for details.)

No, I don't. I consider GCC to be ridiculously slow, particularly when parsing C++. When parsing C, clang is already 2.5x faster than GCC at -fsyntax-only, and about 4.6x faster if you just count the parser (not the preprocessor/lexer) (http://clang.llvm.org/clang_video-07-25-2007.html). I expect clang to be an even higher multiple faster than GCC on C++ code, when we finally support it. This basically reflects the fact that the GCC C++ Front-end has a huge array of known performance problems and N^2 algorithms in it.

I don't see how GLR parsing can be done without a liberal amount of
heap traffic, but maybe I'm missing something.

The GLR algorithm itself uses a graph-structured stack, but the
Elkhound implementation uses a node pool that is nearly as fast as an
ordinary LR stack like bison's. Heap traffic can be fast if the
locality is good and the general-purpose allocator is avoided.

Right. The critical thing here is that Elsa's nondeterministic parser trades parse speed and memory (both due to heap allocations) for a more elegant implementation where semantic analysis is completely decoupled from parsing. Unfortunately, for the goals I have for clang, this tradeoff isn't acceptable. :frowning:

At the end of the day, there are basically three approaches to parsing
C++ that I am aware of:

* Lexer hack plus hand-crafted parsing. This is what GCC-3.4+ does,
and what EDG has done all along. It ends up being better (IMO) than
the LR approach b/c the unpredictability goes away, but it's still a
lot of work to maintain since the parser does a lot of explicit token
manipulation. It's also difficult to separate the parser from its
AST, which is unfortunate from the perspective of possibly reusing the
GCC parser.

Right, this is the approach that I plan :slight_smile:

-Chris