decl/expr ambiguity

Okay, here's another crazy idea. If you boil it down, my objections to preparsing are basically:

1. the perf cost of having to do the prepare in *every* decl case.
2. [minor] the perf cost for qualified expr cases (std::cout << ...)
3. [minor] the maintenance cost of the second parser.

I think that the "parse expr/decl with leading identifier" approach, when generalized to work with any qualified name would adequately solve #2.

Would it be possible to do a similar thing with entire type names? If we had a "parse decl/expr with leading type" then we wouldn't have to decide whether something was a statement/decl until after we read the type. I'm not sure, but if this would let us disambiguate the "definitely a decl" case with current or one token lookahead, then this would fix #1 as well.

Is it possible to determine that something is "definitely a decl" after reading the "int*" in this case?

   int *X = ...

If so, this would solve my primary two objects to the pre-parser. It would mean that we only resolve the leading type once, and only invoke the preparser in extremely rare cases (not *every* decl).

What do you think?

-Chris

Yes; once we've read "int*", we know it must be a decl.

-Eli

Chris Lattner wrote:

Okay, here's another crazy idea. If you boil it down, my objections to preparsing are basically:

1. the perf cost of having to do the prepare in *every* decl case.
2. [minor] the perf cost for qualified expr cases (std::cout << ...)
3. [minor] the maintenance cost of the second parser.
  
1) This is not true, as I explain in this post:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-August/002625.html

2) This is not inherent to the preparser, even if there's a "tentatively parse decl then parse as expr" approach, we still prefer to do such resolutions once; This perf cost needs to be solved in either case.

3) Ummm.. it's [minor] right ? :slight_smile:

As a sidenote, the standard has provisions for the preparser approach, for example:
"If, during parsing, a name in a template parameter is bound differently than it would be bound during a trial parse, the program is ill-formed".

-Argiris

Eli wrote:

Is it possible to determine that something is "definitely a decl"
after reading the "int*" in this case?

int *X = ...

Yes; once we've read "int*", we know it must be a decl.

Ok, but that only helps us if we parse the 'int*' in the unconditional part of the parser. This approach is good for me, and using the preparser to disambiguate hard but uncommon cases is also fine. However, the current patch doesn't do any of this.

Chris Lattner wrote:

Okay, here's another crazy idea. If you boil it down, my objections to preparsing are basically:

1. the perf cost of having to do the prepare in *every* decl case.
2. [minor] the perf cost for qualified expr cases (std::cout << ...)
3. [minor] the maintenance cost of the second parser.

1) This is not true, as I explain in this post:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-August/002625.html

Either I'm very confused (likely!) or your current patch doesn't do this. As I mentioned in the other post, it looks like it runs the preparser for any identifier at top level. It even runs it for the "x = 4" case.

2) This is not inherent to the preparser, even if there's a "tentatively parse decl then parse as expr" approach, we still prefer to do such resolutions once; This perf cost needs to be solved in either case.

By using a "parse expr with leading qualified name" approach, or with something else?

3) Ummm.. it's [minor] right ? :slight_smile:

Yes, if this is the only issue, then I agree it is the best approach :slight_smile:

Thanks for working through this with me, I'm sorry I'm slow/obstinate, :slight_smile:

-Chris

I'm not at all convinced that this is minor. There's a whole lot of
pre-parsing logic to disambiguate this mess of a declaration:

  ::std::vector<int> v(istream_iterator<int>(fin), typename
lambda<istream_iterator<_1>>::template apply<T>::type());

I could have had a bit more fun if I used non-type template arguments
in there, since that requires being able to pre-parse expressions. But
the point is that the maintenance issue looks more minor than it is
because Clang isn't parsing the tricky parts of simple-type-specifier
and elaborated-type-specifier.

Having tentative parsing makes other parts of the C++ compiler easier
to implement, especially error recovery where the parser wants to
check whether tweaking the tokens could produce a valid parse. For
example, one GCC recovers well from is the common error of writing a
<: digraph when starting a template argument list, e.g.,

  namespace N { struct X { }; }
  std::vector<::N::X> vec;

There are also places in C++ where we have to determine whether we're
looking at a type or an expression (e.g., in sizeof(blah)). Will we
have another pre-parser for these? Tentatively parsing solves this
problem as well.

All that said, it might not matter so much whether we pre-parse now or
not. As those trickier parsing bits for types go into the parser, they
could probably be abstracted out to be useful for both the
pre-parser(s?) and the parser, to eliminate unnecessary code/coding
duplication.

  - Doug

Chris Lattner wrote:

Chris Lattner wrote:

Okay, here's another crazy idea. If you boil it down, my objections to preparsing are basically:

1. the perf cost of having to do the prepare in *every* decl case.
2. [minor] the perf cost for qualified expr cases (std::cout << ...)
3. [minor] the maintenance cost of the second parser.

1) This is not true, as I explain in this post:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-August/002625.html

Either I'm very confused (likely!) or your current patch doesn't do this. As I mentioned in the other post, it looks like it runs the preparser for any identifier at top level. It even runs it for the "x = 4" case.

In the second patch posted here:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-August/002617.html
No backtracking is enabled for the common cases:

Parser::TentativeParsingResult Parser::isCXXDeclarationStatement() {
.......
    TentativeParsingResult TPR = isCXXDeclarationSpecifier();
    if (TPR != TPR_ambiguous)
      return TPR;

    TentativeParsingAction PA(*this);

    TPR = TryParseSimpleDeclaration();

    PA.Revert();
....
}

isCXXDeclarationSpecifier(), does at most a one token lookahead. (actually, for 'typeof' it tentatively skips through it to see if "typeof(..)" is followed by '(' but this is too uncommon to even discuss it).
If the current token does not indicate a type, isCXXDeclarationSpecifier() does no token consumption.

2) This is not inherent to the preparser, even if there's a "tentatively parse decl then parse as expr" approach, we still prefer to do such resolutions once; This perf cost needs to be solved in either case.

By using a "parse expr with leading qualified name" approach, or with something else?

Here's what I have in mind.
-something like "A::" indicates a scope qualifier.
-there's a parser method with a purpose to resolve scope qualifiers, say "ParseCXXScopeQualifier", it parses them, calls sema actions to resolve them and returns a CXXScopeTy* from Sema (this is used to pass to sema actions that will need it).
-ParseCXXScopeQualifier can cache that CXXScopeTy* result, so that when it is called again for the same token source location, it will return the cached result without doing any sema resolution at all. It will also skip the necessary number of tokens (if it was previously called with "A::b::", it will skip 4 tokens).

Now say that a statement starts with this:

-The preparser sees that 'A::' is a scope qualifier and calls ParseCXXScopeQualifier to do its thing. Then calls Parser::IsTypeName. Both methods cache their results.
-The preparser sees that "A::b::T" is not followed by a '(', backtracks, and returns "it's a declaration"
-The normal parser sees that 'A::' is a scope qualifier and calls ParseCXXScopeQualifier which just returns the previously cached result and skips 4 tokens.
-The normal parser calls Parser::IsTypeName which also just returns the previously cached result.
-sema resolutions are only done once

What do you think ?

PS: Here's a weird test I did. I used the preparser to see how many "ambiguous" declarations (of T(...) style) there are in actual C code. I used a few of GCC files:
gcc.c: declarations in functions: 432 ambiguous: 0
expr.c declarations in functions: 730 ambiguous: 0
combine.c declarations in functions: 564 ambiguous: 0

I think this suggests that it's a really uncommon case.
If anyone wants to give me a preprocessed file for a test, please do!

-Argiris

Doug Gregor wrote:

There are also places in C++ where we have to determine whether we're
looking at a type or an expression (e.g., in sizeof(blah)).

Can you be more specific? The way I see it, you can skip through when expressions are expected without affecting the disambiguation process.
Are you talking about non-type template arguments ? I think in these cases common parsing methods can be used by both a pre-parser and normal parser.

All that said, it might not matter so much whether we pre-parse now or
not. As those trickier parsing bits for types go into the parser, they
could probably be abstracted out to be useful for both the
pre-parser(s?) and the parser, to eliminate unnecessary code/coding
duplication.
  
Certainly! Resolving scope qualifiers and instantiating classes can be shared.

-Argiris

I'm not sure if this has been discussed before (I didn't follow the whole
discussion), so please bear with me if this is repetitive.

A completely different approach would be to build a parser either for a
(non-ambiguous) superset of C++ or building a (GLR like) parser generating
not a single parse tree, but a parse forest covering the ambiguous parts,
and fixing the non-C++ parts/ambiguities in the resulting parse tree(s)
afterwards. As it turns out there are only a handful of quite simple rules
to follow to make sure only valid C++ parse trees remain after this
post-processing. This approach has been used and described in prior art, but
I'm not aware of any comparisons with regard to resource utilization, time
requirements, etc.

I'm aware of the fact that the current approach for writing the C++ parser
is to create a RD parser. For that reason I assume building a GLR parser is
not an option (additionally GLR parser tend to create really huge parse
trees). But the first option is still a possibility: parse a superset of
C++, disambiguating the parse tree during a second (relatively simple) pass.

Here are two papers I know of describing these approaches:

First option: FOG - Flexible Object Generator
Second option:
http://www.lrde.epita.fr/dload/20030521-Seminar/vasseur0503_transformers_rep
ort.pdf

Regards Hartmut

Hi Argiris,

Doug Gregor wrote:

There are also places in C++ where we have to determine whether we're
looking at a type or an expression (e.g., in sizeof(blah)).

Can you be more specific? The way I see it, you can skip through when
expressions are expected without affecting the disambiguation process.

Perhaps sizeof() (and decltype, typeof, typeid) are not a concern,
then; one can skip through and count parentheses in the pre-parser.

Are you talking about non-type template arguments ? I think in these cases
common parsing methods can be used by both a pre-parser and normal parser.

Yes, that was the "fun" part I referred to. Template arguments can be
types or expressions, so the parser needs to be able to disambiguate
those (it resolves in favor of the type-id).

All that said, it might not matter so much whether we pre-parse now or
not. As those trickier parsing bits for types go into the parser, they
could probably be abstracted out to be useful for both the
pre-parser(s?) and the parser, to eliminate unnecessary code/coding
duplication.

Certainly! Resolving scope qualifiers and instantiating classes can be
shared.

How far does "instantiating classes" extend? Will we be sharing the
parsing of typename-specifiers and template-ids, for example?

  - Doug

Argiris Kirtzidis wrote:

PS: Here's a weird test I did. I used the preparser to see how many "ambiguous" declarations (of T(...) style) there are in actual C code. I used a few of GCC files:
gcc.c: declarations in functions: 432 ambiguous: 0
expr.c declarations in functions: 730 ambiguous: 0
combine.c declarations in functions: 564 ambiguous: 0

I think this suggests that it's a really uncommon case.

Just to clarify, IMO this indicates that if we are to enter ambiguity resolution, doing a "tentative parse decl" approach does not offer much of a performance benefit because, in the majority of cases, it will be an expression statement, so backtracking will happen anyway.

-Argiris

Doug Gregor wrote:

Are you talking about non-type template arguments ? I think in these cases
common parsing methods can be used by both a pre-parser and normal parser.
    
Yes, that was the "fun" part I referred to. Template arguments can be
types or expressions, so the parser needs to be able to disambiguate
those (it resolves in favor of the type-id).
  
If the template argument starts with "simple-type-specifier '('" or "typename-specifier '('" the pre-parser can fire off and look for an abstract declarator in order to disambiguate it. The pre-parser still doesn't need to deal with any of the expression syntax.

All that said, it might not matter so much whether we pre-parse now or
not. As those trickier parsing bits for types go into the parser, they
could probably be abstracted out to be useful for both the
pre-parser(s?) and the parser, to eliminate unnecessary code/coding
duplication.

Certainly! Resolving scope qualifiers and instantiating classes can be
shared.
    
How far does "instantiating classes" extend? Will we be sharing the
parsing of typename-specifiers and template-ids, for example?
  
Yes, disambiguation only cares for types (ok and nested-name-specifiers), so methods like ParseTypeNameSpecifier and ParseClassTemplateId can be shared.
They would return a TypeTy* like ParseTypeName currently does (which they would cache to avoid doing instantiations multiple times).

-Argiris

Doug Gregor wrote:

Are you talking about non-type template arguments ? I think in these
cases
common parsing methods can be used by both a pre-parser and normal
parser.

Yes, that was the "fun" part I referred to. Template arguments can be
types or expressions, so the parser needs to be able to disambiguate
those (it resolves in favor of the type-id).

If the template argument starts with "simple-type-specifier '('" or
"typename-specifier '('" the pre-parser can fire off and look for an
abstract declarator in order to disambiguate it.

Okay.

The pre-parser still
doesn't need to deal with any of the expression syntax.

Sure it does. If the template argument isn't a type-id, it's an
expression. The pre-parser doesn't need to parse these expressions,
but it needs to be able to reliably skip over them. That means at
least some parsing to tell when, e.g., '>' is an operator vs. a '>'
that closes a template-argument-list ('>>' is similar, for
error-recovery and C++0x). So we can simplify parsing of expressions,
but it still needs to be there in some form.

All that said, it might not matter so much whether we pre-parse now or
not. As those trickier parsing bits for types go into the parser, they
could probably be abstracted out to be useful for both the
pre-parser(s?) and the parser, to eliminate unnecessary code/coding
duplication.

Certainly! Resolving scope qualifiers and instantiating classes can be
shared.

How far does "instantiating classes" extend? Will we be sharing the
parsing of typename-specifiers and template-ids, for example?

Yes, disambiguation only cares for types (ok and nested-name-specifiers), so
methods like ParseTypeNameSpecifier and ParseClassTemplateId can be shared.
They would return a TypeTy* like ParseTypeName currently does (which they
would cache to avoid doing instantiations multiple times).

If we're sharing ParseTypeNameSpecifier and ParseClassTemplateId (and,
therefore, most of the tricky logic for parsing types), then I'm not
as worried that the pre-parser will require a significant amount of
duplication.

  - Doug

Doug Gregor wrote:

  

If the template argument starts with "simple-type-specifier '('" or
"typename-specifier '('" the pre-parser can fire off and look for an
abstract declarator in order to disambiguate it.
    
Okay.

The pre-parser still
doesn't need to deal with any of the expression syntax.
    
Sure it does. If the template argument isn't a type-id, it's an
expression. The pre-parser doesn't need to parse these expressions,
but it needs to be able to reliably skip over them. That means at
least some parsing to tell when, e.g., '>' is an operator vs. a '>'
that closes a template-argument-list ('>>' is similar, for
error-recovery and C++0x). So we can simplify parsing of expressions,
but it still needs to be there in some form.
  
I wasn't clear enough; I was thinking that parsing template arguments and class instantiations in general would be the job of a shared ParseClassTemplateId, not the responsibility of the pre-parser. So the pre-parser won't actually skip anything, they will be parsed and instantiated properly by ParseClassTemplateId.
ParseClassTemplateId, in turn, may need to fire-off a pre-parser instance just to determine whether a template argument is a type-id or expression.

Yes, disambiguation only cares for types (ok and nested-name-specifiers), so
methods like ParseTypeNameSpecifier and ParseClassTemplateId can be shared.
They would return a TypeTy* like ParseTypeName currently does (which they
would cache to avoid doing instantiations multiple times).
    
If we're sharing ParseTypeNameSpecifier and ParseClassTemplateId (and,
therefore, most of the tricky logic for parsing types), then I'm not
as worried that the pre-parser will require a significant amount of
duplication.
  
Yeah, if the pre-parser was required to properly parse expressions it would definitely be a nightmare ! :slight_smile:

-Argiris

Hi Hartmut,

As you guessed, we have explicitly chosen to not use GLR approaches due to efficiency concerns. Parsing a superset of the language might be possible, but would probably make the common code significantly more complex: it would have to handle the nastiness of decls and exprs in one shared parser.

-Chris

Chris,

> I'm not sure if this has been discussed before (I didn't follow the
> whole
> discussion), so please bear with me if this is repetitive.
>
> A completely different approach would be to build a parser either
> for a
> (non-ambiguous) superset of C++ or building a (GLR like) parser
> generating
> not a single parse tree, but a parse forest covering the ambiguous
> parts,
> and fixing the non-C++ parts/ambiguities in the resulting parse
> tree(s)
> afterwards.

Hi Hartmut,

As you guessed, we have explicitly chosen to not use GLR approaches
due to efficiency concerns.

Sure.

Parsing a superset of the language might
be possible, but would probably make the common code significantly
more complex: it would have to handle the nastiness of decls and exprs
in one shared parser.

I don't follow here. I thought the decls/expr ambiguities are something C++
specific so will have to be handled for C++ and by the C++ specific parts of
the parser only. That means any understood superset of C++ will affect the
C++ specific parser only. No common code with C/ObjC/etc., AFAICS. Could you
elaborate, please?

Regards Hartmut

Two issues: 1) Clang use a unified C and C++ parser, so there is no such thing as a C++-only parser for expressions (for example). 2) the point is that we'd have to maintain code *somewhere* that parses decls and exprs in parallel, it being in the C or C++ front-end doesn't impact the fact that we'd have to write/understand/maintain it.

-Chris