[PATCH] C++ nested-name-specifier (Parser)

Hi,

For the next chapter of the C++ support saga, I've implemented nested-name-specifier support ("foo::bar::").

The attached 'pp-caching' patch, modifies Preprocessor::LookAhead (making it efficient enough to replace LookNext) and adds 'rewinding' functionality.
That is, one can call Preprocessor::EnableRewindAtThisPos, lex and consume any number of tokens, and later call Preprocessor::Rewind to make the preprocessor re-lex the tokens that were previsouly lexed since EnableRewindAtThisPos was called.

This is used by the parser to determine whether a "foo::bar::x" construct represents a typename without consuming any tokens. It goes something like this:

-EnableRewindAtThisPos()
-Parse 'foo::bar::' tokens
-Check whether 'x' is a typename in the 'foo::bar::' scope
-Rewind()

The Preprocessor modifications should have no impact on its efficiency for normal lexing.

The C++ standard does not have a syntactic category for both the global scope specifier ('::') and nested-name-specifier ('foo::'), but I found that it's simpler to regard them both as a 'C++ scope specifier' and have them both treated the same way. For example, "::x", "foo::x", and "::foo::x" will all be treated as "[cxx-scope-spec] x" by the parser.

Comments are welcomed!

-Argiris

pp-caching.patch (13 KB)

nns-parser.patch (34.2 KB)

Hmm, I don't have any low-level comments, but I'm wondering about a few things.

First off, and I guess the most important bit, is how you're planning
to deal with the C++ cast/declaration ambiguity. This ambiguity
affects a lot of the places which are testing ahead for a type, so
adding code that's looking ahead only as far as the end of the type
seems like a temporary step at best. I'm not saying that this patch
is necessarily the wrong way to go (I really have no clue how I'd go
about implementing the ambiguity resolution performantly), but I want
to make sure you have the next step planned so that you don't end up
making extra work for yourself.

This is a more of a side-issue, but would it be worthwhile to make the
parser keep track of enough information to implement isTypeName
itself? It's a relatively large burden on anyone who wants to write
an Action implementation to be required to keep track of enough
information to figure out the answer to this question, especially for
C++, and having the parser use different implementations of identifier
resolution depending on whether we're using Sema or not seems bad.
The obvious arguments against this are that it's sort of duplicating
code from Sema, and it might affect performance for the normal case of
using Sema. That said, it seems worth considering anyway if we really
expect people to be writing their own Action implementations.

-Eli

Hi Eli,

Eli Friedman wrote:

Hmm, I don't have any low-level comments, but I'm wondering about a few things.

First off, and I guess the most important bit, is how you're planning
to deal with the C++ cast/declaration ambiguity. This ambiguity
affects a lot of the places which are testing ahead for a type, so
adding code that's looking ahead only as far as the end of the type
seems like a temporary step at best. I'm not saying that this patch
is necessarily the wrong way to go (I really have no clue how I'd go
about implementing the ambiguity resolution performantly), but I want
to make sure you have the next step planned so that you don't end up
making extra work for yourself.

The parser needs a way to know whether it's standing in front of a type or not. The "look ahead" code that I added is Parser::isTokenStreamTypeName() which only parses ahead of scope specifiers, so that calling isTokenStreamTypeName() can check either "x" or "foo::bar::x". This is useful in general and no matter how exactly the C++ cast/declaration ambiguity gets resolved, I expect isTokenStreamTypeName to still be needed.

This is a more of a side-issue, but would it be worthwhile to make the
parser keep track of enough information to implement isTypeName
itself? It's a relatively large burden on anyone who wants to write
an Action implementation to be required to keep track of enough
information to figure out the answer to this question, especially for
C++, and having the parser use different implementations of identifier
resolution depending on whether we're using Sema or not seems bad.
The obvious arguments against this are that it's sort of duplicating
code from Sema, and it might affect performance for the normal case of
using Sema. That said, it seems worth considering anyway if we really
expect people to be writing their own Action implementations.
  
It's a large burden indeed but I'm not sure it's a good idea to move the burden to the parser (as you said, it's going to overlap with Sema).
I think that a reasonable compromise is MinimalAction which only contains the type checking functionality; other Action implementers can copy from/extend it.

-Argiris

Hi,

For the next chapter of the C++ support saga, I've implemented nested-name-specifier support ("foo::bar::").

Nice!

The attached 'pp-caching' patch, modifies Preprocessor::LookAhead (making it efficient enough to replace LookNext) and adds 'rewinding' functionality.
That is, one can call Preprocessor::EnableRewindAtThisPos, lex and consume any number of tokens, and later call Preprocessor::Rewind to make the preprocessor re-lex the tokens that were previsouly lexed since EnableRewindAtThisPos was called.

I think we will need this functionality (aka 'tentative parsing' or 'backtracking') at some point, but I'm not sure why we need it for this. As a prolog, I haven't dug into the part of the spec, so if I'm missing something obvious, please be gentle on me :slight_smile:

My basic understanding (again, based on intuition, not reading of the spec) of scope resolution was that it could actually be implemented without lookahead. My thought was that if you see ::a::B or A::B that you start by looking up the ::A /A part and decide what it references. when looking up A, consider if it resolves to a namespace. My though would be that the parser invokes an action to look up the decl corresponding to A.

Options are that it could either be:
1) typedef name
2) variable/ivar/etc
3) namespace
4) struct
...

One ::A is looked up, based on what the action returns, my assumption is that the parser would see the next :: and do a subscope lookup. Basically it would parse "::B" and then invoke an action to look up ::B in the decl returned by the ::A lookup. At some point presumably, you end up with a type or a variable etc, which we'd need to treat as the start of a declspec etc. One nice thing about this is that it would not require the parser to build up an explicit 'CXXScopeSpec' object, Sema could build it if it wanted, but it wouldn't be required.

Is this approach achievable? I'd prefer to avoid parsing and rewinding unless absolutely necessary.

-Chris

Hi,

For the next chapter of the C++ support saga, I've implemented
nested-name-specifier support ("foo::bar::").

Hmm, I don't have any low-level comments, but I'm wondering about a few things.

First off, and I guess the most important bit, is how you're planning
to deal with the C++ cast/declaration ambiguity.

You mean the "most vexing parse" issue? I think that we will have to do tentative parsing + backtracking for the general case. With luck, there will be common cases that we can disambiguate early to avoid the cost of backtracking, but I don't know enough about that area to know if that will actually be possible in practice.

This is a more of a side-issue, but would it be worthwhile to make the
parser keep track of enough information to implement isTypeName
itself?

I don't think this is practical unfortunately. In the full generality of C++, you have to do a ton of semantic analysis to be able to handle this. For example:

T<int>::y

If T is some template, you need to know all about members of T to know if Y is a variable or type. Also, T could be partially specialized on int, so you need to handle partial specialization and a bunch of other stuff to handle this. Pragmatically, I don't think we should worry about -parse-noop for C++.

Following up with my previous email to Argiris about scoped lookup, my (again, based on intuition, not the spec) idea of how this would work is that the parser would call:

   ActOnIdentifier("T")

... which returns a decl for the template. It would then see the less than, call an action to determine if the decl is a template or not (if not, '<' is "operator<"). Assuming it is a template, it would parse the template args, calling "act on template args" with them. This would produce a new decl, one for T<int> which Sema would lazily create if needed and handle partial specialization. The parser would then see ::Y and then call the "lookup 'y' in decl" action, which returns yet-another-decl, which is either a type or variable etc.

This doesn't require backtracking or anything crazy in the parser, but it *does* rely on the actions module to keep a large subset of information held in the AST just to be able to parse. I don't see a way around this though.

It's a relatively large burden on anyone who wants to write
an Action implementation to be required to keep track of enough
information to figure out the answer to this question, especially for
C++,

Yes it is. In practice, I don't think -parse-noop will work very well (if at all) for C++. I expect that most clients parsing C++ will end up using the AST.

-Chris

Chris Lattner wrote:

The attached 'pp-caching' patch, modifies Preprocessor::LookAhead (making it efficient enough to replace LookNext) and adds 'rewinding' functionality.
That is, one can call Preprocessor::EnableRewindAtThisPos, lex and consume any number of tokens, and later call Preprocessor::Rewind to make the preprocessor re-lex the tokens that were previsouly lexed since EnableRewindAtThisPos was called.

I think we will need this functionality (aka 'tentative parsing' or 'backtracking') at some point, but I'm not sure why we need it for this. As a prolog, I haven't dug into the part of the spec, so if I'm missing something obvious, please be gentle on me :slight_smile:

This is not a requirement of the spec, it just simplifies the parser in a similar way to how introducing NextToken() simplified it.

My basic understanding (again, based on intuition, not reading of the spec) of scope resolution was that it could actually be implemented without lookahead. My thought was that if you see ::a::B or A::B that you start by looking up the ::A /A part and decide what it references. when looking up A, consider if it resolves to a namespace. My though would be that the parser invokes an action to look up the decl corresponding to A.

Options are that it could either be:
1) typedef name
2) variable/ivar/etc
3) namespace
4) struct
...

One ::A is looked up, based on what the action returns, my assumption is that the parser would see the next :: and do a subscope lookup. Basically it would parse "::B" and then invoke an action to look up ::B in the decl returned by the ::A lookup. At some point presumably, you end up with a type or a variable etc, which we'd need to treat as the start of a declspec etc. One nice thing about this is that it would not require the parser to build up an explicit 'CXXScopeSpec' object, Sema could build it if it wanted, but it wouldn't be required.

As far as the spec is concerned, you basically want to do something like this:

-You have "A::B"
-Parse "A::", not just 'A'. This is because "A::" indicates nested-name-specifier, and you want to do a specialized name lookup for nested-names. This specialized lookup will only consider classes/namespaces and ignore the rest. For example:

namespace A { int x; }
void f() {
   int A;
want the namespace A.
}

-Now parse "B" and lookup 'B' in the scope of 'A'.

Is this approach achievable? I'd prefer to avoid parsing and rewinding unless absolutely necessary.

The advantage of the rewinding part is that there's no need for the Parser or Sema to keep and look after a 'C++ scoping' state (a state that is not just local to functions, but part of the Parser/Sema class).
This also means that we can drop the rewinding part if the Parser or Sema keep and look after a 'C++ scoping' state.
If the Parser keeps the state, it will build and pass 'CXXScopeSpec' objects to action methods, (like in my patch).
If the Sema keeps the state, it will make error recovery awkward. For example:

namespace A{}

-Parser calls ActOnNestedNameSpecifier for "A::"
-Sema goes into "look identifiers in namespace A" mode
-Parsing error at ';'

How should Sema know that it should go back into normal lookup mode ?

And about the preprocessor changes: Is it ok to commit them, and is it more intuitive to change 'Rewind' to 'Backtrack' ?

-Argiris

And about the preprocessor changes: Is it ok to commit them, and is it more intuitive to change 'Rewind' to 'Backtrack' ?

Yes, this does look like a nice cleanup for the previous mechanism. Please improve this comment to explain what the members are and what the invariants are:

+ // Cached tokens state.
+ typedef std::vector<Token> CachedTokensTy;
+ CachedTokensTy CachedTokens;
+ CachedTokensTy::size_type CachedLexPos;
+ CachedTokensTy::size_type CachedRewindPos;

and rename Rewind -> Backtrack and commit it, thanks!

In the future, this will have to be generalized a bit. IIUC, C++ can require nested backtracking in some cases. This would require a stack of speculation. Also, we may eventually need warnings that are emitted while speculating to be cached and discarded if the lookahead is discarded.

Chris Lattner wrote:

The attached 'pp-caching' patch, modifies Preprocessor::LookAhead (making it efficient enough to replace LookNext) and adds 'rewinding' functionality.
That is, one can call Preprocessor::EnableRewindAtThisPos, lex and consume any number of tokens, and later call Preprocessor::Rewind to make the preprocessor re-lex the tokens that were previsouly lexed since EnableRewindAtThisPos was called.

I think we will need this functionality (aka 'tentative parsing' or 'backtracking') at some point, but I'm not sure why we need it for this. As a prolog, I haven't dug into the part of the spec, so if I'm missing something obvious, please be gentle on me :slight_smile:

This is not a requirement of the spec, it just simplifies the parser in a similar way to how introducing NextToken() simplified it.

Ok.

My basic understanding (again, based on intuition, not reading of the spec) of scope resolution was that it could actually be implemented without lookahead. My thought was that if you see ::a::B or A::B that you start by looking up the ::A /A part and decide what it references. when looking up A, consider if it resolves to a namespace. My though would be that the parser invokes an action to look up the decl corresponding to A.

Options are that it could either be:
1) typedef name
2) variable/ivar/etc
3) namespace
4) struct
...

One ::A is looked up, based on what the action returns, my assumption is that the parser would see the next :: and do a subscope lookup. Basically it would parse "::B" and then invoke an action to look up ::B in the decl returned by the ::A lookup. At some point presumably, you end up with a type or a variable etc, which we'd need to treat as the start of a declspec etc. One nice thing about this is that it would not require the parser to build up an explicit 'CXXScopeSpec' object, Sema could build it if it wanted, but it wouldn't be required.

As far as the spec is concerned, you basically want to do something like this:

-You have "A::B"
-Parse "A::", not just 'A'. This is because "A::" indicates nested-name-specifier, and you want to do a specialized name lookup for nested-names. This specialized lookup will only consider classes/namespaces and ignore the rest. For example:

namespace A { int x; }
void f() {
int A;
A::x = 0; // We don't want 'A' lookup to return 'int A' here, we want the namespace A.
}

-Now parse "B" and lookup 'B' in the scope of 'A'.

Ok. Conceptually, I don't see a problem with this. After consuming the identifier, if it is followed by a colon, invoke a different action to do the lookup.

However, I think I am starting to understand what you mean. isDeclarationSpecifier() (for example) contains this code right now:

   switch (Tok.getKind()) {
   ...
     // typedef-name
   case tok::identifier:
     return Actions.isTypeName(*Tok.getIdentifierInfo(), CurScope) != 0;

This means that following my approach would take us back down the route of having a "parse expression with leading identifier expression already eaten" method, and things like that. This is because you'd have to do something like:

case tok::identifier:
   // eats the current identifier and related type stuff as a whole,
   D = ConsumeAndResolveIdentifier();
   if (Actions.isTypeName(D))
     ... it's a decl spec ...
   else
     .. it's an 'identifier expression' ..

Is that the problem? It also means that isDeclarationSpecifier would need backtracking and significant other stuff to work as a predicate. In practice, this means we'd want to refactor all callers to not use it like it does.

Is this approach achievable? I'd prefer to avoid parsing and rewinding unless absolutely necessary.

The advantage of the rewinding part is that there's no need for the Parser or Sema to keep and look after a 'C++ scoping' state (a state that is not just local to functions, but part of the Parser/Sema class).

I'm not sure I believe that. How do you propose to handle more complex things (like the T<int>::x case) in the future? It seems that we're going to need to invoke sema to do template instantiation and other stuff at some point: passing a pre-parsed thing like this to sema as one big action call seems difficult.

This also means that we can drop the rewinding part if the Parser or Sema keep and look after a 'C++ scoping' state.
If the Parser keeps the state, it will build and pass 'CXXScopeSpec' objects to action methods, (like in my patch).
If the Sema keeps the state, it will make error recovery awkward.

I actually think that things are more awkward with having Sema work this out. Specifically, if you have: "A :: B :: C" you might have one reference or it might be "A::B ::C" which can occur in a few places in the grammar:
http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elsa/doc/coloncolon.txt

If the parser just passed all of A/B/C to Sema, I'm not sure how it would handle this. It seems really easy if you invoke actions for "A::" then "B" (which returns a type). When the parser got a type, it would naturally ratchet on and parse ::C as a qualified name.

-Chris

Chris Lattner wrote:.

However, I think I am starting to understand what you mean. isDeclarationSpecifier() (for example) contains this code right now:

  switch (Tok.getKind()) {
  ...
    // typedef-name
  case tok::identifier:
    return Actions.isTypeName(*Tok.getIdentifierInfo(), CurScope) != 0;

This means that following my approach would take us back down the route of having a "parse expression with leading identifier expression already eaten" method, and things like that. This is because you'd have to do something like:

case tok::identifier:
  // eats the current identifier and related type stuff as a whole,
  D = ConsumeAndResolveIdentifier();
  if (Actions.isTypeName(D))
    ... it's a decl spec ...
  else
    .. it's an 'identifier expression' ..

Is that the problem?

Yes, exactly! I want to avoid the "leading stuff.." route.

  It also means that isDeclarationSpecifier would need backtracking and significant other stuff to work as a predicate. In practice, this means we'd want to refactor all callers to not use it like it does.

Here's another way to make it work, without any backtracking:
Either Parser or Sema, keeps a scope-spec state (like Parser keeps a 'current scope' state and Sema keeps 'current DeclContext' state), which indicates the Decl that we need to do lookup into.
As you suggested, the Parser, calls actions during parsing of scope specifiers.
For "A::T<int>::":
-Parses 'A::' and calls ActOnNestedNameSpecifier for 'A' (updates scope-spec)
-Parses, instantiates, etc. 'T<int>' through action calls (updates scope-spec)

Sema actions like isTypeName and ActOnIdentifier, check the scope-spec to see whether it needs to lookup a name inside the scope-spec decl, or do a normal lookup.

Now, the questions is whether Sema should keep the scope-spec state, or whether the Parser should keep the scope-spec state and pass it along to the actions (isTypeName, etc.)
When I say passing the scope-spec to actions, I don't mean like the CXXScopeSpec in my patch which contained only parsed tokens, I mean passing the decl that represents the scope where names should be looked up into.

If the Sema keeps the scope-spec state, it is cleaner, but how will the scope-state be cleared in case of an error like "A:: ;" ?
Should the parser call an ActOnErrorAfterNestedName or something ?

Is this approach achievable? I'd prefer to avoid parsing and rewinding unless absolutely necessary.

The advantage of the rewinding part is that there's no need for the Parser or Sema to keep and look after a 'C++ scoping' state (a state that is not just local to functions, but part of the Parser/Sema class).

I'm not sure I believe that. How do you propose to handle more complex things (like the T<int>::x case) in the future? It seems that we're going to need to invoke sema to do template instantiation and other stuff at some point: passing a pre-parsed thing like this to sema as one big action call seems difficult.

Yes, you are right that for templates, passing them as parsed tokens to one action is not a good idea.

This also means that we can drop the rewinding part if the Parser or Sema keep and look after a 'C++ scoping' state.
If the Parser keeps the state, it will build and pass 'CXXScopeSpec' objects to action methods, (like in my patch).
If the Sema keeps the state, it will make error recovery awkward.

I actually think that things are more awkward with having Sema work this out. Specifically, if you have: "A :: B :: C" you might have one reference or it might be "A::B ::C" which can occur in a few places in the grammar:
http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elsa/doc/coloncolon.txt

If the parser just passed all of A/B/C to Sema, I'm not sure how it would handle this. It seems really easy if you invoke actions for "A::" then "B" (which returns a type). When the parser got a type, it would naturally ratchet on and parse ::C as a qualified name.

But "B" type could contain "C" as nested type, in which case you may want "A::b::C".
Anyway, I don't think we need to worry about whether it should be "A::B ::C" or "A ::b::C" or whatever.
GCC treats it as "A::b::C" always:

struct S {};
namespace A { S f(); }
S ::a::f(); // error: 'S::A' has not been declared

I think this is correct since the spec says nothing about "'::' associativity".

-Argiris

case tok::identifier:
// eats the current identifier and related type stuff as a whole,
D = ConsumeAndResolveIdentifier();
if (Actions.isTypeName(D))
   ... it's a decl spec ...
else
   .. it's an 'identifier expression' ..

Is that the problem?

Yes, exactly! I want to avoid the "leading stuff.." route.

I've been going back and forth on this issue and did a bit of research.

GCC's C++ front-end handles this in a couple of different ways. To handle the ambiguity between declarations and statements at the top level of compound statements, it basically boils down to:

  cp_parser_parse_tentatively (parser);
  /* Try to parse the declaration-statement. */
  cp_parser_declaration_statement (parser);
  /* If that worked, we're done. */
  if (cp_parser_parse_definitely (parser))
     return;

  /* Look for an expression-statement instead. */
  statement = cp_parser_expression_statement (parser, in_statement_expr);

I don't think there is a good way around this (in its full generality), other than some "peephole" look-ahead to try to disambiguate without fully tentative parsing. This is worthwhile because the code above would run for simple things like { x = y; } because looking at 'x=' you can obviously tell whether it starts a declaration or not.

However, I think that the primary expression case can be better and is more important. The GCC parser (afaict) does a ton of tentative parsing and other weird stuff when trying to resolve identifiers in primary expression and I think this is extremely bad news.

Consider some very common C++ cases:

   std::cout << ....
   std::string X;
   ... std::string::npos ...
   std::string("")
   std::vector<int>::iterator
   etc

The big problem that I have with backtracking is that use of it to handle primary expressions means that you do a huge amount of the same work, over and over again. In this case, analyzing "std::string::npos" will probably start tentatively parsing it as a type (because it could be the start of a 'constructor style cast'. This will require lookup of 'std', then finding the 'string' typedef in it, then finding the definition of the string type (a record_decl for basic_string<char...>) then finding the npos member, then seeing the '(', then realizizing that npos is not a type and aborting/backtracking. In the 'retry' it will start parsing it again, which has to resolve 'std', 'string' and 'npos' all over again, just to continue on as a non-type. It is far worse for things like vector<int> because it requires looking up the template and handling partial specialization *multiple times* which is slow slow slow :slight_smile:

In the 'Old Clang C parser' we had a special case for identifier, and we had "parse expression with leading identifier". You convinced me that you can make the preprocessor fast enough at single token lookahead to allow us to depend on it and simplify the implementation. For C, this is sufficient, but for C++ it clearly is not.

Given the complexities of these C++ issues, I'm strongly inclined to go back to the old design for expressions (however, we should still rely on 1-token lookahead for labels). This means that we resolve the qualified identifier streams exactly once in this case, and don't have to re-resolve them and don't have have backtracking in this (very, very, common case). This does bring back some annoyingness in the clang implementation of the grammar, but it comes into a very simple place that doesn't require constant hacking (the primary/postfix/cast part of the grammar is pretty simple).

Places in the code that use the "is decl spec" predicate will also have to be changed (e.g. the compound statement case). I think it would be interesting to consider making the 'is decl spec' predicate either be tri-state (yes/no/maybe) or see if the callers of this can be made to inline the two cases directly, like we can do for primary expressions. I haven't looked at the clients of 'is decl spec' to see how insane this idea is.

It also means that isDeclarationSpecifier would need backtracking and significant other stuff to work as a predicate. In practice, this means we'd want to refactor all callers to not use it like it does.

Here's another way to make it work, without any backtracking:
Either Parser or Sema, keeps a scope-spec state (like Parser keeps a 'current scope' state and Sema keeps 'current DeclContext' state), which indicates the Decl that we need to do lookup into.
As you suggested, the Parser, calls actions during parsing of scope specifiers.
For "A::T<int>::":
-Parses 'A::' and calls ActOnNestedNameSpecifier for 'A' (updates scope-spec)
-Parses, instantiates, etc. 'T<int>' through action calls (updates scope-spec)

Ok. I like that this means the resolution only has to happen once, which is my goal.

Sema actions like isTypeName and ActOnIdentifier, check the scope-spec to see whether it needs to lookup a name inside the scope-spec decl, or do a normal lookup.

This is interesting: doesn't this mean that everything else will have to check to make sure that this hasn't happened, in order to reject invalid code?

Now, the questions is whether Sema should keep the scope-spec state, or whether the Parser should keep the scope-spec state and pass it along to the actions (isTypeName, etc.)
When I say passing the scope-spec to actions, I don't mean like the CXXScopeSpec in my patch which contained only parsed tokens, I mean passing the decl that represents the scope where names should be looked up into.

Ok.

If the Sema keeps the scope-spec state, it is cleaner, but how will the scope-state be cleared in case of an error like "A:: ;" ?
Should the parser call an ActOnErrorAfterNestedName or something ?

That is possible. I'm more concerned with the fact that Sema for ';' will have to check to make sure there is no current scope-spec that is active. This approach seems to make it so that *all* sema actions would have to check for an empty scope-spec :frowning:

This also means that we can drop the rewinding part if the Parser or Sema keep and look after a 'C++ scoping' state.
If the Parser keeps the state, it will build and pass 'CXXScopeSpec' objects to action methods, (like in my patch).
If the Sema keeps the state, it will make error recovery awkward.

I actually think that things are more awkward with having Sema work this out. Specifically, if you have: "A :: B :: C" you might have one reference or it might be "A::B ::C" which can occur in a few places in the grammar:
http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elsa/doc/coloncolon.txt

If the parser just passed all of A/B/C to Sema, I'm not sure how it would handle this. It seems really easy if you invoke actions for "A::" then "B" (which returns a type). When the parser got a type, it would naturally ratchet on and parse ::C as a qualified name.

But "B" type could contain "C" as nested type, in which case you may want "A::b::C".
Anyway, I don't think we need to worry about whether it should be "A::B ::C" or "A ::b::C" or whatever.
GCC treats it as "A::b::C" always:

struct S {};
namespace A { S f(); }
S ::a::f(); // error: 'S::A' has not been declared

I think this is correct since the spec says nothing about "'::' associativity".

Sure, but this can apparently happen in the 'friend' case and some others, according to the Elsa doc. I confess to not being an expert on these matters though.

-Chris

Chris Lattner wrote:

Sema actions like isTypeName and ActOnIdentifier, check the scope-spec to see whether it needs to lookup a name inside the scope-spec decl, or do a normal lookup.

This is interesting: doesn't this mean that everything else will have to check to make sure that this hasn't happened, in order to reject invalid code?

I didn't quite understand what you mean here, can you elaborate ?

Now, the questions is whether Sema should keep the scope-spec state, or whether the Parser should keep the scope-spec state and pass it along to the actions (isTypeName, etc.)
When I say passing the scope-spec to actions, I don't mean like the CXXScopeSpec in my patch which contained only parsed tokens, I mean passing the decl that represents the scope where names should be looked up into.

Ok.

If the Sema keeps the scope-spec state, it is cleaner, but how will the scope-state be cleared in case of an error like "A:: ;" ?
Should the parser call an ActOnErrorAfterNestedName or something ?

That is possible. I'm more concerned with the fact that Sema for ';' will have to check to make sure there is no current scope-spec that is active. This approach seems to make it so that *all* sema actions would have to check for an empty scope-spec :frowning:

This is why I think that having Sema keep the scope-spec state is awkward and a hassle. The parser knows when parsing errors occured, or when it needed to backtrack, etc., so the parser can do a far better job of keeping track of the scope-spec state (when to clear it, etc.).

Sema actions can receive the scope-spec as parameter, which brings me to my next point, that a [scope specifier + identifier] is considered one syntactic construct by the spec, like a qualified-id, or an elaborated-type-specifier, thus it makes sense to pass it along to an action that handles a construct that it may be a part of.
For example, we can consider that ActOnIdentifier receives a qualified-id, and that ActOnTag receives a elaborated-type-specifier ("class A::B {};").

This also means that we can drop the rewinding part if the Parser or Sema keep and look after a 'C++ scoping' state.
If the Parser keeps the state, it will build and pass 'CXXScopeSpec' objects to action methods, (like in my patch).
If the Sema keeps the state, it will make error recovery awkward.

I actually think that things are more awkward with having Sema work this out. Specifically, if you have: "A :: B :: C" you might have one reference or it might be "A::B ::C" which can occur in a few places in the grammar:
http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elsa/doc/coloncolon.txt

If the parser just passed all of A/B/C to Sema, I'm not sure how it would handle this. It seems really easy if you invoke actions for "A::" then "B" (which returns a type). When the parser got a type, it would naturally ratchet on and parse ::C as a qualified name.

But "B" type could contain "C" as nested type, in which case you may want "A::b::C".
Anyway, I don't think we need to worry about whether it should be "A::B ::C" or "A ::b::C" or whatever.
GCC treats it as "A::b::C" always:

struct S {};
namespace A { S f(); }
S ::a::f(); // error: 'S::A' has not been declared

I think this is correct since the spec says nothing about "'::' associativity".

Sure, but this can apparently happen in the 'friend' case and some others, according to the Elsa doc. I confess to not being an expert every C++ compiler I've experimented with regards "::" as binding
very tightly, such that any occurrence of an identifier followed
by "::" is always interpreted as an attempt to use the binary "::",
  
I think Clang can be one those pesky compilers too, at least for compatibility's sake :wink:

-Argiris

Chris Lattner wrote:

Sema actions like isTypeName and ActOnIdentifier, check the scope-spec to see whether it needs to lookup a name inside the scope-spec decl, or do a normal lookup.

This is interesting: doesn't this mean that everything else will have to check to make sure that this hasn't happened, in order to reject invalid code?

I didn't quite understand what you mean here, can you elaborate ?

My understanding is that your implementation will build up an instance variable of Sema with the current "scope" If you get something like "X::Y" you set the scope to the looked up version of "X::" and when you parse Y, you use that to qualify it. If you say "X:: 4" then I assume that sema for "4" would have to see that there is qualification happening and reject it. Is this what you meant?

That is possible. I'm more concerned with the fact that Sema for ';' will have to check to make sure there is no current scope-spec that is active. This approach seems to make it so that *all* sema actions would have to check for an empty scope-spec :frowning:

This is why I think that having Sema keep the scope-spec state is awkward and a hassle. The parser knows when parsing errors occured, or when it needed to backtrack, etc., so the parser can do a far better job of keeping track of the scope-spec state (when to clear it, etc.).

Sema actions can receive the scope-spec as parameter, which brings me to my next point, that a [scope specifier + identifier] is considered one syntactic construct by the spec, like a qualified-id, or an elaborated-type-specifier, thus it makes sense to pass it along to an action that handles a construct that it may be a part of.
For example, we can consider that ActOnIdentifier receives a qualified-id, and that ActOnTag receives a elaborated-type-specifier ("class A::B {};").

Ah... so you're saying that Sema *resolves* the scope specifier, but that it returns a cookie *back to the parser*, which then passes it into the next action. If so, this makes a lot of sense to me, I think this can work. This means that only actions that take the scope specifier would have to handle it, and the parser would know how to handle other parser errors.

Sure, but this can apparently happen in the 'friend' case and some others, according to the Elsa doc. I confess to not being an expert on these matters though.

The Elsa doc also says:

every C++ compiler I've experimented with regards "::" as binding
very tightly, such that any occurrence of an identifier followed
by "::" is always interpreted as an attempt to use the binary "::",

I think Clang can be one those pesky compilers too, at least for compatibility's sake :wink:

If you really don't think this will be an issue, I will happily believe you :), you know far more about this than I do. I just want efficiency :slight_smile:

-Chris

Chris Lattner wrote:

Chris Lattner wrote:

Sema actions like isTypeName and ActOnIdentifier, check the scope-spec to see whether it needs to lookup a name inside the scope-spec decl, or do a normal lookup.

This is interesting: doesn't this mean that everything else will have to check to make sure that this hasn't happened, in order to reject invalid code?

I didn't quite understand what you mean here, can you elaborate ?

My understanding is that your implementation will build up an instance variable of Sema with the current "scope" If you get something like "X::Y" you set the scope to the looked up version of "X::" and when you parse Y, you use that to qualify it. If you say "X:: 4" then I assume that sema for "4" would have to see that there is qualification happening and reject it. Is this what you meant?

No, I think it's the parser's job to reject "X:: 4". It should emit a parsing error when it encounters '4'.

That is possible. I'm more concerned with the fact that Sema for ';' will have to check to make sure there is no current scope-spec that is active. This approach seems to make it so that *all* sema actions would have to check for an empty scope-spec :frowning:

This is why I think that having Sema keep the scope-spec state is awkward and a hassle. The parser knows when parsing errors occured, or when it needed to backtrack, etc., so the parser can do a far better job of keeping track of the scope-spec state (when to clear it, etc.).

Sema actions can receive the scope-spec as parameter, which brings me to my next point, that a [scope specifier + identifier] is considered one syntactic construct by the spec, like a qualified-id, or an elaborated-type-specifier, thus it makes sense to pass it along to an action that handles a construct that it may be a part of.
For example, we can consider that ActOnIdentifier receives a qualified-id, and that ActOnTag receives a elaborated-type-specifier ("class A::B {};").

Ah... so you're saying that Sema *resolves* the scope specifier, but that it returns a cookie *back to the parser*, which then passes it into the next action.

Yes! Sema will return a CXXScopeTy* (alias for void*) and the parser will keep track of it and pass it along to other actions.

  If so, this makes a lot of sense to me, I think this can work. This means that only actions that take the scope specifier would have to handle it, and the parser would know how to handle other parser errors.

I'm glad that you agree. I'll post an updated patch later on.

-Argiris

For parsing scope specifiers ("A::b::") I tried out having the parser keep a "parsed scope spec" state, so that parsing functions can check it and act accordingly.
This works and Sema only resolves a scope specifier once, but it's not a very attractive option from the maintainability viewpoint.

You basically insert various "hasParsedCXXScopeSpec()" checks at several points. Any parsing function may need to use such a check in order to proceed correctly and it's not clear which functions need to do that and at what exact point; you have to consider the execution paths that lead to this function and whether a scope specifier may be parsed before this function is called.

To be more specific, here are some examples:

Parser::DeclTy *Parser::ParseDeclaration(unsigned Context) {
  if (hasParsedCXXScopeSpec())
    return ParseSimpleDeclaration(Context);

  switch (Tok.getKind()) {
  case tok::kw_namespace:
    return ParseNamespace(Context);
  default:
    return ParseSimpleDeclaration(Context);
  }
}

Parser::ExprResult Parser::ParseExpression() {
  if (Tok.is(tok::kw_throw) && !hasParsedCXXScopeSpec())
    return ParseThrowExpression();

  ExprResult LHS = ParseCastExpression(false);
  if (LHS.isInvalid) return LHS;
   return ParseRHSOfBinaryExpression(LHS, prec::Comma);
}

A more general approach that deals with the C++ ambiguities, may work for parsing scope specifiers too.

The C++ ambiguities, that function-style casting introduces, are like these mentioned here:

I think that tentative parsing/backtracking is the most clean way to deal with the C++ ambiguities. In general, backtracking does not work well for most Action methods; for example ActOnIdentifierExpr creates new Expr objects and if you backtrack after calling ActOnIdentifierExpr, you have to deal with both the Expr objects and the diagnostics that may occur (cache them ? clean up Expr objects if you have to discard them ?).
But there are a few Action methods that serve mostly as an aid for the parser and do not deal with AST building. "isTypeName" is one of them, and "ActOnNestedNameSpecifier" can be considered one of them too.
ActOnNestedNameSpecifier basically takes "A::" and returns a CXXScopeTy that the parser can use to pass along to other actions, so backtracking can be done after calling it without a lot of fuss.
If you get "C<int>::", you will instantiate a template class but this instantiation will be added to the list of unique types, it won't be discarded, thus backtracking can be done with something like this too.

Preprocessor can now do efficient backtracking, here's my suggestion for having efficient backtracking in the Parser too:

Have a special parsing function with the purpose of determining whether a statement is a declaration.
It will work by using tentative parsing/backtracking.
It will only call Action methods to do typechecking (isTypeName) and resolve scope specifiers ("A::") (the Action methods may instantiate template classes).
During tentative parsing, the result of these Action methods will be cached based on token location (we can assume that on a particular SourceLocation, if you call Action.isTypeName, the result will be the same).
We don't have to cache diagnostics, if an error occurs because of a mistyped scope specifier (or when doing template class instantiation), allow the diagnostic and skip to the next statement.
As soon as we know that the current statement is a declaration or not, do backtracking and call the normal parsing functions. For the more common cases, the necessary tentative parsing will be little or no tentative parsing at all.
When, during normal parsing, the parser needs to call an Action method that was already called during tentative parsing, the parser will use the cached result.
Before moving to the next statement, clear the cached results of Action methods, and repeat.

With the above approach, Sema will do typechecking only once for each case the parser needs typechecking.

Any thoughts about the above ?

-Argiris