[PATCH] C++ nested-names (Parser) and annotation tokens

Hi,

The attached patches implement support for nested-name-specifiers (foo::bar::x) on the Parser utilizing 'annotation tokens' (many thanks to Doug for the idea here: http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-August/002664.html)

About annotation tokens:

These are a special kind of tokens that the parser may use (not the lexer) to replace a stream of lexed tokens with a single one that encapsulates the relevant semantic information.
There are two kinds:
-typename annotation (represents a typedef name in C, and a possibly qualified typename in C++, like "foo::bar::myclass")
-C++ scope annotation (represents a nested-name-specifier, ("foo::bar::")

Annotation tokens contain a void* value that represents semantic information specific to the annotation kind (a TypeTy* for typename and CXXScopeTy* for scope) and the SourceRange of the tokens that they replaced.
As you can see in the attached "annot-token.patch" there were some changes to the Token class to support annotations but its size did not change.

The benefits of the annotation tokens are:

----- 1) Vastly simplified handling of nested-names.
In my previous attempts at nested-names, the main issue was how to keep track of the "C++ scope specifier state" in a way so that introducing nested-names, at "parsing contexts" that don't particularly care about nested-names, won't over-complicate things and cause a lot of code duplication for the parsing code. Here's an example on how annotation tokens handle that:

Assume that we have:
  sizeof( foo::bar::x )

sizeof doesn't particularly care about nested names, it only wants to find out if it has a type or an expression and defer parsing to the appropriate parsing functions.
Here's how it works if "foo::bar::x" is a type.

-sizeof calls Parser::isDeclarationSpecifier
    -Parser::isDeclarationSpecifier at the beginning calls Parser::AnnotateToken,
        -Parser::AnnotateToken parses and resolves both the scope-spec and the typename and sets as current token an annotation type token that indicates the type
    -Parser::isDeclarationSpecifier sees that the current token is an annotation type token and returns true since this is a declaration specifier
-sizeof calls Parser::ParseTypeName
    -When execution reaches Parser::ParseDeclarationSpecifiers, it sees the annotation type token, takes the information from it and "consumes" it from the token stream.

Ok, how about if "foo::bar::x" is an expression ?:

-sizeof calls Parser::isDeclarationSpecifier
    -Parser::isDeclarationSpecifier, at the beginning calls Parser::AnnotateToken,
        -Parser::AnnotateToken parses the scope-spec, sees that 'x' is not a typename and sets as current token an annotation scope token for "foo::bar::" (which is followed by the 'x' identifier token)
    -Parser::isDeclarationSpecifier sees that the current token is not a declaration specifier and returns false
-sizeof calls Parser::ParseExpression
    -When execution reaches Parser::ParseCastExpression, the annotation scope token indicates a qualified-id expression which is handled by Parser::ParseCXXIdExpression
        -Parser::ParseCXXIdExpression takes the information from the annotation scope token and calls Actions.ActOnIdentifierExpr by passing the 'x' identifier and the specific C++ scope that it should be a member of

The important thing to notice about the above is that nested-names didn't affect the parsing logic of contexts that don't directly deal with nested-names.
Sizeof didn't have to do some special check for nested names. If the expression was this:

sizeof( foo::bar:: )

The error would be reported by Parser::ParseCXXIdExpression, sizeof doesn't have to check for this too.

At this point you may think that the side-effects of Parser::isDeclarationSpecifier (changing the token stream) may lead to problems, but in practice, due to how tokens are used, this is highly unlikely.
The parser mostly deals with just what is the current token and how that affects the current parsing logic. It doesn't have some "long term token memory" that can be "unsynchronized" by changing the token stream.

----- 2) Efficient backtracking.
The ambiguity resolution parser can use annotation tokens to spare the Parser from having to re-parse nested-names.
The nested-names (and typenames) will be resolved by the tentative parser once and the normal parser will use the annotation tokens.

----- 3) While annotation tokens bring the most benefits for C++, they are also useful for C too.
Currently, a typename gets looked up twice, once in Parser::isDeclarationSpecifier and then in Parser::ParseDeclarationSpecifiers. By replacing the typename with an annotation token, a typename gets looked up and resolved only once.

Any comments are welcome!

-Argiris

nns-parser.patch (33.3 KB)

annot-token.patch (7.81 KB)

Hi Argiris,

The attached patches implement support for nested-name-specifiers
(foo::bar::x) on the Parser utilizing 'annotation tokens' (many thanks to
Doug for the idea here:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-August/002664.html)

Nice job taking a one-sentence description and turning it into a very
clean implementation!

About annotation tokens:

These are a special kind of tokens that the parser may use (not the lexer)
to replace a stream of lexed tokens with a single one that encapsulates the
relevant semantic information.
There are two kinds:
-typename annotation (represents a typedef name in C, and a possibly
qualified typename in C++, like "foo::bar::myclass")
-C++ scope annotation (represents a nested-name-specifier, ("foo::bar::")

Annotation tokens contain a void* value that represents semantic information
specific to the annotation kind (a TypeTy* for typename and CXXScopeTy* for
scope) and the SourceRange of the tokens that they replaced.
As you can see in the attached "annot-token.patch" there were some changes
to the Token class to support annotations but its size did not change.

I'm glad to see that Token remained the same size.

The benefits of the annotation tokens are:

----- 1) Vastly simplified handling of nested-names.
In my previous attempts at nested-names, the main issue was how to keep
track of the "C++ scope specifier state" in a way so that introducing
nested-names, at "parsing contexts" that don't particularly care about
nested-names, won't over-complicate things and cause a lot of code
duplication for the parsing code.

I can definitely see how this helps... it means we can effectively
treat these nested names as a single token when parsing other parts of
the grammar, which will drastically simplify those other pieces.

Here's an example on how annotation tokens

[snip]

----- 2) Efficient backtracking.
The ambiguity resolution parser can use annotation tokens to spare the
Parser from having to re-parse nested-names.
The nested-names (and typenames) will be resolved by the tentative parser
once and the normal parser will use the annotation tokens.

----- 3) While annotation tokens bring the most benefits for C++, they are
also useful for C too.
Currently, a typename gets looked up twice, once in
Parser::isDeclarationSpecifier and then in
Parser::ParseDeclarationSpecifiers. By replacing the typename with an
annotation token, a typename gets looked up and resolved only once.

Any comments are welcome!

First of all, I'm really impressed. This came out much cleaner than I
had anticipated. I have some nitpicks and a question below, but as far
as I'm concerned, if you have tests and some minimal Sema support (so
that all the tests run properly), please feel free to check it in.
Great job!

Some nitpicks follow.

+ /// ActOnCXXEnterDeclaratorScope - Called when a C++ scope specifier (global
+ /// scope or nested-name-specifier) is parsed, part of a declarator-id.
+ /// After this method is called, according to [C++ 3.4.3p3], names should be
+ /// looked up in the declarator-id's scope, until the declarator is
parsed and
+ /// ActOnCXXExitDeclaratorScope is called.
+ /// The 'SS' should be a non-empty valid CXXScopeSpec.
+ virtual void ActOnCXXEnterDeclaratorScope(Scope *S, const CXXScopeSpec &SS) {
+ }

Hi Doug, thanks for reviewing.

Doug Gregor wrote:

First of all, I'm really impressed. This came out much cleaner than I
had anticipated. I have some nitpicks and a question below, but as far
as I'm concerned, if you have tests and some minimal Sema support (so
that all the tests run properly), please feel free to check it in.
Great job!

Some nitpicks follow.

+ /// ActOnCXXEnterDeclaratorScope - Called when a C++ scope specifier (global
+ /// scope or nested-name-specifier) is parsed, part of a declarator-id.
+ /// After this method is called, according to [C++ 3.4.3p3], names should be
+ /// looked up in the declarator-id's scope, until the declarator is
parsed and
+ /// ActOnCXXExitDeclaratorScope is called.
+ /// The 'SS' should be a non-empty valid CXXScopeSpec.
+ virtual void ActOnCXXEnterDeclaratorScope(Scope *S, const CXXScopeSpec &SS) {
+ }
+
+ /// ActOnCXXExitDeclaratorScope - Called when a declarator that previously
+ /// invoked ActOnCXXEnterDeclaratorScope(), is finished. 'SS' is the same
+ /// CXXScopeSpec that was passed to ActOnCXXEnterDeclaratorScope as well.
+ /// Used to indicate that names should revert to being looked up in the
+ /// defining scope.
+ virtual void ActOnCXXExitDeclaratorScope(const CXXScopeSpec &SS) {
+ }

It would be really, really great if we had an RAII class of some sort
to make sure that
ActOnCXXEnterDeclaratorScope/ActOnCXXExitDeclaratorScope calls were
always paired. For example, Parser::ParseDirectDeclarator's control
flow looks right, but the Enter call is pretty far from the Exit call
(and at a different level of "if" nesting).
  
Good idea!

I'm pretty certain that I don't understand the role of these
functions. It looks like they want Sema to keep some additional state
regarding the scope in which name lookup should occur that's somehow
different from the current scope of parsing, but I don't know when
that happens... when we're parsing a direct-declarator, say, when
declaring a member function out-of-line:

  void Sema::MemberFunction(TypeFromInSema Arg) { }

Once we've parse Sema::MemberFunction, we seem to be calling
ActOnCXXEnterDeclaratorScope. I assume that's so that the lookup of
TypeFromInSema will look into 'Sema' first?
  
Yes, and here's another example:

class C {
  static const int num=5;
  static int arr[num];
  void m();
};

int C::arr[num]; // lookup 'num' in 'class C' scope

Could we instead have a different kind of Scope that handles this case
of being inside "Sema" without
  
This is a "enter into an arbitrary already existing scope" kind of scope, and doesn't have much in common with the "enter a new parent-nested scope" kind that the scope mechanism of Parser uses.
Also, name lookup in Sema currently relies on 'CurContext' so the ActOnCXXEnterDeclaratorScope is useful to let it know that the current decl context should change.

Do you have an alternative suggestion in mind ?
Like, eliminating Sema's CurContext and putting that kind of information into Parser's Scope ?

+ /// [C++] template-id
+ ///
+ bool isTokenUnqualifiedId() const {

Please put a [TODO] or [FIXME] on the template-id production.
  
Ok.

+ /// AnnotateToken - If the current token position is on a typename (possibly
+ /// qualified in C++) or a C++ scope specifier not followed by a typename,
+ /// AnnotateToken will replace one or more tokens with a single annotation
+ /// token representing the typename or C++ scope respectively.
+ /// This simplifies handling of C++ scope specifiers and allows efficient
+ /// backtracking without the need to re-parse and resolve nested-names and
+ /// typenames.
+ void AnnotateToken();

I think the name of this routine could be better, for two reasons.
First, it doesn't say what the annotations are doing---resolving
qualified-ids and such. Second, I'd like there to be a "Maybe" or a
"Try" at the beginning, to say that it might change the token stream
(or it might not).
  
TryAnnotateTypeOrScopeToken ?

-Argiris

Hi Argiris,

Doug Gregor wrote:

I'm pretty certain that I don't understand the role of these
functions. It looks like they want Sema to keep some additional state
regarding the scope in which name lookup should occur that's somehow
different from the current scope of parsing, but I don't know when
that happens... when we're parsing a direct-declarator, say, when
declaring a member function out-of-line:

void Sema::MemberFunction(TypeFromInSema Arg) { }

Once we've parse Sema::MemberFunction, we seem to be calling
ActOnCXXEnterDeclaratorScope. I assume that's so that the lookup of
TypeFromInSema will look into 'Sema' first?

Yes, and here's another example:

class C {
static const int num=5;
static int arr[num];
void m();
};

int C::arr[num]; // lookup 'num' in 'class C' scope

Okay, good.

Could we instead have a different kind of Scope that handles this case
of being inside "Sema" without

This is a "enter into an arbitrary already existing scope" kind of scope,
and doesn't have much in common with the "enter a new parent-nested scope"
kind that the scope mechanism of Parser uses.

Well, they have in common that name lookup needs to start from
whatever scope we most recently entered, which is one of the most
important parts of the Parser-Sema interaction. While there's
certainly a difference between "enter a new child scope" and "enter an
arbitrary scope", I think the ideas are close enough that it's just a
boolean flag on the enter/exit routines.

Also, name lookup in Sema currently relies on 'CurContext' so the
ActOnCXXEnterDeclaratorScope is useful to let it know that the current decl
context should change.

Do you have an alternative suggestion in mind ?
Like, eliminating Sema's CurContext and putting that kind of information
into Parser's Scope ?

I was about to say "yes", and then I thought about template
instantiation. With template instantiation, you typically want to
enter into the scope of the template you'll be instantiating so that
name lookup and access checking can be done from the appropriate
place. That means that it is a good thing for Sema to have CurContext
(since the parser need not exist when template instantiation occurs),
and therefore probably shouldn't be completely tied to the Parser's
Scope.

I didn't actually get to see what Sema's ActOnCXXEnterDeclaratorScope
looked like, so I'm not sure if I have a better suggestion. Does it
use a second kind of "current context" beyond CurContext?

+ void AnnotateToken();

I think the name of this routine could be better, for two reasons.
First, it doesn't say what the annotations are doing---resolving
qualified-ids and such. Second, I'd like there to be a "Maybe" or a
"Try" at the beginning, to say that it might change the token stream
(or it might not).

TryAnnotateTypeOrScopeToken ?

Sure.

  - Doug

Doug Gregor wrote:

  

Could we instead have a different kind of Scope that handles this case
of being inside "Sema" without

This is a "enter into an arbitrary already existing scope" kind of scope,
and doesn't have much in common with the "enter a new parent-nested scope"
kind that the scope mechanism of Parser uses.
    
Well, they have in common that name lookup needs to start from
whatever scope we most recently entered, which is one of the most
important parts of the Parser-Sema interaction. While there's
certainly a difference between "enter a new child scope" and "enter an
arbitrary scope", I think the ideas are close enough that it's just a
boolean flag on the enter/exit routines.
  
Currently the assumptions for the Scope is that it contains a set of declarations that were declared in it and has a parent Scope that contains it.
I'm concerned about the Parser passing to Sema a Scope that has an empty declaration set and no real parent.
But it may turn out not a big deal if it doesn't actually matter much.

Also, name lookup in Sema currently relies on 'CurContext' so the
ActOnCXXEnterDeclaratorScope is useful to let it know that the current decl
context should change.

Do you have an alternative suggestion in mind ?
Like, eliminating Sema's CurContext and putting that kind of information
into Parser's Scope ?
    
I was about to say "yes", and then I thought about template
instantiation. With template instantiation, you typically want to
enter into the scope of the template you'll be instantiating so that
name lookup and access checking can be done from the appropriate
place. That means that it is a good thing for Sema to have CurContext
(since the parser need not exist when template instantiation occurs),
and therefore probably shouldn't be completely tied to the Parser's
Scope.
  
If I understand correctly, you mean when instantiating a template that contains
   SomeClass<T>::x // 'T' is a template parameter
the DeclContext for "SomeClass<T>::" will be resolved at instatiation time, not parsing time, so we need to have the 'CurContext' around.
Is this correct ?

I didn't actually get to see what Sema's ActOnCXXEnterDeclaratorScope
looked like, so I'm not sure if I have a better suggestion. Does it
use a second kind of "current context" beyond CurContext?
  
No, ActOnCXXEnterDeclaratorScope will just set 'CurContext' to the context of the nested-name defined in the declarator.
The difference is in how contexts are 'popped' in Sema. Currently, when 'CurContext' is popped we switch to its parent, but this won't work for contexts caused by nested names (when they are popped, 'CurContext' should switch to the declaration context before the nested-name was introduced).

-Argiris

If I understand correctly, you mean when instantiating a template that
contains
SomeClass<T>::x // 'T' is a template parameter
the DeclContext for "SomeClass<T>::" will be resolved at instatiation time,
not parsing time, so we need to have the 'CurContext' around.
Is this correct ?

Yes, that's the idea.

No, ActOnCXXEnterDeclaratorScope will just set 'CurContext' to the context
of the nested-name defined in the declarator.
The difference is in how contexts are 'popped' in Sema. Currently, when
'CurContext' is popped we switch to its parent, but this won't work for
contexts caused by nested names (when they are popped, 'CurContext' should
switch to the declaration context before the nested-name was introduced).

Okay, I like this approach. Thanks!

  - Doug