[PATCH] C++ decl/expr ambiguity resolution

Hi,

The attached patch disambiguates between declaration or expression statements. It uses tentative parsing and backtracking.
For the common cases no tentative parsing is needed; for the ambiguous ones, this will result in multiple type checks (for tentative parsing and for actual parsing).
As I've mentioned in a previous post, this can be made efficient in a later patch by caching the type checking results of the Action module.

About the "ambiguity resoluter":
The C++ standard doesn't define in details how the ambiguity resolution should work, it will only guarantee that if the statement is a valid declaration, it will be parsed as a declaration.
But exactly which parts of the statement should we examine to determine the validity of a declaration.. this is left undefined.
In practice this means that for a statement that GCC finds as invalid declaration, Clang may parse it as valid expression.
Such differences are not important because they are about errors, the more important "correct GCC code should be parsed correctly by Clang"-assertion is not 'compromised'.
-If GCC parses a statement without errors, Clang will parse it the same way.
-If GCC parses a statement and emits errors because it treats it as erroneous declaration, Clang may treat it as a correct expression instead.

Here's an example:

FuncType(a)(x,y)->z = 0;

GCC will disambiguate it as a declaration and emit errors, Clang will disambiguate it as an expression instead.

-Argiris

tentative-parse.patch (24.9 KB)

But exactly which parts of the statement should we examine to determine the
validity of a declaration.. this is left undefined.

Hmm... I see the following: "Note: To disambiguate, the whole
statement might have to be examined to determine if it is an
expression-statement or a declaration." I think this is pretty clear;
if there were some allowable shortcuts, they would be documented in
the standard.

-If GCC parses a statement without errors, Clang will parse it the same way.

That's always good :slight_smile:

-If GCC parses a statement and emits errors because it treats it as
erroneous declaration, Clang may treat it as a correct expression instead.

We don't want to be accepting incorrect code, but it doesn't look like
that's the case here.

Here's an example:

FuncType(a)(x,y)->z = 0;

GCC will disambiguate it as a declaration and emit errors, Clang will
disambiguate it as an expression instead.

What does the complete testcase look like? Something like the following?

struct S {int z;};
typedef S* (*FuncType)(int,int);
int x,y;
S* a() {
  FuncType(a)(x,y)->z = 0;
  return 0;
}

I think that's pretty clearly an expression; there isn't any
reasonable way to parse the -> operator as part of a declaration. The
Comeau compiler seems to agree, and accepts the given testcase. This
appears to be a bug in gcc; they'd probably appreciate it if you filed
it.

I'll try to take a look at the patch sometime over the weekend.

-Eli

Eli Friedman wrote:

  

But exactly which parts of the statement should we examine to determine the
validity of a declaration.. this is left undefined.
    
Hmm... I see the following: "Note: To disambiguate, the whole
statement might have to be examined to determine if it is an
expression-statement or a declaration." I think this is pretty clear;
if there were some allowable shortcuts, they would be documented in
the standard.
  
Yes but the "might have..." is pretty... ambiguous :slight_smile:
One might say,
"by examining these parts of the statement I can conclude whether it's a declaration or expression"
another one might say
"these parts are not enough, I'll continue to examine the rest of the statement".
The standard says "statement disambiguated as a declaration may be an ill-formed declaration", so the first one can say "with these parts of the statement I concluded it's a declaration, if the next parts do not compile as a declaration then it's an ill-formed declaration".

If the standard said "the whole statement *should* be examined to determine.." it would be more clear. But this is very subtle and I may be wrong.

The issue is mostly how to deal with comma separated declarators/expressions:

int(x), ++x;

if we examine "++x" we will conclude that this is an expression. If we only examine "int(x)" we will regard this as declaration (note that the C++ standard does not have such an example).
GCC examines only the first declarator, while MSVC examines all of them.
Given that comma separated declarators are *much* more common than comma separated expressions, and that a function-style cast followed by a comma is mostly useless, I chose to go along with GCC and only examine the first declarator.

Another one:

T(x) = 1, ++x;

Again, if we examine "++x" we will conclude that this is an expression. Now both GCC and MSVC only take into account "T(x) = 1" and consider the above example as declaration; GCC is more consistent here.

In general, one might say that it's a bit more consistent to treat statements as declarations by the first declarator:

T x, ++x; // this is malformed declaration
T(x), ++x; // so is this

What do you think about comma-separated declarators vs. comma-separated expressions ?

Here's an example:

FuncType(a)(x,y)->z = 0;

GCC will disambiguate it as a declaration and emit errors, Clang will
disambiguate it as an expression instead.
    
What does the complete testcase look like? Something like the following?

struct S {int z;};
typedef S* (*FuncType)(int,int);
int x,y;
S* a() {
  FuncType(a)(x,y)->z = 0;
  return 0;
}

I think that's pretty clearly an expression; there isn't any
reasonable way to parse the -> operator as part of a declaration. The
Comeau compiler seems to agree, and accepts the given testcase. This
appears to be a bug in gcc; they'd probably appreciate it if you filed
it.
  
Ok, will do.

I'll try to take a look at the patch sometime over the weekend.
  
Thanks!

-Argiris

Eli Friedman wrote:

But exactly which parts of the statement should we examine to determine
the
validity of a declaration.. this is left undefined.

Hmm... I see the following: "Note: To disambiguate, the whole
statement might have to be examined to determine if it is an
expression-statement or a declaration." I think this is pretty clear;
if there were some allowable shortcuts, they would be documented in
the standard.

Yes but the "might have..." is pretty... ambiguous :slight_smile:
One might say,
"by examining these parts of the statement I can conclude whether it's a
declaration or expression"
another one might say
"these parts are not enough, I'll continue to examine the rest of the
statement".
The standard says "statement disambiguated as a declaration may be an
ill-formed declaration", so the first one can say "with these parts of the
statement I concluded it's a declaration, if the next parts do not compile
as a declaration then it's an ill-formed declaration".

If the standard said "the whole statement *should* be examined to
determine.." it would be more clear. But this is very subtle and I may be
wrong.

I agree that part of the standard isn't very well written. That said,
for the cases that are supposed to parse as expressions, the ambiguity
resolution doesn't actually have any effect; the grammar isn't
ambiguous for those cases in the first place. The grammar is just
very annoying for the parser. :slight_smile:

Comeau seems to agree with my interpretation; it accepts all the
examples that you've given.

That said, as long as we accept everything that gcc accepts, it
probably doesn't particularly matter for anything besides
standards-compliance testsuites, so I wouldn't worry about it too
much.

The issue is mostly how to deal with comma separated
declarators/expressions:

int(x), ++x;

if we examine "++x" we will conclude that this is an expression. If we only
examine "int(x)" we will regard this as declaration (note that the C++
standard does not have such an example).
GCC examines only the first declarator, while MSVC examines all of them.
Given that comma separated declarators are *much* more common than comma
separated expressions, and that a function-style cast followed by a comma is
mostly useless, I chose to go along with GCC and only examine the first
declarator.

Another one:

T(x) = 1, ++x;

Again, if we examine "++x" we will conclude that this is an expression. Now
both GCC and MSVC only take into account "T(x) = 1" and consider the above
example as declaration; GCC is more consistent here.

In general, one might say that it's a bit more consistent to treat
statements as declarations by the first declarator:

T x, ++x; // this is malformed declaration
T(x), ++x; // so is this

What do you think about comma-separated declarators vs. comma-separated
expressions ?

Hmm, well, I agree it's very unlikely that real-world code uses a
comma after a function-style constructor...

In any case, it's probably a good idea to emit a warning for code that
ends up falling into ambiguity resolution, i.e. anything that gets
past the call to TryParseDeclarationSpecifier from
TryParseSimpleDeclaration in your patch. It's very likely that such
code doesn't do what the author intends, and it's usually very easy to
make the intention explicit (either by removing parentheses around the
variable declaration, or putting parentheses around the function-style
cast).

Comments on the patch follow:

+ while (Tok.is(tok::kw_const) || Tok.is(tok::kw_volatile))
+ ConsumeToken();

Doesn't the presence of const or volatile prove that this is a
declaration? I don't think this actually affects the semantics, but
it seems cleaner to exit once we can prove something.

Also, shouldn't we be handling restrict here as well? It's not
standard C++, but I think people use it.

+ if (Tok.is(tok::identifier)) {
+ // declarator-id
+ ConsumeToken();

Don't you need some sort of check to see whether this is a type?
Consider something like the following:
class T {public: T(int a) {} T() {} int operator*() {return 0;}};
void a() {T(*T());} // Must be expression

+ case tok::kw_const:
+ case tok::kw_volatile:

Again, handle restrict?

+Parser::TentativeParsingResult
+Parser::TryParseParameterDeclarationClause(SourceLocation LParenLoc) {
+ TentativeParsingResult TPR = TryParseDeclarationSpecifier();

Hmm... I see the following: "Note: To disambiguate, the whole
statement might have to be examined to determine if it is an
expression-statement or a declaration." I think this is pretty clear;
if there were some allowable shortcuts, they would be documented in
the standard.

Yes but the "might have..." is pretty... ambiguous :slight_smile:
One might say,
"by examining these parts of the statement I can conclude whether it's a declaration or expression"
another one might say
"these parts are not enough, I'll continue to examine the rest of the statement".
The standard says "statement disambiguated as a declaration may be an ill-formed declaration", so the first one can say "with these parts of the statement I concluded it's a declaration, if the next parts do not compile as a declaration then it's an ill-formed declaration".

If the standard said "the whole statement *should* be examined to determine.." it would be more clear. But this is very subtle and I may be wrong.

I agree with Eli. The standard is very clear here, and it sounds like these are bugs in GCC. Specifically, if a whole statement can be parsed as a declaration, it should be. If the whole statement can't, and it is legal as an expression, then it must be parsed as a statement. Rejecting a legal expression because it doesn't fit the declaration syntax would be a bug.

The issue is mostly how to deal with comma separated declarators/expressions:

int(x), ++x;

if we examine "++x" we will conclude that this is an expression. If we only examine "int(x)" we will regard this as declaration (note that the C++ standard does not have such an example).
GCC examines only the first declarator, while MSVC examines all of them.
Given that comma separated declarators are *much* more common than comma separated expressions, and that a function-style cast followed by a comma is mostly useless, I chose to go along with GCC and only examine the first declarator.

I agree that this isn't a common case, but it would be very preferable for clang to get it right. :slight_smile:

Another related but purely orthogonal issue is the efficiency concern. It is preferable to avoid backtracking in obvious cases where we know that something is a statement. For example, it is impossible for a declaration to start with the 'do' keyword, so if we see that, we can avoid the cost of starting/reverting backtracking.

-Chris

Hi Argiris,

You have obviously put a lot of time and thought into this code, and it shows in its cleanliness and attention to detail. I have a few specific comments about the patch, but I'd like to first talk about the approach you've chosen to make sure we're on the same page.

The problem is that we don't know whether something is a declaration or a statement without doing a significant amount of parsing to decide. If it is a declaration, we have to accept it as such, otherwise we should try parsing it as a statement. If neither work it is an error.

At root, your approach basically boils down to a very simple approach: any time when we can't tell whether something is an obvious statement (an important optimization!), you fire off a special parser that looks ahead and sees if the tokens match the declaration grammar. This parser comes back with an answer, and based on that answer, the normal parser goes off and parses with one approach or another.

This is conceptually an extremely clean approach, because it decouples disambiguation from actual parsing. However, this has two big downsides: 1) maintenance: it duplicates a lot of parsing rules, because now we have to be able to "tentatively parse" and "really parse" all the declaration stuff. 2) performance: in addition to the basic cost of backtracking this has to duplicate parsing, and some amount of sema (type lookups etc).

To me, these are pretty big drawbacks, and I want to make sure we're all on the same page and agree that this is the right approach before we go much farther.

Ignoring optimizations for common cases, the only other realistic solution that I know of is to make use of tentative parsing to actually *start parsing* the ambiguous cases as declarations, and then roll back and re-parse if an error occurs. The code basically looks like this:

   // Parse declaration tentatively.
   PP.StartTentativeParsing();
   Res = ParseDeclarationStatement();
   if (!Res.isError) {
      PP.Commit();
      return Res;
   }
   PP.Rollback();

   // Parse statement non-tentatively
   return ParseStatement();
}

This approach produces some very interesting tradeoffs. The drawback of this approach is that any action made while parsing and sema'ing the declaration statement *must be able to be rolled back*. This means that warnings need to be buffered up (errors stop the tentative parse) and that Sema needs to be able to undo any changes it makes to its internal state (e.g. remove declared variables, delete initializer expressions that are parsed and built, etc). Also, if the ambiguous statement ends up being an expression, your approach would be superior in space and time.

However, the advantage of this approach is that it is much faster than doing a pre-parse when things turn out to actually be a declaration. Since many statement cases get filtered out before reaching the tentative case, I think this is a win. Additionally, this does not require duplicating parsing code for a bunch of grammar productions, which is my primary objection to your approach.

Did you consider the tentative parsing approach? What do you think of it? Will you be able to reasonably expand your preparser to handle cases like this?:

int X = X;

I think that this sort of example requires doing some amount of sema in the disambiguation parser, though maybe the grammar is regular enough to let you just skip over initializers completely (right now, this doesn't matter for you because you stop lookahead at the =).

Once we agree on the overall approach, I'll hassle you about things like whether we should factor declspec processing out of the tentative part and try to find out if there are other important cases that can get filtered out as expressions early on (e.g. what about a statement that starts with an identifier that is known to be a variable/function, can that ever start a declaration?). :slight_smile:

Thoughts?

-Chris

This is conceptually an extremely clean approach, because it decouples
disambiguation from actual parsing. However, this has two big
downsides: 1) maintenance: it duplicates a lot of parsing rules,
because now we have to be able to "tentatively parse" and "really
parse" all the declaration stuff. 2) performance: in addition to the
basic cost of backtracking this has to duplicate parsing, and some
amount of sema (type lookups etc).

To me, these are pretty big drawbacks, and I want to make sure we're
all on the same page and agree that this is the right approach before
we go much farther.

(1) is definitely an issue... it's a balancing act. I won't try to
comment on it because I don't know the difficulty of the alternative.

For (2), the current code doesn't try to avoid activating
backtracking; we can easily avoid it for common cases depending on how
expensive it is to activate backtracking. The only cases which need
the statement vs. expression disambiguation are those starting with T(
and N::T(, where T is a type. We can catch all the cases which don't
start with a complicated type with only a single token of lookahead.

The current version is already quite optimized in terms of the
distance it looks ahead; it tries to cut out as soon as possible, and
most of the time it shouldn't end up looking very far ahead. This
approach also has the advantage that it doesn't need to completely
parse the code; it can take shortcuts in places where it doesn't
affect the statement vs. expression distinction.

Also, if the ambiguous
statement ends up being an expression, your approach would be superior
in space and time.

If we want to know what the balance is here, we'll really have to
benchmark; hopefully the cases where we have to lookahead more than a
few tokens are rare, in which case performance isn't so much of an
issue.

Did you consider the tentative parsing approach? What do you think of
it? Will you be able to reasonably expand your preparser to handle
cases like this?:

int X = X;

I think that this sort of example requires doing some amount of sema
in the disambiguation parser, though maybe the grammar is regular
enough to let you just skip over initializers completely (right now,
this doesn't matter for you because you stop lookahead at the =).

If an identifier changes from a type to an identifier in the middle of
a statement, and that affects disambiguation, the program is
ill-formed per the standard; we can just assume for disambiguation
purposes that anything that isn't a type is a valid variable. And for
initializers, I'm pretty sure the rules allow skipping forward to the
next comma/semicolon once we see an = sign; what follows the = sign is
just a regular assignment-expression, so it doesn't matter in terms of
disambiguation.

-Eli

New patch attached, comments follow below:

Eli Friedman wrote:

I agree that part of the standard isn't very well written. That said,
for the cases that are supposed to parse as expressions, the ambiguity
resolution doesn't actually have any effect; the grammar isn't
ambiguous for those cases in the first place. The grammar is just
very annoying for the parser. :slight_smile:

Comeau seems to agree with my interpretation; it accepts all the
examples that you've given.
  
Chris Lattner wrote:

I agree with Eli. The standard is very clear here, and it sounds like these are bugs in GCC. Specifically, if a whole statement can be parsed as a declaration, it should be. If the whole statement can't, and it is legal as an expression, then it must be parsed as a statement. Rejecting a legal expression because it doesn't fit the declaration syntax would be a bug.

Ok, this sounds reasonable. I was leaning on the more optimistic side that these are not GCC bugs but "design choices" :slight_smile:
In the new patch, disambiguation follows the Comeau way and all the declarators are checked. Note that "int(x) = {0}, ++x;" will be interpreted as ill-formed expression.

Another related but purely orthogonal issue is the efficiency concern. It is preferable to avoid backtracking in obvious cases where we know that something is a statement. For example, it is impossible for a declaration to start with the 'do' keyword, so if we see that, we can avoid the cost of starting/reverting backtracking.

It's not that costly to start/revert (without consuming tokens in-between) but we can certainly avoid it, the new patch does.

Eli Friedman wrote:

In any case, it's probably a good idea to emit a warning for code that
ends up falling into ambiguity resolution, i.e. anything that gets
past the call to TryParseDeclarationSpecifier from
TryParseSimpleDeclaration in your patch. It's very likely that such
code doesn't do what the author intends, and it's usually very easy to
make the intention explicit (either by removing parentheses around the
variable declaration, or putting parentheses around the function-style
cast).
  
The warning seems like a good idea, I've added one and it works like this:

t.cpp:7:3: warning: statement was disambiguated as declaration
  int(x);
  ^~~~~~~

Just to reduce these warnings a bit, I don't emit it when we disambiguated a declaration starting with 'void', which almost certainly is an obvious function declaration.

Comments on the patch follow:

+ while (Tok.is(tok::kw_const) || Tok.is(tok::kw_volatile))
+ ConsumeToken();

Doesn't the presence of const or volatile prove that this is a
declaration? I don't think this actually affects the semantics, but
it seems cleaner to exit once we can prove something.
  
Could be that they are misplaced in an expression statement. GCC, MSVC, and Comeau agree:

void f() {
  int x;
  int (* const x)++;
}

"ComeauTest.c", line 3: error: expected an expression
    int (* const x), ++x;
           ^

Also, shouldn't we be handling restrict here as well? It's not
standard C++, but I think people use it.
  
Done.

+ if (Tok.is(tok::identifier)) {
+ // declarator-id
+ ConsumeToken();

Don't you need some sort of check to see whether this is a type?
  
It doesn't matter whether it is a type or not. If it is a valid declaration, it shadows the type-name. If the type is in the same scope it may erroneously redeclare it but disambiguation doesn't take scope information into account.

Consider something like the following:
class T {public: T(int a) {} T() {} int operator*() {return 0;}};
void a() {T(*T());} // Must be expression
  
This is actually a function declaration, the same as "T* T();". GCC agrees:

class T {public: T(int a) {} T() {} int operator*() {return 0;}};
void a() {
  T(*T());
  int T;
}

t.cpp: In function 'void a()':
t.cpp:4: error: 'int T' redeclared as different kind of symbol
t.cpp:3: error: previous declaration of 'T* T()'

So is Comeau:

"ComeauTest.c", line 4: error: "T" has already been declared in the current scope
    int T;
        ^

<sarcasm> I love that we are debating whether something is a declaration or an expression, the C++ grammar is so beautiful! </sarcasm>

+ case tok::kw_const:
+ case tok::kw_volatile:

Again, handle restrict?
  
Done.

+Parser::TentativeParsingResult
+Parser::TryParseParameterDeclarationClause(SourceLocation LParenLoc) {
+ TentativeParsingResult TPR = TryParseDeclarationSpecifier();
+
+ if (TPR != TPR_false) {

What the heck is this if statement supposed to be testing for? Cases
like empty parameter declarations? This whole area looks rather
strange. I haven't worked out whether this has any visible effects,
but it seems like it would be a lot clearer to explicitly check for
the cases where TryParseDeclarationSpecifier returns false, but there
is in fact a valid parameter list (the only cases I can think of are
an empty parameter list and a list with only "...").
  
I've improved the readability of this function.

Otherwise, I think it looks fine; that said, someone who knows the
parser better should review it as well.
  
Thank you for your feedback!

-Argiris

tentative-parse-2.patch (29.9 KB)

Chris Lattner wrote:

Hi Argiris,

You have obviously put a lot of time and thought into this code, and it shows in its cleanliness and attention to detail. I have a few specific comments about the patch, but I'd like to first talk about the approach you've chosen to make sure we're on the same page.

The problem is that we don't know whether something is a declaration or a statement without doing a significant amount of parsing to decide. If it is a declaration, we have to accept it as such, otherwise we should try parsing it as a statement. If neither work it is an error.

At root, your approach basically boils down to a very simple approach: any time when we can't tell whether something is an obvious statement (an important optimization!), you fire off a special parser that looks ahead and sees if the tokens match the declaration grammar. This parser comes back with an answer, and based on that answer, the normal parser goes off and parses with one approach or another.

This is conceptually an extremely clean approach, because it decouples disambiguation from actual parsing. However, this has two big downsides: 1) maintenance: it duplicates a lot of parsing rules, because now we have to be able to "tentatively parse" and "really parse" all the declaration stuff. 2) performance: in addition to the basic cost of backtracking this has to duplicate parsing, and some amount of sema (type lookups etc).

To me, these are pretty big drawbacks, and I want to make sure we're all on the same page and agree that this is the right approach before we go much farther.

About maintenance:
I'd say that having separate "disambiguation parsing" offers flexibility. For example, if we decide that disambiguating comma-separated declarators as comma-separated expressions is almost always not what the programmer intended, and it leads to less confusing errors to "shortcut" to a declaration by the first declarator, we can do that easily by changing a couple of lines.We can have a fine-grained control over disambiguation.
The declaration parsing rules for this 'special' parser are straightforward and restricted to one file, so it doesn't have huge maintenance cost.

Ignoring optimizations for common cases, the only other realistic solution that I know of is to make use of tentative parsing to actually *start parsing* the ambiguous cases as declarations, and then roll back and re-parse if an error occurs. The code basically looks like this:

  // Parse declaration tentatively.
  PP.StartTentativeParsing();
  Res = ParseDeclarationStatement();
  if (!Res.isError) {
     PP.Commit();
     return Res;
  }
  PP.Rollback();

  // Parse statement non-tentatively
  return ParseStatement();
}

This approach produces some very interesting tradeoffs. The drawback of this approach is that any action made while parsing and sema'ing the declaration statement *must be able to be rolled back*. This means that warnings need to be buffered up (errors stop the tentative parse) and that Sema needs to be able to undo any changes it makes to its internal state (e.g. remove declared variables, delete initializer expressions that are parsed and built, etc). Also, if the ambiguous statement ends up being an expression, your approach would be superior in space and time.

However, the advantage of this approach is that it is much faster than doing a pre-parse when things turn out to actually be a declaration. Since many statement cases get filtered out before reaching the tentative case, I think this is a win. Additionally, this does not require duplicating parsing code for a bunch of grammar productions, which is my primary objection to your approach.

With that approach, we will make big architectural changes just to deal with a very uncommon situation. I really don't think the tradeoffs worth it.
I don't have benchmarks, but I don't think there's such a big performance cost with the 'special parser' approach:

--The situation where disambiguation is required is uncommon.
--We can do shortcuts, like disambiguating to a declaration by the first declarator (GCC style), or when the first declarator has a '=' initializer.
--Lexer does lexing *only once*. The tokens are cached.
--The "disambiguation parser" is a tight "state-machine", no fancy stuff except type checking.
--The only architectural change that I propose is caching type-checks which:
       1) would also be needed for a "normal parser with roll-back"
       2) is useful in general; even on C there are multiple type-checks (i.e isDeclarationSpecifier calls)

A "normal parser with roll-back" approach would induce it's own time/space overheads but my main concern is that it would induce many-many changes and maintenance headaches for Parser+Sema, just to accommodate a weird part of the C++ grammar.

In general, the "special disambiguation parser" is a very unintrusive approach. If later on, using benchmarks, we determine that it has unacceptable performance cost, we can easily replace it with a more sophisticated "normal parser with rollback" approach.

Did you consider the tentative parsing approach? What do you think of it? Will you be able to reasonably expand your preparser to handle cases like this?:

int X = X;

I think that this sort of example requires doing some amount of sema in the disambiguation parser, though maybe the grammar is regular enough to let you just skip over initializers completely (right now, this doesn't matter for you because you stop lookahead at the =).

As Eli, mentioned, (and is stated in the standard) only type-checking is required for disambiguation. (Also, in the new patch I skip over initializers completely; I agree with Eli here too)

Once we agree on the overall approach, I'll hassle you about things like whether we should factor declspec processing out of the tentative part and try to find out if there are other important cases that can get filtered out as expressions early on (e.g. what about a statement that starts with an identifier that is known to be a variable/function, can that ever start a declaration?). :slight_smile:

"can that ever start a declaration?": No, I don't think so.

-Argiris

+ if (Tok.is(tok::identifier)) {
+ // declarator-id
+ ConsumeToken();

Don't you need some sort of check to see whether this is a type?

It doesn't matter whether it is a type or not. If it is a valid declaration,
it shadows the type-name. If the type is in the same scope it may
erroneously redeclare it but disambiguation doesn't take scope information
into account.

<sarcasm> I love that we are debating whether something is a declaration or
an expression, the C++ grammar is so beautiful! </sarcasm>

Ah, oops. Horrible stuff.

+Parser::TentativeParsingResult
+Parser::TryParseParameterDeclarationClause(SourceLocation LParenLoc) {
+ TentativeParsingResult TPR = TryParseDeclarationSpecifier();
+
+ if (TPR != TPR_false) {

What the heck is this if statement supposed to be testing for? Cases
like empty parameter declarations? This whole area looks rather
strange. I haven't worked out whether this has any visible effects,
but it seems like it would be a lot clearer to explicitly check for
the cases where TryParseDeclarationSpecifier returns false, but there
is in fact a valid parameter list (the only cases I can think of are
an empty parameter list and a list with only "...").

I've improved the readability of this function.

Definitely more readable; although, how is it supposed to deal with an
empty parameter declaration clause?

-Eli

This is conceptually an extremely clean approach, because it decouples
disambiguation from actual parsing. However, this has two big
downsides: 1) maintenance: it duplicates a lot of parsing rules,
because now we have to be able to "tentatively parse" and "really
parse" all the declaration stuff. 2) performance: in addition to the
basic cost of backtracking this has to duplicate parsing, and some
amount of sema (type lookups etc).

To me, these are pretty big drawbacks, and I want to make sure we're
all on the same page and agree that this is the right approach before
we go much farther.

(1) is definitely an issue... it's a balancing act. I won't try to
comment on it because I don't know the difficulty of the alternative.

Right, this whole area is a balancing act... there isn't really a clear "right" way to go, which is why discussing it is useful :slight_smile:

For (2), the current code doesn't try to avoid activating
backtracking; we can easily avoid it for common cases depending on how
expensive it is to activate backtracking.

Activating backtracking isn't hugely expensive, but it certainly isn't free. Also, scanning ahead and buffering tokens certainly isn't free, so we should avoid it when possible. One of the costs is the Sema/Type resolution that has to happen in the preparser. For example, silly things like:

   std::vector<bool>::iterator ...

require the preparser to do template specialization etc to look up what "iterator" in vector<bool> is. Immediately after the preparser decides that it is a type, the decl parser kicks in and has to do all the same resolution stuff.

The only cases which need
the statement vs. expression disambiguation are those starting with T(
and N::T(, where T is a type. We can catch all the cases which don't
start with a complicated type with only a single token of lookahead.

Right.

The current version is already quite optimized in terms of the
distance it looks ahead; it tries to cut out as soon as possible, and
most of the time it shouldn't end up looking very far ahead.

Yep, Argiris did a nice job with his implementation of the approach.

If an identifier changes from a type to an identifier in the middle of
a statement, and that affects disambiguation, the program is
ill-formed per the standard; we can just assume for disambiguation
purposes that anything that isn't a type is a valid variable. And for
initializers, I'm pretty sure the rules allow skipping forward to the
next comma/semicolon once we see an = sign; what follows the = sign is
just a regular assignment-expression, so it doesn't matter in terms of
disambiguation.

Ok, great! Who knew that there would be *some* sanity in C++? :slight_smile:

Also, if the ambiguous
statement ends up being an expression, your approach would be superior
in space and time.

If we want to know what the balance is here, we'll really have to
benchmark; hopefully the cases where we have to lookahead more than a
few tokens are rare, in which case performance isn't so much of an
issue.

It would be very interesting to hack the GCC front-end and run it over some code base to find out how much it ends up speculating wrong.

After thinking about this some more and sleeping on it, I think that GCC's approach (tentatively parsing something as a decl, then falling back to expr on error) is really the best approach. The high level reason for this is that I think that almost all of the ambiguous cases will end up being declarations in practice, so doing a "lookahead" prepare is wasted effort compared to just parsing as a decl and undoing if it really is a statement.

The basic reason I think this is that I think almost all of the common expression cases can be easily disambiguated without any[1] lookahead. For example, here are some silly but important cases:

   if (...) // decl can't start with 'if' keyword
   X = 4; // X isn't a typename
   foo(4); // foo isn't a typename
   I->foo(); // I isn't a typename
   cout << .. // cout isn't a typename

For fun, I just scanned through the source for instcombine, looking at the top level statements. Almost all of them would be rejected as "not decls" very trivially. This gives me good confidence that backtracking won't even enter the picture for most expressions and statements. However, every variable definition (obviously) starts with a type:

   int X = ...
   Value *V = ..

This means that every variable definition will require starting backtrack bookkeeping, doing a preparse (even if not very far forward) then then deciding "yep, it's a decl", backtracking, and then reparsing as a decl. This seems like a pretty significant cost to pay for these common cases, and I think the "speculatively parse as a decl if ambiguous and back off later" approach is better.

There is one more thing I want to bring (back) up, w.r.t. "[1]" above. The problem is that there is a significant class of things where we need more than one token of lookahead to resolve and these are commonly exprs. This is the qualified name case:

   std::cout << ... // is an expr
   UndefValue::get(Ty) // static method is an expr

which is very ambiguous with the decl case:

   std::vector<int> Y
   BasicBlock::iterator X = ...
   ICmpInst::Predicate NewPred =

The LLVM codebase as a whole doesn't use *too* much namespace qualification, because everything is in the llvm namespace and we don't use std::cout at all. However, a lot of codebases use namespaces heavily. With the "speculate it is a decl" approach, we would handle the decl cases well (which are the majority of the cases for the llvm codebase) but we lose (compared to the 'preparse to disambiguate' approach) for the qualified expr case.

I think the solution to this (which I know Argiris doesn't like :slight_smile: is to resolve the qualified type *before* making a decision about whether it is a statement or a decl. This makes the logic a little more complex (we need to bring back the "parse expr with leading identifier" logic) but it would very neatly solve this problem in just about every real case, and it would mean that we aren't re-resolving and sema-ing types in any of the common cases here.

This is clearly something that we can discuss more and do as a later separate step, but I just wanted to throw it out there for kicks. :slight_smile:

Argiris, what do you think of the "tentatively parse as a decl" approach?

-Chris

This is the approach that GCC's parser uses (as of GCC 3.4), and it
works reasonably well for them. A review of GCC's parser could give us
some sense of how often tentative parsing is needed, although my
impression is that they tentatively parse more often than is
absolutely necessary. The problem is that a lot of C++'s tricky
parsing rules (including lots of template parsing rules) get into the
declaration/expression ambiguity, so there is likely to be a
significant amount of duplication in the special parser. And the
declaration/expression ambiguity isn't the only place where
non-trivial disambiguation is required in C++.

Performance-wise, tentative parsing can be improved by allowing it to
annotate the token stream with new "already parsed as a..." tokens as
it goes through. Good candidates for this kind of annotation are
qualified names (which Chris mentioned) and template-ids (GCC does the
latter).

Having worked with it a bit (and run into a lot of places where
tentative parsing was a natural approach), I'm a big fan of tentative
parsing for ambiguity resolution.

  - Doug

To me, these are pretty big drawbacks, and I want to make sure we're all on the same page and agree that this is the right approach before we go much farther.

About maintenance:
I'd say that having separate "disambiguation parsing" offers flexibility. For example, if we decide that disambiguating comma-separated declarators as comma-separated expressions is almost always not what the programmer intended, and it leads to less confusing errors to "shortcut" to a declaration by the first declarator, we can do that easily by changing a couple of lines.We can have a fine-grained control over disambiguation.

I think there are two separate future costs, which are important to consider. Once all of c++ is implemented (!) we get to maintenance cost, which is the biggest piece of the "software cost" puzzle. For example, one cost is paid when you want to make an extension to the grammar (e.g. C++'0x features). I think that having two parsers running around is actually worse for this sort of thing, because it means understanding and maintaining two separate parsers, and making sure they agree on everything.

The decision of how to handle comma isn't really in the same category, because it is a bug, not an extension point for future change.

OTOH, I agree with you that the lookahead parser is very simple and the cost of understanding it and keeping it in sync with the rest of the parser is probably not too bad. We still have the efficiency issues though, see the other email :slight_smile:

However, the advantage of this approach is that it is much faster than doing a pre-parse when things turn out to actually be a declaration. Since many statement cases get filtered out before reaching the tentative case, I think this is a win. Additionally, this does not require duplicating parsing code for a bunch of grammar productions, which is my primary objection to your approach.

With that approach, we will make big architectural changes just to deal with a very uncommon situation. I really don't think the tradeoffs worth it.

Which architectural changes?

I don't have benchmarks, but I don't think there's such a big performance cost with the 'special parser' approach:

--The situation where disambiguation is required is uncommon.

If that's true, then it doesn't matter which solution we pick :slight_smile:

--We can do shortcuts, like disambiguating to a declaration by the first declarator (GCC style), or when the first declarator has a '=' initializer.

You can do the same thing with tentative parsing if desired.

--Lexer does lexing *only once*. The tokens are cached.

Right.

--The "disambiguation parser" is a tight "state-machine", no fancy stuff except type checking.

Except that type checking pulls in a ton of sema, e.g. to handle std::vector<bool>::iterator.

--The only architectural change that I propose is caching type-checks which:
     1) would also be needed for a "normal parser with roll-back"
     2) is useful in general; even on C there are multiple type-checks (i.e isDeclarationSpecifier calls)

I don't have a mental model for what this would entail, so I'm not sure if it would be clean or not.

In general, the "special disambiguation parser" is a very unintrusive approach. If later on, using benchmarks, we determine that it has unacceptable performance cost, we can easily replace it with a more sophisticated "normal parser with rollback" approach.

I agree, and the best part of the disambiguation parser is that it is completely separated from the rest of the parser. This means it doesn't make the rest of the parser incrementally more complicated. OTOH, I strongly believe that the time to make this decision is now, before we have template instantiation and other complex stuff. Changing approaches will be much harder later.

-Chris

Eli Friedman wrote:

Definitely more readable; although, how is it supposed to deal with an
empty parameter declaration clause?
  
Oops, it got too readable for its own good :slight_smile:
Thanks for the catch!

-Argiris

Chris Lattner wrote:

Activating backtracking isn't hugely expensive, but it certainly isn't free. Also, scanning ahead and buffering tokens certainly isn't free, so we should avoid it when possible. One of the costs is the Sema/Type resolution that has to happen in the preparser. For example, silly things like:

  std::vector<bool>::iterator ...

require the preparser to do template specialization etc to look up what "iterator" in vector<bool> is. Immediately after the preparser decides that it is a type, the decl parser kicks in and has to do all the same resolution stuff.

The most efficient approach is to do template specialization and other resolution stuff once, right ? Even with the "tentatively parse as a decl" approach you'd prefer to not repeat resolutions when going for "expression parsing".
How about caching the resolution stuff. The way it would work is, before going to Sema to do type resolutions, the parser would check if the type resolution was made by previous calls to Sema.(instead of calling Action.isTypeName directly, there would be a Parser::IsTypename method to take care of this)
The preparser would do qualified type checks and template specializations but the normal parser would not have to re-do them again.

After thinking about this some more and sleeping on it, I think that GCC's approach (tentatively parsing something as a decl, then falling back to expr on error) is really the best approach. The high level reason for this is that I think that almost all of the ambiguous cases will end up being declarations in practice, so doing a "lookahead" prepare is wasted effort compared to just parsing as a decl and undoing if it really is a statement.

The basic reason I think this is that I think almost all of the common expression cases can be easily disambiguated without any[1] lookahead. For example, here are some silly but important cases:

  if (...) // decl can't start with 'if' keyword
  X = 4; // X isn't a typename
  foo(4); // foo isn't a typename
  I->foo(); // I isn't a typename
  cout << .. // cout isn't a typename

For fun, I just scanned through the source for instcombine, looking at the top level statements. Almost all of them would be rejected as "not decls" very trivially. This gives me good confidence that backtracking won't even enter the picture for most expressions and statements. However, every variable definition (obviously) starts with a type:

  int X = ...
  Value *V = ..

This means that every variable definition will require starting backtrack bookkeeping, doing a preparse (even if not very far forward) then then deciding "yep, it's a decl", backtracking, and then reparsing as a decl. This seems like a pretty significant cost to pay for these common cases, and I think the "speculatively parse as a decl if ambiguous and back off later" approach is better.

I think that you have a misunderstanding about what the ambiguous cases are. The ambiguous cases are those where a type is followed by a '('. These cases:
  int X = ...
  Value *V = ..

and the vast majority of declarations are not ambiguous at all, so no preparsing, backtracking etc. is needed for them, just a one-token lookahead.
int X = // not ambiguous
int (X) = // ambiguous
const int (X) // not ambiguous because it starts with 'const'

So your assumption that "almost all of the ambiguous cases will end up being declarations" is not true, in fact, if I had to guess, I'd say that the balance would probably lean towards the expression side.
Most of the declarations that are of the T(..) variety, in practice, are function pointer declarations and these are not that common inside functions.

There is one more thing I want to bring (back) up, w.r.t. "[1]" above. The problem is that there is a significant class of things where we need more than one token of lookahead to resolve and these are commonly exprs. This is the qualified name case:

  std::cout << ... // is an expr
  UndefValue::get(Ty) // static method is an expr

which is very ambiguous with the decl case:

  std::vector<int> Y
  BasicBlock::iterator X = ...
  ICmpInst::Predicate NewPred =

The LLVM codebase as a whole doesn't use *too* much namespace qualification, because everything is in the llvm namespace and we don't use std::cout at all. However, a lot of codebases use namespaces heavily. With the "speculate it is a decl" approach, we would handle the decl cases well (which are the majority of the cases for the llvm codebase) but we lose (compared to the 'preparse to disambiguate' approach) for the qualified expr case.

I think the solution to this (which I know Argiris doesn't like :slight_smile: is to resolve the qualified type *before* making a decision about whether it is a statement or a decl. This makes the logic a little more complex (we need to bring back the "parse expr with leading identifier" logic) but it would very neatly solve this problem in just about every real case, and it would mean that we aren't re-resolving and sema-ing types in any of the common cases here.

The "having the parser cache sema resolutions" solves this.

This is clearly something that we can discuss more and do as a later separate step, but I just wanted to throw it out there for kicks. :slight_smile:

Argiris, what do you think of the "tentatively parse as a decl" approach?

As previously said, the preparser approach is non-intrusive and easily replaceable. We can go with a "tentatively parse as a decl" approach and muddle up the Parser+Sema for this C++ corner case (at the expense of the other languages) but we wouldn't know if it was really worth it. If we had the preparser approach in place we would be able to compare against it and see that it's a clear performance benefit or not. Or we could run against actual codebases and see what the ambiguous statements resolve to.
How would we feel if we went through the trouble for a "Parse+Sema rollback version" and see that most of the ambiguous statements resolve to expressions ? :slight_smile:

-Argiris

Chris Lattner wrote:

Activating backtracking isn’t hugely expensive, but it certainly isn’t free. Also, scanning ahead and buffering tokens certainly isn’t free, so we should avoid it when possible. One of the costs is the Sema/Type resolution that has to happen in the preparser. For example, silly things like:

std::vector::iterator …

require the preparser to do template specialization etc to look up what “iterator” in vector is. Immediately after the preparser decides that it is a type, the decl parser kicks in and has to do all the same resolution stuff.

The most efficient approach is to do template specialization and other resolution stuff once, right ?

Right, of course.

Even with the “tentatively parse as a decl” approach you’d prefer to not repeat resolutions when going for “expression parsing”.

If we only do tentative parsing when the result is likely to end up being a decl, and if it does end up being a decl, there is no reanalysis. The trick is to not tentatively parse when it is likely to be an expr.

This means that every variable definition will require starting backtrack bookkeeping, doing a preparse (even if not very far forward) then then deciding “yep, it’s a decl”, backtracking, and then reparsing as a decl. This seems like a pretty significant cost to pay for these common cases, and I think the “speculatively parse as a decl if ambiguous and back off later” approach is better.

I think that you have a misunderstanding about what the ambiguous cases are.

I’m sure I do :). The part of your patch that freaks me out is this:

  • default: {

  • bool isDeclaration = false;
    // If we have an identifier at the top level statement…

  • if (!OnlyStatement) {

  • TentativeParsingResult TPR = isDeclarationStatement();

  • TentativeParsingResult isDeclarationStatement() {

  • if (getLang().CPlusPlus)

  • return isCXXDeclarationStatement();

  • return isDeclarationSpecifier() ? TPR_true : TPR_false;

  • }

This means that (if I understand correctly) your current patch uses the preparser for tons of cases, including the “x = 4” and “func(4)” cases. It certainly isn’t reading types and qualified identifiers before deciding.

The ambiguous cases are those where a type is followed by a ‘(’. These cases:
int X = …
Value *V = …

and the vast majority of declarations are not ambiguous at all, so no preparsing, backtracking etc. is needed for them, just a one-token lookahead.
int X = // not ambiguous
int (X) = // ambiguous
const int (X) // not ambiguous because it starts with ‘const’

Ok.

So your assumption that “almost all of the ambiguous cases will end up being declarations” is not true, in fact, if I had to guess, I’d say that the balance would probably lean towards the expression side.
Most of the declarations that are of the T(…) variety, in practice, are function pointer declarations and these are not that common inside functions.

Ok, so my objection is really to the current implementation, not the approach :slight_smile:

e qualified type before making a decision about whether it is a statement or a decl. This makes the logic a little more complex (we need to bring back the “parse expr with leading identifier” logic) but it would very neatly solve this problem in just about every real case, and it would mean that we aren’t re-resolving and sema-ing types in any of the common cases here.

The “having the parser cache sema resolutions” solves this.

If the plan of record is to eat the type in the unconditional part of the parser, then this is a non-issue.

Argiris, what do you think of the “tentatively parse as a decl” approach?

As previously said, the preparser approach is non-intrusive and easily replaceable. We can go with a “tentatively parse as a decl” approach and muddle up the Parser+Sema for this C++ corner case (at the expense of the other languages) but we wouldn’t know if it was really worth it. If we had the preparser approach in place we would be able to compare against it and see that it’s a clear performance benefit or not. Or we could run against actual codebases and see what the ambiguous statements resolve to.

I agree that going with the preparser approach is preferable… particularly if it doesn’t kick in very often. Are you really agreeing to the “parse and sema qualified names and types before deciding whether it is a decl or expr” approach?

If so, I think we’re on the same page and the preparser idea works for me, if not, I’m really confused (again) :slight_smile:

-Chris

Doug Gregor wrote:

Performance-wise, tentative parsing can be improved by allowing it to
annotate the token stream with new "already parsed as a..." tokens as
it goes through. Good candidates for this kind of annotation are
qualified names (which Chris mentioned) and template-ids (GCC does the
latter).
  
I think this is a good idea that can be incorporated in Clang. This won't only be useful for improving performance of tentative parsing but also as a way to cleanly parse qualified names.

Parsing qualified names ("A::b::x" - let's say 'scope qualifiers' for the "A::b::" part) without backtracking is tricky.
Chris suggested doing something like ParseWithLeadingScopeQualifier... functions. The problem with this is that it creates a kind of "fork" where parsing a bit of the grammar is duplicated.
This wasn't a big deal with the ParseWithLeadingIdentifier for disambiguating labels/identifiers, but it's a much bigger problem with scope qualifiers because these are to a *lot* of places:

-They are in parsing declaration specifiers
-in declarators
-in id-expressions
-in type-specifiers
-in declarations, expressions, function declarations, initializers, typeof, sizeof, etc..
...

Anyway, you get the picture. This means there are a lot of "forking" that will lead to a lot of duplication of grammar parsing.

Here are a couple of examples of how annotation tokens can be used:
-ParseDeclarationSpecifiers, may start parsing "A::b::" and if "A::b::x" turns out not to be a type, it may push a "scope qualifier annotation token" so that ParseDirectDeclarator can act upon it.
-The "disambiguation" pre-parser may use annotation tokens so that this token stream

would turn to this when the normal parser starts parsing the declaration:

'type token' '(' 'scope token' 'x' ')' ';'

The annotation tokens would hold a pointer value, TypeTy* for the "type token" and CXXScopeTy* for the "scope token".

Any thoughts?

-Argiris

Hi Argiris,

Doug Gregor wrote:

Performance-wise, tentative parsing can be improved by allowing it to
annotate the token stream with new "already parsed as a..." tokens as
it goes through. Good candidates for this kind of annotation are
qualified names (which Chris mentioned) and template-ids (GCC does the
latter).

I think this is a good idea that can be incorporated in Clang. This won't
only be useful for improving performance of tentative parsing but also as a
way to cleanly parse qualified names.

It should allow re-use of the qualified-name parser inside the
-pre-parser and the main parser, without requiring two passes over the
tokens that make up the qualified names.

Parsing qualified names ("A::b::x" - let's say 'scope qualifiers' for the
"A::b::" part) without backtracking is tricky.
Chris suggested doing something like ParseWithLeadingScopeQualifier...
functions. The problem with this is that it creates a kind of "fork" where
parsing a bit of the grammar is duplicated.
This wasn't a big deal with the ParseWithLeadingIdentifier for
disambiguating labels/identifiers, but it's a much bigger problem with scope
qualifiers because these are to a *lot* of places:

-They are in parsing declaration specifiers
-in declarators
-in id-expressions
-in type-specifiers
-in declarations, expressions, function declarations, initializers, typeof,
sizeof, etc..
...

Anyway, you get the picture. This means there are a lot of "forking" that
will lead to a lot of duplication of grammar parsing.

Yes.

Here are a couple of examples of how annotation tokens can be used:
-ParseDeclarationSpecifiers, may start parsing "A::b::" and if "A::b::x"
turns out not to be a type, it may push a "scope qualifier annotation token"
so that ParseDirectDeclarator can act upon it.
-The "disambiguation" pre-parser may use annotation tokens so that this
token stream

A::T<int> (A::x);

would turn to this when the normal parser starts parsing the declaration:

'type token' '(' 'scope token' 'x' ')' ';'

The annotation tokens would hold a pointer value, TypeTy* for the "type
token" and CXXScopeTy* for the "scope token".

Any thoughts?

I think this is an important step for unifying the common bits of the
pre-parser and main parser, and should improve parsing performance.
The same technique would eventually be used for template-ids, and
could be used in other cases where avoiding additional parsing could
pay off (long decl-specifier-seqs? that might be taking the idea too
far...).

  - Doug

To be honest, I'm not sure exactly how this will shape up, and I have a hard time understanding how it will fit into the rest of the parser. I'm slightly concerned that this is an abuse of the Token abstraction (which represents things coming out of the lexer), and we don't want to make token any larger. Perhaps a new magic token Kind, where the identifier pointer was a pointer to a "DeclTy" or something like that would be ok.

If you can make it clean, go for it. :slight_smile:

-Chris