This is conceptually an extremely clean approach, because it decouples
disambiguation from actual parsing. However, this has two big
downsides: 1) maintenance: it duplicates a lot of parsing rules,
because now we have to be able to "tentatively parse" and "really
parse" all the declaration stuff. 2) performance: in addition to the
basic cost of backtracking this has to duplicate parsing, and some
amount of sema (type lookups etc).
To me, these are pretty big drawbacks, and I want to make sure we're
all on the same page and agree that this is the right approach before
we go much farther.
(1) is definitely an issue... it's a balancing act. I won't try to
comment on it because I don't know the difficulty of the alternative.
Right, this whole area is a balancing act... there isn't really a clear "right" way to go, which is why discussing it is useful
For (2), the current code doesn't try to avoid activating
backtracking; we can easily avoid it for common cases depending on how
expensive it is to activate backtracking.
Activating backtracking isn't hugely expensive, but it certainly isn't free. Also, scanning ahead and buffering tokens certainly isn't free, so we should avoid it when possible. One of the costs is the Sema/Type resolution that has to happen in the preparser. For example, silly things like:
require the preparser to do template specialization etc to look up what "iterator" in vector<bool> is. Immediately after the preparser decides that it is a type, the decl parser kicks in and has to do all the same resolution stuff.
The only cases which need
the statement vs. expression disambiguation are those starting with T(
and N::T(, where T is a type. We can catch all the cases which don't
start with a complicated type with only a single token of lookahead.
The current version is already quite optimized in terms of the
distance it looks ahead; it tries to cut out as soon as possible, and
most of the time it shouldn't end up looking very far ahead.
Yep, Argiris did a nice job with his implementation of the approach.
If an identifier changes from a type to an identifier in the middle of
a statement, and that affects disambiguation, the program is
ill-formed per the standard; we can just assume for disambiguation
purposes that anything that isn't a type is a valid variable. And for
initializers, I'm pretty sure the rules allow skipping forward to the
next comma/semicolon once we see an = sign; what follows the = sign is
just a regular assignment-expression, so it doesn't matter in terms of
Ok, great! Who knew that there would be *some* sanity in C++?
Also, if the ambiguous
statement ends up being an expression, your approach would be superior
in space and time.
If we want to know what the balance is here, we'll really have to
benchmark; hopefully the cases where we have to lookahead more than a
few tokens are rare, in which case performance isn't so much of an
It would be very interesting to hack the GCC front-end and run it over some code base to find out how much it ends up speculating wrong.
After thinking about this some more and sleeping on it, I think that GCC's approach (tentatively parsing something as a decl, then falling back to expr on error) is really the best approach. The high level reason for this is that I think that almost all of the ambiguous cases will end up being declarations in practice, so doing a "lookahead" prepare is wasted effort compared to just parsing as a decl and undoing if it really is a statement.
The basic reason I think this is that I think almost all of the common expression cases can be easily disambiguated without any lookahead. For example, here are some silly but important cases:
if (...) // decl can't start with 'if' keyword
X = 4; // X isn't a typename
foo(4); // foo isn't a typename
I->foo(); // I isn't a typename
cout << .. // cout isn't a typename
For fun, I just scanned through the source for instcombine, looking at the top level statements. Almost all of them would be rejected as "not decls" very trivially. This gives me good confidence that backtracking won't even enter the picture for most expressions and statements. However, every variable definition (obviously) starts with a type:
int X = ...
Value *V = ..
This means that every variable definition will require starting backtrack bookkeeping, doing a preparse (even if not very far forward) then then deciding "yep, it's a decl", backtracking, and then reparsing as a decl. This seems like a pretty significant cost to pay for these common cases, and I think the "speculatively parse as a decl if ambiguous and back off later" approach is better.
There is one more thing I want to bring (back) up, w.r.t. "" above. The problem is that there is a significant class of things where we need more than one token of lookahead to resolve and these are commonly exprs. This is the qualified name case:
std::cout << ... // is an expr
UndefValue::get(Ty) // static method is an expr
which is very ambiguous with the decl case:
BasicBlock::iterator X = ...
ICmpInst::Predicate NewPred =
The LLVM codebase as a whole doesn't use *too* much namespace qualification, because everything is in the llvm namespace and we don't use std::cout at all. However, a lot of codebases use namespaces heavily. With the "speculate it is a decl" approach, we would handle the decl cases well (which are the majority of the cases for the llvm codebase) but we lose (compared to the 'preparse to disambiguate' approach) for the qualified expr case.
I think the solution to this (which I know Argiris doesn't like is to resolve the qualified type *before* making a decision about whether it is a statement or a decl. This makes the logic a little more complex (we need to bring back the "parse expr with leading identifier" logic) but it would very neatly solve this problem in just about every real case, and it would mean that we aren't re-resolving and sema-ing types in any of the common cases here.
This is clearly something that we can discuss more and do as a later separate step, but I just wanted to throw it out there for kicks.
Argiris, what do you think of the "tentatively parse as a decl" approach?