Token lookahead without the preprocessor

Hi, all. I've been trying to come up with a useful recovery for this case (<rdar://problem/11602405> for Apple folks):

void foo();
{
  // note the spurious semicolon above
}

The trouble is, having a semicolon there is a perfectly good way to end a declaration. It's clear that if there's a brace on the next line, it was actually supposed to be a definition (because C/C++ don't have top-level braces). But we get in trouble in this case (from test/CodeGen/pragma-weak.c):

void __both2(void);
void both2(void) __attribute((alias("__both2"))); // first, wins
#pragma weak both2 = __both2
void __both2(void) {}
// CHECK: @both2 = alias void ()* @__both2
// CHECK: define void @__both2()

The lookahead after the semicolon has to go all the way to the next 'void' to get another token, and meanwhile the Lexer and Preprocessor have seen and recorded the #pragma weak.

There are similar problems in test/SemaCXX/warn-thread-safety-analysis.cpp, though I haven't specifically tracked them down.

Any ideas on what's the right thing to do here? I'd be fine with "there's a preprocessing directive in the way; don't bother" or "the next token is 'void' but you're gonna have to re-Lex from where you are" but I don't think we have a good way to do either one. (Raw mode /almost/ works except I'm not sure of the right way to go into raw mode from Parser.)

Thanks for any advice,
Jordan

Hi Jordy,

This is PR10101, and was fixed in r145372, but the fix was backed out due to the #pragma weak (and, at the time, #pragma visibility) issue. The problem is that the implementation of this pragma is incorrect, since it takes effect when the pragma is lexed, rather than when it is parsed, and the point at which the pragma occurs has a semantic impact. We shouldn’t be hacking around that in the parser by avoiding lookahead; the right fix is for the lexer to produce an annotation token when it encounters such a pragma, as it does for #pragma visibility, #pragma pack, and #pragma unused.

This issue also makes our parsing of “#pragma weak” accept code which GCC rejects (though I’m hesitant to call it an accepts-invalid since I can’t find a precise spec for this pragma): GCC (as far as I can determine) only accepts the pragma in places where it would parse a declaration.

Incidentally, I wonder whether it’d make sense to provide a more general framework for such cases, rather than adding ad-hoc pragma annotations. Perhaps, for all pragmas which can only appear in specific places in the grammar, we could lex them as a tok::annot_pragma followed by the tokens in the pragma and an tok::eod, and perform the pragma parsing in the parser.

Richard

Whoops, guess I should have been more proactive in looking for a PR. (I searched "semi" and didn't see it turn up.) Or tracking cfe-commits more closely.

I'd be fine with implementing tok::annot_pragma, but since I'm fairly new to Lex/Parse I'll wait to see if someone else wants to weigh in first.

Thanks for giving me the right context for this,
Jordan

This seems fairly reasonable, but would require picking apart the current logic for parsing function definitions to allow this. That actually doesn't look so bad, at least not for top-level definitions, but it sounds like Richard's refactoring would be a better solution anyway. (It's fairly orthogonal but it would make this unnecessary. The slight difference is that it wouldn't allow recovery in the #pragma weak case, but that seems fair to me.)

Of course, I also still don't have too much experience with Lex/Parse, so I might be missing something as well. :slight_smile:

Jordan