Clang 3.2 assertion failure reading AST files: TokenID != tok::identifier && "Already at tok::identifier

The following code causes Clang (3.2 on Linux) to fail an assertion test when deserializing an AST from a PCH file. Note that the identifier (__is_void) for the struct matches a Clang keyword.

struct __is_void {
   int val;
} a = { 42 };

$ clang --version
clang version 3.2 (tags/RELEASE_32/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix

$ clang -c __is_void.cpp
<no error, object file is generated successfully>

$ clang -emit-ast __is_void.cpp
<no error, AST file is generated successfully>

$ clang -c __is_void.ast
clang: include/clang/Basic/IdentifierTable.h:168: void clang::IdentifierInfo::RevertTokenIDToIdentifier(): Assertion `TokenID != tok::identifier && "Already at tok::identifier"' failed.
clang: error: unable to execute command: Segmentation fault (core dumped)
clang: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 3.2 (tags/RELEASE_32/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
clang: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script.
clang: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.

This assertion failure (with a different test case) was previously reported here:
   http://llvm.org/bugs/show_bug.cgi?id=13020
   Bug 13020 - Clang 3.1 assertion failures reading and writing AST files

The assertion failure occurs here:

include/clang/Basic/IdentifierTable.h:
167 void RevertTokenIDToIdentifier() {
168 assert(TokenID != tok::identifier && "Already at tok::identifier");
169 TokenID = tok::identifier;
170 RevertedTokenID = true;
171 }

When called from the AST deserialization code here:

lib/Serialization/ASTReader.cpp:
  461 IdentifierInfo *ASTIdentifierLookupTrait::ReadData(const internal_key_type& k,
  462 const unsigned char* d,
  463 unsigned DataLen) {
  ...
  487 unsigned Bits = ReadUnalignedLE16(d);
  ...
  490 bool HasRevertedTokenIDToIdentifier = Bits & 0x01;
  ...
  502 // Build the IdentifierInfo itself and link the identifier ID with
  503 // the new IdentifierInfo.
  504 IdentifierInfo *II = KnownII;
  505 if (!II) {
  506 II = &Reader.getIdentifierTable().getOwn(StringRef(k.first, k.second));
  507 KnownII = II;
  508 }
  509 Reader.markIdentifierUpToDate(II);
  510 II->setIsFromAST();
  511
  512 // Set or check the various bits in the IdentifierInfo structure.
  513 // Token IDs are read-only.
  514 if (HasRevertedTokenIDToIdentifier)
  515 II->RevertTokenIDToIdentifier();
  ...
  550 }

At line 515, the code is attempting to restore the RevertedTokenID field for the IdentifierInfo instance by calling RevertTokenIDToIdentifier(), but the code then asserts because the token kind (TokenID) already equals tok::identifier.

The corresponding serialization code is here:

lib/Serialization/ASTWriter.cpp:
2658 class ASTIdentifierTableTrait {
....
2741 void EmitData(raw_ostream& Out, IdentifierInfo* II,
2742 IdentID ID, unsigned) {
....
2750 uint32_t Bits = (uint32_t)II->getObjCOrBuiltinID();
....
2758 Bits = (Bits << 1) | unsigned(II->hasRevertedTokenIDToIdentifier());
....
2760 clang::io::Emit16(Out, Bits);
....
2784 }
2785 };

Line 1131 and 1132 below contain the calls to revert the token ID and set the token kind to tok::identifier when a keyword is used as a struct name. I suspect this is what sets the stage for the later assert when deserializing the AST, but I haven't debugged further.

lib/Parse/ParseDeclCXX.cpp:
1049 void Parser::ParseClassSpecifier(tok::TokenKind TagTokKind,
1050 SourceLocation StartLoc, DeclSpec &DS,
1051 const ParsedTemplateInfo &TemplateInfo,
1052 AccessSpecifier AS,
1053 bool EnteringContext, DeclSpecContext DSC) {
....
1107 if (TagType == DeclSpec::TST_struct &&
1108 !Tok.is(tok::identifier) &&
1109 Tok.getIdentifierInfo() &&
1110 (Tok.is(tok::kw___is_arithmetic) ||
....
1125 Tok.is(tok::kw___is_void))) {
1126 // GNU libstdc++ 4.2 and libc++ use certain intrinsic names as the
1127 // name of struct templates, but some are keywords in GCC >= 4.3
1128 // and Clang. Therefore, when we see the token sequence "struct
1129 // X", make X into a normal identifier rather than a keyword, to
1130 // allow libstdc++ 4.2 and libc++ to work properly.
1131 Tok.getIdentifierInfo()->RevertTokenIDToIdentifier();
1132 Tok.setKind(tok::identifier);
1133 }
....
1501 }

The problem might also be that the IdentifierInfo constructor initializes TokenID to tok::identifier by default:

lib/Basic/IdentifierTable.cpp:
  31 IdentifierInfo::IdentifierInfo() {
  32 TokenID = tok::identifier;
  ..
  48 }

It isn't clear to me what the preferred fix for this would be. Options include:

1) Remove the assert.

2) Change the default initialization of TokenID in the IdentifierInfo constructor from tok::identifier to tok::unknown and force all instances to be explicitly initialized.

3) Modify ASTIdentifierLookupTrait::ReadData() above to force the TokenID value to something other than tok::identifier before calling RevertTokenIDToIdentifier().

4) Others?

Tom.

This is fixed in r176148

-Argyrios