Identifier text in raw lexing

Currently, during lexing in LexingRawMode only the literal text is
collected (setLiteralData/getLiteralData), while the identifier text is
lost.

We'd like to save the start of identifier text in raw lexing mode using
the same shared pointer to permit to raw lexing client to avoid to do a
lot of extra work to get the identifier text.

To implement that:

- we'd insert a raw_identifer token kind
- we'd add Token::isIdentifier method
- we'd add Token::get/setRawIdentiferData methods
- we'd change Token::getIdentifierInfo to return 0 on raw_identifier
- we'd replace occurrences of check for tok::identifier with
Token::isIdentifer()

The performance impact should be definitely zero, but we are willing to
do the benchmark you ask to proof that.

A patch following the above guidelines have some chances to be accepted
upstream?

Currently, during lexing in LexingRawMode only the literal text is
collected (setLiteralData/getLiteralData), while the identifier text is
lost.

Hi Abramo,

I'm not sure what you're trying to accomplish here: it sounds like you want to get the "identifier text" from the token itself without getting the spelling. Is this correct? If so, are you looking for a performance win, a functionality improvement, or something else? What problem are you trying to solve?

-Chris

Yes, the idea is to have access to identifier text just like we have now
access to LiteralData.

We are looking for a performance win (and secondarily for a simmetric
interface wrt literals) for our raw lexer clients (currently they need
to use something like the above to get the text for each identifier):

  token_begin = sm.getSpellingLoc(token_begin);

  const char* current_char = sm.getCharacterData(token_begin);
  const char* end = current_char + length;

  while (current_char < end) {
    char c = clang::Lexer::getCharAndSizeNoWarn(current_char, length, lo);
    output += c;
    current_char += length;
  }

I've attached the working patch for approval.

raw-identifier.patch (51.5 KB)

Hi Abramo,

This approach looks ok in principle, but I have some nit-pics about your patch :slight_smile:

+++ lib/Parse/ParseExprCXX.cpp (working copy)
@@ -130,7 +130,7 @@
       SourceLocation TemplateKWLoc = ConsumeToken();
       
       UnqualifiedId TemplateName;
- if (Tok.is(tok::identifier)) {
+ if (Tok.isIdentifier()) {
         // Consume the identifier.

You seem to have search and replaced all uses of tok::identifier with isIdentifier. However, raw_identifiers can't get into the parser and can't get into most places in the preprocessor. Please keep these as checks of just tok::identifier. Only code that can actually get both should use "isIdentifier". Also, for the same reason, please rename isIdentifier to isAnyIdentifier.

+++ lib/Lex/Preprocessor.cpp (working copy)
@@ -369,11 +369,11 @@
// Lexer Event Handling.
//===----------------------------------------------------------------------===//

-/// LookUpIdentifierInfo - Given a tok::identifier token, look up the
+/// LookUpIdentifierInfo - Given a (raw) identifier token, look up the
/// identifier information for the token and install it into the token.
IdentifierInfo *Preprocessor::LookUpIdentifierInfo(Token &Identifier,
                                                    const char *BufPtr) const {
- assert(Identifier.is(tok::identifier) && "Not an identifier!");
+ assert(Identifier.isIdentifier() && "Not an identifier!");
   assert(Identifier.getIdentifierInfo() == 0 && "Identinfo already exists!");

This gets passed in a raw_identifier if I understand things correctly. Please change the assert to verify that. Given your changes, there is no need to pass in BufPtr either. Please eliminate this argument, getting the pointer from the token instead.

+++ lib/Lex/Lexer.cpp (working copy)
@@ -473,7 +473,7 @@
       // we don't have an identifier table available. Instead, just look at
       // the raw identifier to recognize and categorize preprocessor directives.
       TheLexer.LexFromRawLexer(TheTok);
- if (TheTok.getKind() == tok::identifier && !TheTok.needsCleaning()) {
+ if (TheTok.isIdentifier() && !TheTok.needsCleaning()) {
         const char *IdStart = Buffer->getBufferStart()
                             + TheTok.getLocation().getRawEncoding() - 1;
         llvm::StringRef Keyword(IdStart, TheTok.getLength());

Likewise, this seems like it should only check for a raw_identifier. The code can be dramatically simplified by using the "const char*" in the token. Please do.

Similarly, the code in Preprocessor::SkipExcludedConditionalBlock shouldn't have to use:

    const char *RawCharData = SourceMgr.getCharacterData(Tok.getLocation(),
                                                         &Invalid);

it can now use the const char* in the token. While I was initially skeptical of the approach, I think it will lead to a nice cleanup of existing code (and is the right thing to do), but it is important not to scatter "isIdentifiers" everywhere, and important to make use of the new capabilities!

Thanks for working on this,

-Chris

[...]

I've attached the working patch for approval.

Hi Abramo,

This approach looks ok in principle, but I have some nit-pics about
your patch :slight_smile:

[...]

You seem to have search and replaced all uses of tok::identifier with
isIdentifier. However, raw_identifiers can't get into the parser and
can't get into most places in the preprocessor. Please keep these as
checks of just tok::identifier. Only code that can actually get both
should use "isIdentifier". Also, for the same reason, please rename
isIdentifier to isAnyIdentifier.

[...]

While I was initially
skeptical of the approach, I think it will lead to a nice cleanup of
existing code (and is the right thing to do), but it is important not
to scatter "isIdentifiers" everywhere, and important to make use of
the new capabilities!

Thanks for working on this,

-Chris

Here is the revised patch. It passes all clang tests, but we would
really appreciate to have it reviewed again before commit.

Cheers,
Enea.

revised-raw-identifier.patch (20.2 KB)

Ping.

Can I proceed with commit?

Sorry for the delay, this patch looks really great to me. Please commit!

-Chris