Bug: Lexer::getLocForEndOfToken() returns a position too far for a token which include backslash-newline pairs

My setup is complicated but I think I nailed down the problem here.
Hopefully you will be able to reproduce it.

Given a token with embedded backslash-newline pairs, such that:
  "foo\
  bar\
  baz"
Lexer::getLocForEndOfToken() returns a location which is too far by
the number of backslash-newline pairs times 2, as if these characters
were counted twice. Looking at the implementation:

SourceLocation Lexer::getLocForEndOfToken(SourceLocation Loc, unsigned Offset,
                                          const SourceManager &SM,
                                          const LangOptions &Features) {
  if (Loc.isInvalid() || !Loc.isFileID())
    return SourceLocation();

  unsigned Len = Lexer::MeasureTokenLength(Loc, SM, Features);
  printf("Token length: %d\n", Len);
  if (Len > Offset)
    Len = Len - Offset;
  else
    return Loc;

  return AdvanceToTokenCharacter(Loc, Len, SM, Features);
}

I guess that MeasureTokenLength() includes any backslash-newline
pairs, but AdvanceToTokenCharacter() skips them. I'm not sure in which
direction this should be fixed.

I'm not subscribed to the list.

I *think* the right solution here is for getLocForEndOfToken to just use
getFileLocWithOffset instead of AdvanceToTokenCharacter. Would
you mind writing that up and testing it?

John.

Yes, results look fine with this change.

— /tmp/g4-89926/cache/depot/google3/third_party/llvm/trunk/tools/clang/lib/Lex/Lexer.cpp#33 2011-03-22 01:58:24.000000000 +0100
+++ /home/qrczak/qrczak-janitor/google3/third_party/llvm/trunk/tools/clang/lib/Lex/Lexer.cpp 2011-04-05 13:02:09.557044000 +0200
@@ -674,7 +674,7 @@
else
return Loc;

  • return AdvanceToTokenCharacter(Loc, Len, SM, Features);
  • return Loc.getFileLocWithOffset(Len);
    }

//===----------------------------------------------------------------------===//

What is the impact on performance here (if any)?

I would expect the corrected code to be slightly faster.
getFileLocWithOffset() does mostly an addition, while
AdvanceToTokenCharacter() scans the token character by character.

We can say more than that: the corrected code should really be substantially
faster, because we're basically cutting the amount of work it does in half, but
it doesn't matter because this function is so far from being hot (except maybe
in the rewriter) that it would probably have to intentionally crash the compiler
to have a noticeable impact on compile time.

John.

Committed in r128978, thanks!

John.