[patch] Support for C++0x raw string literals

Still need to write up test cases, but so far this patch seems to
work. The lexer code is getting pretty ugly. I still need to fix up
token concatenation.

raw_strings.patch (11.4 KB)

This adds test cases and fixes a few minor bugs from the first patch.

~Craig

raw_strings.patch (17.3 KB)

Cool. A few comments:

+def err_raw_delim_too_long : Error<
+ "raw string delimiter longer than 16 characters">;

While technically correct, this isn't a very helpful diagnostic. I'm guessing that users are going to write raw string literals and forget the XXX()XXX part of it. The first question they'll ask is, "what's a delimiter?"

Diving into the code, I think there's a better way:

+/// LexRawStringLiteral - Lex the remainder of a string literal, after having
+/// lexed either R".
+void Lexer::LexRawStringLiteral(Token &Result, const char *CurPtr,
+ tok::TokenKind Kind) {
+ char C;
+ const char *Prefix = CurPtr;
+ unsigned PrefixLen = 0;

Here's my new and improved patch. Changes from previous version:

Fix up the error messages to give example delimiter. Added ending
delimiter to the unterminated message.
Fixed the lexing to not do phase 1 and phase 2 conversions per
lex.pptoken paragraph 3.
Use CharInfo array to simplify detection of valid delimiter characters.
On errors, lex forward to the next quote character hoping to get the
lexer back on track.
Factored out duplicate code from IsIdentifierStringPrefix and better
commented exactly what the checks were doing.
Fixed the comment block above StringLiteralParser to include unicode
and raw string literals. Added similar block above CharLiteralParser.
Factored out the code that copies characters into ResultBuf in
StringLiteralParser to better separate raw and non-raw string
handling.

~Craig

raw_strings.patch (29.5 KB)

Hi Craig,

Here's my new and improved patch. Changes from previous version:

Fix up the error messages to give example delimiter. Added ending
delimiter to the unterminated message.
Fixed the lexing to not do phase 1 and phase 2 conversions per
lex.pptoken paragraph 3.
Use CharInfo array to simplify detection of valid delimiter characters.
On errors, lex forward to the next quote character hoping to get the
lexer back on track.
Factored out duplicate code from IsIdentifierStringPrefix and better
commented exactly what the checks were doing.
Fixed the comment block above StringLiteralParser to include unicode
and raw string literals. Added similar block above CharLiteralParser.
Factored out the code that copies characters into ResultBuf in
StringLiteralParser to better separate raw and non-raw string
handling.

This looks great to me. Give Chris Lattner a day to look over this and comment, and assuming no other issues come to light, please go ahead and commit tomorrow with one tiny typo fix:

@@ -1062,6 +1117,26 @@
}

+/// copyStringFragment - This function copies from Start to End into ResultPtr.
+/// Peforms widening for multi-byte characters.

Typo "peforms" here.

FWIW, please send future patches to cfe-commits. Thanks!

  - Doug

Hi Craig,

I don't know enough about the standard to know if this implements the right thing, but here's a general code review:

+++ include/clang/Lex/LiteralSupport.h (working copy)
@@ -197,6 +197,7 @@

private:
   void init(const Token *StringToks, unsigned NumStringToks);
+ void CopyStringFragment(const char *Start, const char *End);

This should probably take a StringRef.

+++ lib/Lex/Lexer.cpp (working copy)

+ while (PrefixLen != 16 && isRawStringDelimBody(CurPtr[PrefixLen]))
+ ++PrefixLen;

Please indent by 2, not 3 spaces.

+ if (C == '"') {
+ break;
+ } else

No need for "else" after break, or the curlies around the break.

Given how terrible it could have been ("undoing" phases of translation etc), I'm very pleased with how clean the Lexer code turned out!

+++ lib/Lex/TokenConcatenation.cpp (working copy)

+// IsStringPrefix - Return true if the buffer contains a string prefix
+static inline bool IsStringPrefix(const char *Ptr, unsigned length,

Please make this a proper doxygen context, and finish your thought in the comment :).

Please make it take a StringRef instead of ptr+length. This may also let you simplify its body, using startsWith etc.

Please change the body of this function to be multiple if statements instead of one big ||/&& conglomeration :slight_smile:

I'll trust you on the LiteralSupport.cpp part :slight_smile:

Thanks Craig, this is great work! Feel free to commit with these updates if you agree to them,

-Chris

+++ lib/Lex/TokenConcatenation.cpp (working copy)

+// IsStringPrefix - Return true if the buffer contains a string prefix
+static inline bool IsStringPrefix(const char *Ptr, unsigned length,

Please make this a proper doxygen context, and finish your thought in the comment :).

Please make it take a StringRef instead of ptr+length. This may also let you simplify its body, using startsWith etc.

Please change the body of this function to be multiple if statements instead of one big ||/&& conglomeration :slight_smile:

After testing gcc, I wonder if its worth all this trouble to detect
only string prefixes. gcc seems to not concatenate any identifier with
a string

+++ include/clang/Lex/LiteralSupport.h (working copy)
@@ -197,6 +197,7 @@

private:
void init(const Token *StringToks, unsigned NumStringToks);
+ void CopyStringFragment(const char *Start, const char *End);

This should probably take a StringRef.

Given the low level nature of the code inside that method does
changing it to a StringRef really accomplish much?

It is really a coding standards sort of thing, we prefer stringref for string slices.

-Chris