Review request: Patch for adding raw string support to clang

Hi,

I mostly implemented the C++0x proposal for raw string. You can declare a string like:

const char* s2 = R"[test]"; // test
const char* s3 = R"azerty[test]azerty"; // test
const char* s4 = R"azerty[\t\\e]azertst]azerty"; // \t\\e]azertst
const char* s5 = R"[\
a
multiline
string
literal]";

This isn't a very important functionnality, but it can be useful (think regex). I have some point on which I would like comment marked with TO REVIEW.
The patch is not yet ready for commit, but handle the mentioned case without problem.

Regards,

Cédric

rawstringliteral_review1.patch (11.8 KB)

test2.cpp (1.12 KB)

test.cpp (226 Bytes)

Hi Cédric,

Please try this procedure when submitting your final version of this patch:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2008-January/011992.html

Hi Cédric,

Please try this procedure when submitting your final version of this
patch:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2008-January/011992.html

Hey Gordon, can you add that info to the developer policy guide?

-Chris

DIAG(err_unterminated_string, ERROR,
     "missing terminating \" character")
+DIAG(err_too_much_dchars_rawstring, ERROR,
+ "The d-chars part of the string is too long (16 max)")

Is a separate diagnostic for a missing d-chars sequence at the end necessary?

TOK(string_literal) // "foo"
TOK(wide_string_literal) // L"foo"
TOK(angle_string_literal)// <foo>
+TOK(raw_string_literal) // R"**[foo]**" (N2442), support for u, U, u8 and
L?

Put the question part of the comment on a separate line.

+void Lexer::LexRawStringLiteral(Token &Result, const char *CurPtr, bool
Wide) {
+/* TO REVIEW: should I let this comment here?

Yes, it's good to have the grammar in the code as a reference for what
the code is trying to do.

+raw-string:
+ "d-char-sequence opt [r-char-sequenceopt ]d-char-sequenceopt "
+
+r-char-sequence:
+ r-char
+ r-char-sequence r-char
+
+r-char:
+ any member of the source character set, except, (1), a backslash \
followed by a u or U, or, (2), a right square bracket ]
+ followed by the initial d-char-sequence (which may
be empty) followed by a double quote ".
+ universal-character-name
+
+d-char-sequence:
+ d-char
+ d-char-sequence d-char
+
+d-char:
+ any member of the basic source character set, except space, the left
square bracket [, the right square bracket ],
+ or the control characters representing horizontal
tab, vertical tab, form feed, or new-line.
+*/
+
+ char dchar_seq[16];
+ int dchar_seq_len = 0;
+
+ // first read the optional d-char-sequence (0 to 16 characters)
+ char C = getAndAdvanceChar(CurPtr, Result);
+
+ while (C != '[') {
+ // FIXME: check the characters are in the allowed set
+ if(dchar_seq_len>=16) {
+ Diag(BufferPtr, diag::err_too_much_dchars_rawstring);
+ // TO REVIEW: should we attempt to recuperate on error here?

It would be ideal if the code supported more than 16 chars, but
otherwise lexing is essentially screwed anyway. Anything sane is
fine.

+ Result.setKind(tok::unknown);
+ FormTokenWithChars(Result, CurPtr-1);
+ return;
+ }
+ dchar_seq[dchar_seq_len++] = C;
+ C = getAndAdvanceChar(CurPtr, Result);

Don't you need to check for EOF/embedded null here?

+ }
+ // skip the '['
+ C = getAndAdvanceChar(CurPtr, Result);
+ while(1) {
+ while (C != ']') {
+ //if( a backslash \ followed by a u or U

Mmm, definitely need to handle this case. And also the escaped-newline case.

+ // TO REVIEW: enable this only for C++0x, or any language with
extension
+ // activated? or add a features like the pascal string?

Let's stick with just C++0x mode for the moment; this has the
potential to break existing code that uses R as a macro.

+// TO REVIEW: how to do this properly? if not followed by a ", need to
+// unconsume the char, perhaps saving the curptr and sizetmp var is enough?
+/* else if(Char == 'R') {
+ ConsumeChar(
+ if(getCharAndSize(CurPtr,SizeTmp)=='"')
+ return LexRawStringLiteral(Result, ConsumeChar(CurPtr, SizeTmp,
Result),
+ true);
+*/

If I'm not mistaken, you can actually just fall through from here, and
it will still work correctly.

+ // FIXME: what about raw string and trigraph, escaped end of line ...??

I don't follow; would these somehow make the string grow? Or is there
some other issue?

    // Check if this is a pascal string
    if (pp.getLangOptions().PascalStrings && ThisTokBuf + 1 != ThisTokEnd &&
        ThisTokBuf[0] == '\\' && ThisTokBuf[1] == 'p') {
+ // TO REVIEW: has a pascal raw string a meaning?

I'd say don't make raw pascal strings; escapes aren't generally
accepted in raw strings, so allowing \p would be confusing.

+ case tok::raw_string_literal: // TO REVIEW: objC and C++0x possible?

What's the question?

Note that this isn't a complete review.

-Eli

Hi,

Eli Friedman a écrit :

  

DIAG(err_unterminated_string, ERROR,
     "missing terminating \" character")
+DIAG(err_too_much_dchars_rawstring, ERROR,
+ "The d-chars part of the string is too long (16 max)")
    
Is a separate diagnostic for a missing d-chars sequence at the end necessary?
  
You can't tell if it is a missing dchar sequence or if it is just part of the string, so I don't see how a dignostic could be emitted. Consider:
R"abc[abc]ab]"c]abc" => legal string "abc]ab]\"c" (in clasical form)

+
+ char dchar_seq[16];
+ int dchar_seq_len = 0;
+
+ // first read the optional d-char-sequence (0 to 16 characters)
+ char C = getAndAdvanceChar(CurPtr, Result);
+
+ while (C != '[') {
+ // FIXME: check the characters are in the allowed set
+ if(dchar_seq_len>=16) {
+ Diag(BufferPtr, diag::err_too_much_dchars_rawstring);
+ // TO REVIEW: should we attempt to recuperate on error here?
    
It would be ideal if the code supported more than 16 chars, but
otherwise lexing is essentially screwed anyway. Anything sane is
fine.
  
I didn't choose 16, it comes from the C++ proposal (accepted): http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm:
"A /d-char-sequence/ shall consist of at most 16 characters."
It would be hard to lift this limitation but, since this is in the standard, I think it is better like this?

+ Result.setKind(tok::unknown);
+ FormTokenWithChars(Result, CurPtr-1);
+ return;
+ }
+ dchar_seq[dchar_seq_len++] = C;
+ C = getAndAdvanceChar(CurPtr, Result);
    
Don't you need to check for EOF/embedded null here?
  
yes, done.

+ }
+ // skip the '['
+ C = getAndAdvanceChar(CurPtr, Result);
+ while(1) {
+ while (C != ']') {
+ //if( a backslash \ followed by a u or U
    
Mmm, definitely need to handle this case. And also the escaped-newline case.
  
This is handled latter in fact, in StringLiteralParser::StringLiteralParser (litteralSupport.cpp). In fact, since the UCN aren't handled in the standard string, I didn't handle them in raw string at the moment. it assert if it encouter them.

+ // FIXME: what about raw string and trigraph, escaped end of line ...??
    
I don't follow; would these somehow make the string grow? Or is there
some other issue?

No, I was wondering what to do of a trigaph or an escaped eol in a raw string? translate them like it is actually done or let them as is. I don't know the preprocessor and lexer phases so I am not sure what is the correct way. I think this is ok, so I deleted the comment.

+ case tok::raw_string_literal: // TO REVIEW: objC and C++0x possible?
    
What's the question?
  
I was wondering if we could encounter a C++0x specific token in an objC parser function... Anyways, adding the case is the safe alternative and don't cost much.

Note that this isn't a complete review.

-Eli
  
Thanks for taking the time. I attached a modified patch. And I changed the option for thunderbird, so it should be correctly attached.

rawstringliteral_review2.patch (12.3 KB)

Hi Cédric,

I know very little about this feature and haven't studied the wording, but here are some thoughts. This is actually a pretty nifty feature!

+ case tok::raw_string_literal: // TO REVIEW: objC and C++0x possible?

What's the question?

I was wondering if we could encounter a C++0x specific token in an objC parser function... Anyways, adding the case is the safe alternative and don't cost much.

Yes, you definitely can. The "Objective C++" language contains a superset of C++ and Objective C features. However, I don't think we should allow @R"...", please do not accept this.

Thanks for taking the time. I attached a modified patch. And I changed the option for thunderbird, so it should be correctly attached.

ok:

+DIAG(err_too_much_dchars_rawstring, ERROR,
+ "The d-chars part of the string is too long (16 max)")

What are d-chars? This is great for someone well versed in standardese, but not very useful for an end user. How about "raw string prefix too long"?

+DIAG(err_invalid_dchar, ERROR,
+ "The d-chars part of a raw string should not contain space, left square"
+ " bracket [, right square bracket ], horizontal tab, vertical tab,"
+ " form feed or new-line")

This is too long. A diagnostic should be ~65 chars max with very few exceptions.

Also, diagnostics should not be sentences, they should be all lower case.

+TOK(raw_string_literal) // R"**[foo]**" (N2442)

N2442 should be a more structured reference. How about "C++ working paper N2442"?

+void Lexer::LexRawStringLiteral(Token &Result, const char *CurPtr, bool Wide) {
+/* from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm
+raw-string:
+ "d-char-sequence_opt[r-char-sequence_opt]d-char-sequence_opt"

The comment should go in a doxygen (///) comment above the method like other lexer methods.

+ // TO REVIEW: Is this the corect way to return from error?
+ // should we attempt to recuperate on error here?
+ Result.setKind(tok::unknown);
+ FormTokenWithChars(Result, CurPtr-1);
+ return;

This is the right thing to do.

+ } else if (C == ' ' || C == ']' || C == '\t' || C == '\n' || C == '\f'
+ || C == '\v' || C == '\r') {
+ // TO REVIEW: how to set the error carret on the erroneous char?
+ Diag(BufferPtr, diag::err_invalid_dchar);

This should do the right thing, please write a testcase. If this is one character *after* the intended one, just safe BufferPtr from before the call to getAndAdvanceChar and report it at that "const char*" location.

+ // TO REVIEW: Is this the corect way to return from error?
+ // should we attempt to recuperate on error here?

this is correct.

@@ -1294,6 +1399,18 @@
        return LexStringLiteral(Result, ConsumeChar(CurPtr, SizeTmp, Result),
                                true);

+ if(Char == 'R') {
+ const char* NextCharPtr=CurPtr+SizeTmp;
+ unsigned int NextSizeTmp;

This should have a comment explaining what it is: e.g. LR"**[wide long string]**"

Please add a space after the 'if'.

+ if(NextChar=='"') {

This one too.

+ // TO REVIEW: do we need to consume each char of the token?
+ ConsumeChar(CurPtr, SizeTmp, Result);

Please write a testcase to verify :slight_smile:

+ return LexRawStringLiteral(Result, ConsumeChar(NextCharPtr, NextSizeTmp,Result),
+ true);

80 columns.

+++ lib/Lex/LiteralSupport.cpp (working copy)
@@ -608,7 +608,8 @@
        MaxTokenLength = StringToks[i].getLength();

      // Remember if we see any wide strings.
- AnyWide |= StringToks[i].is(tok::wide_string_literal);
+ AnyWide |= StringToks[i].is(tok::wide_string_literal)

The grammar above the StringLiteralParser ctor should be updated.

It seems that the normal and raw string code in LiteralSupport are pretty distinct. How about splitting them out into two separate methods of StringLiteralParser?

Please add several testcases for this feature to the clang testcase, verifying that the parser accepts valid cases and rejects invalid ones. Also, please add a simple codegen test. If you need help with these, please let me know.

-Chris