[PATCH] C++0x unicode string and character literals now with test cases

Hello,

Attached is a new patch that fixes a few bugs in the patch I posted
last week. It also converts the multiple bool flags I was using into
an enum. I've updated existing char literal and string literal tests
and a few others to cover the new types.

Hopefully someone can review.

Thanks,
~Craig Topper

unicode_strings.patch (75.4 KB)

I took a quick look at it and it looks pretty good. You'll want to watch
long (> 80 character) lines. Someone more familiar with the code will
have to approve it though.

Thanks!

-eric

Thanks for the review! I'll fix the line lengths tonight. Do you know
who is more familiar with the code?

Thanks,
~Craig

Chris or Doug would be the two people most likely to approve it.

Had a couple of thoughts that I figured I'd ask about:

a) the enum for character and string is almost identical, any reason
not to make them identical

b) any reason not to have the string literal creation machinery just
default to ascii and not need the changes in most of the use cases?

-eric

Thanks for the review! I'll fix the line lengths tonight. Do you know
who is more familiar with the code?

Chris or Doug would be the two people most likely to approve it.

Had a couple of thoughts that I figured I'd ask about:

a) the enum for character and string is almost identical, any reason
not to make them identical

I mainly didn't make them identical cause I put them inside the
CharacterLiteral and StringLiteral classes. Wasn't sure were an
appropriate home would be if they were shared.

b) any reason not to have the string literal creation machinery just
default to ascii and not need the changes in most of the use cases?

I don't follow.

Here's a new patch that fixes the line lengths. Also widens to the
token kind in the Identifier table because 8-bits wasn't enough
anymore so kw_unknown_anytype was being truncated from 256 to 0.

unicode_strings.patch (77.1 KB)

IsIdentifierStringPrefix should be language-sensitive. Other than that, nothing more stands out to me.

Sean

Reverted TokenConcatenation::IsIdentifierL and added a new method
specifically for detecting identifiers of "u8", "u", or "U". This new
method is qualified with CPlusPlus0x check so that we won't avoid the
concat when Unicode string literals aren't supported. Used a separate
method just to make it cleaner than scattering C++0x checks throughout
the method.

unicode_strings.patch (77.5 KB)

Previous patch doesn't apply cleanly anymore so here's a new version.

Chris or Doug, if you're reading this can I get a review as suggested by Eric?

Thanks,
~Craig

unicode_strings.patch (77.3 KB)

Index: lib/Sema/SemaExprCXX.cpp

What if the value is outside the range of enumerators? This could
certainly happen due to, say, a PCH bug. I would point out that we
prefer llvm_unreachable for impossible cases though. In either case,
there ought to be no real cost since this will all get elided in
release mode.

Sean

Also, please remove the 'default' case; it's enough that the switch statement is exhaustive.

What if the value is outside the range of enumerators? This could
certainly happen due to, say, a PCH bug.

In that case, we'd find it and fix PCH.

I would point out that we
prefer llvm_unreachable for impossible cases though. In either case,
there ought to be no real cost since this will all get elided in
release mode.

If you leave the 'default' case in there, it silences warnings about missing enumerators when switching on an enum type. That's Very Very Bad.

  - Doug

As far as the const conversion for the new string literal types, gcc
4.5.2 allows it and throws a warning. Looks like the standard may
consider these cases to be illegal now instead of deprecated for even
the old cases.

Google found this page which documents discussion on this topic. Look
for "693. New string types and deprecated conversion "
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2009/n3007.html

~Craig

I think we should follow the CWG's direction here. The reason it was
not added, even as a deprecated conversion, was to avoid people
relying on a brand new deprecated conversion which would make it more
difficult to remove later when it could have been nipped in the bud.
So for consistency, the deprecated conversion for char* was removed.
In my opinion, given that the committee decided on this specific
issue, we should follow the committee's direction and reject this
conversion, certainly for the new types but probably char* as well
given the committee's explicit decision in that regard (and update the
compatibility docs).

Sean

Ok. I'll remove the conversions I added. Once this patch is in I'll
submit something separate for removing the old conversions.

~Craig

I have a sinking feeling that rejecting the string literal to char* conversion is going to break a lot of existing code. We'll have to do it eventually, but we'll need *excellent* recovery when we do.

  - Doug

Ok I've removed the defaults that I could and removed the implicit
const conversion stuff.

~Craig

unicode_strings.patch (77 KB)

Doh! I accidentally warned on UCNs larger than 0xffff in UTF-32
character literals. This patch fixes that.

unicode_strings.patch (77 KB)

Hi Craig,

Doug is definitely the best one to review the semantics of this. Some minor comments:

   IdentifierInfo &get(StringRef Name, tok::TokenKind TokenCode) {
     IdentifierInfo &II = get(Name);
+ assert(TokenCode < 512 && "TokenCode too large");
     II.TokenID = TokenCode;
     return II;

I would suggest instead:

   IdentifierInfo &get(StringRef Name, tok::TokenKind TokenCode) {
     IdentifierInfo &II = get(Name);
     II.TokenID = TokenCode;
+ assert(II.TokenID == TokenCode && "TokenCode too large");
     return II;

to avoid tying the '9' bit bitfield size to the magic 512 constant.

class StringLiteral : public Expr {
...
   unsigned ByteLength;
- bool IsWide;
+ StringKind Kind;
   bool IsPascal;
   unsigned NumConcatenated;

sizeof(StringKind) is likely to be 4, wasting space. I'd suggest making it an 8 bit bitfield or something.

In SemaDeclAttr etc:

- if (Str == 0 || Str->isWide()) {
+ if (Str == 0 || Str->getKind() != StringLiteral::Ascii) {
       S.Diag(Attr.getLoc(), diag::err_attribute_argument_n_not_string)
           << "weakref" << 1;

I'd suggest introducing the proper helper methods and using:

- if (Str == 0 || Str->isWide()) {
+ if (Str == 0 || !Str->isAscii()) {
       S.Diag(Attr.getLoc(), diag::err_attribute_argument_n_not_string)
           << "weakref" << 1;

In Lexer.cpp:

+ // treat U like the start of an identifier.
+ goto StartIdentifier;

Instead of doing this, I'd replace the gotos with just "return LexIdentifier(Result, CurPtr);" since "MIOpt.ReadToken" has already been done.

I would suggest fusing IsIdentifierL and IsIdentifierUTFStringPrefix into one function.

Otherwise, LGTM!

-Chris

Here's a new patch addressing Chris' suggestions.

~Craig

unicode_strings.patch (76.8 KB)