[Review Request] char16_t and char32_t character literals

Hi, all,

I wrote a patch to implement new char16_t and char32_t character literals introduced in C++0x.
My implementation is based on 2.13.3 Character literals [lex.ccon], C++0x draft N3242.
Could you review my patch and feed me back your comments and requests?

  • New features

Compiling sources with -std=c++0x option, clang accepts the char16_t and char32_t character
literals.

At this point, there is a limitation. See also (1) in the TODO list below.

  • Source Example

char16_t a = u’a’; // u’a’ has char16_t type. The value is \u0061 in UTF16.
char32_t b = U’b’; // U’b’ has char32_t type, The value is \U00000062 in UTF32.

  • Implementation

I added 3 token kinds, tok::wchar_constnat, tok::utf16char_constant, tok::utf32char_constant.
( include/clang/Basic/TokenKinds.def)

If the Lexer finds the token starting with L’, u’ or U’ , it try to construct a char literal token
whose token kinds are tok::wchar_constant, tok::utf16char_constant, or tok::utf32char_constant.
(Lexer::LexTokenInternal(Token& Result), lib/Lex/Lexer.cpp)

To make it easy to set the proper token kind to the char literal token object, I modified the class
CharLiteralParser so that it has a private member to hold a token kind, and append an argument
to the ctor to take the literal kind.
(include/clang/Lex/LiteralSupport.h, lib/Lex/LiteralSupport.h)

The parser and Sema set appropriate type for wchar_t, char16_t and char32_t literal.
(Parser:::ParseCastExpression, lib/Parse/ParseExpr.cpp

  • TODO

(1) No Code Conversion.
At this point, only ascii characters are available in the char16_t and char32_t constants because
I have not implemented code conversion logic. I plan to fix the problem in next patch to support
chart16_t and char32_t string literals.

(2) Code Gen Problem.
When defining an array whose type of char32_t (4byte aligned type), the array is aligned to 16 byte
boundary instead of 4byte. I think this is a bug in clang. A test program in my patch,
test/CodeGenCXX/cxx0x-char-literal.cpp, line 33 and 34, demonstrates the problem.

  • Test Environment and Result
    My patch based on r131788.
    And I checked it on Fedora14 x86_64.

There is no degradation.
But a problems is found as described in TODO list.

cxx0x-char-literal.r131788.patch (17.7 KB)

Clang needs full UTF-8 support, both inside and outside string-literals.
Please bear this in mind when coding support for it, otherwise it will
be just as much work to put full UTF-8 support in as it already is, and
you'll have wasted effort.

The intended design is to convert universal-character-names to UTF-8
internally, which we do not currently do.

Sean

If it helps, libc++ has conversion among all of UTF-8, UTF-16 and UTF-32 (well UCS4 actually). It is in locale.cpp (http://llvm.org/svn/llvm-project/libcxx/trunk/src/locale.cpp). Look for:

static
codecvt_base::result
utf16_to_utf8(const uint16_t* frm, const uint16_t* frm_end, const uint16_t*& frm_nxt,
              uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf16_to_utf8(const uint32_t* frm, const uint32_t* frm_end, const uint32_t*& frm_nxt,
              uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf8_to_utf16(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
              uint16_t* to, uint16_t* to_end, uint16_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf8_to_utf16(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
              uint32_t* to, uint32_t* to_end, uint32_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
ucs4_to_utf8(const uint32_t* frm, const uint32_t* frm_end, const uint32_t*& frm_nxt,
             uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
             unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf8_to_ucs4(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
             uint32_t* to, uint32_t* to_end, uint32_t*& to_nxt,
             unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

etc.

Also I made myself this "cheat sheet" for summarizing the UTF encodings:

http://home.roadrunner.com/~hinnant/utf_summary.html

If this stuff is helpful, great, if not, that's fine too. I just didn't want it to be hidden.

Howard

We already have routines for that; see include/clang/Basic/ConvertUTF.h.

-Eli

2011/5/23 Eli Friedman <eli.friedman@gmail.com>

The intended design is to convert universal-character-names to UTF-8

I’m happy to here the intention.

Could you tell me where I can find the implemetation spec and/or future plans about multilingual support?
Especially I’d like to know what kind of character encodings of source files are/will be allowed in clang.

internally, which we do not currently do.We already have routines for that; see include/clang/Basic/ConvertUTF.h.

OK. I’ll check the implementation.
I’ll update the patch and send it in a day.
Please wait a moment.

I updated the patch.

The patch implements code conversion from UTF-8 to UTF16/UTF32 in character literal.
The rest issue is to fixed array alignment issue reported. I’m investing it.

Could you review it again?

Thank you for your advices.

2011/5/24 Yusaku Shiga <yusaku.shiga@gmail.com>

cxx0x-char-literal.r131967.patch (22.9 KB)

2011/5/25 Yusaku Shiga <yusaku.shiga@gmail.com>

I updated the patch.

The patch implements code conversion from UTF-8 to UTF16/UTF32 in character literal.
The rest issue is to fixed array alignment issue reported. I’m investing it.

I observed clang silently extends the alignment to 16byte if the array objects occupies 16bytes or more (on x86_64 target).
I’ll remove the alignment checks from my test cases.

Updating the patch because I found a degradation in the previous patch that failed warning check for multi char wchar literal.
Here is the test results of cfe/trunk revision 132053 and the patch.

unaju[512] llvm/utils/lit/lit.py -sv llvm/tools/clang/test
lit.py: lit.cfg:108: note: using out-of-tree build at ‘/llvm/l2/bld/tools/clang’
lit.py: lit.cfg:143: note: using clang: ‘/llvm/l2/bld/Debug+Asserts/bin/clang’
lit.py: main.py:298: warning: test suite ‘Clang-Unit’ contained no tests
Testing Time: 180.05s
Expected Passes : 3096
Expected Failures : 22

2011/5/25 Yusaku Shiga <yusaku.shiga@gmail.com>

cxx0x-char-literal.r132053.patch (22.9 KB)