Patch for UCNs in string literals

The attached patch applies N2170 for C++0x, which broadens the range of
valid UCNs in string literals, but not identifiers.

It does not address any of the outstanding issues such as no support for
UCNs in identifiers or character literals.

It does address an outstanding FIXME to issue an extension warning when
encountering UCNs in non C++/C99 mode.

Attached are two test cases that should go into the directory:

  ...\llvm\tools\clang\test\CXX\lex\lex.charset

For some reason I can't get my patch generator to add new files :¬(

One test is for C, the other for C++0x.
Both are XFAIL for now, due to a number of unimplemented features this
paragraph relies on. However, I have confirmed that n2170 passes the
correct subset of tests.

AlisdairM

n2170.patch (2.02 KB)

p2-cpp0x.cpp (7.67 KB)

p2.cpp (4.41 KB)

For some reason I can't get my patch generator to add new files :¬(

If you're using "svn diff", you can just use "svn add" on the new
files. If not, I highly suggest it :slight_smile:

One test is for C, the other for C++0x.
Both are XFAIL for now, due to a number of unimplemented features this
paragraph relies on. However, I have confirmed that n2170 passes the
correct subset of tests.

Please just include the subset of tests that currently passes; we can
add additional tests as the issues get fixed.

-Eli

Eli Friedman wrote:

Please just include the subset of tests that currently passes; we can
add additional tests as the issues get fixed.

I was under the impression based on the website that XFAIL'd tests were encouraged:

     http://clang.llvm.org/OpenProjects.html

     "Pick a paragraph, write a few tests, and see if they work! Even if
     they don't we'd still like the new tests (with XFAIL'd) so that we
     know what to fix."

- Jim

A patch should be accompanied by a non-XFAIL'd test for each thing it
fixes; otherwise, the tests won't detect regressions. I haven't
thought through what that means for the C++ tests; perhaps splitting
them into p2a/p2b/p2c/etc. would be appropriate.

-Eli

i/ there is a different behaviour between C++ and C. Ideally I would like to test both.
That is why I submitted two versions of tests for the same paragraph.

Right; the C tests should be separate.

In hindsight it looks like we might be better served by a parallel test structure to test/CXX for the C language, although I don't have a copy of that standard myself to know where to submit tests. At least with C++0x we have publicly available working draughts.

There are also working drafts available for C99, for example
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1336.pdf . That
said, we don't have the C tests structured that way, and I'm not sure
it's really worth the effort at this point, so just name it something
reasonable and put it in with the existing tests.

A subset of this issue is the change between C++ and C++0x, which again suggests two tests. I followed precedent elsewhere of tagging the 0x specific tests with the cpp0x suffix. Ideally we should run all C++ tests in both C++ and C++0x mode as (most) conforming C++03 programs should also be conforming C++0x programs. So far, the main incompatibility would be handling of auto.

That's an interesting idea, but it's probably not worth bothering: the
codepaths are mostly identical.

ii/ Eli's original issue that we want to see tests passing to indicate the new feature works, as well as the extensive tests to validate the paragraph. I am happy to break out the passing functionality tests, but I don't want to lose the full paragraph coverage. How should we name these tests and where should we place them?

For example, when working on my type-alias patch I am using a test file unrelated to paragraph numbers, as the controlling paragraph says not a lot more than 'works like a typedef' so I need to test a wide variety of typedef-like uses, which is almost an open set. Could I submit my working tests as n2170a.cpp, while retaining the full XFAILed p2? Likewise, we can incrementally add n2170b,c,d,etc. as support for UCNs in more contexts becomes available.

Hmm, actually, how about splitting it into, for example, "n2170.cpp"
and "n2170-xfail.cpp"? The passing tests go into n2170.cpp, the
failing ones go into n2170-xfail.cpp, and we shuffle them around as
stuff gets implemented.

(Side note: cfe-dev doesn't mess with the reply-to, so if you hit
"reply" instead of "reply all", your response doesn't go to the list.)

-Eli

Thanks. I just found n1362 which seems to be the latest draught. Do we have any plans for an experimental C1x mode yet? For example, I notice that this draught now has support for static_assert and Unicode types and literals.

Given we already implement static_assert for C++0x, I imagine it should be fairly easy to adapt for _Static_assert in C1x mode.

The Unicode characters give me more pause (which I am looking to implement for C++0x). The C standard does not spell out a specific name for these types, but defers them to uint_least_16/32_t. However, I believe the string literals are usable even when this header has not been included, and obviously those typedefs might resolve to different types on different platforms. What would be the best way to support this for C? I had originally thought that C defined two new types, _Char16_t and _Char32_t just as it introduces _Bool, but that does not seem to be the case.

On the subject of C testing of UCNs, where should I put this test? Does test/lexer/universal_char_name.c sound right?

Again, what should I do about the XFAIL wide-char tests? (other than implement the support ;¬)

AlisdairM

From: Eli Friedman [mailto:eli.friedman@gmail.com]
Sent: 05 July 2009 12:45
To: AlisdairM(public)
Cc: clang-dev Developers
Subject: Re: [cfe-dev] Patch for UCNs in string literals

There are also working drafts available for C99, for example
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1336.pdf . That
said, we don't have the C tests structured that way, and I'm not sure
it's really worth the effort at this point, so just name it something
reasonable and put it in with the existing tests.

Thanks. I just found n1362 which seems to be the latest draught. Do we have any plans for an experimental C1x mode yet? For example, I notice that this draught now has support for static_assert and Unicode types and literals.

No plans, but we'd be happy to put in the flags for a C1x mode.

Given we already implement static_assert for C++0x, I imagine it should be fairly easy to adapt for _Static_assert in C1x mode.

How interesting. C1x _Static_assert is identical to C++0x static_assert, then <assert.h> #defines static_assert to _Static_assert.

Should be a trivial matter to support C1x's _Static_assert with the same implementation as C++0x's static_assert. Patch welcome :slight_smile:

The Unicode characters give me more pause (which I am looking to implement for C++0x). The C standard does not spell out a specific name for these types, but defers them to uint_least_16/32_t. However, I believe the string literals are usable even when this header has not been included, and obviously those typedefs might resolve to different types on different platforms.

We should know what uint_least16/32_t are from our target information, so we can just dig out that appropriate integral type. However, it's unfortunate that these types don't have a name that would be recognizable in diagnostics. (What's a character string of type const int*, anyway?) Smells like a design problem.

What would be the best way to support this for C? I had originally thought that C defined two new types, _Char16_t and _Char32_t just as it introduces _Bool, but that does not seem to be the case.

The best way might be to convince the C committee to add _Char16_t and _Char32_t. C1x is still young and incomplete.

On the subject of C testing of UCNs, where should I put this test? Does test/lexer/universal_char_name.c sound right?

Sure.

Again, what should I do about the XFAIL wide-char tests? (other than implement the support ;¬)

We can add a new XFAIL tests (in a separate .c file), and will revisit them when someone gets around to implementing more support for wide characters.

   - Doug

We have exactly the same problem for wchar_t. And we can solve it in
exactly the same way: typedef the underlying type to something like
__wchar_t, and use the typedef for string literals. It's not quite
ideal, but there's only so much we can do.

-Eli

Great, thank you for working on this, some thoughts about the patch:

    // Check UCN constraints (C99 6.4.3p2).
- if ((UcnVal < 0xa0 &&
+ // C++0x has simpler constraints in literals, but not identifiers.
+ // (C++0x 2.3p2 [lex.charset])
+ if ( (UcnVal >= 0xD800 && UcnVal <= 0xDFFF) ||
+ (UcnVal > 0x10FFFF) /* the maximum legal UTF32 value */ ||
+ (!PP.getLangOptions().CPlusPlus0x && UcnVal < 0xa0 &&
        (UcnVal != 0x24 && UcnVal != 0x40 && UcnVal != 0x60 )) // $, @, `
- || (UcnVal >= 0xD800 && UcnVal <= 0xDFFF)
- || (UcnVal > 0x10FFFF)) /* the maximum legal UTF32 value */ {
+ ) {

Please split this out into a static helper function that takes UcnVal and LangOptions. That will allow you to use early exits and make the comments more clear.

+def warn_ucn_extension_in_c89 : Extension<"UCNs are an extension from C99">;

How about "universal character names (UCNs) are supported in c89 mode as an extension"?

Do you have commit-after-approval access? If not, please contact me off-list for information.

-Chris