C99/C++ UCN (Universal Character Name) Support

Folks,

I'm in the process of implementing UCN support in LiteralSupport.cpp.

Part of implementing this is converting UTF-16 (\u) and UTF-32 (\U) to UTF-8 (for insertion into a C-string, say).

Unfortunately, Unix doesn't appear to have any standard support for this type of conversion (which surprised me).

Does anyone have any experience with this type of conversion?

Thanks for any pointers!

snaroff

Part of implementing this is converting UTF-16 (\u) and UTF-32 (\U) to
UTF-8 (for insertion into a C-string, say).

It's not very hard; one version of the formula is available at
UTF-8 - Wikipedia. And UTF-16 isn't really relevant
here; \u denotes a Unicode code point, not a UTF-16 code unit.

Unfortunately, Unix doesn't appear to have any standard support for
this type of conversion (which surprised me).

You could use iconv, although that's overkill here...

-Eli

Part of implementing this is converting UTF-16 (\u) and UTF-32 (\U) to
UTF-8 (for insertion into a C-string, say).

It's not very hard; one version of the formula is available at
UTF-8 - Wikipedia. And UTF-16 isn't really relevant
here; \u denotes a Unicode code point, not a UTF-16 code unit.

Unfortunately, Unix doesn't appear to have any standard support for
this type of conversion (which surprised me).

You could use iconv, although that's overkill here...

I agree. I believe this is what GCC uses.

One of the Unicode guy's within Apple pointed me to...

http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c

...which looks good to me.

snaroff

steve naroff wrote:-

Folks,

I'm in the process of implementing UCN support in LiteralSupport.cpp.

Part of implementing this is converting UTF-16 (\u) and UTF-32 (\U) to
UTF-8 (for insertion into a C-string, say).

Unfortunately, Unix doesn't appear to have any standard support for this
type of conversion (which surprised me).

Does anyone have any experience with this type of conversion?

Thanks for any pointers!

Are you working on accepting them in identifiers too?

It's nice, both in literals and identifiers, to permit multibyte
characters from the current locale too, interchangably with UCNs.

Does clang implement execution charsets, or is it mandating it be
UTF-8?

Neil.

The first step is UTF8 only, and only accepted in string literals.

-Chris

Neil is my own personal deity here on this, but gcc uses a combination of iconv
and it's own converters to deal with various different character set issues. For these
two IIRC Neil/Zack wrote their own. iconv works really well when you start getting
into the more... esoteric execution character set issues though.

-eric