Question on character sets and encodings

According to C++ phases of translation (I don't know C or ObjectiveC) source
files should be transformed into the 'basic source character set' before
parsing, with any characters outside this set turned into a
universal=character-name representation.

Of course, this is an 'as-if' rule and we are free to implement something
that does such translation on the fly, or be really smart and work with a
different character set/encoding entirely that behaves as a super-set (e.g.
ASCII or UTF8).

So my question is: What does Clang actually do?

Within a parse function, can I assume any character I meet will be
exclusively from the basic character set? Can I assume ASCII encoding (e.g.
all control characters have value < 32)?

Conversely, what source encodings does Clang accept?
Can I feed it a file with UTF-8/UTF-16/UTF32 encodings?

I'm going to need to understand this if I'm going to do implement
char16_t/char32_t 'right', rather than quickly, and I'm beginning to think I
could have picked an easier starter project after all!

Finally, are there any existing Unicode facilities in the code base I can
call on when trying to transcode into/out-of Unicode?

Thanks
AlisdairM

Of course, this is an 'as-if' rule and we are free to implement something
that does such translation on the fly, or be really smart and work with a
different character set/encoding entirely that behaves as a super-set (e.g.
ASCII or UTF8).

So my question is: What does Clang actually do?

clang currently does nothing in this regard; in practice, this ends up
being roughly equivalent to assuming both the source and execution
charset are UTF-8. If you want more discussion, try looking through
the cfe-dev archives.

Within a parse function, can I assume any character I meet will be
exclusively from the basic character set? Can I assume ASCII encoding (e.g.
all control characters have value < 32)?

Yes, feel free to assume an ASCII superset; the current plan (once
someone gets around to tackling it) is to translate to UTF-8 any
charset where that doesn't work.

Conversely, what source encodings does Clang accept?
Can I feed it a file with UTF-8/UTF-16/UTF32 encodings?

Currently just UTF-8. Actually, it might be a decent first project to
add finput-charset support: it should just be a matter of making the
source manager do charset translation on the file before starting
lexing.

Finally, are there any existing Unicode facilities in the code base I can
call on when trying to transcode into/out-of Unicode?

See include/Basic/ConvertUTF.h.

-Eli

From: Eli Friedman [mailto:eli.friedman@gmail.com]
Sent: 06 June 2009 12:08
To: AlisdairM(public)
Cc: cfe-dev@cs.uiuc.edu
Subject: Re: [cfe-dev] Question on character sets and encodings

Of course, this is an 'as-if' rule and we are free to implement
something that does such translation on the fly, or be really
smart and work with a different character set/encoding entirely
that behaves as a super-set (e.g. ASCII or UTF8).

So my question is: What does Clang actually do?

clang currently does nothing in this regard; in practice, this ends up
being roughly equivalent to assuming both the source and execution
charset are UTF-8. If you want more discussion, try looking through
the cfe-dev archives.

Thanks.

Within a parse function, can I assume any character I meet will be
exclusively from the basic character set? Can I assume ASCII encoding
(e.g. all control characters have value < 32)?

Yes, feel free to assume an ASCII superset; the current plan (once
someone gets around to tackling it) is to translate to UTF-8 any
charset where that doesn't work.

Ok, so I should assume character like '@' will appear as a single character,
and not translated to the appropriate universal-character-name?

Conversely, what source encodings does Clang accept?
Can I feed it a file with UTF-8/UTF-16/UTF32 encodings?

Currently just UTF-8. Actually, it might be a decent first project to
add finput-charset support: it should just be a matter of making the
source manager do charset translation on the file before starting
lexing.

Makes sense, although it is a part of the compiler I had been hoping to
leave for others <g>

I'm not sure about using a compiler switch though, as surely we must cope
with #including a file with a different encoding than the rest of the
project? For example, if we are encoding in UTF16, it is unlikely that the
standard library was supplied with same encoding, or Boost, or other popular
libraries.

I suggest it might be better to detect a Unicode BOM and transcode
accordingly. In the absence of a BOM, assume UTF8.

Finally, are there any existing Unicode facilities in the code base I
can call on when trying to transcode into/out-of Unicode?

See include/Basic/ConvertUTF.h.

Excellent! I see all the facilities I will want to use are currently
commented out though!

At least there is an implementation of sorts. I assume they are disabled
for lack of tests/validation?

So it looks the pre-requisite to the pre-requisite of my self-selected easy
first project should be to implement the UTF16/32-to-UTF8 transcoders. Does
anyone on this list have a good set of reference tests I can use?

AlisdairM