From: Eli Friedman [mailto:firstname.lastname@example.org]
Sent: 06 June 2009 12:08
Subject: Re: [cfe-dev] Question on character sets and encodings
Of course, this is an 'as-if' rule and we are free to implement
something that does such translation on the fly, or be really
smart and work with a different character set/encoding entirely
that behaves as a super-set (e.g. ASCII or UTF8).
So my question is: What does Clang actually do?
clang currently does nothing in this regard; in practice, this ends up
being roughly equivalent to assuming both the source and execution
charset are UTF-8. If you want more discussion, try looking through
the cfe-dev archives.
Within a parse function, can I assume any character I meet will be
exclusively from the basic character set? Can I assume ASCII encoding
(e.g. all control characters have value < 32)?
Yes, feel free to assume an ASCII superset; the current plan (once
someone gets around to tackling it) is to translate to UTF-8 any
charset where that doesn't work.
Ok, so I should assume character like '@' will appear as a single character,
and not translated to the appropriate universal-character-name?
Conversely, what source encodings does Clang accept?
Can I feed it a file with UTF-8/UTF-16/UTF32 encodings?
Currently just UTF-8. Actually, it might be a decent first project to
add finput-charset support: it should just be a matter of making the
source manager do charset translation on the file before starting
Makes sense, although it is a part of the compiler I had been hoping to
leave for others <g>
I'm not sure about using a compiler switch though, as surely we must cope
with #including a file with a different encoding than the rest of the
project? For example, if we are encoding in UTF16, it is unlikely that the
standard library was supplied with same encoding, or Boost, or other popular
I suggest it might be better to detect a Unicode BOM and transcode
accordingly. In the absence of a BOM, assume UTF8.
Finally, are there any existing Unicode facilities in the code base I
can call on when trying to transcode into/out-of Unicode?
Excellent! I see all the facilities I will want to use are currently
commented out though!
At least there is an implementation of sorts. I assume they are disabled
for lack of tests/validation?
So it looks the pre-requisite to the pre-requisite of my self-selected easy
first project should be to implement the UTF16/32-to-UTF8 transcoders. Does
anyone on this list have a good set of reference tests I can use?