I'm new to clang. I've been looking at adding support for
-finput-charset, -fexec-charset and -fwide-exec-charset. I took at
look through the mailing list archives and code, and I haven't seen a
lot of discussion of this except in a general sense. Has anyone taken
a more serious look at this?
From what I can tell, the general assumption inside clang about
character sets is that both the input and execution character sets are
some form of single byte encoding based on ASCII. Lexer.cpp actually
has an ASCII table in it. Multi-byte characters in encodings like
UTF-8 are falsely considered multiple characters.
That is, clang is working like this:
input-charset: Bytes straight from file, assumed to be ASCII, with no
awareness of multi-byte encoding.
exec-charset: Same as input
wide-exec-charset: Not a character set exactly. A single byte from the
input file is copied into the wide character, with no awareness of
multi-byte encoding. Also seems to assume little endian.
GCC, by default, works like this:
input-charset: Locale specified encoding or UTF-8 if it cannot detect
system encoding. Picks up byte order marks (sometimes?) as it relies
wide-exec-charset: UTF-16 or UTF-32 (depending on wchar_t size) in
local endian byte order
Obviously, you can override these values with the command line
options. It supports anything the native iconv library supports.
Given that clang seems to assume single byte ASCII all over the place,
the best way to go would seem to be to use iconv (or the native
Windows API) to convert the input files into UTF-8 and use that
internally. Existing code will function for character values in the
basic ASCII set (that is, values smaller than 128) and multi-byte
characters should work correctly in comments. The one place that would
need a serious update would be the parsing of string literals (i.e.
Any comments, suggestions, etc? Specifically, input from people that
have looked at supporting universal-character-names or unicode string