Here are some starting steps:
1) Document StringLiteral as being canonicalized to UTF8. We'll
require sema to translate the input string to utf8, and codegen and
other clients to convert it to the character set they want.
Sema itself needs to do the translation to the execution charset, or
at least have some knowledge it; sizeof("こんにちは") depends on the
execution charset.
2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.
It's not particularly difficult to get, but Windows users are unlikely
to have it installed.
3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).
"Good" charsets, if I'm understanding correctly, are those character
sets which are a superset of ASCII and where ASCII bytes are never
part of a multi-byte sequence representing something else. This
includes charsets like UTF-8, ISO-8859-*, and EUC-JP. "Bad" charsets
include UTF-16 (not an ASCII superset) and Shift JIS (breaks the
multi-byte sequence rule).
Assuming Sema never sees a string in a "bad" charset, conversion can
be skipped if and only if either the input and execution character
sets are the same, or the string contains only ASCII characters.
4) Enhance the lexer, if required, to handle lexing strings properly.
The current lexer likely breaks in a lot of other cases for "bad"
charsets... we probably just want to convert input in any of these
charsets upfront, before lexing. The lexer shouldn't need any changes
for "good" charsets, if my understanding of the standard is correct.
-Eli