Chris Lattner wrote:-
Sounds great to me. A disclaimer: I don't know anything about this
stuff, Neil, I'd very much appreciate validation that this approach
makes sense :).Here are some starting steps:
1) Document StringLiteral as being canonicalized to UTF8. We'll
require sema to translate the input string to utf8, and codegen and
other clients to convert it to the character set they want.
2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.
3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).
4) Enhance the lexer, if required, to handle lexing strings properly.
5) Enhance codegen to translate into the execution char set.
6) Start working on character constants.Does this seem reasonable Paolo (and Neil)?
It should work, but will break caret diagnostics I expect.
There's no real need for such flexibility though - the standard
doesn't permit UTF-16, UTF-32 etc; and I've never heard of anyone
wanting to use them, so why not just require ASCII supersets like
the standard does (for ASCII hosts)? Then your caret diagnostics
keep working too, and special-casing the extra characters is straight
forward, even for SJIS.
The standard also requires input to be in the current locale; is
there any need to be more relaxed? Realistically all the source
has to be in the same charset, and that charset must include the
ability to read the system headers. You then just get to use
mbtowc in a few places.
Neil.