Wide strings and clang::StringLiteral.

Chris Lattner wrote:-

Sounds great to me. A disclaimer: I don't know anything about this
stuff, Neil, I'd very much appreciate validation that this approach
makes sense :).

Here are some starting steps:

1) Document StringLiteral as being canonicalized to UTF8. We'll
require sema to translate the input string to utf8, and codegen and
other clients to convert it to the character set they want.
2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.
3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).
4) Enhance the lexer, if required, to handle lexing strings properly.
5) Enhance codegen to translate into the execution char set.
6) Start working on character constants.

Does this seem reasonable Paolo (and Neil)?

It should work, but will break caret diagnostics I expect.

There's no real need for such flexibility though - the standard
doesn't permit UTF-16, UTF-32 etc; and I've never heard of anyone
wanting to use them, so why not just require ASCII supersets like
the standard does (for ASCII hosts)? Then your caret diagnostics
keep working too, and special-casing the extra characters is straight
forward, even for SJIS.

The standard also requires input to be in the current locale; is
there any need to be more relaxed? Realistically all the source
has to be in the same charset, and that charset must include the
ability to read the system headers. You then just get to use
mbtowc in a few places.

Neil.

You have to remember that it's not just a Japanese issue; quite a lot of languages commonly use non-ASCII characters. French, German and Spanish, to name a few. I suspect surprisingly many domain specific code bases aren't localised to more than one language.

I second Jean-Daniel Dupas' recommendation of ICU; other than translating between encodings, its extensive support for Unicode normalisation and canonicalisation might be useful. Imagine, for instance, a rewriter which enabled printf() and so on to gracefully degrade smart quotes depending on the runtime encoding :slight_smile:

Sure, I'm not arguing for lack of functionality, just saying that we shouldn't worry about optimizing for that case yet. The precentage of tokens that are string literals is also very low, the percent that would use high characters is even lower, even in French-speaking-countries (I suspect).

-Chris

1) Document StringLiteral as being canonicalized to UTF8. We'll
require sema to translate the input string to utf8, and codegen and
other clients to convert it to the character set they want.
2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.
3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).
4) Enhance the lexer, if required, to handle lexing strings properly.
5) Enhance codegen to translate into the execution char set.
6) Start working on character constants.

Does this seem reasonable Paolo (and Neil)?

It should work, but will break caret diagnostics I expect.

There's no real need for such flexibility though - the standard
doesn't permit UTF-16, UTF-32 etc; and I've never heard of anyone
wanting to use them, so why not just require ASCII supersets like
the standard does (for ASCII hosts)? Then your caret diagnostics
keep working too, and special-casing the extra characters is straight
forward, even for SJIS.

I have no desire to go above and beyond the standard unless (e.g.) GCC supports some extension and there is a large body of code that depends on it.

Please remember that I know very little about this, so if I suggest something silly, it is probably out of ignorance rather than some devious plan :slight_smile:

The standard also requires input to be in the current locale; is
there any need to be more relaxed?

No.

Realistically all the source
has to be in the same charset, and that charset must include the
ability to read the system headers. You then just get to use
mbtowc in a few places.

Can you give some pseudocode of what you mean?

-Chris

Chris Lattner wrote:

  

The standard also requires input to be in the current locale; is
there any need to be more relaxed?
    
No.

Yes. There is no locale under Windows that uses UTF-8 as its encoding, and there never will be. (As Raymond Chen mentions from time to time, the narrow version of the WinAPI cannot, in general, handle UTF-8.) Since we also don't accept encodings that aren't a superset of ASCII, UTF-16 or UTF-32 can't be used either. This leaves Windows users without the option of using Unicode encodings.

Sebastian

The issue with SJIS in particular is that sometimes ASCII bytes don't
actually represent ASCII. Although, looking at the character set more
carefully, it looks like that doesn't actually affect the lexer unless
we allow Japanese characters in identifiers... that's kind of nice.

I don't see where the standard requires an ASCII superset; it
certainly requires a lot of characters from ASCII, but EBCDIC, for
example, appears to be an legal source character set. Oddly, though,
UTF-16 appears to be an illegal source character set... that seems
slightly strange to me, since nothing really depends on the source
character set.

I don't see why any of this affects caret diagnostics, though: column
counts should be the same no matter what encoding is used.

-Eli

UTF-16 sources are not unheard of in Windows-only codes. cl handles them transparently, so it might unwise to make any design decisions which preclude them, regardless what the standard says on the matter. cl supports encoding sniffing to accomplish this, since system headers are in ASCII.

— Gordon

Eli Friedman wrote:-

I don't see why any of this affects caret diagnostics, though: column
counts should be the same no matter what encoding is used.

If you iconv a whole file, ala GCC, you've got no map from original
bytes to the bytes you're lexing. Caret diagnostics refer to
original source bytes; your lexer is reading the iconv-ed file.

Neil.

Chris Lattner wrote:-

The standard also requires input to be in the current locale; is
there any need to be more relaxed?

No.

Right, and if you want a different locale there's always a command
line switch and setlocale. Locale is at least an out-of-the-box
supported thing on a system with a C compiler.

Realistically all the source
has to be in the same charset, and that charset must include the
ability to read the system headers. You then just get to use
mbtowc in a few places.

Can you give some pseudocode of what you mean?

Below, where I write ASCII, I really mean basic source-charset;
the same logic works if you're on EBCDIC hosts or if your
shift-to-extended-charset character is ASCII 26 (?).

I'm just talking about e.g.

a) the main switch statement of the lexer; the default case
   (assuming you want to accept native charset identifiers,
   which is a nice touch and not hard to do) becomes a
   call to lex_identifier(), and only an "other" token if
   this call doesn't succeed.

b) comments: there's not much to do here but to lex in a
   multibyte aware fashion. The main loop of my mb-aware block
   comment lexer is, for example:

  while (!is_at_eof (lexer))
    {
      prevc = c;
      c = lexer_get_clean_mbchar (lexer, token);
      
      /* People like decorating comments with '*', so check for '/'
   instead for efficiency. */
      if (c == '/' && prevc == '*')
  return;
    }

   which is one way of doing it. If you're worried about
   performance, you could a) only support multibyte charsets
   as a compile-time option, so people who don't want it don't
   get it, or b) only fall through to the generic mbchar-aware
   slower code once you've read a non-ASCII character. Most
   comments are pure ASCII so won't fall into the slow path.

c) identifiers: again the fast loop, and a slower loop if you
   want to support native identifiers and a non-ASCII character
   is encountered. When hashing you convert to UTF-8 or something,
   you need to do that if there are UCNs too. You can flag these
   non-clean identifiers just like you flag trigraph or escaped-newline
   non-clean identifiers.

d) numbers - if you look at the lexer grammar these are a superset of
   identifiers with +, - and . characters. They could be pasted
   with an identifier to create another identifier, for example.
   Reuse the identifier logic, or cut-n-paste.

e) literals (strings, character constants and header names);
   in my case they use lexer_get_clean_mbchar() function shown
   above, but you could do a fast-track and slow-track thing too.

In my case I have mbchar support a compile-time option; if it's
turned off lexer_get_clean_mbchar is a macro that becomes
lexer_get_clean_char, for example.

I've found that a fully mb-char aware lexer is about 25-30% slower
than one with it compiled out. But I've not tried to optimize
comments and identifier lexing to have a fast and slow path. So
25%-30% is probably a worst case slowdown.

Neil.