Wide strings and clang::StringLiteral.

I need to convert the strings literals to other encoding, I was planning to
use iconv.h's functions, but I need to know the encoding of the input strings.

So the question is, what encoding have the strings returned by
clang::StringLiteral::getStrData(), overall wide ones?

Thanks
pb

Hi Paolo,

I really have no idea. We're just reading in the raw bytes from the source file, so I guess it depends on whatever the source encoding is. In practice, this sounds like a really bad idea :).

Clang doesn't have any notion of an input character set at present, and doesn't handle unicode escapes. How do other compilers handle input character sets? Are there command line options to specify it? Should the AST hold the string in a canonical form like UTF8?

-Chris

GCC support all iconv encodings via the -finput-charset= argument.
It also have a -fexec-charset= and -fwide-exec-charset= to specify encoding of generated constant string.

"input" defaults to the locale encoding if defined, else fall back to UTF-8.
"exec" defaults to UTF-8.
"wide-exec" defaults to UTF16 or UTF32 based on the wide_t size.

Clang may have to do some string manipulation while compiling (for example to convert non-ascii constant CF/NSString into UTF-16). It will probably be easier to handle if the AST strings use a predefined encoding (UTF-8). It iwll also be simpler for client that want to manipulate strings.
Not to mention UTF-16/UTF-32 source files (gcc support them). It would be very difficult (if not impossible) to keep them in UTF-16 internally, as most functions expects C string.

IMHO, if someone considere adding charset handling, he may considere writing a converter class based on iconv for example, and not call iconv functions directly. It will be easier to switch the underlying library if needed (and use icu for example).

Chris Lattner wrote:-

>
> I need to convert the strings literals to other encoding, I was
> planning to
> use iconv.h's functions, but I need to know the encoding of the
> input strings.
>
> So the question is, what encoding have the strings returned by
> clang::StringLiteral::getStrData(), overall wide ones?

Hi Paolo,

I really have no idea. We're just reading in the raw bytes from the
source file, so I guess it depends on whatever the source encoding
is. In practice, this sounds like a really bad idea :).

Clang doesn't have any notion of an input character set at present,
and doesn't handle unicode escapes. How do other compilers handle
input character sets? Are there command line options to specify it?
Should the AST hold the string in a canonical form like UTF8?

Clang should have an idea of the encoding of its input, otherwise
it cannot reason about the characters that appear in a string
literal. The standard imposes constraints on those characters,
and requires input source to be in the current locale. Of course
this latter bit could be overridden with a command line switch.

Realistically I don't think there is much alternative to an internal
representation in some form of Unicode, or at least reasoning about
the input in Unicode. This is essentially enforced by requiring
UCNs to be accepted.

As for execution charset, GCC's -fexec-charset seems a very reasonable
approach, with some kind of error character for characters not
representable in said charset.

Note that accepting UCNs in identifiers, as both C99 and C++ require,
mandates converting to some kind of canonical Unicode form for
identifiers internally, before hashing, too.

I've got some experience implementing all the above, so can give some
advice if necessary.

Neil.

I didn’t know that C99 supports UCN in identifier.
I don’t see a lot of informations about it in the C99 spec (except that UCN may appear in an identifier). Does this mean that this code is valid ?

---------- test.c -------

int main (int argc, char **argv) {
int h\u00e9 = 0; // hé
return he\u0301; // hé - using decomposed form
}

Jean-Daniel Dupas wrote:-

I didn't know that C99 supports UCN in identifier.
I don't see a lot of informations about it in the C99 spec (except that
UCN may appear in an identifier). Does this mean that this code is valid
?

---------- test.c -------

int main (int argc, char **argv) {
  int h\u00e9 = 0; // hé
  return he\u0301; // hé - using decomposed form
}
--------------------------

Actually, GCC does not support combining character (like COMBINING ACUTE
ACCENT: 0x0301) :

test.c:4:9: error: universal character \u0301 is not valid in an
identifier
test.c: In function ‘main’:
test.c:4: error: ‘hé’ undeclared (first use in this function)
test.c:4: error: (Each undeclared identifier is reported only once
test.c:4: error: for each function it appears in.)

Note that the error is correctly displayed anyway.

My front end gives

$ ~/src/c/cfe /tmp/test.c
"/tmp/test.c", line 3: error: universal character name "\u0301" cannot
       be used in an identifier
        return he\u0301; // hé - using decomposed form
               --^^^^^^

1 error found compiling "/tmp/bug.c".

So you've chosen an invalid UCN. The standard lists the acceptable
UCNs; apparently this isn't one.

Just like "one" and "One" might be considered the same identifier,
they are different in the standard, which as I read it couldn't care
less about combining characters / case etc., it's purely a function
of Unicode point spelling. So \u00aa and \u00Aa and \U000000aA are
identical simply because they represent the same Unicode point.

Neil.

The standard define some constaints on UCN, but I don't see why 0x0301 does not works.

From WG14/N1124 CommitteeDraft — May 6, 2005: page 53:

“A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘),
nor one in the range D800 through DFFF inclusive.)”

Jean-Daniel Dupas wrote:-

My front end gives

$ ~/src/c/cfe /tmp/test.c
"/tmp/test.c", line 3: error: universal character name "\u0301" cannot
      be used in an identifier
       return he\u0301; // hé - using decomposed form
              --^^^^^^

1 error found compiling "/tmp/bug.c".

So you've chosen an invalid UCN. The standard lists the acceptable
UCNs; apparently this isn't one.

The standard define some constaints on UCN, but I don't see why 0x0301
does not works.

From WG14/N1124 CommitteeDraft — May 6, 2005: page 53:

“A universal character name shall not specify a character whose short
identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘),
nor one in the range D800 through DFFF inclusive.)”

Annex D.

Neil.

dear cfe-devs,

I think we are worrying about less important details. Universal character
names in identifiers are, of course, important. But I think it is much more
urgent finding a way to manage wide string correctly.

Personally I never seen identifiers with extended characters, but I can
easily imagine L"non ascii string" in non-English programs.

So what about focusing about a normalized way to memorize wide strings and
thinking about extended characters in identifiers later?

pb

Paolo Bolzoni wrote:

dear cfe-devs,

I think we are worrying about less important details. Universal character
names in identifiers are, of course, important. But I think it is much more
urgent finding a way to manage wide string correctly.
  

Universal character identifiers are also easy -- and provide an alternate way to represent UNICODE characters in wide strings.

Personally I never seen identifiers with extended characters, but I can
easily imagine L"non ascii string" in non-English programs.

So what about focusing about a normalized way to memorize wide strings and
thinking about extended characters in identifiers later?
  

As I see it (speaking naively, I have no real experience here), the "normalized" way to memorize wide strings would be to abstract the encoding into a class that chooses "on the fly" between UTF-8, UTF-16, UCS-2(sp?), and UTF-32. [Note that the choice of wchar_t depends on what character set one is supporting: it should be either 16+ bits (for UCS-2) or 32+ bits (for UTF-32). This class should be able to be switched between conversion libraries at compile-time. As UTF-8 and UTF-16 are multibyte encodings, they are strictly disallowed for wchar_t (and wide strings) but will be allowed for UNICODE strings if they make it into the next standard.]

dear cfe-devs,

I think we are worrying about less important details. Universal character
names in identifiers are, of course, important. But I think it is much more
urgent finding a way to manage wide string correctly.

Yes, I agree that this is the right starting point.

So what about focusing about a normalized way to memorize wide strings and
thinking about extended characters in identifiers later?

Sounds great to me. A disclaimer: I don't know anything about this stuff, Neil, I'd very much appreciate validation that this approach makes sense :).

Here are some starting steps:

1) Document StringLiteral as being canonicalized to UTF8. We'll require sema to translate the input string to utf8, and codegen and other clients to convert it to the character set they want.
2) Add -finput-charset to clang. Is iconv generally available (e.g. on windows?) if not, we'll need some configury magic to detect it.
3) Teach sema about UTF8 input and iconv. Sema should handle the default cases (e.g. UTF8 and character sets where no "bad" things occur) as quickly as possible, while falling back to iconv for hard cases (or emitting an error if iconv isn't available).
4) Enhance the lexer, if required, to handle lexing strings properly.
5) Enhance codegen to translate into the execution char set.
6) Start working on character constants.

Does this seem reasonable Paolo (and Neil)?

-Chris

Here are some starting steps:

1) Document StringLiteral as being canonicalized to UTF8. We'll
require sema to translate the input string to utf8, and codegen and
other clients to convert it to the character set they want.

Sema itself needs to do the translation to the execution charset, or
at least have some knowledge it; sizeof("こんにちは") depends on the
execution charset.

2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.

It's not particularly difficult to get, but Windows users are unlikely
to have it installed.

3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).

"Good" charsets, if I'm understanding correctly, are those character
sets which are a superset of ASCII and where ASCII bytes are never
part of a multi-byte sequence representing something else. This
includes charsets like UTF-8, ISO-8859-*, and EUC-JP. "Bad" charsets
include UTF-16 (not an ASCII superset) and Shift JIS (breaks the
multi-byte sequence rule).

Assuming Sema never sees a string in a "bad" charset, conversion can
be skipped if and only if either the input and execution character
sets are the same, or the string contains only ASCII characters.

4) Enhance the lexer, if required, to handle lexing strings properly.

The current lexer likely breaks in a lot of other cases for "bad"
charsets... we probably just want to convert input in any of these
charsets upfront, before lexing. The lexer shouldn't need any changes
for "good" charsets, if my understanding of the standard is correct.

-Eli

Here are some starting steps:

1) Document StringLiteral as being canonicalized to UTF8. We'll
require sema to translate the input string to utf8, and codegen and
other clients to convert it to the character set they want.

Sema itself needs to do the translation to the execution charset, or
at least have some knowledge it; sizeof("こんにちは") depends on the
execution charset.

Right, I think that should happen through the sema of string literal (which sets its type, which includes the length).

2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.

It's not particularly difficult to get, but Windows users are unlikely
to have it installed.

Ok, it would be nice to not add a dependency so people can get started quickly. Are there any license issues?

3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).

"Good" charsets, if I'm understanding correctly, are those character
sets which are a superset of ASCII and where ASCII bytes are never
part of a multi-byte sequence representing something else. This
includes charsets like UTF-8, ISO-8859-*, and EUC-JP. "Bad" charsets
include UTF-16 (not an ASCII superset) and Shift JIS (breaks the
multi-byte sequence rule).

Assuming Sema never sees a string in a "bad" charset, conversion can
be skipped if and only if either the input and execution character
sets are the same, or the string contains only ASCII characters.

I'd be fine with specializing on the "ascii subset && string contains only characters in the range 0-0x7f" and having a slow path for everything else.

4) Enhance the lexer, if required, to handle lexing strings properly.

The current lexer likely breaks in a lot of other cases for "bad"
charsets... we probably just want to convert input in any of these
charsets upfront, before lexing. The lexer shouldn't need any changes
for "good" charsets, if my understanding of the standard is correct.

That makes a lot of sense to me!

-Chris

2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.

It's not particularly difficult to get, but Windows users are unlikely
to have it installed.

Ok, it would be nice to not add a dependency so people can get started
quickly. Are there any license issues?

libiconv is LGPL; whether this is an "issue" license probably depends
on the company. There's also apparently a lightweight alternative
called win_iconv which I stumbled upon while I was looking around, but
I don't know much about it.

3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).

"Good" charsets, if I'm understanding correctly, are those character
sets which are a superset of ASCII and where ASCII bytes are never
part of a multi-byte sequence representing something else. This
includes charsets like UTF-8, ISO-8859-*, and EUC-JP. "Bad" charsets
include UTF-16 (not an ASCII superset) and Shift JIS (breaks the
multi-byte sequence rule).

Assuming Sema never sees a string in a "bad" charset, conversion can
be skipped if and only if either the input and execution character
sets are the same, or the string contains only ASCII characters.

I'd be fine with specializing on the "ascii subset && string contains only
characters in the range 0-0x7f" and having a slow path for everything else.

Ah, right, you want to store the strings in UTF-8. That seems fine; I
expect non-ASCII in strings is very rare.

Actually, something just occurred to me, though: if I recall
correctly, Shift JIS is the default charset on Japanese Windows
systems. Do we plan to key the default -finput-charset off of the
default system charset? I'm not sure what, if anything, we can/should
do in that situation.

-Eli

SJIS in particular is hard because it's state dependent. What we did
in gcc was either ascii or ebcdic depending on what we had as a host
system. But in general, yes, I agree with you here.

-eric

Frankly, I'd be fine with clang being really really slow on all Japanese Windows installs (if that makes things easier) to start with. If and when someone cares enough, we can figure out the best way to optimize it. :slight_smile:

-Chris

Chris Lattner wrote:

Frankly, I'd be fine with clang being really really slow on all Japanese Windows installs (if that makes things easier) to start with. If and when someone cares enough, we can figure out the best way to optimize it. :slight_smile:

From what I've seen, there is no standard encoding for C code with embedded Japanese comments (I've seen EUC, Shift-JIS, and UTF-8), so people are used to paying the performance penalty associated with conversion.

Patrick

2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.

It's not particularly difficult to get, but Windows users are unlikely
to have it installed.

Ok, it would be nice to not add a dependency so people can get started
quickly. Are there any license issues?

libiconv is LGPL; whether this is an "issue" license probably depends
on the company. There's also apparently a lightweight alternative
called win_iconv which I stumbled upon while I was looking around, but
I don't know much about it.

You may also considere ICU that is distributed with a nonrestrictive license:

  Projects – Unicode

It compile fine on Windows (that's what the WebKit and Safari use for unicode handling), and is bundle with most modern OS (Linux, OS X)

1) Document StringLiteral as being canonicalized to UTF8. We'll
require sema to translate the input string to utf8, and codegen and
other clients to convert it to the character set they want.
2) Add -finput-charset to clang. Is iconv generally available (e.g.
on windows?) if not, we'll need some configury magic to detect it.
3) Teach sema about UTF8 input and iconv. Sema should handle the
default cases (e.g. UTF8 and character sets where no "bad" things
occur) as quickly as possible, while falling back to iconv for hard
cases (or emitting an error if iconv isn't available).
4) Enhance the lexer, if required, to handle lexing strings properly.
5) Enhance codegen to translate into the execution char set.
6) Start working on character constants.

And do not forget: Enhance rewriter to preserve input encoding.

set && string contains only
characters in the range 0-0x7f" and having a slow path for everything else.
    
Ah, right, you want to store the strings in UTF-8. That seems fine; I
expect non-ASCII in strings is very rare

For french programmes and probably other non-english language, non-ASCII in strings is *not* very rare. Every accentued character is not ascii and Most of the french sentence will have at least one accentued character.

I just wanted to point this out, even if it is not very important. (I am french but I write my programme in english, and I suppose more and more people use internationalisation software, so the problem may be small).

regards,

Cédric

character.

The question is, how many localized applications of significant size don't
manage their strings in some external resource that doesn't affect
compilation?

Sebastian