UCNs/extended characters revisitted

OK, while my first Unicode patch is stewing let's consider how to support UCNs.

Currently Clang effectively requires source files to be UTF-8 encoded. If fact it mostly requires files without any extended characters at all, but translates UCNs in string literals to UTF-8 so I propose that UTF-8 is recognised as the formal internal representation.

Now when lexing, any time we hit an extended character we describe it as an unknown token. My first proposal is that we recognise there are is no punctuation to be recognised from characters above 0x7F, so essentially we can add an arbitrarily long string of such characters to an identifier without worrying about breaking the parse. There are two issues at this point though:

i/ The extended characters must form a valid UTF-8 code point
ii/ Not all code points in the basic character plane are valid for use in identifiers. While C++ might not act on punctuation in different alphabets, it should still not allow it in identifiers.

(i) is easily and efficiently checked.
(ii) requires a large database of valid/invalid code points to look up against - and frankly this part ought to be written by a Unicode specialist who can validate all the corner cases. At the moment that database is around 2.5Mb, although that contains more info than we need for simple validation of identifiers. We will still need a reasonable sized lookup though, ideally encoded into some kind of sparse bit-vector. I recommend doing this validation in parse or sema, and for performance simply allowing lex to pass along tokens without splitting on the invalid characters.

So my suggestion is to pass on (ii) for now, and accept a broader range of identifiers than strictly allowed. We might issue a warning the first time we see an extended character in an identifier (per translation unit) that extended character support currently permits characters that may become illegal in future versions.

If I can get permission for this approach, UCNs follow fairly easily, simply encoding into the same extended UTF-8 code points. At this point we would have most the UCN support required for C99/C++98-03, with the exception that we are a little permissive in what we accept. Technically, that's an extension, as we should translate all valid portable programs ;¬)

AlisdairM

OK, while my first Unicode patch is stewing let's consider how to support UCNs.

Currently Clang effectively requires source files to be UTF-8 encoded. If fact it mostly requires files without any extended characters at all, but translates UCNs in string literals to UTF-8 so I propose that UTF-8 is recognised as the formal internal representation.

Right, that's what we've been intending to implement.

Now when lexing, any time we hit an extended character we describe it as an unknown token. My first proposal is that we recognise there are is no punctuation to be recognised from characters above 0x7F, so essentially we can add an arbitrarily long string of such characters to an identifier without worrying about breaking the parse. There are two issues at this point though:

i/ The extended characters must form a valid UTF-8 code point
ii/ Not all code points in the basic character plane are valid for use in identifiers. While C++ might not act on punctuation in different alphabets, it should still not allow it in identifiers.

(i) is easily and efficiently checked.
(ii) requires a large database of valid/invalid code points to look up against - and frankly this part ought to be written by a Unicode specialist who can validate all the corner cases. At the moment that database is around 2.5Mb, although that contains more info than we need for simple validation of identifiers. We will still need a reasonable sized lookup though, ideally encoded into some kind of sparse bit-vector. I recommend doing this validation in parse or sema, and for performance simply allowing lex to pass along tokens without splitting on the invalid characters.

So my suggestion is to pass on (ii) for now, and accept a broader range of identifiers than strictly allowed. We might issue a warning the first time we see an extended character in an identifier (per translation unit) that extended character support currently permits characters that may become illegal in future versions.

Sounds like a reasonable proposal, except that we can't delay the
validation until Parser/Sema (consider the case of a macro whose name
contains an extended character).

-Eli

i/ The extended characters must form a valid UTF-8 code point
ii/ Not all code points in the basic character plane are valid for use in identifiers. While C++ might not act on punctuation in different alphabets, it should still not allow it in identifiers.

(i) is easily and efficiently checked.
(ii) requires a large database of valid/invalid code points to look up against - and frankly this part ought to be written by a Unicode specialist who can validate all the corner cases. At the moment that database is around 2.5Mb, although that contains more info than we need for simple validation of identifiers. We will still need a reasonable sized lookup though, ideally encoded into some kind of sparse bit-vector. I recommend doing this validation in parse or sema, and for performance simply allowing lex to pass along tokens without splitting on the invalid characters.

As for (ii), I assume you simply want to disallow characters having general category P* (ie: Pd, Ps, etc.), no? The full list of such characters for Unicode 5.1 can be found in <http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt >. Do you want to embed data from a particular version of Unicode, or would you rather track whatever version of Unicode is supported by the host environment? If the former, for a simple binary test such as this you could always use a simple inversion list; if the latter, you could either call directly into a host API to classify each character or you could preprocess these data to avoid calling the API repeatedly. I'm happy to assist with any of this.

Ned

Actually, the data are probably overly sparse for an inversion list. In any event, I'd be glad to help with whatever needs doing.

Ned

Not exactly; see Annex D of the C99 standard.

-Eli

The rule here for C++0x is covered by 2.11p1 [lex.name]

" Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Annex A of TR 10176:2003."

Note that is one of the ISO standards that is freely available from their web site.

Eli has also pointed out the governing rules for C99, and I really hope they are similar enough not to be an issue!

As I said, my initial plan is to simply accept all code points in the basic character plane, other than those already covered in the basic ASCII range. I'm really hoping someone else (hint hint<G>) will provide that last level of validation, although I can hook in a validation hook that always returns 'true'.

The other issue once we allow UCNs is tracking of column numbers, which is mostly an issue for formatting our error messages. Internally, I recommend everything stays as now, tracking utf-8 code *units*. We should convert column numbers to index based on code *points* at the time we return an error message, and leave the user's rendering system to cope with combining multiple code-points into single glyphs, although that means our ^ and ~~~~~ may be a little out of synch in awkward cases. Fundamentally, I don't think there is any way to resolve that - those markers should ultimately be rendered by an IDE rather than our code anyway.

AlisdairM

I guess that means I'm not sure what Alisdair meant, then: is there special additional processing needed for punctuation, or is the goal simply an efficient representation of the table in Annex D?

Ned

The rule here for C++0x is covered by 2.11p1 [lex.name]

" Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Annex A of TR 10176:2003."

Note that is one of the ISO standards that is freely available from their web site.

Eli has also pointed out the governing rules for C99, and I really hope they are similar enough not to be an issue!

As I said, my initial plan is to simply accept all code points in the basic character plane, other than those already covered in the basic ASCII range. I'm really hoping someone else (hint hint<G>) will provide that last level of validation, although I can hook in a validation hook that always returns 'true'.

The other issue once we allow UCNs is tracking of column numbers, which is mostly an issue for formatting our error messages. Internally, I recommend everything stays as now, tracking utf-8 code *units*. We should convert column numbers to index based on code *points* at the time we return an error message, and leave the user's rendering system to cope with combining multiple code-points into single glyphs, although that means our ^ and ~~~~~ may be a little out of synch in awkward cases. Fundamentally, I don't think there is any way to resolve that - those markers should ultimately be rendered by an IDE rather than our code anyway.

AlisdairM

Yes, that is what I mean.
I interpret the excluded character ranges as essentially containing punctuation and similar glyphs from assorted alphabets, which clearly should not be part of an identifier. They are generally 'non-alphabetic' characters anyway, and I probably expressed myself badly by generalising.

AlisdairM

Well, it's true that there's only so much we can do to display a caret
on a terminal for a line of code containing, for example, Arabic text.
There are some simple things we can do, though: for example, we can
special-case Chinese characters to count as two terminal columns.

-Eli

OK, while my first Unicode patch is stewing let's consider how to support UCNs.

Currently Clang effectively requires source files to be UTF-8 encoded. If fact it mostly requires files without any extended characters at all, but translates UCNs in string literals to UTF-8 so I propose that UTF-8 is recognised as the formal internal representation.

Yep, that's the plan. If we want to support other input formats (e.g. EBCDIC or UTF16) we can always translate the memory buffer to utf8 before the lexer starts going at it. This is a separate project from handling UCN's of course :).

Now when lexing, any time we hit an extended character we describe it as an unknown token. My first proposal is that we recognise there are is no punctuation to be recognised from characters above 0x7F, so essentially we can add an arbitrarily long string of such characters to an identifier without worrying about breaking the parse.

Ok.

There are two issues at this point though:

i/ The extended characters must form a valid UTF-8 code point
ii/ Not all code points in the basic character plane are valid for use in identifiers. While C++ might not act on punctuation in different alphabets, it should still not allow it in identifiers.

(i) is easily and efficiently checked.
(ii) requires a large database of valid/invalid code points to look up against - and frankly this part ought to be written by a Unicode specialist who can validate all the corner cases. At the moment that database is around 2.5Mb, although that contains more info than we need for simple validation of identifiers. We will still need a reasonable sized lookup though, ideally encoded into some kind of sparse bit-vector. I recommend doing this validation in parse or sema, and for performance simply allowing lex to pass along tokens without splitting on the invalid characters.

So my suggestion is to pass on (ii) for now, and accept a broader range of identifiers than strictly allowed. We might issue a warning the first time we see an extended character in an identifier (per translation unit) that extended character support currently permits characters that may become illegal in future versions.

Adding a 2.5M database to clang sounds like a non-starter. When we actually care enough about this, hopefully there will be a better way to go. Neil, do you know of a good way to do this sort of check?

If I can get permission for this approach, UCNs follow fairly easily, simply encoding into the same extended UTF-8 code points. At this point we would have most the UCN support required for C99/C++98-03, with the exception that we are a little permissive in what we accept. Technically, that's an extension, as we should translate all valid portable programs ;¬)

Makes sense to me. There is another question of canonicalization though: at which stage should a UCN be translated into its corresponding UTF8 character? Should this be done when the lexer forms the IdentifierInfo?

-Chris

I'm working on this now. Worst (simplest) case would be two 32K tables, one for each of the tables in Annex A of TR 10176. Best case will be less, TBD by how much.

Speaking of which, does anyone have a copy of ISOIEC_TR_10176_2003_Table.txt as referenced by the aforementioned public specification? The table is not easily extracted from the PDF due to it being rotated.

Ned

There are some simple things we can do, though: for example, we can
special-case Chinese characters to count as two terminal columns.

This should only happen if you’re outputting to a terminal, and then only as specified in Unicode Standard Annex #11: East Asian Width. You’ll want to treat ambiguous characters as narrow (unless you’re willing to target non-UTF8 output, but then you’re in for a lot of other trouble anyhow).

-Ben

Er, two 8K tables, whoops!

Ned

Chris Lattner wrote:-

Adding a 2.5M database to clang sounds like a non-starter. When we
actually care enough about this, hopefully there will be a better way to
go. Neil, do you know of a good way to do this sort of check?

That would indeed be insane. I don't see anything wrong with GCC's
approach. You need three states, not two: invalid in identifiers,
invalid at the start of identifiers, and valid in identifiers. There
are different rules for C and C++. This is hardly performance
critical code; a binary lookup should suffice.

Neil.