Musings about UCNs

I've been eyeing UCNs for a while, and so I've got a few musings to share; perhaps they will help whoever gets around to implementing them.

Disclaimer: I'm basing this off the C++ spec. If there are differences/incompatbilities for the C spec, I haven't noticed.

Thoughts:
  - We should probably use UTF-8 internally because it has a bunch of nice features, like not breaking any existing code within clang.
  - We could also accept UTF-8 as the default character encoding and process extended characters directly. The driver should handle other encodings by converting them to UTF-8.
  - Pursuant to that, does clang currently assume it's being compiled on an ASCII system?
  - To reduce performance hits, we should only scan a given identifier once to see if it contains any illegal characters. I'm thinking the Token should store whether it contains a universal-character as it stores whether or not it needs cleaning, and IdentifierTable::get() gets a default parameter added; if it's set and the identifier is not already in the table, then a check is performed, ideally on a precompiled trie.
  - For literals, UCN processing will occur in the token lexer invoked by Sema later on, including conversion to the execution character set if necessary.
  - How extended characters should be stored in names in unclear. Ancient cxx-abi-dev discussions are undecided on whether simply using UTF-8 is correct. GCC code seems to suggest this is the intent in the long run.

Sean

I've been eyeing UCNs for a while, and so I've got a few musings to
share; perhaps they will help whoever gets around to implementing them.

Disclaimer: I'm basing this off the C++ spec. If there are
differences/incompatbilities for the C spec, I haven't noticed.

Thoughts:
- We should probably use UTF-8 internally because it has a bunch of
nice features, like not breaking any existing code within clang.

yes.

- We could also accept UTF-8 as the default character encoding and
process extended characters directly. The driver should handle other
encodings by converting them to UTF-8.

We should have SourceMgr do this, the driver doesn't know about all the headers etc.

- Pursuant to that, does clang currently assume it's being compiled on
an ASCII system?

Yes, we don't care about non-ascii systems. When we do, sourcemgr can translate them as well.

- To reduce performance hits, we should only scan a given identifier
once to see if it contains any illegal characters.

Yes, the lexer should just handle this in the identifier lexing code. The common case is "no ucn" so any ucn characters should cause a branch out of the fastpath into the existing slow case of identifier lexing.

I'm thinking the
Token should store whether it contains a universal-character as it
stores whether or not it needs cleaning, and IdentifierTable::get() gets
a default parameter added; if it's set and the identifier is not already
in the table, then a check is performed, ideally on a precompiled trie.

I don't think this is necessary. The IdentifierInfo* should contain the canonicalized utf8 encoding, and the spelling is whatever is in the code (after sourcemgr switches the character set).

- For literals, UCN processing will occur in the token lexer invoked
by Sema later on, including conversion to the execution character set if
necessary.

Sure.

- How extended characters should be stored in names in unclear.
Ancient cxx-abi-dev discussions are undecided on whether simply using
UTF-8 is correct. GCC code seems to suggest this is the intent in the
long run.

Storing canonicalized utf8 in the identifiers is the only reasonable thing to do.

-Chris