[REVIEW] UTF-8 in identifiers proof of concept

Hey folks,

Attached is a proof of concept for the handling of UTF-8 in
identifiers. Aside from the terrible isIdentifierBody function, which
should be optimized where possible (possibly into a lookup table for
the BMP, since that would be 8kb, and using the simple bitwise
operation in there for other planes), I think the approach is the
correct one. Given that this is sensitive code, however, I would like
to ensure no one has any issues with this approach before I convert
more of the lexer code over.


utf8-proof-of-concept.patch (3.49 KB)

This patch still applies reasonably cleanly; any feedback?


utf8-proof-of-concept.patch (3.49 KB)

+/// identifier, which is [a-zA-Z0-9_], or a Unicode character defined by
+/// Annex E of the C++ standard (note: pretty sure this is different in C).
+static inline bool isIdentifierBody(UTF32 c) {

  • if (c <= 127)
  • return (CharInfo[c] & (CHAR_LETTER|CHAR_NUMBER|CHAR_UNDER)) ? true : false;
  • else if (c & 0xffff0000)
  • return ~c & 0xfffd;

This should be split into two functions: isIdentifierBody() which just handles the <= 127 case, and the rest in a attribute(no_inline) function to handle the slowpath.

We don’t want the slow path to prevent inlining of the fast path. If we have a macro for builtin_expect, it would be worth using it on the “c <= 127” branch.

I would also prefer the loop in Lexer::LexIdentifier to be written something like this, which might make the idea above irrelevant:

do {
C = *CurPtr;
if (C > 127) goto UTFIdentifier;
} while (isNonUTFIdentifierBody(C))

… handle the general case here.


If speed is /really/ important, I’d just go for an 8KB lookup table, since it’s a one-time memory cost and it’s not a lot compared to the usual total memory cost, and then it has an advantage that UTF-8 heavy code, if it ever happens, will run quickly if the table stays in cache. But I can do as suggested as well.