Hey folks,
Attached is a proof of concept for the handling of UTF-8 in
identifiers. Aside from the terrible isIdentifierBody function, which
should be optimized where possible (possibly into a lookup table for
the BMP, since that would be 8kb, and using the simple bitwise
operation in there for other planes), I think the approach is the
correct one. Given that this is sensitive code, however, I would like
to ensure no one has any issues with this approach before I convert
more of the lexer code over.
Sean
utf8-proof-of-concept.patch (3.49 KB)
This patch still applies reasonably cleanly; any feedback?
Sean
utf8-proof-of-concept.patch (3.49 KB)
+/// identifier, which is [a-zA-Z0-9_], or a Unicode character defined by
+/// Annex E of the C++ standard (note: pretty sure this is different in C).
+static inline bool isIdentifierBody(UTF32 c) {
- if (c <= 127)
- return (CharInfo[c] & (CHAR_LETTER|CHAR_NUMBER|CHAR_UNDER)) ? true : false;
- else if (c & 0xffff0000)
- return ~c & 0xfffd;
This should be split into two functions: isIdentifierBody() which just handles the <= 127 case, and the rest in a attribute(no_inline) function to handle the slowpath.
We don’t want the slow path to prevent inlining of the fast path. If we have a macro for builtin_expect, it would be worth using it on the “c <= 127” branch.
I would also prefer the loop in Lexer::LexIdentifier to be written something like this, which might make the idea above irrelevant:
do {
C = *CurPtr;
if (C > 127) goto UTFIdentifier;
++CurPtr;
} while (isNonUTFIdentifierBody(C))
…
UTFIdentifier:
… handle the general case here.
-Chris
If speed is /really/ important, I’d just go for an 8KB lookup table, since it’s a one-time memory cost and it’s not a lot compared to the usual total memory cost, and then it has an advantage that UTF-8 heavy code, if it ever happens, will run quickly if the table stays in cache. But I can do as suggested as well.
Sean