unicode identifiers

Hi!

I'd like to use unicode (utf-8) identifiers and for this I simply
patched the CharInfo in Lexer.cpp to contain CHAR_LETTER
for characters 128 to 255. Is this simple solution different from
what the standard requires and if yes, what would be
the correct solution for UCN (universal character name)
identifiers?

-Jochen

ENOPATCH :slight_smile:

-Chris

I'd like to use unicode (utf-8) identifiers and for this I simply
patched the CharInfo in Lexer.cpp to contain CHAR_LETTER
for characters 128 to 255. Is this simple solution different from
what the standard requires and if yes, what would be
the correct solution for UCN (universal character name)
identifiers?
     

ENOPATCH :slight_smile:
   

does this mean you don't want to support it in mainline?
hey even visual studio does it :wink:
but at first i'd like to know if this is how it's done or if you have a link about this
topic.

-Jochen

Hi;

Yes. The standard has a list of characters allowed in identifiers in Appendix E. We would need to decode the UTF-8 to see if it is valid, as well as ignore invalid UTF-8 sequences.

Sean

Just attach the patch to your email so it can be reviewed here.

here you are. of course this influences the AsmWriter in llvm if
it sees characters with the high bit set, for example I would not use
the locale dependent function isalnum()

-Jochen

Lexer.patch (2.29 KB)

I'm afraid the parsing of the text is the smaller part of the problem.
I've only looked into this briefly, but it was enough to realize
you're going to run into platform specific linking/tool issues.

Windows, for example, normally takes any UTF-16 string in functions
that take strings as input, but in the documentation of GetProcAddress
it says this:

lpProcName [in]

    The function or variable name, or the function's ordinal value. If
this parameter is an ordinal value, it must be in the low-order word;
the high-order word must be zero.

Essentially, they constrained the character set of identifier names to
the basic ASCII characters. You probably won't be able to get a
library with unicode characters to link. Even then, LLVM would need
the brains to convert the UTF-8 string to UTF-16, which Windows
normally expects.

As far as I could tell, you would need to go platform by platform and
see if they demanded any special rules for identifiers in executables
and libraries. Unfortunately, most of them seem to say nothing
explicit about it in the documentation.

The last time I checked gcc required use of -fextended-identifiers,
which was marked as experimental.

-Scott

Ahh, never mind. I'm told my Windows example was totally bogus and
that that only applies for ordinal values.

Still, I think you'll have to see what constraints exist on different platforms.

-Scott

then why not start with supporting it in the frontend and llvm ir
and report errors in the back ends that tell the reason
("unicode identifiers not supported"). for example all variable
names will not be a problem since the back ends don't see them.

-Jochen

The reason might be bogus but you're right, DLLs can only expose ASCII
names in their export table, and hence there's no Unicode version of
GetProcAddress. Though the symbol names in the PECOFF spec is said to
be UTF-8, so Unicode names might theoretically work as long as they're
not dllexport. I have no clue if any linker supports it though.

BR,
Niklas

Hi!

here my patch for VMCore (against 2.9).
My suggestion would be that clang and llvm use utf8, i.e. chars
128 to 255 are identifiers, because llvm is a general virtual machine.
if some linkers don't support it then an error should be generated in the
corresponding back end.

-Jochen

VMCore.patch (1.75 KB)

The reason might be bogus but you're right, DLLs can only expose ASCII
names in their export table, and hence there's no Unicode version of
GetProcAddress. Though the symbol names in the PECOFF spec is said to
be UTF-8, so Unicode names might theoretically work as long as they're
not dllexport.

"no unicode version of ..." is microsoft speak and means "no utf-16 version of...".
So are you sure that GetProcAdress doesn't swallow utf-8? The main advantage
of utf-8 is that you get unicode support for free. So if GetProcAddress just takes
the byte sequence and compares it against the PECOFF table then everything is ok.
It only does not work if it is broken on purpose such as the utf-8 literals in visual c.

-Jochen

Hi!

by the way lua also uses the simple approach to allow every character
with the msb set as identifier:

http://lua-users.org/wiki/UnicodeIdentifers

so I would say at least for llvm this is simple and efficient,
for c++ it's dependent on the standard.

-Jochen