Transcoding UTF-8 to ASCII?

Soliciting your help on the following question ...

XPL uses the UTF-8 encoding for its identifiers. As such it supports
Unicode and many non-ASCII characters.

LLVM uses std::string for identifiers which is based on a signed
character which only supports 7-bit ASCII.

Although the size of the characters in both schemes is the same, the bit
encoding is different (UTF-8 is unsigned, ASCII is 7-bit so uses a
signed char). I need to support UTF-8 in XPL and am wondering about
transcoding my identifiers for use by LLVM.

There's two approaches here:
     1. just cast unsigned char to signed char and _hope_ it doesn't
        screw up LLVM
     2. encode the extended characters into ASCII somewhat like what
        browsers do with URLs (e.g. space=%20).

Would approach #1 work with LLVM? Are there any character bit patterns
forbidden in LLVM identifiers? This would be the simplest to implement
but I'm unsure of the consequences. If there is, I'll be forced into
approach #2.

Thanks,

Reid.

There's two approaches here:
     1. just cast unsigned char to signed char and _hope_ it doesn't
        screw up LLVM
     2. encode the extended characters into ASCII somewhat like what
        browsers do with URLs (e.g. space=%20).

Would approach #1 work with LLVM?

Yes, we never 'interpret' the characters in an identifier for anything
that would cause us to consider the 'sign bit' special.

Are there any character bit patterns forbidden in LLVM identifiers?

No. I believe everything should work, though you might get into trouble
if you use the '"' character (which can be easily fixed in the lexer and
asmwriter as needed.

This would be the simplest to implement but I'm unsure of the
consequences. If there is, I'll be forced into approach #2.

I believe this should work. LLVM identifiers should be completely
unconstrained.

-Chris