Mangling of UTF-8 characters in symbol names

Why is it that high (>127) bytes in symbol names get mangled by LLVM into XX, where XX is the hex representation of the character? Is this required by ELF or some similar standard? This behavior is inconsistent with GCC.

Sean

I think it's just so that we have a way to actually write out the
symbol into the assembly file. What does gcc do?

-Eli

It emits the high bits literally. The consequence is that UTF-8-encoded identifiers come out in UTF-8:

scshunt@natural-flavours:~$ gcc -fextended-identifiers -std=c99 -x c -c -o test.o -
int i\u03bb;
scshunt@natural-flavours:~$ nm test.o
00000004 C iλ
scshunt@natural-flavours:~$

As you can see, the nm output includes the literal lambda.

Sean

Okay... then we should probably support that as well. Might need to
be a bit careful to make sure the assembly files work correctly.

-Eli

You mean machine assembly and not IR, right?

Sean

Err, yes... LLVM IR already has a scheme for handling this safely.

-Eli

GCC's assembly output simply has the UTF-8 put in with no changes.

I think that we should do this by default, and if this causes
problems, come up with a setting (possibly in the target?) to control
how we encode Unicode on other platforms.

Sean