Why is it that high (>127) bytes in symbol names get mangled by LLVM into XX, where XX is the hex representation of the character? Is this required by ELF or some similar standard? This behavior is inconsistent with GCC.
Sean
Why is it that high (>127) bytes in symbol names get mangled by LLVM into XX, where XX is the hex representation of the character? Is this required by ELF or some similar standard? This behavior is inconsistent with GCC.
Sean
I think it's just so that we have a way to actually write out the
symbol into the assembly file. What does gcc do?
-Eli
It emits the high bits literally. The consequence is that UTF-8-encoded identifiers come out in UTF-8:
scshunt@natural-flavours:~$ gcc -fextended-identifiers -std=c99 -x c -c -o test.o -
int i\u03bb;
scshunt@natural-flavours:~$ nm test.o
00000004 C iλ
scshunt@natural-flavours:~$
As you can see, the nm output includes the literal lambda.
Sean
Okay... then we should probably support that as well. Might need to
be a bit careful to make sure the assembly files work correctly.
-Eli
You mean machine assembly and not IR, right?
Sean
Err, yes... LLVM IR already has a scheme for handling this safely.
-Eli
GCC's assembly output simply has the UTF-8 put in with no changes.
I think that we should do this by default, and if this causes
problems, come up with a setting (possibly in the target?) to control
how we encode Unicode on other platforms.
Sean