Mangling of UTF-8 characters in symbol names

Sean_Hunt1 · March 30, 2012, 7:12pm

Why is it that high (>127) bytes in symbol names get mangled by LLVM into XX, where XX is the hex representation of the character? Is this required by ELF or some similar standard? This behavior is inconsistent with GCC.

Sean

Eli_Friedman1 · March 30, 2012, 7:22pm

I think it's just so that we have a way to actually write out the
symbol into the assembly file. What does gcc do?

-Eli

Sean_Hunt1 · March 31, 2012, 1:17am

It emits the high bits literally. The consequence is that UTF-8-encoded identifiers come out in UTF-8:

scshunt@natural-flavours:~$ gcc -fextended-identifiers -std=c99 -x c -c -o test.o -
int i\u03bb;
scshunt@natural-flavours:~$ nm test.o
00000004 C iλ
scshunt@natural-flavours:~$

As you can see, the nm output includes the literal lambda.

Sean

Eli_Friedman1 · March 31, 2012, 1:22am

Okay... then we should probably support that as well. Might need to
be a bit careful to make sure the assembly files work correctly.

-Eli

Sean_Hunt1 · March 31, 2012, 9:44pm

You mean machine assembly and not IR, right?

Sean

Eli_Friedman1 · March 31, 2012, 11:05pm

Err, yes... LLVM IR already has a scheme for handling this safely.

-Eli

Sean_Hunt1 · April 2, 2012, 2:28am

GCC's assembly output simply has the UTF-8 put in with no changes.

I think that we should do this by default, and if this causes
problems, come up with a setting (possibly in the target?) to control
how we encode Unicode on other platforms.

Sean

Topic		Replies	Views
Where does LLVM mangle characters from llvm-ir names while generating native code? LLVM Dev List Archives	8	158	November 26, 2011
Quick question about assembly output LLVM Dev List Archives	1	60	November 15, 2009
Valid names for symbols LLVM Dev List Archives	2	83	September 11, 2010
[PATCH] Output UTF-8-encoded characters as identifier characters into assembly by default. LLVM Dev List Archives	0	78	April 2, 2012
Quick question about Unicode support LLVM Dev List Archives	1	68	November 12, 2013

Mangling of UTF-8 characters in symbol names

Related topics