-fhash-long-section-names=N, -fhashed-section-names=map.csv

Hi list,

I'm a bit new to hacking LLVM / Clang, and I wanted to add a new
command line option "-fhash-long-section-names=N". The change will
help to overcome the 16-character limit in section names in macOS[1]
which is currently a bit of a showstopper for a certain feature in one
specific project. The option itself does not necessarily need to be
tied to macOS. ELF does not impose such a limitation on section name
size. The default would be to preserve existing behaviour and not to
hash section names but instead continue to return errors [2]. The
minimum size for N is chosen to be 16. The maximum value is arbitrary.
A value of 0 indicates "no hashing".

The hashing process will consist of:
* SHA256
* Base64
* Truncate to N

This is already a somewhat common approach to solving this problem on macOS.

The basic idea is this (N = 16):
// this is a short section, so no change
__attribute__((section("foo"))) => "foo"
// this "long" section has been hashed
__attribute__((section("ThisSectionNameIsTooLong"))) => "ip9RNVxH27rCS+Ix"

In the unlikely event of a section name collision, it would be good to
throw an error (a good test point). Also, since hashing is not
trivially reversible, I would like to add another option
-fhashed-section-names=map.csv, which would forward hashed section
names in a format easy to read by subsequent tooling.

For macOS, specifically, patterns like the following would also need
transformation:
section("__DATA,phoo")
extern struct foo foo_start __asm("section$start$__DATA$phoo");
extern struct foo foo_end __asm("section$end$__DATA$phoo");

This is kind of a macOS parallel of linker-generated start and stop
symbols in ELF world.

The clang frontend changes were fairly straightforward and it was
quite simple to create the transform itself in python and llvm. I'm a
little unsure of how to proceed from here. Likely there
will be some aspect of AST and some aspect of Sema involved. I have
gone over the documentation and examples [3][4], and I'm still not
entirely sure.

I have done some brute-forcing and have played around with
MCSectionMachO.cpp and MCSymbolMachO.cpp, but I think that is
definitely the wrong approach.

Finally, my questions:
1. First, is this a feature that upstream would accept?
2. Should I use the AST / Replacement approach mentioned in [5]?
3. Is there another, preferable form of "backend magic" that should be used?
4. Are there any existing tests that would be good examples to borrow from?

Would you be able to point me in the right direction?

Thanks, and hope you are well.

C

[1] See "section[16]" here:
https://opensource.apple.com/source/cctools/cctools-921/include/mach-o/loader.h.auto.html
[2]
error: argument to 'section' attribute is not valid for this target:
mach-o section specifier requires a section whose length is between 1
and 16 characters
[3] Hacking on clang
[4] “Clang” CFE Internals Manual — Clang 16.0.0git documentation
[5] The Clang AST - a Tutorial - YouTube

Made some progress and would still very much like to get some feedback
from LLVM devs.

I ended up implementing an AST matcher via the tutorial[1] and have a
PoC here[2] which works[3]. I do need to move the matcher out of the
custom tool that I created and into Sema still. Also added some
regression tests (although, I think I might need to move that logic to
unit tests instead).

One thing I've come to realize a bit more though is that I might need
to target LangOptions rather than CodeGenOptions. Is anyone able to
confirm that is the correct location for this option?

Working with the AST and SemaCXX has surprisingly little to do with
specific machine code generation. I guess that's just how LLVM is
architected. There is one exception in this case though; I still need
to somehow determine if the target object file is MachO in order to
decide whether to call MCSectionMachO::ParseSectionSpecifier().

Is there an easy way to determine the target object file format via
the ASTContext?

[1] Tutorial for building tools using LibTooling and LibASTMatchers — Clang 16.0.0git documentation
[2] https://github.com/llvm/llvm-project/compare/main...cfriedt:fhash-long-section-names
[3]
% long-section-converter clang/test/SemaCXX/attr-section-hashed-macos.cpp \
    -- -fhash-long-section-names=16
VarDecl 0x1470cdec0 <.../attr-section-hashed-macos.cpp:1:1, line:2:5>
col:5 foo 'int'
`-SectionAttr 0x1470cdf28 <line:1:16, col:59> section
"__RODATA,ip9RNVxH27rCS+Ix"
VarDecl 0x1470ce390 parent 0x14702fc08
<.../attr-section-hashed-macos.cpp:8:3, col:24> col:14 used start_foo
'int' extern
`-AsmLabelAttr 0x1470ce408 <col:32>
"section$start$__RODATA$ip9RNVxH27rCS+Ix" IsLiteralLabel
VarDecl 0x147808e00 parent 0x14702fc08
<.../attr-section-hashed-macos.cpp:9:3, col:22> col:14 used end_foo
'int' extern
`-AsmLabelAttr 0x147808e78 <col:30>
"section$end$__RODATA$ip9RNVxH27rCS+Ix" IsLiteralLabel