Problems with Mach-O address significance table generation

Hi all,

We have been working on implementing ELF’s .addrsig format – originally outlined by @pcc here – for the MC Mach-O backend. However, we’ve run into some issues getting it to work properly. The Mach-O streamer seems to be structured quite differently from the ELF one, and our experience working with MC is limited, so I was hoping we could get some tips on how to handle this. I also have some suggestions for alternative formats of encoding the data, which might be simpler to implement.

The Problem

First, a quick recap of the format: We have symbol table indices stored as as ULEB128-encoded values in an __llvm_addrsig section. These indices tell the linker that those symbols should never be merged during code folding. Since we are using ULEB128, the size of this addrsig section depends on the actual index values, so we can’t determine its final size until we have assigned those indices.

The problem with Mach-O addrsig generation is that MCAssembler::layout() runs before symbol table indices are assigned; this happens in MachObjectWriter::computeSymbolTable(). But layout needs to know the size of the sections to work properly, leaving us in a bit of a circular bind.

I think MC’s ELF backend seems to sidestep this problem because it computes section offsets separately from MCAsmLayout, but the Mach-O backend doesn’t seem to do this. I might be missing something obvious though given my unfamiliarity with the code. Pointers will be highly appreciated!

Possible Alternative Encodings

I’ve been wondering if we could mark address-significant symbols within the symbol table itself, instead of using an auxiliary section. In particular, we could set a bit in the nlist::n_desc field for this. Alternatively, we could maybe reuse the REFERENCED_DYNAMICALLY bit and combine it with N_PEXT to indicate symbols whose addresses are significant, but which can still be removed by llvm-strip. Either way, this will probably need Apple’s stamp of approval. @cachemeifyoucan, do you have any thoughts on this?

1 Like

Make sure I understand your problem correctly. The current implementation after ⚙ D123751 Implement support for __llvm_addrsig for MachO in llvm-mc is not entirely correct because the layout might not be correct because new data are added after we finish layout()?

Personally, I am also not an expert in MC layer. I would prefer a solution with n_desc value but since we already used up all the fields there, we need to really think carefully about how we can add it without breaking the tools that doesn’t know about the new combination of bits. I will talk to our linker team to see if they have any thoughts there.

For a quick experiment. Maybe you can sort the symbol table differently by putting address significant symbols first? Then you just need to encode very few number for start and end index so you don’t need to use ULEB128?

1 Like

The current implementation after :gear: D123751 Implement support for __llvm_addrsig for MachO in llvm-mc is not entirely correct because the layout might not be correct because new data are added after we finish layout() ?

Close. Right now we aren’t setting the size of __llvm_addrsig at all, so layout thinks it has zero size, and we emit an invalid object file if there are any sections following it. (It seems that only DWARF sections are being emitted afterward, so the problem doesn’t crop up unless debug info is added.) ⚙ D127637 [MC][MachO] Change addrsig format + ensure its size is properly set was my attempt to calculate the section size, but then I realized that the point where we create the addrsig section, all sym->getIndex() returns is zero because the symbol indices haven’t been assigned. To sum up, we don’t have enough information to calculate the size correctly within createAddrSigSection().

Maybe you can sort the symbol table differently by putting address significant symbols first? Then you just need to encode very few number for start and end index so you don’t need to use ULEB128?

Hmm, that’s an interesting idea. We would have to have separate ranges for local vs external symbols, of course, but it seems workable.

I will talk to our linker team to see if they have any thoughts there.

Awesome, would love to hear their thoughts!

1 Like

Some quick feedback from our linker team:

  • Using a symbol table bit is definitely a more reasonable solution than a ULEB encoded list
  • Get a new bit is a quite tricky especially you are currently uses it as a “Don’t fold” bit. Because if you don’t read the bit from an old object file, that doesn’t mean you can fold the function. The reverse will make more sense but we still need to work out the design
  • ld64 currently only dedups weak functions and dedup is also very expensive. It would be interesting to find the balance between how many functions we check and code size optimized.
2 Likes

What we did for .cg_profile which also used to store symbol table indices is to instead store
relocations to the function symbols: ⚙ D104080 [LLD][LLVM] CG Graph profile using relocations. This was because any tools that modified the symbol table would break these tables since they won’t be updated (and would solve it for .addrsig as well) but also gets rid of the dependency problem here.

The cost is that the linker needs to be updated to read relocations instead of ULEB128 and .addrsig size goes up by 3x. However, this section is only used by the linker and then discarded for the final binary so this isn’t a big deal.

1 Like

However, this section is only used by the linker and then discarded for the final binary so this isn’t a big deal.

(FWIW, object size is a big deal for Google at least - the number of bytes of object files can be a scaling factor/problem for our larger programs - I imagine that was probably part of why the ULEB encoding was chosen for ELF/by Google. Would be sort of a pity if the MachO choice had a significantly different tradeoff, but if it’s necessary… )

1 Like

This section is tiny relative to the final binary: I’m seeing 118 bytes for .addrsig out of 1.36 MB in lib/Analysis/CMakeFiles/LLVMAnalysis.dir/ScalarEvolution.cpp.o or ~0.01% of total size. This is in Release mode as well with no debug information.

There’s a correctness issue with having it encode symbol table indices: any post-processing tool that changes the symbol table silently invalidates the table contents. We saw strip -g on the object files change the symbol table which caused one of the indices to go out of bounds which resulted in a hard linker error. If none of them went out of bounds, the final program could be silently incorrect from function merging. Similar observation found here: 23817 – strip and ld -r corrupt SHT_LLVM_ADDRSIG section (files built with clang-7) which also suggested using relocations instead which is maintained by post-processing tools.

My suggestion is to switch all instances of .addrsig to use the relocation format. There’s not much pressure from backwards compatibility because these are intermediate binaries. For .cg_profile we stopped recognizing the previous format and used a new section type for it to indicate the newer format and haven’t encountered any reported issues.

What’s the use case for stripping object files? I was under the impression that we usually just strip the final binary

(Nevertheless the ld -r thing definitely seems like an issue)

(Nevertheless the ld -r thing definitely seems like an issue)

Actually, it’s only an issue if we mix usage of ld64 together with LLD. l suppose that since LLD doesn’t yet support -r, there might be a use case for this, but IMO the long term solution is to support -r and teach it to update the addrsig info.

Also, it seems that the Mach-O CG profile implementation uses symbol indices instead of relocations, so we are already subject to this problem.

@cachemeifyoucan does the linker team have any further opinions on this? If using an additional n_desc bit is going to take a few months to hash out, perhaps we can go with a simple workaround of using fixed-size encodings of symbol indices for now, and move to the n_desc solution later to make smaller object files + work better with other symtab-editing tools?

ld -r will internalize visibility hidden symbols, make them private. You can then strip to hide them from users (either avoid duplicated symbols or expose less internals when you ship a static library). So yes, it is common to ld -r and strip object file before making it an archive.

You can totally support that even with symbol index approach, because strip calls into ld64 for certain cases (I think it is a single object ld -r because linker is the source of truth for object file). You can rebuild the symbol table when ld -r to make sure the table is still valid.

I don’t have any comments on this and I don’t have any estimates that I can give. I think even we end up doing the n_desc route, implementing a prototype with indexes is a good data point for arguing for some bits in machO format. We can then schedule a meeting to discuss.

1 Like

@pcc do you have data/recall specifics about alternative encoding schemes here, the cost of using traditional relocations?

1 Like

If the final binary isn’t built with debug information, stripping it out of the intermediate object files keeps them smaller so that the build cache is more efficient.

1 Like

I would suggest using a bit on the symbol table for this if you are able to allocate one. For ELF there were no available bit positions for adding these kinds of attributes and we couldn’t reach agreement on adding a new side table for the symbol table. Relocations were considered too expensive to have the feature turned on by default.

Adding a section with symbol table indexes has its own complications because, as you observed, downstream tools like strip may change the symbol table indexes. In ELF we detect this situation by making sh_link point to the symbol table. At the time I surveyed the available downstream tools and they were all setting sh_link to 0 for unknown sections so we used that as a marker to detect manipulation by an incompatible downstream tool. This was less of a concern for COFF because people don’t typically run these kinds of tools on COFF object files.

If there isn’t anything comparable on Mach-O then it may be best to go with relocations at least as an initial implementation. If you can find an unused bit in the Mach-O relocation format then you might even consider following one of the suggestions that was made and encode the address significance in a bit on the relocations themselves (this was considered impractical with ELF because it would require duplicating pretty much every relocation type). Otherwise you can have an empty section with relocations pointing to the address-significant symbols and later move to the bit on the symbol table to save space.

2 Likes

Thanks for the feedback everyone! I’ve updated ⚙ D127637 [MC][MachO] Change addrsig format + ensure its size is properly set with the symbol relocation approach as suggested. Looking forward to a cleaner n_desc approach down the line.

I wish that we use SHT_REL relocation format for .llvm_addrsig, but there is indeed a noticeable size overhead.
This command computes the total size and the number of .llvm_addrsig symbols.

find . -type f -name '*.[ao]' | ruby -lne 'BEGIN{tot=a=0}; tot+=File.size($_); a+=`llvm-readobj --addrsig #{$_} | grep -c Sym:`.to_i; END{puts "total size: #{tot}\n# of addrsig symbols: #{a}"}'

A default build (no debug) of chromium (2021-12-31)

total size: 4087677552
# of addrsig symbols: 5286068
size proportion with SHT_REL: 0.02069074356381621

A -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=on build of clang (recent)

total size: 486550806
# of addrsig symbols: 48027
size proportion with SHT_REL: 0.0015793458576656844

A large internal project:

total size: 3105872224
# of addrsig symbols: 1402945
size proportion with SHT_REL: 0.007227315063259336

I agree that object file size is a big deal for Google, but 0.7% is likely acceptable.

For clang, files like lib/libLLVM{AArch64,X86}Desc.a have a larger addrsig_syms/size ratio due to many TableGen-generated namespace-local variables referenced by other variables:

static const MCPhysReg ImplicitList85[] = { X86::ECX, X86::EFLAGS, 0 };
static const MCPhysReg ImplicitList86[] = { X86::XMM0, X86::EFLAGS, 0 };
...
  { 2128,       7,      0,      0,      254,    0|(1ULL<<MCID::MayLoad), 0x3098046819ULL, ImplicitList27, ImplicitList85, OperandInfo120 },  // Inst #2128 = PCMPESTRIrm
  { 2131,       3,      0,      0,      257,    0, 0x3018046829ULL, ImplicitList27, ImplicitList86, OperandInfo121 },  // Inst #2131 = PCMPESTRMrr

If we have an source-level attribute to mark a variable address-insignificant, we can lower addrsig_syms/size for lib/libLLVM{AArch64,X86}Desc.a.

It’s a bit off-topic, but my memory is vague as to why we needed an address-significant table for ELF in the first place. We have kind of already solved the problem when making a decision as to whether we need to create a canonical PLT or a regular PLT, no? That is, if a symbol is referenced by an address-taken relocation (such as R_X86_64_64), we need to create a canonical PLT. If there’s no such relocation for a symbol, we can create a regular PLT. So, it looks like we have already answered to the question as to whether a symbol is address-taken or not.

What am I missing?

For text sections, relocation type distinction mostly suffices.
For rodata sections (which can be merged as well), the references all look like R_X86_64_64/R_X86_64_PC32 but the sections may be address insignificant.

gold --icf=safe mostly checks relocation types but it does not suffice, so it uses demangled names for some tests (e.g. vtable). The demangling makes ICF slow. The relocation type matching looks brittle and requires different processing for each architecture.

Are you sure that relocation types are sufficient? I thought that LEA and CALL instructions basically use the same 32-bit PC-relative relocation, so in many cases, the linker can’t really tell what is address-taken. I don’t remember the details, but I settled on the conclusion that these tables are necessary, and I’m comfortable sticking with that.

I also believe the address significance tables are more aggressive because they leverage information from the frontend. C++ virtual methods are not comparable by address, so they can all be marked unnamed_addr in LLVM IR, even though they are all address-taken as far as the linker is concerned.

I’m sure it’ll work for function. It won’t work for data, as we can’t distinguish address-taken relocations from non-address-taken relocations. But for functions, if it is referenced by R_X86_64_PLT32 (or equivalents), that relocation doesn’t take an address, while other types of relocations take an address.

Technically, you can get an address of a function by getting its relative address with R_X86_64_PLT32 and add a PC from that instruction, but I don’t think we don’t have to worry too much about that.

I believe lld is already using that heuristics to decide whether it should create a canonical PLT or non-canonical PLT. It creates a canonical PLT only if a function address is taken. So it’s not entirely new.

Besides that, I actually don’t think that you cannot take an address of a function with R_X86_64_PLT32 followed by an ADD to add a PC to the resulting value, because it always creates an address of a PLT entry, even if we are not using the address of the PLT as the function address. So the only correct way to get an address of a function is to refer it using R_X86_64_PLT32 or via a GOT entry.