“Parse input files” is the major bottleneck in a link. lld should improve this aspect to be faster.
Why isn't ld.lld faster? | MaskRay has some statistics.
Here are some insights on making symbol initialization and resolution parallel:
- Some performance win (initialization of local symbols and sections can be made parallel) does not need parallel symbol resolution.
- Current ELF
ObjFile<ELFT>::parse
couples some tasks which make parallelism hard. - In all ports, symbol initialization and resolution are coupled too heavily. Some tasks (e.g. reporting duplicate definition error) can be postponed.
More concretely, I think we need these changes in order:
- postpone duplicate definition error (in
Symbol::resolveDefined
) to the very end. For each relocatable object file, iterate over its symtab, check whethersymbols[i]->file
andfile
lead to a duplicate definition error. This can be parallel, with the downside that theduplicate symbol
diagnostic order may not be deterministic. I think we should take the sacrifice. - postpone COMDAT group resolution. In
ObjFile<ELFT>::initializeSymbols
, drop thesec == &InputSection::discarded
case. Instead, fix the symbol kind after all files are parsed. - postpone and parallelize initialization of local symbols (
for (size_t i = 0, end = firstGlobal; i != end; ++i) {
). This can happen after we get a list of needed relocatable object files (i.e. ignoring unextracted lazy object files). - postpone and parallelize initialization of sections. This can happen after we get a list of needed relocatable object files.
With these, “Parse input files” should be noticeably faster.
I will strive to keep the relocation refers to a symbol in a discarded section
diagnostic (⚙ D59649 [ELF] Improve error message for relocations to symbols defined in discarded sections), but there is a non-zero probability that we may lose it in the default mode if it does not fit into parallelism well (hurt performance too much).
If it turns out to be too slow, perhaps add an opt-in option for it.
Then think about parallel initialization of non-local symbols.
This part is largely under-specified.
There are two main things.
First is how to parallelize lazy object files (archive member extraction).
The basic idea is that we need to parse lazy object files eagerly and do an object file level mark-and-sweep.
Initially every non-lazy relocatable object file is marked and in a job queue.
If any symbol resolves to an unextracted lazy object, place the lazy object in the job queue.
TODO add more descriptions
set = {all relocatable object files and lazy object files extracted by options like -u}
for file in set
for sym in file's non-local symbols
if !sym.isWeak() && isDefinedByUnextractedLazyObject(sym)
set.insert(sym.file)
Second is how to parallelize file parsing.
Every input file has a priority (relocatable > shared = lazy).
When iterating over its symbol table, consider whether its entry has a higher priority than the existing global symtab entry. If yes, overwrite some properties.
void initializeSymbols {
for (esym,sym) in file's non-local symbols {
if sym.isUndefined() { continue; }
if priority(esym) < sym.priority() {
sym.file = this;
// Replace other fields
}
}
}
Some notes:
- llvm-project does not provide a concurrent unordered map/vector.
- Think hard to keep the order of
SmallVector<Symbol *, 0> symVector;
. Symbol order stability is important. - I have tested that simply replacing
llvm::DenseMap<llvm::CachedHashStringRef, int> symMap;
with sharded symbol maps does not improve performance, because symbol iteration will become slow. - Some symbol properties initialized in
ObjFile<ELFT>::initializeSymbols
can be postponed. Candidates: value/type.
This post focuses on ELF.
For Mach-O:
- it does not have COMDAT, so easier
- it has tricky code like
if (dysym->isWeakDef())
inSymbolTable::addLazyObject
. This appears more complex than ELF, so harder