Parallel input file parsing

“Parse input files” is the major bottleneck in a link. lld should improve this aspect to be faster.
Why isn't ld.lld faster? | MaskRay has some statistics.

Here are some insights on making symbol initialization and resolution parallel:

  • Some performance win (initialization of local symbols and sections can be made parallel) does not need parallel symbol resolution.
  • Current ELF ObjFile<ELFT>::parse couples some tasks which make parallelism hard.
  • In all ports, symbol initialization and resolution are coupled too heavily. Some tasks (e.g. reporting duplicate definition error) can be postponed.

More concretely, I think we need these changes in order:

  • postpone duplicate definition error (in Symbol::resolveDefined) to the very end. For each relocatable object file, iterate over its symtab, check whether symbols[i]->file and file lead to a duplicate definition error. This can be parallel, with the downside that the duplicate symbol diagnostic order may not be deterministic. I think we should take the sacrifice.
  • postpone COMDAT group resolution. In ObjFile<ELFT>::initializeSymbols, drop the sec == &InputSection::discarded case. Instead, fix the symbol kind after all files are parsed.
  • postpone and parallelize initialization of local symbols (for (size_t i = 0, end = firstGlobal; i != end; ++i) {). This can happen after we get a list of needed relocatable object files (i.e. ignoring unextracted lazy object files).
  • postpone and parallelize initialization of sections. This can happen after we get a list of needed relocatable object files.

With these, “Parse input files” should be noticeably faster.
I will strive to keep the relocation refers to a symbol in a discarded section diagnostic (⚙ D59649 [ELF] Improve error message for relocations to symbols defined in discarded sections), but there is a non-zero probability that we may lose it in the default mode if it does not fit into parallelism well (hurt performance too much).
If it turns out to be too slow, perhaps add an opt-in option for it.


Then think about parallel initialization of non-local symbols.
This part is largely under-specified.

There are two main things.

First is how to parallelize lazy object files (archive member extraction).
The basic idea is that we need to parse lazy object files eagerly and do an object file level mark-and-sweep.
Initially every non-lazy relocatable object file is marked and in a job queue.
If any symbol resolves to an unextracted lazy object, place the lazy object in the job queue.
TODO add more descriptions

set = {all relocatable object files and lazy object files extracted by options like -u}
for file in set
  for sym in file's non-local symbols
    if !sym.isWeak() && isDefinedByUnextractedLazyObject(sym)
      set.insert(sym.file)

Second is how to parallelize file parsing.
Every input file has a priority (relocatable > shared = lazy).
When iterating over its symbol table, consider whether its entry has a higher priority than the existing global symtab entry. If yes, overwrite some properties.

void initializeSymbols {
  for (esym,sym) in file's non-local symbols {
    if sym.isUndefined() { continue; }
    if priority(esym) < sym.priority() {
      sym.file = this;
      // Replace other fields
    }
  }
}

Some notes:

  • llvm-project does not provide a concurrent unordered map/vector.
  • Think hard to keep the order of SmallVector<Symbol *, 0> symVector;. Symbol order stability is important.
  • I have tested that simply replacing llvm::DenseMap<llvm::CachedHashStringRef, int> symMap; with sharded symbol maps does not improve performance, because symbol iteration will become slow.
  • Some symbol properties initialized in ObjFile<ELFT>::initializeSymbols can be postponed. Candidates: value/type.

This post focuses on ELF.
For Mach-O:

  • it does not have COMDAT, so easier
  • it has tricky code like if (dysym->isWeakDef()) in SymbolTable::addLazyObject. This appears more complex than ELF, so harder
2 Likes

CC @int3 and @oontvoo for LLD for Mach-O :slight_smile:

Thanks for cc’ing - (I haven’t followed Discourse closely :))

I’ve been chatting with maskray about this for sometime. From my current experiment with parellising input parsing for macho (ie., similar to the what the current ELF’s ObjFile::parse does), the result does not seem very good. The implementation is quite simple at the moment - simply have all the macho::InputFile delay populating SymTab - but instead, just parse the membufs in parallel and store the temporary Symbols’ state, then after everyone is done, populating the symboltab serially.

One of the concern right now is the allocator. (Current bump allocator isn’t thread safe, and switching to simple malloc or even TCMalloc seems to slow things down by ~13-20%, depending on the applications). However, I believe we can do some nfc refactoring in InputFile to ensure it doesn’t allocate memory in the ctor call. Perhaps that could work.

Another thing is that the cost of storing temporary symbols’ state before resolution is high. (maskray suggested interning the symbols - which i’m exploring)

(Simple sketch of my “ideal” goal is


)

1 Like

Any ideas how this could map to PE/COFF? I haven’t studied the format that deeply but I know we could benefit if the linker could be made faster.

However, I believe we can do some nfc refactoring in InputFile to ensure it doesn’t allocate memory in the ctor call. Perhaps that could work.

Or we could make a thread-safe bump allocator :slight_smile: It seems like a generally useful thing to have in LLVM…

Maybe there is more in LLVM than just a ThreadPool …
Maybe there is need in LLVM for more complex concurrent data structures than just a ThreadPool.

1 Like

int3:
Or we could make a thread-safe bump allocator :slight_smile: It seems like a generally useful thing to have in LLVM…

tschuett
Maybe there is more in LLVM than just a ThreadPool …
Maybe there is need in LLVM for more complex concurrent data structures than just a ThreadPool.

I feel those two topics point out again the lack in LLVM of thread-safe, lock-free containers & memory allocator. This boils down to the question of whether we should reference or vendor external packages from LLVM. Right now it seems that only “optional” external libs are referenced by LLVM, and any “required” external library has to duplicated/vendored in the monorepo (and inherently fit the LLVM licence).

Writing a cross-platform, lock-free allocator isn’t trivial (although we have SCUDO), and nor is re-writing all the common containers (map, vector, deque, etc) in a thread-safe, lock-free manner.
If we could rely on IntelTBB, that would solve both raised issues (since IntelTBB has tbbmalloc). Adding in mimalloc or rpmalloc would further improve the situation.

Would it be an option to let LLD rely on external submodules?

2 Likes

Popping in here and don’t intend to distract from the LLD discussion much, but the desire for these lock-free/thread-safe structures is not specific to LLD. MLIR heavily relies on threading and the use is growing. Just want to make sure to keep context that the desired features here are not specific to certain sub projects.

– River

I created a new post for the library shortcoming: Parallel/thread-safe algorithms, allocators, and containers We can move these discussions there:)

1 Like

For the ELF port, symbol resolution is still sequential, but otherwise quite a large portion of input file parsing has been made parallel. Major patches

with lots of minor refactoring.