I spent a week to optimize LLD performance and just wanted to share things what I found. Also if there’s anyone who have a good idea on how to make it faster, I’d like to hear.
My focus is mainly on Windows, but my optimizations are generally platform neutral. I aim both single-thread and multi-thread performance.
r231434 is a change that has the largest impact. It greatly reduced linking time to link large binaries, as it changed the order of number of symbols we need to process for files within --start-group/–end-group. If you have many library files in a group, you would see a significant performance gain. It’s probably not often the case on Unix. On the other hand, on Windows (and IIRC on Darwin), we move all library files to end of input file list and group them together, so it’s effective. It improves single-thread performance.
r231454 is to apply relocations in parallel in the writer using parallel_for. Because number of relocations are usually pretty large, and each application is independent, you basically get linear performance gain by using threads. Previously it took about 2 seconds to apply 10M relocations (the number in the commit message is wrong) on my machine, but it’s now 300 milliseconds, for example. This technique should be applicable to other ports.
r231585 changes the algorithm to create base relocations so that we can use parallel_sort. Unfortunately, base relocations are Windows specific, I don’t think this technique is applicable to other ports.
At this point, the resolver is bottleneck. Readers introduce surprisingly small delay when linking large binaries, probably thanks to parallel file loading and archive member preloading. Or just that file parsing is an easy task. Preloading hit rate is >99%, so when you need a symbol from an archive file, its member is almost always parsed and ready to be used. ArchiveFile::find() returns immediately with a result. Writers and other post-resolver passes seem reasonably fast. The dominant factor is the resolver.
What the resolver does is, roughly speaking, reading files until it resolves all symbols, and put all symbols received from files to a hash table. That’s a kind of tight loop. In r231549 I cut number of hash table lookup, but looks like it’s hard to optimize that more than this.
An idea to make the resolver faster would be to use a concurrent hash map to insert new symbols in parallel. Assuming symbols from the same file don’t conflict each other (I think it’s a valid assumption), this can be parallelized. I wanted single-thread performance gain, though. (Also, concurrent hash maps are not currently available in LLVM.)
Another idea is to eliminate a preprocess to create reverse edges of the atom graph. Some edges in the graph need to be treated as bi-directional, so that all connected edges become live or dead as a group regardless of direction of edges (so that depending symbols are not reclaimed by garbage collector). We probably should always add two edges for each bi-directional edge in the first place to eliminate the need of preprocessing entirely. It’s definitely doable.
An interesting idea that unfortunately didn’t work is interning symbol names. I thought that computing a hash value of a symbol name is bottleneck in hash table, as C++ symbols can be very long. So I wrote a thread-safe string pool to intern (unique-fy) symbol strings, so that we can compare symbol equivalence by pointer comparison. String interning was done in the reader, which is parallelized, and I expected it would improve single-thread performance of the resolver. But I didn’t observe meaningful difference in terms of performance.
Bottlenecks were not there where I expected, as always (I learned about that again and again, but I was surprised every time). Maybe we need to revisit “optimized” code in LLD to see if it’s not really premature optimization. If it is, we should rewrite with simple code.
I’m currently trying two ideas above. This mail is just FYI, but if you have any recommendations or ideas or whatever, hit “reply”.