Storing the hash of each symbol name instead of the name itself (mach-o)

The lookups of symbols by their name in the symbol table is a major bottleneck in the mach-o lld linker. One way to speed it up is to store the hash itself as the key, meaning that, whenever keys are tested for equality, there’s no strcmp or pointer chasing required. The uint64_t keys are simply compared byte-for-byte. The risk, of course, is hash colllisions. In the 1.2GB binary I’m testing against, with ~6,000,000 unique symbol names, there are no hash collisions when using XXH3. Even when artificially doubling the number of symbol names to 12,000,000, and zeroing out the top 16 bits of each hash, there are still no collisions. The reduction in linking time is about 5% for doing this to the main symbol table (symMap), and there are other hash maps where we can also speed things up once we assume that the hashes are unique. This could be nice to have, if not as a default, at least as a flag that one can toggle depending on their unique symbol count and risk tolerance.

Interesting idea, but the notion that your toolchain may probabilistically and intentionally betray you doesn’t sit right.

My first thought was that we could verify the strings match asynchronously. Symbol hashing is only done once at file load time, meaning that we have the rest of the linker running time to verify that there are no hash collisions. In the rare situation where we do have a collision, we could abort & redo the link with full string comparisons this time.

My next thought was that if we are going to leverage parallelism anyway, maybe it would make more sense to first parallelize the input file loading. Perhaps the string comparisons won’t look so costly when done in parallel with file I/O. This is something we’ve discussed previously (and @oontvoo has been pretty interested in building). It will be a good chunk of work though, so probably not something that will land any time soon.

By “parallel with file I/O” do you mean the plain laid out by @oontvoo in Parallel input file parsing ? I think the approach in that thread is a good idea (I have a fork with a similar idea actually), and it looks like it leaves symbol insertion into the map alone, meaning we can look at the two ideas somewhat independently.

Matching strings asynchronously later might be the way to go. I think we’d also have to wait for it to finish for any error that occurs, to ensure that the cause was not duplicated strings. Another possibility would be if there’s one specific error that usually occurs in case of a hash collision, then we just check synchronously when we hit that case (I believe that would be a duplicate symbol error). Thoughts?

I think it’s also worth unpacking the idea of probabilistic failure here further, in case I could get some consensus on doing it without this change. If we assume that the hash function is “perfectly random”, and the odds of any two strings having matching hashes is 1 / 2 ** 64, we can approximate the odds of at least one collision occurring with the formula n * (n - 1) / 2M, where n is the number of symbols (let’s say 6,000,000 here) and M is the number of possibilities (2 ** 64), the odds of at least one collision occurring for Uber are less than 1 in 1,000,000. If we use 96 bits for the hash, the odds go down to less than 1 in 4 quadrillion. We can analyze how close the hash function is to “perfectly random”, for example, by looking at the decrease in hash collisions as we add each bit to the hash backwards and forwards (starting with all bits zeroed out). So, is there maybe a point here where we don’t need duplicate-checking at all?

By “parallel with file I/O” do you mean the plain laid out by @oontvoo in Parallel input file parsing ? I think the approach in that thread is a good idea (I have a fork with a similar idea actually), and it looks like it leaves symbol insertion into the map alone, meaning we can look at the two ideas somewhat independently.

It would be nice to have fully parallel symbol resolution, but yeah that may not be practical / easy to implement. @oontvoo’s proposed design does seem like a reasonable alternative.

I guess there’s two possible failure modes here: one is a false duplicate symbol error, and the other is that the linker could conclude that a missing symbol is present and link successfully (but incorrectly). So just double-checking in case of a collision would still leave us open to (admittedly very unlikely) mislinks.

But thinking a bit more about it, I guess gating it behind a flag is fine. You’re right that the actual risk is very low, and as long as the end user is aware that they’re taking on that risk, it’s all good.