Storing the hash of each symbol name instead of the name itself (mach-o)

michaeleisel · June 17, 2022, 5:36pm

By “parallel with file I/O” do you mean the plain laid out by @oontvoo in Parallel input file parsing ? I think the approach in that thread is a good idea (I have a fork with a similar idea actually), and it looks like it leaves symbol insertion into the map alone, meaning we can look at the two ideas somewhat independently.

Matching strings asynchronously later might be the way to go. I think we’d also have to wait for it to finish for any error that occurs, to ensure that the cause was not duplicated strings. Another possibility would be if there’s one specific error that usually occurs in case of a hash collision, then we just check synchronously when we hit that case (I believe that would be a duplicate symbol error). Thoughts?

I think it’s also worth unpacking the idea of probabilistic failure here further, in case I could get some consensus on doing it without this change. If we assume that the hash function is “perfectly random”, and the odds of any two strings having matching hashes is 1 / 2 ** 64, we can approximate the odds of at least one collision occurring with the formula n * (n - 1) / 2M, where n is the number of symbols (let’s say 6,000,000 here) and M is the number of possibilities (2 ** 64), the odds of at least one collision occurring for Uber are less than 1 in 1,000,000. If we use 96 bits for the hash, the odds go down to less than 1 in 4 quadrillion. We can analyze how close the hash function is to “perfectly random”, for example, by looking at the decrease in hash collisions as we add each bit to the hash backwards and forwards (starting with all bits zeroed out). So, is there maybe a point here where we don’t need duplicate-checking at all?

Topic		Replies	Views
On LLD performance LLVM Dev List Archives	46	353	March 21, 2015
Concurrent Hashmap? LLVM Dev List Archives	14	144	April 13, 2021
hash_map issues with Visual Studio LLVM Dev List Archives	2	84	October 25, 2004
RFC: LLD symbol table redesign LLVM Dev List Archives	4	96	April 27, 2016
Proposed implementation of N3333 hashing interfaces for LLVM (and possible libc++) LLVM Dev List Archives	12	184	March 1, 2012

Storing the hash of each symbol name instead of the name itself (mach-o)

Related topics