By “parallel with file I/O” do you mean the plain laid out by @oontvoo in Parallel input file parsing ? I think the approach in that thread is a good idea (I have a fork with a similar idea actually), and it looks like it leaves symbol insertion into the map alone, meaning we can look at the two ideas somewhat independently.
Matching strings asynchronously later might be the way to go. I think we’d also have to wait for it to finish for any error that occurs, to ensure that the cause was not duplicated strings. Another possibility would be if there’s one specific error that usually occurs in case of a hash collision, then we just check synchronously when we hit that case (I believe that would be a duplicate symbol error). Thoughts?
I think it’s also worth unpacking the idea of probabilistic failure here further, in case I could get some consensus on doing it without this change. If we assume that the hash function is “perfectly random”, and the odds of any two strings having matching hashes is 1 / 2 ** 64
, we can approximate the odds of at least one collision occurring with the formula n * (n - 1) / 2M
, where n
is the number of symbols (let’s say 6,000,000 here) and M
is the number of possibilities (2 ** 64
), the odds of at least one collision occurring for Uber are less than 1 in 1,000,000. If we use 96 bits for the hash, the odds go down to less than 1 in 4 quadrillion. We can analyze how close the hash function is to “perfectly random”, for example, by looking at the decrease in hash collisions as we add each bit to the hash backwards and forwards (starting with all bits zeroed out). So, is there maybe a point here where we don’t need duplicate-checking at all?