[PATCH] FoldingSetNodeID: use MurmurHash2 instead of SuperFastHash

Some additional info can be found at:

  http://murmurhash.googlepages.com/
  MurmurHash - Wikipedia
  Hash Functions: An Empirical Comparison - CodeProject

as well as in the patch description itself. Patch and benchmark attached.

        Gregory

0001-FoldingSetNodeID-use-MurmurHash2-instead-of-SuperFas.patch (4.1 KB)

hash_bench.c (3.56 KB)

While I've not reviewed the patch in too much detail, it looks
promising. Can you run some end-to-end benchmarks to make sure that
cache pressure in the full program or other variables not accounted
for in a micro-benchmark don't dominate performance? Specifically the
nightly tester includes a number of real programs and machinery to
measure total compile time.

While I've not reviewed the patch in too much detail, it looks
promising. Can you run some end-to-end benchmarks to make sure that
cache pressure in the full program or other variables not accounted
for in a micro-benchmark don't dominate performance? Specifically the
nightly tester includes a number of real programs and machinery to
measure total compile time.

Ok, now with some kinda-hard numbers!

          murmurhash2 | superfasthash
                                     >
- 6.6404 seconds (6.6697 wall clock) | 6.6204 seconds (6.8557 wall clock)
+ 2.6722 seconds (2.7064 wall clock) | 2.7962 seconds (2.7502 wall clock)
+ 8.6725 seconds (8.6662 wall clock) | 8.7526 seconds (8.7162 wall clock)
+ 2.7362 seconds (2.7729 wall clock) | 2.8242 seconds (2.8146 wall clock)
+ 1.4281 seconds (1.4068 wall clock) | 1.4761 seconds (1.4198 wall clock)
+ 4.3163 seconds (4.3392 wall clock) | 4.3683 seconds (4.3969 wall clock)
                                     >
+ 6.6804 seconds (6.6916 wall clock) | 6.7804 seconds (6.6950 wall clock)
+ 2.7682 seconds (2.7342 wall clock) | 2.7802 seconds (2.7321 wall clock)
- 8.7365 seconds (8.9505 wall clock) | 8.6965 seconds (8.7124 wall clock)
+ 2.7962 seconds (2.7812 wall clock) | 2.8282 seconds (2.8049 wall clock)
+ 1.3881 seconds (1.4103 wall clock) | 1.4001 seconds (1.4103 wall clock)
- 4.3843 seconds (4.3509 wall clock) | 4.3723 seconds (4.3714 wall clock)
                                     >
+ 6.5724 seconds (6.7971 wall clock) | 6.6684 seconds (6.6465 wall clock)
+ 2.6401 seconds (2.7378 wall clock) | 2.6682 seconds (2.7965 wall clock)
+ 8.7005 seconds (8.7068 wall clock) | 8.7766 seconds (8.8452 wall clock)
+ 2.7282 seconds (2.7680 wall clock) | 2.7762 seconds (2.7904 wall clock)
- 1.4561 seconds (1.4230 wall clock) | 1.4201 seconds (1.4184 wall clock)
- 4.4523 seconds (4.3754 wall clock) | 4.4003 seconds (4.3706 wall clock)

This table was obtaind by running 3 times 'make TEST=nightly report.html -j2
&& l Output/ | grep info | xargs cat | grep 'Total Execution Time' >> timings
&& make clean' or something like that, on MultiSource/Applications/ClamAV.

It looks like FoldingSetNodeID is not the most performance-critical part of
LLVM, but changing the hash algorithm help a bit.

        Gregory

+/// This version additionally assumes that 'len % 4 == 0'. Below are the
+/// original notes.
+///
+/// Note - This code makes a few assumptions about how your machine
behaves -
+///
+/// 1. We can read a 4-byte value from any address without crashing

Doesn't it only need the start of the data to be 4-byte aligned?
That seems to be the case already, since this is a smallvector of unsigned.

+/// 2. sizeof(int) == 4

How about using uint32_t/int32_t instead of int then?

Best regards,
--Edwin

Hi Gregory,

These numbers are so noisy, that they aren't particularly useful. Could you try instrumenting foldingset to keep track track of the # collisions and # hash table resizes and compare those? They should be much more stable and still correlate directly to performance.

-Chris

You are right in both cases -- these are notes (and source) from the original
implementation, I decided to leave them as-is for easier verification of the
hashing algorithm.

  Gregory

Yup -- they are extremely noisy. Very good idea -- will give it a try.

  Gregory

OK, now with real numbers :slight_smile:

First, the main thing: SuperFastHash appears to be the hash with best
distribution. Use of MurmurHash instead generates 1.28% more collisions while
doing nightly test in MultiSource/Applications.

Second: I've also tested lookup3 hash, and its use generated 0.1% more
collisions, compared to SFH.

These results were a bit surprising for me!

Number of hash table resizes is independent of used hashing algorithm, because
hash table grows when 'nentries > nbuckets * 2'.

  Gregory

Thanks for doing the evaluation! It sounds like we should stay with SFH. In addition to fewer collisions, it is cheaper to compute.

-Chris

While I agree that staying with SFH makes sense, I should note that in
microbenchmarks, MMH wins by a big margin (see the original patch).

  Gregory

And now something really surprising, Bernstein hash

  unsigned bernstein_hash(const unsigned char *str, size_t len)
  {
    unsigned hash = 5381;
    unsigned c;

    while (len--) {
      c = *str++;
      hash = (hash * 33) ^ c;
    }

    return hash;
  }

gives 1.48% less (sic!) collisions than SFH! That means less than lookup3, MMH
and FNV, too. I have no idea why this simplest hash is so good for LLVM. Maybe
because LLVM hashes lots of pointers?

But BH is ~2 times slower than SFH.

Overall, it looks like
1) Number of collisions for different hashes is within 5% interval.
2) MMH is the fastest by a big margin, FNV / BH are the slowest by a big margin.
3) Any hash will do for FoldingSet.

  Gregory