While I've not reviewed the patch in too much detail, it looks
promising. Can you run some end-to-end benchmarks to make sure that
cache pressure in the full program or other variables not accounted
for in a micro-benchmark don't dominate performance? Specifically the
nightly tester includes a number of real programs and machinery to
measure total compile time.
While I've not reviewed the patch in too much detail, it looks
promising. Can you run some end-to-end benchmarks to make sure that
cache pressure in the full program or other variables not accounted
for in a micro-benchmark don't dominate performance? Specifically the
nightly tester includes a number of real programs and machinery to
measure total compile time.
Ok, now with some kinda-hard numbers!
murmurhash2 | superfasthash
>
- 6.6404 seconds (6.6697 wall clock) | 6.6204 seconds (6.8557 wall clock)
+ 2.6722 seconds (2.7064 wall clock) | 2.7962 seconds (2.7502 wall clock)
+ 8.6725 seconds (8.6662 wall clock) | 8.7526 seconds (8.7162 wall clock)
+ 2.7362 seconds (2.7729 wall clock) | 2.8242 seconds (2.8146 wall clock)
+ 1.4281 seconds (1.4068 wall clock) | 1.4761 seconds (1.4198 wall clock)
+ 4.3163 seconds (4.3392 wall clock) | 4.3683 seconds (4.3969 wall clock)
>
+ 6.6804 seconds (6.6916 wall clock) | 6.7804 seconds (6.6950 wall clock)
+ 2.7682 seconds (2.7342 wall clock) | 2.7802 seconds (2.7321 wall clock)
- 8.7365 seconds (8.9505 wall clock) | 8.6965 seconds (8.7124 wall clock)
+ 2.7962 seconds (2.7812 wall clock) | 2.8282 seconds (2.8049 wall clock)
+ 1.3881 seconds (1.4103 wall clock) | 1.4001 seconds (1.4103 wall clock)
- 4.3843 seconds (4.3509 wall clock) | 4.3723 seconds (4.3714 wall clock)
>
+ 6.5724 seconds (6.7971 wall clock) | 6.6684 seconds (6.6465 wall clock)
+ 2.6401 seconds (2.7378 wall clock) | 2.6682 seconds (2.7965 wall clock)
+ 8.7005 seconds (8.7068 wall clock) | 8.7766 seconds (8.8452 wall clock)
+ 2.7282 seconds (2.7680 wall clock) | 2.7762 seconds (2.7904 wall clock)
- 1.4561 seconds (1.4230 wall clock) | 1.4201 seconds (1.4184 wall clock)
- 4.4523 seconds (4.3754 wall clock) | 4.4003 seconds (4.3706 wall clock)
This table was obtaind by running 3 times 'make TEST=nightly report.html -j2
&& l Output/ | grep info | xargs cat | grep 'Total Execution Time' >> timings
&& make clean' or something like that, on MultiSource/Applications/ClamAV.
It looks like FoldingSetNodeID is not the most performance-critical part of
LLVM, but changing the hash algorithm help a bit.
Gregory
+/// This version additionally assumes that 'len % 4 == 0'. Below are the
+/// original notes.
+///
+/// Note - This code makes a few assumptions about how your machine
behaves -
+///
+/// 1. We can read a 4-byte value from any address without crashing
Doesn't it only need the start of the data to be 4-byte aligned?
That seems to be the case already, since this is a smallvector of unsigned.
+/// 2. sizeof(int) == 4
How about using uint32_t/int32_t instead of int then?
Best regards,
--Edwin
Hi Gregory,
These numbers are so noisy, that they aren't particularly useful. Could you try instrumenting foldingset to keep track track of the # collisions and # hash table resizes and compare those? They should be much more stable and still correlate directly to performance.
-Chris
You are right in both cases -- these are notes (and source) from the original
implementation, I decided to leave them as-is for easier verification of the
hashing algorithm.
Gregory
Yup -- they are extremely noisy. Very good idea -- will give it a try.
Gregory
OK, now with real numbers 
First, the main thing: SuperFastHash appears to be the hash with best
distribution. Use of MurmurHash instead generates 1.28% more collisions while
doing nightly test in MultiSource/Applications.
Second: I've also tested lookup3 hash, and its use generated 0.1% more
collisions, compared to SFH.
These results were a bit surprising for me!
Number of hash table resizes is independent of used hashing algorithm, because
hash table grows when 'nentries > nbuckets * 2'.
Gregory
Thanks for doing the evaluation! It sounds like we should stay with SFH. In addition to fewer collisions, it is cheaper to compute.
-Chris
While I agree that staying with SFH makes sense, I should note that in
microbenchmarks, MMH wins by a big margin (see the original patch).
Gregory
And now something really surprising, Bernstein hash
unsigned bernstein_hash(const unsigned char *str, size_t len)
{
unsigned hash = 5381;
unsigned c;
while (len--) {
c = *str++;
hash = (hash * 33) ^ c;
}
return hash;
}
gives 1.48% less (sic!) collisions than SFH! That means less than lookup3, MMH
and FNV, too. I have no idea why this simplest hash is so good for LLVM. Maybe
because LLVM hashes lots of pointers?
But BH is ~2 times slower than SFH.
Overall, it looks like
1) Number of collisions for different hashes is within 5% interval.
2) MMH is the fastest by a big margin, FNV / BH are the slowest by a big margin.
3) Any hash will do for FoldingSet.
Gregory