Hi,
I’ve been looking at Chandler Carruth talks about internal data structures for LLVM in order to implement my own library. That’s how I ended up looking at the internals of DenseMap and the default hash functions defined in DenseMapInfo.
I have been very surprised by the hash function used for integers which is hash(k) = 37 * k. This kind of hashing function does not change the lowest bits of k, and if your keys happen to all be a multiple of 8, you end up not using about 90% of the slot as a direct access in an open addressing hash table. By the way, does anyone know the rationale behind using 37?
In order to improve this, it could be nice to use a better hash function such as the Knuth hash which is defined as:
std::size_t hash_value(std::size_t val, int p) {
assert(p >= 0 && p <= 64);
const std::size_t knuth = 11133510565745311;
return (val * knuth) >> (64 - p);
}
The big advantage of this function is that is shuffles all the bits and does not suffer from the problems given above. It is also quite cheap to compute. Unfortunately, one needs to change the signature of the hash_value function as the hash function in order to take an integer k and hash it into a integer in {0, 1, 2, …, 2^p} is no longer in the form hash(k) % 2^p. It needs to be in the form hash(k, p).
As I don’t know anything about the LLVM source code, here are a few questions:
- Is DenseMap used a lot with integers as a key in LLVM? Do you think it would benefit from a better default hash function?
- Is there any benchmark to run that makes a heavy use of those hash function so I can see if my changes would improve performance?
Best regards,
François Fayard
Founder & Consultant - Inside Loop
Applied Mathematics & High Performance Computing
Tel: +33 (0)6 01 44 06 93
Web: www.insideloop.io
I have been very surprised by the hash function used for integers which
is hash(k) = 37 * k. This kind of hashing function does not change the
lowest bits of k, and if your keys happen to all be a multiple of 8,
you end up not using about 90% of the slot as a direct access in an
open addressing hash table. By the way, does anyone know the rationale
behind using 37?
37 is a prime. Multiples of 8 that are otherwise evenly distributed will
map to different slots in the output space.
In order to improve this, it could be nice to use a better hash
function such as the Knuth hash which is defined as:
64bit multiplications are very expensive on many 32bit platforms. It is
certainly an order of magnitude slower than the existing hash.
Joerg
37 is a prime. Multiples of 8 that are otherwise evenly distributed will
map to different slots in the output space.
As the hash number is reduced modulo 2^p (the number of slots), if your hash function is of the form hash(k) = c * k, I understand that you want to be able to fill all te slots. One can show that we get this property if c is not a multiple of 2. So it gives a lot of choices: 1, 3, 5, 7, 9, etc. I don’t see any reason not to use c = 1 if you want to use that kind of hash function.
64bit multiplications are very expensive on many 32bit platforms. It is
certainly an order of magnitude slower than the existing hash.
There is also a 32-bit version of the Knuth hash:
std::uint32_t hash(std::uint32_t val, int p) {
const std::uint32_t knuth = 2654435769;
const std::uint32_t y = val;
return (y * knuth) >> (32 - p);
}
Those numbers come from the following fact. Let alpha = (-1 + sqrt(5)) / 2 which is about 0.618034. Multiply this number by 2^32 (or 2^64 for 64 bit hash), and round this number to the nearest integer. You’ll get 2 654 435 769. alpha has been chosen so that it is an irrational number in between 0 and 1 and that it is very difficult to approach it by rational numbers. One can show that all irrational numbers being a root of a degree 2 polynom with integer coefficients with leading coefficient being 1 have this property (X^2 + X - 1 = 0 for alpha).
I wrote my own hash table using open addressing and quadratic probing. I have tried to insert n = 2^26 keys given as follow: generate a random number in between 0 and 2^(64-3) and multiply it by 8. Then search all those keys in the order they were inserted. Here are the timings I get:
- hash(x) = x : insertion 7.42s, searching 4.09s
- hash(x) = 37 * x : insertion 7.49s, searching 5.50s
- Knuth hashing: insertion 6.04, searching 2.54s
So using Knuth hashing is really worth it if the keys happen to be in the form 2^k * u + v. And multiply by 37 is not a good idea here. And I am still very puzzled and don’t really understand the situations where it could be useful.
François Fayard
As others have pointed out, 37 does have some nice properties (by being prime), but we’ve also had it from the very early days. I wouldn’t say that it has been extremely well considered at all.
-Chris
Thanks for your reply. But I am not sure to understand the last sentence. Does it mean that this hash function has been criticized before? Do you also think that it can be improved?
I am sorry to insist on that, but I don’t see any advantage of using hash(k) = c * k with c being a primer number. What is important is choosing c such that gcd(c, m) = 1 for any size m of the hash table. This is the case for c = 1. As a consequence, I don’t see any benefit in using c = 37 over c = 1. Moreover, all those hash functions are badly considered for a reason : they behave very poorly when the keys are of the form d * k0 + k1.
I really think that this hash function could be improved. But to know if it’s worth it, I would like to know if having a better hash for integer types would benefit LLVM.
François Fayard
Some tests I can suggest is to replace the hash function with your favorite and:
Thanks. I’ll give it a try. It will take some time as I need to rewrite DenseMap if I want to use the Knuth multiplicative hash.
ultimately I believe real-world impact is the best way to get a change in.
That’s exactly what I think. I was just trying to get some hints on what parts of the code could benefit from a better hash function for integers, in order to find a benchmark which is relevant.
If this is used for having pointers, then it is likely gonna be used *everywhere*.
Even simple test like “time to deserialize a large bitcode file” would be a good start.
It is not used for pointers, as this function would be truly awful for them: because of alignments, pointers are of the form 2^a * k0. The hash function used for pointers is (k >> 4) ^ (k >> 9). My guess is that the k >> 4 is here because pointers returned by malloc are 16 bytes aligned. I don’t have any guess for the 9 though.
So the hashing function for pointers is not flawed like the one for integers.
François Fayard
Both are not very sophisticated.
You should also look at the different MurmurHash versions, and descendants such as CityHash.
They are only very slightly more complex (e.g. two multiplies and a rotate) but of extremely good quality, approaching MD5 or SHA for randomness and independence of resulting bits (but not secure against malicious attacks like they are).
I’ve successfully used simply k (i.e. c =1) for both integers and pointers in hash tables with prime number sizes. It works fine, and they usually have non-overlapping ranges in practice. Hashing structures and strings is the more interesting challenge.
But if you’re using power of two (or at least non-prime) table sizes then you should munch up the bits a lot more with multipliers with a lot more bits, not just something like 37.
It may be that these hash tables don’t have many entries and don’t have many collisions and 37 is fine, and a more expensive hash function would slow it down. Only testing can tell you.
Thanks for your thoughts.
Both are not very sophisticated.
You should also look at the different MurmurHash versions, and descendants such as CityHash.
I’ll give a try at those MurmurHash functions.
I've successfully used simply k (i.e. c =1) for both integers and pointers in hash tables with prime number sizes. It works fine, and they usually have non-overlapping ranges in practice. Hashing structures and strings is the more interesting challenge.
Using hash(k) = k when the size of the table is a prime number is OK. The bad news come when you start using 2^k as the size of your hash table and the keys are of the form 2^a k0 + k1.
But if you're using power of two (or at least non-prime) table sizes then you should munch up the bits a lot more with multipliers with a lot more bits, not just something like 37.
It may be that these hash tables don't have many entries and don't have many collisions and 37 is fine, and a more expensive hash function would slow it down. Only testing can tell you.
I’ll first try to make things worse such as using hash(k) = k for pointers which should be really bad. See how much difference it does make.
François Fayard
The reason for avoiding c=1 is the trivial aliasing set for power-of-two
differences. Those tend to be fairly common in any data set. Beside
that, 37 works somewhat decent for many small differences, it can be
computed easily without multiplication where necessary and noone has
bothered enough to compare different hash functions across different
architectures. E.g. the current hash is simple, avoids some of the
obvious bad cases and is itself decently fast. Anything else needs
careful analysis of the data.
Joerg
The reason for avoiding c=1 is the trivial aliasing set for power-of-two
differences. Those tend to be fairly common in any data set.
That’s where I completely disagree. Multiplying by 37 won’t help.
Suppose that you have two integers a and b such that a = b (modulo 8). You also have 37 a = 37 b (modulo 8), and multiplying by 37 does not change anything here. Prime numbers are useful for the size of the hash table and kill those trivial aliasing which is why hash(k) = k is a decent hash for those tables. But I really don’t see any benefit in multiplying by a prime number.
E.g. the current hash is simple, avoids some of the
obvious bad cases and is itself decently fast. Anything else needs
careful analysis of the data.
It disagree on the fact that it avoids obvious bad cases.
François Fayard
I did a few benchmark this morning, trying to tweak the hashing for pointers (as many people seem to use pointers as keys). The hash function in LLVM is quite simple, but it gives very good results. The benchmark I did was compiling a C++ file with clang. I did 10 tests for every version of clang/llvm compiled with the hash function, and here are the best timings:
- Original LLVM: hash(k) = (k >> 4) ^ (k >> 9)
3.40s
- Stupid hash: hash(k) = k
3.55s
- A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) = 2^unused_bits)
3.47s
- Murmurhash3
3.47s
So different hashing functions make a difference. For pointers, it seems that the one used by LLVM is quite good. It gives the best performance here. It is quite surprising that Murmurhash3 does not behave better than the “A bit less stupid” hash. My guess is that the LLVM hash function is the best because it is tailored to the distribution of pointers returned by malloc. As I said, the (k << 4) seems to be related to the fact that pointers are 16-bytes aligned on most OS. I still don’t understand the 9.
I’ll do some benchmarks for hashing integers tomorrow. This is the hashing that really looks suspicious to me but it does not seem to be used that much in LLVM.
François Fayard
It is not clear what the test-case is (what source, what compiler options).
My suspicion is that your differences are in the noise, and most of the
time is spent doing other things than hashing. Did you profile the a run,
and check how much of the total time is spent in the hash-function [you may
need to tweak the code a bit to not inline the actual hash function]. Also
publishing the RANGE for each set of tests, since the variation between
"best and worst" may be more important than the overall best time.
Counting the number of collisions or some such would probably also help.
[I have no useful answer as to why `k >> 9` is used. `k >> 4` because
pointers [allocated from the heap] are often aligned to 16 or greater does
make sense. I suspect 9 is just some arbitrary value that gives some more
entropy in the low part of the value [and since the value is later used in
a modulo buckets or some such, having more variation in the lower bits does
help - upper bits are discarded, but also often the same for large number
of allocations!] - whether someone spent minutes, hours or days picking
exactly 9 I have no idea - nor whether it still is the "right" value, or
there should be some other number there...
It is not clear what the test-case is (what source, what compiler options). My suspicion is that your differences are in the noise, and most of the time is spent doing other things than hashing. Did you profile the a run, and check how much of the total time is spent in the hash-function [you may need to tweak the code a bit to not inline the actual hash function]. Also publishing the RANGE for each set of tests, since the variation between "best and worst" may be more important than the overall best time.
I agree that I did not present a clear test case, which was just a random .cpp file that benchmarks different hash tables from LLVM/Google/STL compiled with -O3 option. I’ll try to make it better tomorrow, but I can assure you that it was not the noise. I have disabled turboboost and hyper-threading on my machine and ran the benchmark 10 times for every test and the difference was way larger than the observed standard deviation. But anyway, for the time being, the winner for hashing pointers is the current one from of LLVM.
Counting the number of collisions or some such would probably also help.
Yes, I could do that. If I want to report all the collision, I might do that in the destructor of DenseMap. Sorry for my ignorance, but is clang/llvm multithreaded? Is it safe to write to a file in a destructor for logging purposes?
François Fayard
llvm is not multithreaded.
> It is not clear what the test-case is (what source, what compiler
options). My suspicion is that your differences are in the noise, and most
of the time is spent doing other things than hashing. Did you profile the a
run, and check how much of the total time is spent in the hash-function
[you may need to tweak the code a bit to not inline the actual hash
function]. Also publishing the RANGE for each set of tests, since the
variation between "best and worst" may be more important than the overall
best time.
I agree that I did not present a clear test case, which was just a random
.cpp file that benchmarks different hash tables from LLVM/Google/STL
compiled with -O3 option. I’ll try to make it better tomorrow, but I can
assure you that it was not the noise. I have disabled turboboost and
hyper-threading on my machine and ran the benchmark 10 times for every test
and the difference was way larger than the observed standard deviation. But
anyway, for the time being, the winner for hashing pointers is the current
one from of LLVM.
Yes, it seems to be "good". But it's always good to understand WHY.
> Counting the number of collisions or some such would probably also help.
Yes, I could do that. If I want to report all the collision, I might do
that in the destructor of DenseMap. Sorry for my ignorance, but is
clang/llvm multithreaded? Is it safe to write to a file in a destructor for
logging purposes?
I don't think you can rely on that ALWAYS, but clang and LLVM doesn't spawn
threads - someone may use two threads simultaneously to call into llvm and
clang functionality as a library tho'. In a limited addition to the code to
report something like `std::cout << "Number colliisions: " << NumCollisions
<< std::endl;` should be safe enough when using clang as the test [and
opening a file in append mode should also be reasonably safe]. If you
wanted to submit such a change as a patch, it would probably require a fair
bit more work to guarantee threadsafety... [I have certainly done similar
quick "print something" additions to Clang and LLVM, including in
destructors].
I’ll be glad to understand how is has been crafted, especially where does to 9 come from. I would be glad to know if it is an experimental number or something that comes from thoughts around malloc?
I have also worked on the hashing for integers and my first test was trying to measure difference in performance when putting the worst hash function you can think of for integers : hash(k) = 0. It turns out that the difference I measured is below 0.5 %, which is below the noise of repeating experience. So, with my benchmark (compiling a .cpp file with clang with -O3), the quality of the hash for integers doe not matter at all. It seems that the usage of integers as keys is very low.
François Fayard