RFC: change string hash function in libc++

I recently worked with some other folks at Google to replace the
string-hashing function in libstdc++ with Murmurhash
(GitHub - aappleby/smhasher: Automatically exported from code.google.com/p/smhasher), a very fast hash function with
excellent hashing properties.

I'd like to propose changing libc++ to also use Murmurhash, or perhaps
Cityhash (http://code.google.com/p/cityhash/), which hasn't been
around as long but seems to perform even better.

Assuming the maintainers are receptive to such a change, we would be
happy to do the work to create a patch to libc++; we just need to know
the proper procedures for doing so.

Also, is there a preference to using Murmurhash or Cityhash? Both are
excellent hash functions, and are equally portable. In favor of
Murmurhash: it's been around for longer and is already used in other
STL implementations. In favor of Cityhash: it performs a bit better
than Murmurhash, on average.

craig

One point that I think would be interesting to share (if you have any information):

} What are the numbers behind these "performs better"?

Austin Appleby (the murmurhash author) agreed to run some numbers.
Here's his report. The executive summary is that city is *much*
faster for strings >512 bytes, and murmur is a bit faster for strings
<512 bytes.

craig

-- cut here--

SMHasher can generate performance numbers for various hash functions -
below are numbers for Murmur and City generated on my desktop machine
(I added the additional block sizes that chandlerc requested and
tweaked SMHasher's output a bit to make comparison easier).

Generally speaking, City is _much_ faster for large blocks (more loop
unrolling), Murmur is faster for small blocks. The crossover point in
the data below is at 512 bytes.

"Resistance to collisions" is harder to quantify, but any hash
function that passes all the tests in SMHasher's test suite should be
virtually indistiguishable from a random oracle in most applications -
it is quite thorough.

-Austin

} What are the numbers behind these "performs better"?

Austin Appleby (the murmurhash author) agreed to run some numbers.
Here's his report. The executive summary is that city is *much*
faster for strings >512 bytes, and murmur is a bit faster for strings
<512 bytes.

craig

-- cut here--

SMHasher can generate performance numbers for various hash functions -
below are numbers for Murmur and City generated on my desktop machine
(I added the additional block sizes that chandlerc requested and
tweaked SMHasher's output a bit to make comparison easier).

Generally speaking, City is _much_ faster for large blocks (more loop
unrolling),

Curious - is it relevant which compiler/optimization levels were used
for this comparison?

Sorry to nitpick, but http://code.google.com/p/cityhash/ says "we use
variants of CityHash64() mainly in hash tables such as
hash_map<string, int>. ... CityHash128() and similar return a 128-bit
hash and are tuned for strings of at least a few hundred bytes.
Depending on your compiler and hardware, it may be faster than
CityHash64() on sufficiently long strings. It is known to be slower
than necessary on shorter strings, but we expect that case to be
relatively unimportant. Inside Google we use variants of CityHash128()
mainly for code that wants to minimize collisions."

So it's probably better to test City64 than City128 for a hash-table proposal.

Here's the cityhash performance claim from
http://code.google.com/p/cityhash/source/browse/trunk/README:

Performance on short strings

Here are the updated results, including City64, which is probably a
better candidate here than City128. The data is a little noisy, but
it looks like City64 is faster than Murmur3F starting at
around 8-byte strings.

Answering another question, Austin said all tests were performed on an
x86_64 machine with gcc 4.6.x, with code compiled via gcc -O2.
Results can differ on different platforms/compilers/etc, making the
choice of a 'best' hash function difficult. Overall, I think either
cityhash or murmurhash would be great choices, and better than what
exists in libc++ now. I don't want to get too bogged down in choosing
between them.

craig

Agreed, but thanks for at least providing a baseline idea of the performance of these routines.

Based on the data we have, my inclination would be to go with City64 for 64-bit architectures. For 32-bit architectures we could start with an “obvious” port of City64, and potentially improve on it if needed in the future. Let’s see what Howard and others have to say though.

Also, I wonder about the code size impact. Any idea there? I’m not certain it would be terribly relevant (hopefully there would be only one copy of it) but it might…

} Also, I wonder about the code size impact. Any idea there? I'm not
} certain it would be terribly relevant (hopefully there would be only
} one copy of it) but it might...

You mean generated code size? This code is designed to boil down to a
small number of assembly instructions, but I don't know the
particulars I'm afraid.

craig

It would be good to know how LLVM performs on these functions so we
can fix any CodeGen issues concurrent with the changes in libc++.

deep

I'm attempting to quantify the goodness of a hash function installed into a std::unordered_set/map. I.e. ultimately the number of collisions is determined not only by the hash function, but by the implementation of the unordered container as well.

I've put together the following test function which scores a container at any given time during its lifetime based on the number of collisions it has. A score of 0.f indicates that all values are hashed into the same bucket. A score of 100.f indicates that there are no collisions.

// A quantitative grade [0, 100] of how many collisions there
// are in an unordred_set/map.
// 0 == all elements in the same bucket
// 100 == no collisions
// A lower max_load_factor pushes this score towards 100
// A higher max_load_factor pushes this score towards 0
template <class C>
float
grade(const C& c)
{
    using namespace std;
    if (c.size() <= 1)
        return 100;
    float score = 0;
    size_t bc = c.bucket_count();
    for (size_t i = 0; i != bc; ++i)
    {
        size_t bs = c.bucket_size(i);
        if (bs > 1)
            score += bs - 1;
    }
    return 100*(1 - score / (c.size() - 1));
}

I'm not at this time proposing putting this code into libc++ nor even in the test suite. I'm simply requesting comments on if this is a reasonable tool for quantifying the goodness of an unordered container's state. I've been working with it for only a day now, but I'm happy with it. But I'm not infallible, so before I make decisions based on this metric, I'm opening up the opportunity to trash the metric.

Howard

Out of curiousity, how do these numbers compare to the hash functions currently used by libc++?

-Alexei

Yes, it matches [1]. hmap_agg_index was introduced Aug 19 2009 there.

[1] libHX / Git tools
    long: libHX / Git tools

Proof positive that there are no new ideas. Just recycled ones. :wink:

Howard

I'm attempting to quantify the goodness of a hash function installed into a std::unordered_set/map. I.e. ultimately the number of collisions is determined not only by the hash function, but by the implementation of the unordered container as well.

And of course benchmarks should also be made in the context of the
hash container.

IIUC these hash functions are designed to be used with containers
whose number of buckets is a power of 2 which can be faster since they
use bit operations to select the bucket. So it might be worth testing
for that case as well as the current implementation. Doing that in an
STL container would require some template thing to determine which
strategy to use. I don't know if you would consider that acceptable.

I've put together the following test function which scores a container at any given time during its lifetime based on the number of collisions it has. A score of 0.f indicates that all values are hashed into the same bucket. A score of 100.f indicates that there are no collisions.

// A quantitative grade [0, 100] of how many collisions there
// are in an unordred_set/map.
// 0 == all elements in the same bucket
// 100 == no collisions
// A lower max_load_factor pushes this score towards 100
// A higher max_load_factor pushes this score towards 0
template <class C>
float
grade(const C& c)
{
using namespace std;
if (c.size() <= 1)
return 100;
float score = 0;
size_t bc = c.bucket_count();
for (size_t i = 0; i != bc; ++i)
{
size_t bs = c.bucket_size(i);
if (bs > 1)
score += bs - 1;
}
return 100*(1 - score / (c.size() - 1));
}

I'm not at this time proposing putting this code into libc++ nor even in the test suite. I'm simply requesting comments on if this is a reasonable tool for quantifying the goodness of an unordered container's state. I've been working with it for only a day now, but I'm happy with it. But I'm not infallible, so before I make decisions based on this metric, I'm opening up the opportunity to trash the metric.

This metric underestimates the cost of multiple collisions in a single
bucket. For example, it will give the same result for the case when
you have 98 elements in bucket 1, 2 in bucket 2 as it would for 50
elements in bucket 1, 50 in bucket 2. A better metric might be the
expected number of element lookups to find an element in the table -
it's equal to the sum of (bucket_size(i) * (bucket_size(i) + 1) / 2)
for all buckets divided by the number of elements.

Another metric would be to place a set of elements in the container,
and then measure the number of lookups required for a separate set of
elements. Using real data is also a good idea; randomly generated data
tends to hash to random buckets.

I'm attempting to quantify the goodness of a hash function installed into a std::unordered_set/map. I.e. ultimately the number of collisions is determined not only by the hash function, but by the implementation of the unordered container as well.

And of course benchmarks should also be made in the context of the
hash container.

IIUC these hash functions are designed to be used with containers
whose number of buckets is a power of 2 which can be faster since they
use bit operations to select the bucket. So it might be worth testing
for that case as well as the current implementation. Doing that in an
STL container would require some template thing to determine which
strategy to use. I don't know if you would consider that acceptable.

The libc++ std::hash are being designed/tested only with the libc++ unordered containers. The libc++ containers restrict the number of buckets to a prime (which the client can select). Perhaps I'm misunderstanding your point?

I've put together the following test function which scores a container at any given time during its lifetime based on the number of collisions it has. A score of 0.f indicates that all values are hashed into the same bucket. A score of 100.f indicates that there are no collisions.

// A quantitative grade [0, 100] of how many collisions there
// are in an unordred_set/map.
// 0 == all elements in the same bucket
// 100 == no collisions
// A lower max_load_factor pushes this score towards 100
// A higher max_load_factor pushes this score towards 0
template <class C>
float
grade(const C& c)
{
   using namespace std;
   if (c.size() <= 1)
       return 100;
   float score = 0;
   size_t bc = c.bucket_count();
   for (size_t i = 0; i != bc; ++i)
   {
       size_t bs = c.bucket_size(i);
       if (bs > 1)
           score += bs - 1;
   }
   return 100*(1 - score / (c.size() - 1));
}

I'm not at this time proposing putting this code into libc++ nor even in the test suite. I'm simply requesting comments on if this is a reasonable tool for quantifying the goodness of an unordered container's state. I've been working with it for only a day now, but I'm happy with it. But I'm not infallible, so before I make decisions based on this metric, I'm opening up the opportunity to trash the metric.

This metric underestimates the cost of multiple collisions in a single
bucket. For example, it will give the same result for the case when
you have 98 elements in bucket 1, 2 in bucket 2 as it would for 50
elements in bucket 1, 50 in bucket 2. A better metric might be the
expected number of element lookups to find an element in the table -
it's equal to the sum of (bucket_size(i) * (bucket_size(i) + 1) / 2)
for all buckets divided by the number of elements.

I like this suggestion a lot! I've switched to it.

Another metric would be to place a set of elements in the container,
and then measure the number of lookups required for a separate set of
elements. Using real data is also a good idea; randomly generated data
tends to hash to random buckets.

So far I've just been testing hash<arithmetic types>, though by the end of the day I hope to have moved on to hash<string>. I'm open to concrete suggestions on data sets I should be testing. For floating point types I've been using:

    mt19937_64 eng;
    uniform_real_distribution<T> dist(numeric_limits<T>::min(),
                                      numeric_limits<T>::max());

And for integral types:

    mt19937_64 eng;
    uniform_int_distribution<T> dist(numeric_limits<T>::min(),
                                     numeric_limits<T>::max());

For example here is one test I'm running:

#include <unordered_set>
#include <random>
#include <iostream>

// Computes the average number of comparisions per lookup.
// A perfect hash will return 1.
template <class C>
float
grade(const C& c)
{
    using namespace std;
    if (c.size() <= 1)
        return 100;
    float score = 0;
    size_t bc = c.bucket_count();
    for (size_t i = 0; i != bc; ++i)
    {
        size_t bs = c.bucket_size(i);
        score += bs * (bs+1) / 2;
    }
    return score / c.size();
}

int main()
{
    typedef long long T;
    using namespace std;
    mt19937_64 eng;
// uniform_real_distribution<T> dist(numeric_limits<T>::min(),
// numeric_limits<T>::max());
    uniform_int_distribution<T> dist(numeric_limits<T>::min(),
                                     numeric_limits<T>::max());
    unordered_set<T> table;
    table.max_load_factor(1);
    for (int i = 0; i < 100; ++i)
    {
        for (int j = 0; j < 10000; ++j)
            table.insert(dist(eng));
        cout << "score = " << grade(table) << '\n';
    }
}

One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one. What's checked in is:

        return __u.__a ^ __u.__b;

But I was wondering if it would be better to:

        return __murmur2<size_t>()(&__u, sizeof(__u));

My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor. I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:

    return static_cast<size_t>(__v);

I'm looking at this because of http://llvm.org/bugs/show_bug.cgi?id=10971 .

Howard

IIUC these hash functions are designed to be used with containers
whose number of buckets is a power of 2 which can be faster since they
use bit operations to select the bucket. So it might be worth testing
for that case as well as the current implementation. Doing that in an
STL container would require some template thing to determine which
strategy to use. I don't know if you would consider that acceptable.

The libc++ std::hash are being designed/tested only with the libc++ unordered containers. The libc++ containers restrict the number of buckets to a prime (which the client can select). Perhaps I'm misunderstanding your point?

Probably not, I was mostly blathering on. My main point was that these
hash functions seem to be optimised for a different use case. That
doesn't mean they aren't good for this use case, but there are fewer
benefits.

Another metric would be to place a set of elements in the container,
and then measure the number of lookups required for a separate set of
elements. Using real data is also a good idea; randomly generated data
tends to hash to random buckets.

So far I've just been testing hash<arithmetic types>, though by the end of the day I hope to have moved on to hash<string>. I'm open to concrete suggestions on data sets I should be testing.

For one example, I once read a paper (that I can't find) which used IP
addresses from server logs. I think it was for strings but converting
ip addresses to integer values would be an interesting. IP addresses
tend to have patterns that you don't see in random data because of
they way they're allocated.

One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one. What's checked in is:

   return \_\_u\.\_\_a ^ \_\_u\.\_\_b;

That does seem pretty bad. Using a random distribution won't catch the
potential problems with something like that. For integers it's quite
easy to come up with a realistic example of where that could go wrong.
Say you're using it to hash 64-bit integers to 32-bit values. Someone
comes along and stores 32-bit co-ordinate pairs in 64-bit values (i.e.
x * 2^32 + y). Now (x,y) will always hash to the same value as (y,x),
When x=y, (x,y) will always hash to 0.

But I was wondering if it would be better to:

   return \_\_murmur2&lt;size\_t&gt;\(\)\(&amp;\_\_u, sizeof\(\_\_u\)\);

That's actually a bit buggy for floats. On some platforms sizeof is
larger than the actual size of the floating point type and includes
some padding for alignment which can contain junk values (very common
for long double). It isn't always guaranteed that equal floating point
values have different representations (although IIRC it is for the
IEEE apart from weird edge cases like 0, -0 and maybe the infinity
values). Sometimes you get really weird representations, such as using
the sum of two doubles for a long double. I think Apple used to do
that for PowerPC macs.

You can see my attempt at a portable hash function for floats in boost
(boost/functional/hash/detail/hash_float_*.hpp), but it isn't great.
I'm sure an expert could do better. I only used a binary hash for one
particular setup where it was needed. I considered doing it elsewhere,
but it would have taken too much maintenance. It's such an odd use
case I'm not sure if anyone has ever used it.

My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor. I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:

return static_cast<size_t>(__v);

This is worth a read:

http://www.concentric.net/~ttwang/tech/inthash.htm

I'm looking at this because of 10971 – Due to bad hash functions, unordered_set can easily give horrible performance .

His test seems very contrived. All hash functions are going to have
worst cases, the question is how likely it is that they'll turn up.

IIUC these hash functions are designed to be used with containers
whose number of buckets is a power of 2 which can be faster since they
use bit operations to select the bucket. So it might be worth testing
for that case as well as the current implementation. Doing that in an
STL container would require some template thing to determine which
strategy to use. I don't know if you would consider that acceptable.

The libc++ std::hash are being designed/tested only with the libc++ unordered containers. The libc++ containers restrict the number of buckets to a prime (which the client can select). Perhaps I'm misunderstanding your point?

Probably not, I was mostly blathering on. My main point was that these
hash functions seem to be optimised for a different use case. That
doesn't mean they aren't good for this use case, but there are fewer
benefits.

Now that I've studied the problem more, I think I understand you better here. An identity hash function for a small integral type is a lot less appealing when your hash container has a power of 2 number of buckets. The identity hash function collides "more often" in that environment.

One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one. What's checked in is:

       return __u.__a ^ __u.__b;

That does seem pretty bad. Using a random distribution won't catch the
potential problems with something like that. For integers it's quite
easy to come up with a realistic example of where that could go wrong.
Say you're using it to hash 64-bit integers to 32-bit values. Someone
comes along and stores 32-bit co-ordinate pairs in 64-bit values (i.e.
x * 2^32 + y). Now (x,y) will always hash to the same value as (y,x),
When x=y, (x,y) will always hash to 0.

I've been convinced to use murmur2 when I need to combine 2, 3, or 4 size_t's into a single one, except for the case of long double on x86. My rationale is that long double patterns are not likely to accidentally occur often in a long double. So I'm sticking with xor for long double for now. If I see cases that could easily occur accidentally, this can always be changed in the future.

I settled on murmur2 instead of murmur3 because of the easy availability of the 32 bit and 64 bit versions. I did not see a 64 bit murmur3, and figured murmur2 is good enough. murmur2 won't actually ever be used when size_t is 64 bits. It will only be activated for (unsigned) long long and double on 32 bit platforms.

Oh, except that I activated murmur2 for basic_string as well. The tests I ran weren't that convincing. However I also admit that there is a lot I don't know here, and people are asking for murmur. And what I had was operating a character at a time and murmur operates a word at a time. So done.

But I was wondering if it would be better to:

       return __murmur2<size_t>()(&__u, sizeof(__u));

That's actually a bit buggy for floats. On some platforms sizeof is
larger than the actual size of the floating point type and includes
some padding for alignment which can contain junk values (very common
for long double). It isn't always guaranteed that equal floating point
values have different representations (although IIRC it is for the
IEEE apart from weird edge cases like 0, -0 and maybe the infinity
values). Sometimes you get really weird representations, such as using
the sum of two doubles for a long double. I think Apple used to do
that for PowerPC macs.

I hit this bug twice in the past few days (once for float and once for long double). I think it is solved now by initializing to 0 in the right places.

You can see my attempt at a portable hash function for floats in boost
(boost/functional/hash/detail/hash_float_*.hpp), but it isn't great.
I'm sure an expert could do better. I only used a binary hash for one
particular setup where it was needed. I considered doing it elsewhere,
but it would have taken too much maintenance. It's such an odd use
case I'm not sure if anyone has ever used it.

Looked at your hash_combine. Tested it against xor and murmur. murmur impressed me the most with 1 bit being able to influence all other bits in the size_t, for both 32 and 64 bit platforms. Your hash_float when sizeof(float) <= sizeof(size_t) is I believe identical to what is in there now. Except you're using a pointer cast and I'm using a union (I was using a pointer cast too until Dave Zarzycki pointed out a nasty bug just a few days ago -- I don't see the same bug in your implementation).

My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor. I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:

   return static_cast<size_t>(__v);

This is worth a read:

http://www.concentric.net/~ttwang/tech/inthash.htm

Thanks, that was a helpful read.

I'm looking at this because of http://llvm.org/bugs/show_bug.cgi?id=10971 .

His test seems very contrived. All hash functions are going to have
worst cases, the question is how likely it is that they'll turn up.

After much thought and testing, I've commented and closed this bug as "behaves correctly".

Howard

I’ve been convinced to use murmur2 when I need to combine 2, 3, or 4 size_t’s into a single one, except for the case of long double on x86. My rationale is that long double patterns are not likely to accidentally occur often in a long double. So I’m sticking with xor for long double for now. If I see cases that could easily occur accidentally, this can always be changed in the future.

I settled on murmur2 instead of murmur3 because of the easy availability of the 32 bit and 64 bit versions.

I’m confused. What do you mean by 32-bit and 64-bit versions? If you could clarify what versions you would like, the author of the murmur hashes and some other folks here at google are offering to help build variants that target exactly what libc++ needs.

I did not see a 64 bit murmur3, and figured murmur2 is good enough. murmur2 won’t actually ever be used when size_t is 64 bits. It will only be activated for (unsigned) long long and double on 32 bit platforms.

Oh, except that I activated murmur2 for basic_string as well.

Why not murmur3 or city hash? murmur3 is both higher quality and significantly faster. city hash is still faster. For reference, the original thread was primarily focused on basic_string hashing.

The tests I ran weren’t that convincing. However I also admit that there is a lot I don’t know here, and people are asking for murmur. And what I had was operating a character at a time and murmur operates a word at a time. So done.

At Google, we have found CityHash64 to provide very significant benefits when used to hash strings for unordered containers. If you look at the benchmark we posted (SMHasher) it was written specifically to showcase the problems hit by real-world hashing of string contents.

The libc++ code is publicly available at http://llvm.org/svn/llvm-project/libcxx/trunk/ . Patches to this code are reviewed regularly by myself and others, and welcomed. Please include a patch for http://llvm.org/svn/llvm-project/libcxx/trunk/CREDITS\.TXT if you contribute.

Howard