I'm attempting to quantify the goodness of a hash function installed into a std::unordered_set/map. I.e. ultimately the number of collisions is determined not only by the hash function, but by the implementation of the unordered container as well.

And of course benchmarks should also be made in the context of the

hash container.

IIUC these hash functions are designed to be used with containers

whose number of buckets is a power of 2 which can be faster since they

use bit operations to select the bucket. So it might be worth testing

for that case as well as the current implementation. Doing that in an

STL container would require some template thing to determine which

strategy to use. I don't know if you would consider that acceptable.

The libc++ std::hash are being designed/tested only with the libc++ unordered containers. The libc++ containers restrict the number of buckets to a prime (which the client can select). Perhaps I'm misunderstanding your point?

I've put together the following test function which scores a container at any given time during its lifetime based on the number of collisions it has. A score of 0.f indicates that all values are hashed into the same bucket. A score of 100.f indicates that there are no collisions.

// A quantitative grade [0, 100] of how many collisions there

// are in an unordred_set/map.

// 0 == all elements in the same bucket

// 100 == no collisions

// A lower max_load_factor pushes this score towards 100

// A higher max_load_factor pushes this score towards 0

template <class C>

float

grade(const C& c)

{

using namespace std;

if (c.size() <= 1)

return 100;

float score = 0;

size_t bc = c.bucket_count();

for (size_t i = 0; i != bc; ++i)

{

size_t bs = c.bucket_size(i);

if (bs > 1)

score += bs - 1;

}

return 100*(1 - score / (c.size() - 1));

}

I'm not at this time proposing putting this code into libc++ nor even in the test suite. I'm simply requesting comments on if this is a reasonable tool for quantifying the goodness of an unordered container's state. I've been working with it for only a day now, but I'm happy with it. But I'm not infallible, so before I make decisions based on this metric, I'm opening up the opportunity to trash the metric.

This metric underestimates the cost of multiple collisions in a single

bucket. For example, it will give the same result for the case when

you have 98 elements in bucket 1, 2 in bucket 2 as it would for 50

elements in bucket 1, 50 in bucket 2. A better metric might be the

expected number of element lookups to find an element in the table -

it's equal to the sum of (bucket_size(i) * (bucket_size(i) + 1) / 2)

for all buckets divided by the number of elements.

I like this suggestion a lot! I've switched to it.

Another metric would be to place a set of elements in the container,

and then measure the number of lookups required for a separate set of

elements. Using real data is also a good idea; randomly generated data

tends to hash to random buckets.

So far I've just been testing hash<arithmetic types>, though by the end of the day I hope to have moved on to hash<string>. I'm open to concrete suggestions on data sets I should be testing. For floating point types I've been using:

mt19937_64 eng;

uniform_real_distribution<T> dist(numeric_limits<T>::min(),

numeric_limits<T>::max());

And for integral types:

mt19937_64 eng;

uniform_int_distribution<T> dist(numeric_limits<T>::min(),

numeric_limits<T>::max());

For example here is one test I'm running:

#include <unordered_set>

#include <random>

#include <iostream>

// Computes the average number of comparisions per lookup.

// A perfect hash will return 1.

template <class C>

float

grade(const C& c)

{

using namespace std;

if (c.size() <= 1)

return 100;

float score = 0;

size_t bc = c.bucket_count();

for (size_t i = 0; i != bc; ++i)

{

size_t bs = c.bucket_size(i);

score += bs * (bs+1) / 2;

}

return score / c.size();

}

int main()

{

typedef long long T;

using namespace std;

mt19937_64 eng;

// uniform_real_distribution<T> dist(numeric_limits<T>::min(),

// numeric_limits<T>::max());

uniform_int_distribution<T> dist(numeric_limits<T>::min(),

numeric_limits<T>::max());

unordered_set<T> table;

table.max_load_factor(1);

for (int i = 0; i < 100; ++i)

{

for (int j = 0; j < 10000; ++j)

table.insert(dist(eng));

cout << "score = " << grade(table) << '\n';

}

}

One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one. What's checked in is:

return __u.__a ^ __u.__b;

But I was wondering if it would be better to:

return __murmur2<size_t>()(&__u, sizeof(__u));

My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor. I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:

return static_cast<size_t>(__v);

I'm looking at this because of http://llvm.org/bugs/show_bug.cgi?id=10971 .

Howard