[libcxx] Extending the behavior of std::hash in libc++

Hi, I wanted to leave some thoughts here from my perspective as an
application developer. As context, I added heterogeneous lookup
support to Google's internal dense_hash_{set,map}, and my day-to-day
job involves working with lots of high-throughput hashtables.

From my perspective, custom hash algorithms are pretty important, and

I'm concerned that this standard may ossify the state of the world for
applications which could benefit from using a different hash
algorithm.

Even if the standard library randomizes its hash (which I completely
agree is a good idea), I contend that it's still going to be hard for
standard library maintainers to swap out hash functions, because of
the conservative nature of standard library writing. (As evidence,
consider that, last time I checked, glibc still used a malloc
implementation whose throughput dropped by more than half as soon as
you created a thread.)

I expect that the result of this conservatism is that even if a good
hash function is chosen today, it will rather quickly become outdated,
as hashing is an area of active development in the community at large
(see e.g. xxHash, MetroHash). I expect this trend to continue -- it's
kind of the perfect nerdsnipe, and it also matters for lots of
high-performance code.

Moreover, applications don't necessarily control the version of the
standard library they link with; even if one standard library
implementation uses a good hash, that's no guarantee that they all
will. As soon as any one of the standard libraries I support uses a
hash that's too slow or somehow weak (perhaps its randomization is
weak and doesn't prevent flooding), I have to make nontrivial changes
to all of my code.

So I contend that "standard library writers should use a good, fast
hash" implies "at least some standard libraries will use a possibly
bad, probably slow hash" some time in the relatively near future, just
as today some standard libraries use crappy malloc implementations.
And so just as performance-sensitive applications commonly replace
malloc, I expect many performance-sensitive applications will want to
control their default hasher.

Indeed, as the proposal says, "we must plan for hash algorithms that
have a finite (and probably short) operational life," and my
expectation is that there's a lot of carefully-maintained user code is
going to better able to play this cat-and-mouse game than the compiler
/ standard library headers.

One option is to say, fine, these applications can continue to use a
custom hash functor just as they do today. But as the proposal says,
ergonomics and experience suggest that most people are just going to
use the default. Therefore my suggestion is, it would be really nice
if there were an easy way for application developers to swap in a new
hasher, even if it's only for a set of explicitly-listed types.

I'll refrain from saying something dumb by not suggesting how this
might be specified. :slight_smile:

-Justin

I'm confused. You start by saying that you're concerned that "this standard
[meaning P0029, I guess?] may ossify the state of the world for
applications which could benefit from using a different hash algorithm",
but I don't see how the proposal is any worse than the status quo with
respect to the problems you describe. Furthermore, you go on to express a
desire for user code to be able to control the default hash algorithm, but
if anything P0029 makes that easier, by giving you one central hash
algorithm to control, instead of dealing with hundreds of independent
per-type hash algorithms.

In any event, "an easy way for application developers to swap in a new
hasher" is basically exactly what I'm asking for here. I take no position
on whether it would be a good idea to put such a mechanism in the standard
itself, but if you want to go that way, surely gaining implementation
experience in libc++ would be a good first step?

I'm confused. You start by saying that you're concerned that "this standard
[meaning P0029, I guess?] may ossify the state of the world for applications
which could benefit from using a different hash algorithm", but I don't see
how the proposal is any worse than the status quo with respect to the
problems you describe.

It's not.

My mail was more to express the position that switching out hashers is
important for many applications. It seems that perhaps if hashing is
being reconsidered in a big way, the ability to switch out the hasher
should be considered at the same time, lest an opportunity be lost.
Even if switching out hashers is out of scope for P0029, it seems that
if it's desirable and important it should be carefully considered, in
the same way that heterogeneous lookups are being considered.

Furthermore, you go on to express a desire for user
code to be able to control the default hash algorithm, but if anything P0029
makes that easier, by giving you one central hash algorithm to control,
instead of dealing with hundreds of independent per-type hash algorithms.

I suppose this is true from one perspective. In particular it's
easier to use a single hasher with complex types -- assuming that
everyone writes their boilerplate using the templated idiom.

On the other hand, the data structure using the hash -- unordered_map
or whatever -- still controls which hasher we use. My contention in
the earlier e-mail is that because (a) people will largely use the
default and (b) library writers are conservative, we will likely end
up with possibly bad, probably slow standard-library hashers in the
near future. Just as replacing malloc is often a good idea, it seems
to me that replacing the default hasher will also often be a good
idea. So I was asking that we consider whether it's desirable to
design a system where it's easier to replace the default hasher.

Maybe this is the best we can reasonably do in that respect. It would
sort of be a bummer (as I understand the proposal), but I get that
there's a lot of constraints on the problem.

I take no position on
whether it would be a good idea to put such a mechanism in the standard
itself, but if you want to go that way, surely gaining implementation
experience in libc++ would be a good first step?

Sure, and I feel like this thread is part of that process?

> I'm confused. You start by saying that you're concerned that "this
standard
> [meaning P0029, I guess?] may ossify the state of the world for
applications
> which could benefit from using a different hash algorithm", but I don't
see
> how the proposal is any worse than the status quo with respect to the
> problems you describe.

It's not.

My mail was more to express the position that switching out hashers is
important for many applications. It seems that perhaps if hashing is
being reconsidered in a big way, the ability to switch out the hasher
should be considered at the same time, lest an opportunity be lost.
Even if switching out hashers is out of scope for P0029, it seems that
if it's desirable and important it should be carefully considered, in
the same way that heterogeneous lookups are being considered.

I see, that makes sense; I've made a note to discuss that in the next
version of the paper. I don't intend to actually propose it at this
juncture, but I agree it's worth considering

> Furthermore, you go on to express a desire for user
> code to be able to control the default hash algorithm, but if anything
P0029
> makes that easier, by giving you one central hash algorithm to control,
> instead of dealing with hundreds of independent per-type hash algorithms.

I suppose this is true from one perspective. In particular it's
easier to use a single hasher with complex types -- assuming that
everyone writes their boilerplate using the templated idiom.

Sorry, I don't follow.

On the other hand, the data structure using the hash -- unordered_map
or whatever -- still controls which hasher we use. My contention in
the earlier e-mail is that because (a) people will largely use the
default and (b) library writers are conservative, we will likely end
up with possibly bad, probably slow standard-library hashers in the
near future. Just as replacing malloc is often a good idea, it seems
to me that replacing the default hasher will also often be a good
idea. So I was asking that we consider whether it's desirable to
design a system where it's easier to replace the default hasher.

I get all that. My point is that even if P0029 doesn't directly give you a
way to replace the hash algorithm, it gets you closer: under the status
quo, even if the standard library gave you a way to change the default
hasher, that would just give you a way to change the behavior for each type
individually. Under P0029, by contrast, if the standard library gives you a
way to change the hash algorithm, then you can change it for all types at a
stroke.

Maybe this is the best we can reasonably do in that respect. It would
sort of be a bummer (as I understand the proposal), but I get that
there's a lot of constraints on the problem.

> I take no position on
> whether it would be a good idea to put such a mechanism in the standard
> itself, but if you want to go that way, surely gaining implementation
> experience in libc++ would be a good first step?

Sure, and I feel like this thread is part of that process?

Agreed; I had thought you were objecting to my request, but it sounds like
that's not the case.

I've made a note to discuss that in the next version of the paper. I don't intend to actually propose it at this juncture, but I agree it's worth considering

Fantastic, thank you!

Under P0029, by contrast, if the standard library gives you a way to change the hash algorithm, then you can change it for all types at a stroke.

Strong agree.

I had thought you were objecting to my request, but it sounds like that's not the case.

Sorry, there clearly are communication nuances on this list that I'm
not yet hep to. Thanks for being patient and digging out what I was
trying to say.

-Justin

> I've made a note to discuss that in the next version of the paper. I
don't intend to actually propose it at this juncture, but I agree it's
worth considering

Fantastic, thank you!

> Under P0029, by contrast, if the standard library gives you a way to
change the hash algorithm, then you can change it for all types at a stroke.

Strong agree.

> I had thought you were objecting to my request, but it sounds like
that's not the case.

Sorry, there clearly are communication nuances on this list that I'm
not yet hep to. Thanks for being patient and digging out what I was
trying to say.

I'm new here too, so this may just be down to my lack of reading
comprehension skils.

The thing I would like (also... a pony...) is a more explicit
discussion of hash function families, and the ability to select from a
family. There are relatively important hash-based data structures that
require 2+ hash functions, notably Bloom filters and Cuckoo hash
tables. Given the big push to revamp std::hash, this seems like a
great opportunity to address the needs of library authors who want to
be able to select randomly from a family of hash functions. Ideally
selecting new hash functions could happen "at will," e.g., at table
rehash in Cuckoo.

Apologies if this is already covered by your proposed changes to the
C++ standard.

thanks,

Geoff

The thing I would like is a more explicit discussion of hash function families, and the ability to select from a family.

How would this be different from the hashtable / bloom filter hashing
in a random number into the beginning or end of the bitstream? Is the
idea that we could be more efficient than this, or that we could get
better hashes if we had explicit support for seeding the hash?

Would the container writer be able to manipulate a bitstream?
Currently, the container just gets a hash function to call, whose
input is a key and whose output is size_t.

Also, the ideal number of bits to prepend, if you're going that route,
would be different for different hash function families. It seems more
logical to me to make the interface more explicit about the existence
of hash function families.

Geoff

Would the container writer be able to manipulate a bitstream?
Currently, the container just gets a hash function to call, whose
input is a key and whose output is size_t.

I don't see why it couldn't manipulate the bitstream? If I'm reading
this API right, it could do something like

  size_t hash(const T& t) {
    return hash_combine(std::hash_code(), seed, t)();
  }

Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.

I guess what you're saying is that the algorithm keeps some number of
bits as internal state, and ideally you'd want to fill up all of those
bits with a seed, rather than just using a size_t?

But in practice we're not usually interested in exploring the full
range of possible hashers, especially given that flooding protection
is already being taken care of by randomizing at startup. If you know
you'll need n different hashers, is it not sufficient to seed with
values in the range [0, n)?

I guess you could just pass initial state to the hash_code constructor
(again if I'm reading this API right). But that's kind of an ugly API
if the constructor takes a different number of bytes depending on the
hash function family. Maybe if this information is really needed it
would be better to expose a typedef on some hasher type that tells you
how many bits of entropy it wants -- then it's up to you to prepend
the right amount.

This all brings up another concern:

Looking at the sample implementation [1], hashing a small value, e.g.
a pointer, an int64, or even a short string seems necessarily slower
than if I just implemented std::hash<T> to call FarmHash64 directly.
The reason being that the new API doesn't know a priori how many bytes
I'm going to be hashing, and I'm free to split up those bytes into as
many calls to the hasher as I want. Therefore it has to carefully
buffer things until I finalize the hasher.

In contrast, if I just call FarmHash64(some_int), it drops right into
a fast path and doesn't do any buffering and so on.

I like that the proposal lets us hash aggregate types (vectors,
structs, etc.) while preserving the full state of the hasher (that is,
without finalizing down to size_t), but in accordance with the
zero-cost principle of C++, it's probably important that simple types
are hashed as fast as you could hash them if you were invoking
FarmHash64 directly?

[1] https://github.com/google/hashing-demo/blob/N0029R0/farmhash.h

No. Depending on the specific context, you want either a family of
Universal Hash Functions or even pairwise independent hash functions.
That's a lot stronger than just n different hash functions and whether
hash combining is enough to qualify is quite questionable.

Joerg

> Would the container writer be able to manipulate a bitstream?
> Currently, the container just gets a hash function to call, whose
> input is a key and whose output is size_t.

I don't see why it couldn't manipulate the bitstream? If I'm reading
this API right, it could do something like

  size_t hash(const T& t) {
    return hash_combine(std::hash_code(), seed, t)();
  }

> Also, the ideal number of bits to prepend, if you're going that route,
would be different for different hash function families.

I guess what you're saying is that the algorithm keeps some number of
bits as internal state, and ideally you'd want to fill up all of those
bits with a seed, rather than just using a size_t?

But in practice we're not usually interested in exploring the full
range of possible hashers, especially given that flooding protection
is already being taken care of by randomizing at startup. If you know
you'll need n different hashers, is it not sufficient to seed with
values in the range [0, n)?

I guess you could just pass initial state to the hash_code constructor
(again if I'm reading this API right). But that's kind of an ugly API
if the constructor takes a different number of bytes depending on the
hash function family. Maybe if this information is really needed it
would be better to expose a typedef on some hasher type that tells you
how many bits of entropy it wants -- then it's up to you to prepend
the right amount.

This all brings up another concern:

Looking at the sample implementation [1], hashing a small value, e.g.
a pointer, an int64, or even a short string seems necessarily slower
than if I just implemented std::hash<T> to call FarmHash64 directly.
The reason being that the new API doesn't know a priori how many bytes
I'm going to be hashing, and I'm free to split up those bytes into as
many calls to the hasher as I want. Therefore it has to carefully
buffer things until I finalize the hasher.

In contrast, if I just call FarmHash64(some_int), it drops right into
a fast path and doesn't do any buffering and so on.

I like that the proposal lets us hash aggregate types (vectors,
structs, etc.) while preserving the full state of the hasher (that is,
without finalizing down to size_t), but in accordance with the
zero-cost principle of C++, it's probably important that simple types
are hashed as fast as you could hash them if you were invoking
FarmHash64 directly?

The implementation is permitted to specialize std::hash<T> to call
FarmHash64 (or whatever) directly when T doesn't depend on a user-defined
type (and the user is permitted to do so when T does depend on a
user-defined type), so it's possible to fix this as a QOI matter if it
arises in practice. However, I don't expect this to be a practical issue.
The API has been very carefully crafted to ensure that the optimizer can
discover and take advantage of situations where only a fixed number of
bytes are being hashed, and even if it can't eliminate the buffering, the
cost of a single memcpy is quite small compared to the cost of evaluating a
modern hash function.

Would the container writer be able to manipulate a bitstream?
Currently, the container just gets a hash function to call, whose
input is a key and whose output is size_t.

I don't see why it couldn't manipulate the bitstream? If I'm reading
this API right, it could do something like

  size_t hash(const T& t) {
    return hash_combine(std::hash_code(), seed, t)();
  }

Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.

I guess what you're saying is that the algorithm keeps some number of
bits as internal state, and ideally you'd want to fill up all of those
bits with a seed, rather than just using a size_t?

Yes. Traditionally, when you select from a hash function family you
initialize some state. (In crypto literature, that's the "key," and
the thing to be hashed is the "message.") The size of the state and
the allowable range for each number in it depends on the family.
Sometimes an allowable range isn't a power of 2. (For a hairy example,
see vmac_set_key:
http://lxr.free-electrons.com/source/crypto/vmac.c#L487 .)

But in practice we're not usually interested in exploring the full
range of possible hashers, especially given that flooding protection
is already being taken care of by randomizing at startup. If you know

Randomization at startup would offer some protection, but it doesn't
solve the hash-flooding problem; with various hash-function families
it is conjectured to thwart the worst hash-flooding attacks. I
strongly believe hash-based containers should have some hash-flooding
detection and evasion, and I'm working on that. One idea that looks
promising is to have containers use a quick-and-dirty hash function
most of the time, with judicious use of
randomly-selected-slower-but-nicer hash functions as needed.

you'll need n different hashers, is it not sufficient to seed with
values in the range [0, n)?

I guess you could just pass initial state to the hash_code constructor
(again if I'm reading this API right). But that's kind of an ugly API
if the constructor takes a different number of bytes depending on the
hash function family. Maybe if this information is really needed it
would be better to expose a typedef on some hasher type that tells you
how many bits of entropy it wants -- then it's up to you to prepend
the right amount.

This all brings up another concern:

Looking at the sample implementation [1], hashing a small value, e.g.
a pointer, an int64, or even a short string seems necessarily slower
than if I just implemented std::hash<T> to call FarmHash64 directly.
The reason being that the new API doesn't know a priori how many bytes
I'm going to be hashing, and I'm free to split up those bytes into as
many calls to the hasher as I want. Therefore it has to carefully
buffer things until I finalize the hasher.

In contrast, if I just call FarmHash64(some_int), it drops right into
a fast path and doesn't do any buffering and so on.

I like that the proposal lets us hash aggregate types (vectors,
structs, etc.) while preserving the full state of the hasher (that is,
without finalizing down to size_t), but in accordance with the
zero-cost principle of C++, it's probably important that simple types
are hashed as fast as you could hash them if you were invoking
FarmHash64 directly?

Yes, from a performance standpoint, APIs that let me specify exactly
what I'm doing are probably going to be best. E.g., maybe I want to
initialize and use a few hash functions that map strings to integers
in [0, number of buckets). I'm guessing the counterargument relates to
API complexity.

Geoff