We need better hashing

Talin1 · February 8, 2012, 6:58am

LLVM currently has a bunch of different hashing algorithms scattered throughout the code base.

There’s also a number of places in the code where a FoldingSetNodeID is created for the purpose of calculating a hash, and then discarded. From an efficiency standpoint, this isn’t all that bad unless the number of individual items being hashed > 32, at which point the SmallVector overflows and memory is allocated.

I personally want to see a better approach to hashing because of the cleanup work I’ve been doing - I’ve been replacing std::map and FoldingSet with DenseMap in a number of places, and plan to do more of this. The thing is, for complex key types, you really need to have a custom DenseMapInfo, and that’s where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and many others). The best overall hash function that I am currently aware of is Austin Appleby’s MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM’s use, we want a hash function that can handle mixed data - that is, pointers, ints, strings, and so on. Most of the high-performance hash functions will work well on mixed data types, but you have to put everything into a flat buffer - that is, an array of machine words whose starting address is aligned on a machine-word boundary. The inner loops of the hash functions are designed to take advantage of parallelism of the instruction pipeline, and if you try feeding in values one at a time it’s possible that you can lose a lot of speed. (Although I am not an expert in this area, so feel free to correct me on this point.) On the other hand, if your input values aren’t already arranged into a flat buffer, the cost of writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the speed of the hash function itself is probably more important than getting the best possible mixing of bits.

It seems that for LLVM’s purposes, something that has an interface similar to FoldingSetNodeID would make for an easy transition. One approach would be to start with something very much like FoldingSetNodeID, except with a fixed-length buffer instead of a SmallVector - the idea is that when you are about to overflow, instead of allocating more space, you would compute an intermediate hash value and then start over at the beginning of the buffer.

Another question is whether or not you would want to replace the hash functions in DenseMapInfo, which are designed to be efficient for very small keys - most of the high-performance hash functions have a fairly substantial fixed overhead (usually in the form of a final mixing step) and thus only make sense for larger key sizes.

Talin1 · February 9, 2012, 7:23pm

By the way, the reason I’m bringing this up is that a number of folks are currently working on optimizing the use of hash tables within LLVM’s code base, and unless we can come up with a common hashing facility, there will be an increasing proliferation of cut & paste copies of hash functions. So feedback would be nice.

Talin1 · February 13, 2012, 12:59am

Here’s my latest version of Hashing.h, which I propose to add to llvm/ADT. Comments welcome and encouraged.

Hashing.h (3.94 KB)

jayfoad · February 13, 2012, 9:22am

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

Thanks,
Jay.

jayfoad · February 13, 2012, 9:27am

Incidentally there was talk of using CityHash for LLVM's StringMap
last year, but I don't think it ever came to anything:

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-April/014656.html

Jay.

Talin1 · February 13, 2012, 10:00am

Here’s my latest version of Hashing.h, which I propose to add to llvm/ADT.
Comments welcome and encouraged.

/// Adapted from MurmurHash2 by Austin Appleby

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there’s no incremental version of 3. If you look at the source, you’ll notice that the very first thing that 3 does is Hash ^= Length, but for the incremental case we don’t know the length until we’re done. Also, 2 has fewer instructions per block hashed than 3; 3 also requires bit rotations, whereas 2 only uses bit shifts.

Bear in mind that the “flaw” in MurmurHash2 is a fairly esoteric case which (I suspect) LLVM is unlikely to ever encounter in practice. Austin is trying to develop the best possible hash for a wide range of key probability distributions, and his testing methodologies are quite strict.

LLVM’s needs, on the other hand, are fairly modest. I’m guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn’t hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that “Hashing.h” should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I’m working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

Chris_Lattner · February 14, 2012, 10:26am

At that point, there wasn't a clear win. CityHash (iirc) is optimized for huge strings, which aren't the usual case in the compiler AFAIK.

-Chris

Chris_Lattner · February 14, 2012, 10:44am

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there’s no incremental version of 3.

I think that that is a great reason.

LLVM’s needs, on the other hand, are fairly modest. I’m guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn’t hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Indeed. It can’t be hard to be better than our existing adhoc stuff :). That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor. The identifier uniquing is a very hot path in “clang -E” performance.

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that “Hashing.h” should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I’m working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

I think that this is a great way to stage it. I personally care more about the interface than the implementation. Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.

#include “llvm/ADT/ArrayRef.h”
#include “llvm/ADT/StringRef.h”
#include “llvm/Support/Compiler.h”
#include “llvm/Support/DataTypes.h”
#include “llvm/Support/PointerLikeTypeTraits.h”
#include “llvm/Support/type_traits.h”

Do you actually need all of these includes? PointerLikeTypeTraits doesn’t seem necessary. Is type_traits?

enum {
BufferSize = 32,

BufferSize is dead.

/// Add a pointer value
template
void add(const T *PtrVal) {
addImpl(
reinterpret_cast<const uint32_t *>(&PtrVal),
reinterpret_cast<const uint32_t *>(&PtrVal + 1));
}

This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t’s, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem.

Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

/// Add a float
void add(float Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

/// Add a double
void add(double Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

Similarly, these need to go through a union to avoid TBAA problems.

void add(StringRef StrVal) {
addImpl(StrVal.data(), StrVal.size());
}

I’m contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data? The Bernstein hash function works really well and is much cheaper to compute than Murmur. It is used by HashString (and thus by StringMap).

// Add a possibly unaligned sequence of bytes.
void addImpl(const char *I, size_t Length) {

This should probably be moved out of line to avoid code bloat.

Overall, the design of the class is making sense to me! Thanks for tackling this!

-Chris

Talin1 · February 15, 2012, 6:47am

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there’s no incremental version of 3.

I think that that is a great reason.

LLVM’s needs, on the other hand, are fairly modest. I’m guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn’t hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Indeed. It can’t be hard to be better than our existing adhoc stuff :). That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor. The identifier uniquing is a very hot path in “clang -E” performance.

I’m not planning on touching StringMap.

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that “Hashing.h” should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I’m working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

I think that this is a great way to stage it. I personally care more about the interface than the implementation. Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.

#include “llvm/ADT/ArrayRef.h”
#include “llvm/ADT/StringRef.h”
#include “llvm/Support/Compiler.h”
#include “llvm/Support/DataTypes.h”
#include “llvm/Support/PointerLikeTypeTraits.h”
#include “llvm/Support/type_traits.h”

Do you actually need all of these includes? PointerLikeTypeTraits doesn’t seem necessary. Is type_traits?

Ooops, this was a cut & paste error from FoldingSet.cpp.

enum {
BufferSize = 32,

BufferSize is dead.

/// Add a pointer value
template
void add(const T *PtrVal) {
addImpl(
reinterpret_cast<const uint32_t *>(&PtrVal),
reinterpret_cast<const uint32_t *>(&PtrVal + 1));
}

This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t’s, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem.

So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we’re going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to “addPointer” and adding a comment might be in order. Similarly, I can make the ArrayRef version ‘addPointers’ and have it take an ArrayRef<T*>.

Now, as to the 4 bytes issue, I think I can solve that with some clever template methods.

Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

/// Add a float
void add(float Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

/// Add a double
void add(double Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

Similarly, these need to go through a union to avoid TBAA problems.

I’m not sure how that works. Can you give an example?

void add(StringRef StrVal) {
addImpl(StrVal.data(), StrVal.size());
}

I’m contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data? The Bernstein hash function works really well and is much cheaper to compute than Murmur. It is used by HashString (and thus by StringMap).

So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it’s designed to do such a thorough job of mixing the bits that it really doesn’t matter what data types you feed it. You are correct that for purely string data, you’d want to use a less expensive algorithm (I’m partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)

// Add a possibly unaligned sequence of bytes.
void addImpl(const char *I, size_t Length) {

This should probably be moved out of line to avoid code bloat.

OK

Talin1 · February 15, 2012, 7:32am

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there’s no incremental version of 3.

I think that that is a great reason.

LLVM’s needs, on the other hand, are fairly modest. I’m guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn’t hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Indeed. It can’t be hard to be better than our existing adhoc stuff :). That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor. The identifier uniquing is a very hot path in “clang -E” performance.

I’m not planning on touching StringMap.

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that “Hashing.h” should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I’m working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

I think that this is a great way to stage it. I personally care more about the interface than the implementation. Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.

#include “llvm/ADT/ArrayRef.h”
#include “llvm/ADT/StringRef.h”
#include “llvm/Support/Compiler.h”
#include “llvm/Support/DataTypes.h”
#include “llvm/Support/PointerLikeTypeTraits.h”
#include “llvm/Support/type_traits.h”

Do you actually need all of these includes? PointerLikeTypeTraits doesn’t seem necessary. Is type_traits?

Ooops, this was a cut & paste error from FoldingSet.cpp.

enum {
BufferSize = 32,

BufferSize is dead.

/// Add a pointer value
template
void add(const T *PtrVal) {
addImpl(
reinterpret_cast<const uint32_t *>(&PtrVal),
reinterpret_cast<const uint32_t *>(&PtrVal + 1));
}

This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t’s, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem.

So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we’re going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to “addPointer” and adding a comment might be in order. Similarly, I can make the ArrayRef version ‘addPointers’ and have it take an ArrayRef<T*>.

Also, something I forgot to mention. Since it’s possible to convert any single T into an ArrayRef of size 1, then technically you could just have one method that accepts an ArrayRef which would work for all cases - all of the other methods are just optimizations. In which case, my question is do you still need a union? In other words, would the following work as expected?

void add(float Data) {
addArray(ArrayRef(Data));
}

Jeffrey_Yasskin2 · February 15, 2012, 8:05am

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there's no incremental version of 3.

I think that that is a great reason.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that most
DenseMaps contain less than a few thousand entries. Even a bad hash function
wouldn't hurt performance that much, and the time taken to calculate the
hash is probably more of a factor in overall performance than getting the
optimum distribution of hash values.

Indeed. It can't be hard to be better than our existing adhoc stuff :).
That said, please do not change the hash function used by StringMap without
do careful performance analysis of the clang preprocessor. The identifier
uniquing is a very hot path in "clang -E" performance.

Would it be possible to use CityHash instead for strings?

GitHub - google/cityhash: Automatically exported from code.google.com/p/cityhash

OK by me. My intention is that "Hashing.h" should contain a variety of
different hashing algorithms for various specific needs. Right now, I am
mainly focusing on hashing of mixed data types - that is, you have some
structure which contains pointers, ints, strings, and you want to calculate
a hash of the entire struct. I need this because I'm working on improving
the uniquing of constants and similar data structures. My next target is to
improve the uniquing of MDNodes, but I want to get the hashing stuff squared
away first.

It is also my intent that some person who is more of an expert in hashing
than I am can do detailed performance analysis under real-world loads (such
as compiling actual programs with clang) and tweak and optimize the hashing
algorithm, without needing to modify the API and/or all of the places that
call it.

I think that this is a great way to stage it. I personally care more about
the interface than the implementation. Someone can tweak and tune it after
enough code is using the API to get reasonable performance numbers.

#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/PointerLikeTypeTraits.h"
#include "llvm/Support/type_traits.h"

Do you actually need all of these includes? PointerLikeTypeTraits doesn't
seem necessary. Is type_traits?

enum {
BufferSize = 32,

BufferSize is dead.

/// Add a pointer value
template<typename T>
void add(const T *PtrVal) {
addImpl(
reinterpret_cast<const uint32_t *>(&PtrVal),
reinterpret_cast<const uint32_t *>(&PtrVal + 1));
}

This violates TBAA rules and looks pretty dangerous to expose as public API.
Is this really needed? Also, addImpl is dereferencing the pointers as
uint32_t's, but there is nothing that guarantees that T is a multiple of 4
bytes. The ArrayRef version has the same problem.

Though it is more verbose, I think it would be better to expose a template
specialization approach to getting the hash_value of T.

/// Add a float
void add(float Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

/// Add a double
void add(double Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

Similarly, these need to go through a union to avoid TBAA problems.

These are just wrong: they'll hash +0 and -0 to different values even
though they compare ==.

void add(StringRef StrVal) {
addImpl(StrVal.data(), StrVal.size());
}

I'm contradicting my stance above about not caring about the implementation
:), but is MurmurHash a good hash for string data? The Bernstein hash
function works really well and is much cheaper to compute than Murmur.

Have you seen benchmarks saying that, or are you just guessing from
the length of the code? The benchmarks I've seen say the opposite:

and GitHub - google/cityhash: Automatically exported from code.google.com/p/cityhash.

jayfoad · February 15, 2012, 10:15am

I think you've misread the code. We're not passing PtrVal to addImpl,
we're passing &PtrVal. So the constraint is that the pointer type
"const T *" must be at least as aligned as a uint32_t, which seems
safe.

Jay.

Chris_Lattner · February 15, 2012, 7:58pm

The compiler's use case is not typically a 256K long string. I'm speaking from experience tuning the clang preprocessor identifier lookup table.

-Chris

Chris_Lattner · February 15, 2012, 8:01pm

/// Add a pointer value
template
void add(const T *PtrVal) {
addImpl(
reinterpret_cast<const uint32_t *>(&PtrVal),
reinterpret_cast<const uint32_t *>(&PtrVal + 1));
}

This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t’s, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem.

So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we’re going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to “addPointer” and adding a comment might be in order. Similarly, I can make the ArrayRef version ‘addPointers’ and have it take an ArrayRef<T*>.

Ah, Jay was right, I misread this code!

Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

/// Add a float
void add(float Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

/// Add a double
void add(double Data) {
addImpl(
reinterpret_cast<const uint32_t *>(&Data),
reinterpret_cast<const uint32_t *>(&Data + 1));
}

Similarly, these need to go through a union to avoid TBAA problems.

I’m not sure how that works. Can you give an example?

Just use:

void add(double Data) {
union {
double D; uint64_t I;
};
D = Data;
add(I);
}

void add(StringRef StrVal) {
addImpl(StrVal.data(), StrVal.size());
}

I’m contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data? The Bernstein hash function works really well and is much cheaper to compute than Murmur. It is used by HashString (and thus by StringMap).

So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it’s designed to do such a thorough job of mixing the bits that it really doesn’t matter what data types you feed it. You are correct that for purely string data, you’d want to use a less expensive algorithm (I’m partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)

Ok, so what’s the answer? We can do different things for ArrayRef and StringRef.

-Chris

Bryce_Cogswell1 · February 15, 2012, 8:09pm

These are the numbers I come up with, graphing time (ns) vs. string length (chars), using the current Xcode compiler, 64-bit, -O3.
Bernstein is inlined while Murmor and CityHash are not:

PastedGraphic-5.tiff

Talin1 · February 17, 2012, 8:26am

OK here’s a patch with the latest, including unit tests. I’ve also tried to make the comments clear that this is intended for the case of “generic” key types, where you are either hashing multiple data types together, or you don’t know in advance what the data type will be - for cases such as strings where a tailored algorithm is available, you should use that instead of this.

There’s a couple of things I don’t like: First, there’s too many levels of inlined function calls - my experience is that compilers aren’t as good at inlining nested calls as is often advertised, however I couldn’t figure out a way to get the same effect without the nested calls.

Similarly the addBitsImpl helper types are a little too complex and magical for my taste, but again I couldn’t think of a better way to automatically detect whether to use the aligned vs. unaligned hashing routine.

hashing.patch (11.4 KB)

jayfoad · February 17, 2012, 8:56am

I don't think this is safe, e.g. for:

struct {
char c[4];
}

Jay.

jayfoad · February 17, 2012, 9:03am

... so you should probably use the stuff in <llvm/Support/AlignOf.h> instead!

Jay.

Chris_Lattner · February 17, 2012, 9:32am

OK here's a patch with the latest, including unit tests. I've also tried to make the comments clear that this is intended for the case of "generic" key types, where you are either hashing multiple data types together, or you don't know in advance what the data type will be - for cases such as strings where a tailored algorithm is available, you should use that instead of this.

Makes sense.

+ /// Add a pointer value.
+ /// Note: this adds pointers to the hash using sizes and endianness that
+ /// depend on the host. It doesn't matter however, because hashing on
+ /// pointer values is inherently unstable.

This makes perfect sense.

+ /// Add an ArrayRef of arbitrary data.
+ template<typename T>
+ GeneralHash& add(ArrayRef<T> ArrayVal) {
+ addBits(ArrayVal.begin(), ArrayVal.end());
+ return *this;
+ }

Doesn't this have the same host-specificity problem, except that it will cause things that *are* stable to vary, such as arrays of char, or is the alignment check enough?

+ /// Add a float
+ GeneralHash& add(float Data) {

It is worth adding a comment here that this does a bitwise hash, so -0.0 and +0.0 will hash to different values even though they compare equal and two identical NaN's will hash to the same value even though they compare unequal.

The mix logic is inherently a series of 32-bit operations. Would it be possible and make more sense to implement them as 64-bit operations? 64-bit hosts are the vast majority of the machines that run a compiler these days. OTOH, making it depend on the host brings up host instability issues.

Actually, how much do we care about host instability here? If this is used by hashing containers, we can just declare that iteration order is undefined even for non-pointer values. The real problem I guess is that we'd have to go check out all the existing DenseMap's etc to make sure people aren't doing well-defined iteration right now and fixing the code. What do you think?

There's a couple of things I don't like: First, there's too many levels of inlined function calls - my experience is that compilers aren't as good at inlining nested calls as is often advertised, however I couldn't figure out a way to get the same effect without the nested calls.

Is there a specific observed concern here (e.g. when built with Clang) or is this a historical problem? Compilers have gotten much better about inlining stuff that is actually small, if Clang handles it then I think we're good. Marking these attribute(always_inline) is massive overkill.

Overall the code is looking great. I'd recommend checking in the new API separately from switching Constants.cpp to use it though (just so that any problems doesn't cause both to get reverted).

-Chris

Chandler_Carruth · February 17, 2012, 9:46am

Sorry for coming late to this thread. I’ve been very active in recent work on hashing routines inside and outside of Google, so I’ve a keen interest in this…

Some initial clarifications:

Topic		Replies	Views
Proposed implementation of N3333 hashing interfaces for LLVM (and possible libc++) LLVM Dev List Archives	12	185	March 1, 2012
RFC: change string hash function in libc++ Clang Frontend	21	301	December 6, 2011
Default hashing function for integers (DenseMapInfo.h) LLVM Dev List Archives	16	122	January 10, 2017
[libcxx] Extending the behavior of std::hash in libc++ Clang Frontend	12	218	December 9, 2015
hash_map issues with Visual Studio LLVM Dev List Archives	2	86	October 25, 2004

We need better hashing

Related topics