Proposed implementation of N3333 hashing interfaces for LLVM (and possible libc++)

Hello folks,

TL;DR: This is my proposed hashing interface based on a proposed standard hashing interface. It also is implemented with a much faster and higher quality algorithm than the current one. This is an early draft of the code, looking for initial feedback.

There has been recent interest in improving the quality and consistency of LLVM’s approach to hashing. In particular, getting a single API and a high quality implementation in one place that other parts of the codebase can use to hash custom data structures. As it happens, Jeffrey Yasskin and I have been working on a proposal with similar goals to the the C++ standard library[1].

bernstein_performance.txt (2.92 KB)

Hashing.h (26.9 KB)

superfast_performance.txt (2.93 KB)

old_llvm_performance.txt (2.91 KB)

murmur2a_performance.txt (2.93 KB)

murmur3a_performance.txt (2.93 KB)

city64_performance.txt (2.92 KB)

new_llvm_performance.txt (2.92 KB)

spooky64_performance.txt (2.92 KB)

Hello folks,

TL;DR: This is my proposed hashing interface based on a proposed
standard hashing interface. It also is implemented with a much faster
and higher quality algorithm than the current one. This is an *early
draft* of the code, looking for initial feedback.

There has been recent interest in improving the quality and consistency
of LLVM's approach to hashing. In particular, getting a single API and a
high quality implementation in one place that other parts of the
codebase can use to hash custom data structures. As it happens, Jeffrey
Yasskin and I have been working on a proposal with similar goals to the
the C++ standard library[1].

Hi Chandler,

I just skimmed over the proposal and it seems very interesting to me. Unfortunately I did not have time to look further into it.

One point that I found surprising is that your hashing is in most cases notably faster than the old LLVM hashing, except for key sizes which are a modulo of 4. Here the existing hash function is a lot faster than your new solution. If keys that are multiple of 4 bytes are common in LLVM, it might be worth to check if this change does not cause performance regressions.

old_llvm_performance.txt

-------------------------------------------------------------------------------
--- Testing llvm (LLVM Hashing)

[[[ Speed Tests ]]]

Small key speed test - 4-byte keys - 16.77 cycles/hash 7.39 nanos/hash
Small key speed test - 8-byte keys - 25.93 cycles/hash 11.43 nanos/hash
Small key speed test - 12-byte keys - 30.39 cycles/hash 13.39 nanos/hash
Small key speed test - 16-byte keys - 30.18 cycles/hash 13.30 nanos/hash
Small key speed test - 20-byte keys - 39.33 cycles/hash 17.34 nanos/hash
Small key speed test - 24-byte keys - 44.70 cycles/hash 19.70 nanos/hash
Small key speed test - 28-byte keys - 49.17 cycles/hash 21.68 nanos/hash

new_llvm_performance.txt

-------------------------------------------------------------------------------
--- Testing std64 (LLVM Standard-baed Hashing)

[[[ Speed Tests ]]]

Small key speed test - 4-byte keys - 40.22 cycles/hash 17.73 nanos/hash
Small key speed test - 8-byte keys - 40.24 cycles/hash 17.73 nanos/hash
Small key speed test - 12-byte keys - 44.69 cycles/hash 19.70 nanos/hash
Small key speed test - 16-byte keys - 44.69 cycles/hash 19.70 nanos/hash
Small key speed test - 24-byte keys - 50.95 cycles/hash 22.46 nanos/hash
Small key speed test - 28-byte keys - 50.95 cycles/hash 22.46 nanos/hash

Cheers
Tobi

Does the enclosed implementation implement this part of N3333:

http://www.open-std.org/Jtc1/sc22/wg21/docs/papers/2012/n3333.html#per.process.seed

?

That to me seems like potentially the most controversial part. I scanned Hashing.h but did not immediately see that support (I could've easily missed it).

Howard

Will post an updated (and cleaned up, thanks Nadav!) patch shortly, but wanted to answer this question:

Thanks for the feedback thus far!

I’ve updated the header file, and enclosed a complete patch against LLVM. This passes all the regtests, and I’ll be doing more thorough testing of LLVM itself with the patch. I’ve included some basic unit tests, but could probably do more here… Not sure it’s worth delaying the initial submission though, as the best testing is to use a hash testing suite…

I’ve addressed the comments from Nadav, but there may be more minor stylistic issues or typos. Please keep pointing them out! I appreciate the help there.

I’ve also corrected my stub for the execution-seed to be more correct and to include the ability to override it during program startup. It still doesn’t go the final step to make it actually change on each execution, but is now otherwise identical. (In particular, it is no longer a compile-time constant that can be folded.) This regressed the performance a tiny bit, however…

There are several improvements to the implementation of the hashing algorithm itself. The way the hashing included the seed hurt performance quite a bit. I’ve re-worked it to be much lighter weight, and even after the execution seed fix above slowed things down, this speeds everything up more. We shave 4ns off the 4 to 8 byte case, bringing performance above that of essentially every high quality hashing algorithm…

I still think we can do more, but it’s already much faster than the existing LLVM one except for the issue Tobias pointed out w/ modulo-4 key sizes. I’m going to investigate this, but it may be a consequence of a weakness in that algorithm.

I’ve re-run the SMHasher quality testing suite and the results are very good. There are a few problems identified, but these are long-known problems with CityHash in general, and have proven to thus far not be a cause of real-world issues… I’d like to fix them, but it doesn’t seem a high priority yet.

Finally, I’ve run some rough initial numbers for x86 32-bit, and it’s surprisingly good. The relative speeds of this algorithm and others don’t seem to change much. A bit suspicious of that, so going to do more thorough analysis.

Hashing.h (27.9 KB)

hashing.diff.gz (13.2 KB)

I still think we can do more, but it's already much faster than the existing LLVM one except for the issue Tobias pointed out w/ modulo-4 key sizes. I'm going to investigate this

OK, but this is a VERY big exception! Almost any non-string data
anyone wants to hash will be a multiple of 4 bytes in length.

//===-- llvm/ADT/Hashing.h - Utilities for hashing --------------*- C++ -*-===//

Doesn't this belong in ../Support/ ? No-one's ever explained the
difference to me; I just assumed that generic data types go in ADT/
and everything else goes in Support/.

//
// The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
// This file implements the newly proposed standard C++ interfaces for hashing
// arbitrary data and building hash functions for user-defined types. This
// interface was originally proposed in N3333[1] and is currently under review
// for inclusion in a future TR and/or standard.
//
// The primary interfaces provide are comprised of one type and three functions:

"provided"

//
// -- 'hash_code' class is an opaque type representing the hash code for some
// data. It is the intended product of hashing, and can be used to implement
// hash tables, checksumming, and other common uses of hashes. It is not an
// integer type (although it can be converted to one) because it is risky
// to assume much about the internals of a hash_code. In particular, each
// execution of the program has a high probability of producing a different
// hash_code for a given input. Thus their values are not stable to save or
// persist, and should only be used during the execution for the
// construction of hashing datastructures.
//
// -- 'hash_value' is a function designed to be overloaded for each
// user-defined type which wishes to be used within a hashing context. It
// should be overloaded within the user-defined type's namespace and found via
// ADL. Overloads for primitive types are provided by this library.
//
// -- 'hash_combine' and 'hash_combine_range' are functions designed to aid
// programmers in easily and intuitively combining a set of data into a single
// hash_code for their object. They should only logically be used within the
// implementation of a 'hash_value' routine or similar context.
//
// Note that 'hash_combine_range' contains very special logic for hashing
// a contiguous array of integers or pointers. This logic is *extremely* fast,
// on a modern Intel "Gainestown" Xeon (Nehalem uarch) @2.2 GHz, these were
// benchmarked at over 8.5 GiB/s for large keys, and <20 cycles/

20 cycles per what? Don't keep us in suspense!

//
//===----------------------------------------------------------------------===//

#ifndef LLVM_ADT_HASHING_H
#define LLVM_ADT_HASHING_H

#include "llvm/ADT/STLExtras.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/type_traits.h"
#include <cassert>
#include <cstring>
#include <algorithm>
#include <utility>
#include <iterator>

namespace llvm {

/// \brief An opaque object representing a hash code.
///
/// This object represents the result of hashing some entity. It is intended to
/// be used to implement hashtables or other hashing-based data structures.
/// While it wraps and exposes a numeric value, this value should not be
/// trusted to be stable or predictable across processes or executions.
///
/// In order to obtain the hash_code for an object 'x':
/// \code
/// using llvm::hash_value;
/// llvm::hash_code code = hash_value(x);
/// \endcode
///
/// Also note that there are two numerical values which are reserved, and the
/// implementation ensures will never be produced for real hash_codes. These
/// can be used as sentinels within hashing data structures.
class hash_code {
  size_t value;

public:
  /// \brief Default construct a hash_code. Constructs a null code.
  hash_code() : value() {}

  /// \brief Form a hash code directly from a numerical value.
  hash_code(size_t value) : value(value) {}

  /// \brief Convert the hash code to its numerical value for use.
  /*explicit*/ operator size_t() const { return value; }

  /// \brief Get a hash_code object which corresponds to a null code.
  ///
  /// The null code must never be the result of any 'hash_value' calls and can
  /// be used to detect an unset hash_code.
  static hash_code get_null_code() { return 0; }

  /// \brief Get a hash_code object which corresponds to an invalid code.
  ///
  /// The invalid code must never be the result of any 'hash_value' calls. This
  /// can be used to flag invalid hash_codes or mark entries in a hash table.
  static hash_code get_invalid_code() { return -1; }

I'm sure there's a compiler out there just waiting to warn you that
you're implicitly converting -1 to an unsigned type. :slight_smile:

  friend bool operator==(const hash_code &lhs, const hash_code &rhs) {
    return lhs.value == rhs.value;
  }
  friend bool operator!=(const hash_code &lhs, const hash_code &rhs) {
    return lhs.value != rhs.value;
  }

  /// \brief Allow a hash_code to be directly run through hash_value.
  friend size_t hash_value(const hash_code &code) { return code.value; }
};

// All of the implementation details of actually computing the various hash
// code values are held within this namespace. These routines are included in
// the header file mainly to allow inlining and constant propagation.
namespace hashing { namespace detail {

My only comment on the implementation is that, on CPUs that have
different instruction sequences for aligned and unaligned loads, you
want to be careful that the 8-byte memcpys turn into aligned loads
when hashing naturally aligned data. (But if we only care about
running on x86-64, this won't be a problem.)

Jay.

(But if we only care about
running on x86-64, this won't be a problem.)

Please, no. We have a cortex-a9 native buildbot already in lab.llvm.org and
as manufacturers emit faster ARM chips we (ARM) will want to have LLVM run
native on them.

You've also got the OpenCL use case etc.

Please bear ARM in mind.

Cheers,

James

Behalf Of Jay Foad

Right. (But ARMv7 uses the same ldr instruction for aligned and
unaligned loads (doesn't it?) so it's not an issue there.)

Jay.

That is true!

(But if we only care about
running on x86-64, this won't be a problem.)

Please, no. We have a cortex-a9 native buildbot already in lab.llvm.org and
as manufacturers emit faster ARM chips we (ARM) will want to have LLVM run
native on them.

You've also got the OpenCL use case etc.

Please bear ARM in mind.

Right. (But ARMv7 uses the same ldr instruction for aligned and
unaligned loads (doesn't it?) so it's not an issue there.)

Sorta. It's the same instruction, but on some implementations, unaligned loads will cause an exception, so the compiler may need to generate LDRB sequences instead. There's also the VLD* instructions to consider. Depending on what's being loaded, there's all sorts of potential tricks that can be played. We should try to be careful and give the compiler as many hints about alignment as we can.

-Jim

// – ‘hash_code’ class is an opaque type representing the hash code for some
// data. It is the intended product of hashing, and can be used to implement

vs.

// – ‘hash_value’ is a function designed to be overloaded for each
// user-defined type which wishes to be used within a hashing context. It

The first paragraph has a hanging indent but the second and third don’t.

// benchmarked at over 8.5 GiB/s for large keys, and <20 cycles/

Trailing punctuation.

#include
#include
#include
#include
#include

Sort.

/// \brief Form a hash code directly from a numerical value.
hash_code(size_t value) : value(value) {}

why not assert that the value is not the null or invalid value? I realize you’d need to have a private constructor for the null and invalid ones then.

// All of the implementation details of actually computing the various hash
// code values are held within this namespace. These routines are included in
// the header file mainly to allow inlining and constant propagation.
namespace hashing { namespace detail {

One namespace per line please, in LLVM.

[I only skimmed the implementation details of the hashing. It all looks plausible enough.]

} } // End hashing detail namespaces.

Again, one per line. You do this multiple times in Hashing.h, please fix them all.

/// reproduce exactly a specific behavior. To that end, we provide a function
/// which will forcible set the seed to a fixed value. This must be done at the

forcible → forcibly

template struct is_hashable_data
: integral_constant<bool, ((is_integral::value ||
is_pointer::value)
&& 64 % sizeof(T) == 0)> {};

Optional: I object to the leading && and propose this reflowed text:

template struct is_hashable_data
: integral_constant<bool,
((is_integral::value || is_pointer::value) &&
64 % sizeof(T) == 0)> {};

.

#if __has_feature(cxx_variadic_templates)

I’m pretty sure non-clang compilers will bark at that unless you #define’d __has_feature(X) to 0?

— a/unittests/ADT/HashingTest.cpp
+++ b/unittests/ADT/HashingTest.cpp
@@ -13,45 +13,295 @@

#include “gtest/gtest.h”
#include “llvm/ADT/Hashing.h”
+#include “llvm/Support/DataTypes.h”
+#include
+#include
+#include
+#include

Sort.

  • // Helper for test code to print hash codes.
  • void PrintTo(const hash_code &code, ::std::ostream *os) {

What’s with the extra leading :: before std::?

  • namespace hashing { namespace detail {
  • template <> struct is_hashable_data : true_type {};
  • } } // End hashing detail namespace.
    +} // End LLVM namespace.

A reminder about 1 namespace-per-line, but also the comments on ending namespaces are odd. They usually look like “} // end namespace llvm” or “} // namespace llvm”, and I prefer the latter.

Nick

+ // Helper for test code to print hash codes.
+ void PrintTo(const hash_code &code, ::std::ostream *os) {

What's with the extra leading :: before std::?

Have you ever tried:

namespace foo {
class std {};
}
using namespace foo;

#include <vector>

Well, I'm not sure that Chandler is guarding against this possibility,
but most library implementations of the standard use ::std::
everywhere to avoid this potential for ambiguity.

Thanks for all the comments!

I think I’ve addressed all of them, as wel as Duncan’s comments from IRC. Based on your OK Nick, I’m planning to commit this tomorrow. If anyone has objections or serious concerns, please let me know to hold off. Updated patch is attached, as well as the latest version of the header.

For the record, the only concern that has come up thus far is one of performance. The explanation there is that while some algorithms are slightly faster (5-10 cycles max), they are significantly lower quality, and don’t currently show up on profiles. I’d like to get the quality up to remove collisions, and then continue working to improve the actual performance.

Clearly, very sensitive and hot routines like the Clang lexer’s StringMap aren’t about to change without careful benchmarking and numbers on those specific components. =] I think StringMap will be the very last bit of hashing to change considering its requirements.

Hashing.h (28.3 KB)

hashing2.diff.gz (13.6 KB)