ADT/Hashing.h on 32-bit platforms

Hi,

Currently the hashing implementation in ADT/Hashing.h produces hash values on 32-bit platforms that differ from the lower 32-bits of the hash values produced on 64-bit platforms. It seems the only reason for this difference is that the uint64_t integer seed is truncated to size_t. Since the usage of uint64_t and size_t as types for seed values in the implementation is somewhat inconsistent, I'm wondering: Was this difference deliberately introduced as a performance optimization for 32-bit code, or is this maybe just an implementation artifact?

I also noted that the implementation relies on implicit conversions from uint64_t to size_t in several places where hash_code values are returned, which may trigger compiler warnings on 32-bit platforms.

- Stephan

Almost all of this was largely incidental on my part. I didn't spend a lot
of time tuning the 32-bit platform performance. If you have ideas that you
think would make things better, I'm certainly interested. =D

One thing I would note is that I would prefer to retain the collision
resistance where possible. So I would be leery of reducing the internal
state used. I would more focus on making the outputs friendly for 32-bit
platforms and the inputs consistent.

I've attached a patch that makes the hashing implementation consistently use a 64-bit seed everywhere. With this change the implementation should produce the same hash codes modulo SIZE_MAX + 1 for identical values independent of the platform. (Though hash_combine may still produce different results if the size of an argument type differs between platforms). I suspect the negative performance impact on 32-bit platforms should be small, but I didn't do any benchmarking. With atomics one could probably replace the thread safe local static in get_execution_seed with something that has a little less overhead.

The patch also removes a FIXME from the set_fixed_execution_seed implementation and rewords the documentation string a bit, since using this function in a multi-threaded program requires some kind of external synchronization anyway.

Another attached patch implements llvm::sys::Process::GetRandomNumber() on Windows. (There's already a Unix implementation.) This function could be useful for implementing the missing random per-execution seed functionality. (Though it might take a little care to deal with Process::GetRandomNumberSeed potentially calling back into hash_combine on Unix.)

In a little experiment I made, changing the seed per execution seemed to lead to test suite errors for clang's PCH support and other components.

- Stephan

0001-Use-64-bit-hash-seed-on-32-bit-platforms.patch (13.8 KB)

0002-Implement-llvm-sys-Process-GetRandomNumber-on-Window.patch (2.73 KB)

Very cool. A minor comment:

   /// \brief Form a hash code directly from a numerical value.
- hash_code(size_t value) : value(value) {}
+ ///
+ /// The argument is casted to and stored as a size_t value.
+ hash_code(uint64_t value) : value(static_cast<size_t>(value)) {}

I think it would be better to store the 64-bit value internally and reduce
to size_t when asked. That way we could (perhaps in the future) expose a
higher fidelity uint64_t extraction mechanism.

My idea is basically that the hashing code should be 64-bit input always,
and 64-bit output always, but provide easy methods for consumers which
*need* to only get size_t output to truncate safely and consistently.

Does that make sense?

A couple of minor comments about the details of the patch:

  • hash_state state = state.create(s_begin, seed);
  • hash_state state = hash_state::create(s_begin, seed);

Why this change? The existing code is correct already I think?

  • return ::llvm::hashing::detail::hash_integer_value(value);
  • // the cast is necessary for strong enums and avoids compiler warnings

Generally LLVM comments should be formed as prose (starting with a capitalized word and ending with appropriate punctuation). I don’t think you need a comment here though, or a separate variable. I would just explicitly cast the value in the argument.

  • const uint64_t n = static_cast<uint64_t>(value);
  • return ::llvm::hashing::detail::hash_integer_value(n);

Another attached patch implements llvm::sys::Process::GetRandomNumber()
on Windows. (There's already a Unix implementation.) This function could be
useful for implementing the missing random per-execution seed
functionality. (Though it might take a little care to deal with
Process::GetRandomNumberSeed potentially calling back into hash_combine on
Unix.)

I would start another code review thread for this. Totally different set of
folks should be looking at it I suspect.

In a little experiment I made, changing the seed per execution seemed to
lead to test suite errors for clang's PCH support and other components.

Yep. This is both why we want the functionality (to catch such bugs) and
why it isn't enabled (haven't yet fixed the bugs).

Very cool. A minor comment:

Thanks!

    /// \brief Form a hash code directly from a numerical value.
- hash_code(size_t value) : value(value) {}
+ ///
+ /// The argument is casted to and stored as a size_t value.
+ hash_code(uint64_t value) : value(static_cast<size_t>(value)) {}

I think it would be better to store the 64-bit value internally and
reduce to size_t when asked. That way we could (perhaps in the future)
expose a higher fidelity uint64_t extraction mechanism.

My idea is basically that the hashing code should be 64-bit input
always, and 64-bit output always, but provide easy methods for consumers
which *need* to only get size_t output to truncate safely and consistently.

Does that make sense?

Making hash_code 64-bit would definitely be conceptually cleaner.
But I'd be a bit worried that users might directly store `hash_code` values in their data structures, e.g. in hash table buckets, which probably would be a pessimization on 32-bit systems for most use cases. Do you have a good application in mind for an unstable non-cryptographic 64-bit hash code on a 32-bit system?

- Stephan

A couple of minor comments about the details of the patch:

- hash_state state = state.create(s_begin, seed);
+ hash_state state = hash_state::create(s_begin, seed);

Why this change? The existing code is correct already I think?

Is this a common idiom? On first sight this looked like a "self initialization" bug to me, so I thought it might make the code more readable to clarify that create is a static member function of hash_state. But I probably should have put such a code style "cleanup" into a separate patch.

- return ::llvm::hashing::detail::hash_integer_value(value);
+ // the cast is necessary for strong enums and avoids compiler warnings

Generally LLVM comments should be formed as prose (starting with a
capitalized word and ending with appropriate punctuation). I don't think
you need a comment here though, or a separate variable. I would just
explicitly cast the value in the argument.

+ const uint64_t n = static_cast<uint64_t>(value);
+ return ::llvm::hashing::detail::hash_integer_value(n);

Thanks. Since there previously was no explicit cast I thought it might be useful to explain why I inserted one. I put in the separate variable, so that I'd have a good place to put my comment on without exceeding the column limit :slight_smile:

Let me know if you'd like me to update my patch. I'm happy to make any changes, but since I can't commit it myself, I'm not sure what would be easier for you.

Yeah, I just sent opened a new thread on the commits list. Thanks.

- Stephan