[libc++] ext/hash function

I would like to get everyone's opinion on how ext/hash_set/map should work.

Background: __gnu_cxx::hash_map, __gnu_cxx::hash_multimap, __gnu_cxx::hash_set, and __gnu_cxx::hash_multiset are extension containers introduced to ease migration from gcc's libstdc++. They are not complete re-implementations of the hash containers. They are simply adaptors around the same std::__hash_table that powers the std::unordered containers. They are *similar to* the __gnu_cxx hash containers.

Clarification: I don't really want to discuss whether these extensions should exist or not. That decision has already been made, and it can not be reversed at this point.

What I would like to discuss is: how similar should they be made to the the __gnu_cxx hash containers? The __gnu_cxx hash containers were purposefully not standardized without change because the committee saw defects in their behavior. Should our "migration adaptors" go to extra effort to replicate such defects in the __gnu_cxx hash containers?

For example, Matt Austern writes in N1456 (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1456.html):

Some earlier hash table implementations gave char* special treatment: it specialized the default hash function to look at character array being pointed to, rather than the pointer itself. This proposal removes that special treatment. Special treatment makes it slightly easier to use hash tables for C string, but at the cost of removing uniformity and making it harder to write generic code. Since naive users would generally be expected to use std::basic_string instead of C strings, the cost of special treatment outweighs the benefit.

And indeed, std::hash<char*> today is a direct result of Matt's observation from 8 years ago. The char* is hashed as a pointer, not as a null terminated C string.

The actual __gnu_cxx hash containers however /did/ hash a char* as a null terminated null C string.

Originally the libc++ adaptors which emulate the __gnu_cxx hash containers simply reused std::hash<char*>, and thus hashed the Matt proposed so long ago. However recently we've changed our adaptors to more closely emulate the actual behavior of the __gnu_cxx hash containers. This can be detected (for example) with this program:

#include <ext/hash_set>

int main()
{
    __gnu_cxx::hash_set<char*> s;
    s.insert(nullptr);
}

If we have the adaptors just use std::hash, this program runs ok. If we emulate the old __gnu_cxx::hash function, this program crashes (which is an accurate emulation).

This one also may crash spuriously:

#include <ext/hash_set>

char message[2] = {'a', 'b'};

int main()
{
    __gnu_cxx::hash_set<char*> s;
    s.insert(message);
}

(lack of null termination)

For those customers of ours who use these adaptors, which is more valuable: A little more safety, or a little more accurate emulation?

I should point out that to create a completely accurate emulation will require a completely new implementation. std::__hash_table is a completely different data structure than that underlying the actual __gnu_cxx hash containers. For example:

#include <ext/hash_set>
#include <iostream>

int main()
{
    __gnu_cxx::hash_set<int> s;
    s.insert(3);
    s.insert(7);
    s.insert(1);
    s.insert(5);
    for (__gnu_cxx::hash_set<int>::const_iterator i = s.begin(); i != s.end(); ++i)
        std::cout << *i << '\n';
}

libc++ outputs:

5
1
7
3

libstdc++ outputs:

1
3
5
7

(both libs use the identity for hash<int>, so that isn't the cause of this difference)

Thanks for any discussion/opinions.

Howard

My opinion is that they should emulate the SGI/GCC semantics as closely as possible. These containers are not standard containers and thus their only purpose in life is to provide compatibility with non-standard code. Trying to fix their design problems is highly counterproductive (IMO): code should move off of them, not suffer issues moving to libc++ because we tried to "improve" these already-known-broken containers.

FWIW, my opinion is also that we should provide a slist drop in compatible container (that is just as broken as the GNU one) without trying to fix its issues.

-Chris

This reflects my stance as well. In particular, changing interfaces of these containers that can be reasonably expected to work in a predictable fashion (such as the char* hash implementation, or the fact that the hash implementation lives in the ‘__gnu_cxx’ namespace, and can be explicitly specialized within that namespace) make it easier to migrate to libc++ for large codebases.

That said, some things weren’t reasonably part of the interface of even these legacy containers (such as iteration order details) and it doesn’t seem worth it to emulate them at great expense.

For reference, we’ve now done a trial migration of a reasonably large chunk of our codebase to use libc++ and to see what breaks. We were bitten by both the hash specializations and by the char* hash behavior, but not by iteration order to my knowledge.