Question about hashing AST nodes

Dear Clang Community:

A colleague and I need to use AST nodes as keys in a map. For example, having entered a few variable declaration nodes as keys, linking to associated meta-data, and now looking at a member function application node, with the variables as arguments, we’d like to be able to query the map to find the associated meta-data for those variable objects.

To this end, we need to implement hash and == for AST nodes. That’s what’s needed, e.g., by the C++ unordered map class.

There’s little available information as far as we can tell on this topic. Would you be so kind as to let us know the best way to do this, in your view? Is ODRHash the key to unlocking a solution? Are there other approaches you’d recommend?

Kevin Sullivan
University of Virginia

First, it depends on whether you want this storage to be persistent.
In case you only do these in one pass in memory, the memory address of
the AST node seems a key enough. To quote, "The abstract syntax tree
is not abstract, not syntax-only and not a tree" - for every usage of
a variable, e.g., you'll be able (assuming of course that it is found
in the same TU, etc...) to query the effective VarDecl (the definition
of the variable) the DeclRefExpr (e.g. a variable's usage) points to.

I think ODRHash was invented (apart from many other "node
semi-equivalence" checks) only to allow for diagnostics, not to act as
a true key. One thing that comes to mind is Modules, where I think
ODRHash is used to diagnose certain things, but last time I read about
them it was said that even ODRHash is lacking at parts.

The Clang Static Analyzer has a hash utility for bug reports to
distinguish between bugs that are in the same "file-line-col" triad.
But maybe that's also really tailored for this "bugpath" the SA emits.

For a persistent hash that can go into perhaps a database, in
CodeCompass we came up with our own, basically: using FNVHash on
manglednames and identifiers, locations, etc.

In case you want just for the current compilation / memory image,
you'll get no better answer than pointer identity and what Sema
creates for you in the AST.
If you want to store, anyone can invent their hash function. The big
question is, how will you make sure that the hash persists between
invocations.

For ==, there is the "structural equivalence" cheks which are used by
Sema and ASTImporter / CrossTU.

Kevin Sullivan via cfe-dev <cfe-dev@lists.llvm.org> ezt írta (időpont:
2019. jan. 30., Sze, 8:27):

First of all, why do you want to hash them? Like, are you serializing these hashes to disk or sending them over network? Are these hash values allowed to change between runs as long as equal nodes still get equal hashes within a single run?

Just in case you missed it, if the hashes you're looking for are never going to outlive a single Clang invocation, then you can simply compare AST nodes by pointers. For example, if a function call 'foo(x)' refers to a variable 'x', then it's going to have a DeclRefExpr that points to a VarDecl for 'x', and the pointer to VarDecl that you're going to retrieve from that DeclRefExpr is going to be exactly the same pointer that you'll retrieve by looking at the actual declaration of the variable somewhere above the call. Every AST node is stored exactly once in memory and is never moved to a different place.

Well, there are also forward declarations that are different from the definition (the latter may not even exist in the current translation unit), and sometimes some statements may refer to a forward declaration, but it's easy to obtain a "canonical" declaration in all cases.

If you want something that's more stable across runs (pointers would obviously change every time), you can have a look at the getID() method in various AST nodes, which produces an integer that doesn't change across runs but does change across host machines. These stable IDs are mostly for debugging, but if your work is not supposed to interact with other computers, these may do.

If you want a comparison up to a certain property (eg., you don't want to discriminate between "x + 1" and "1 + x" but you do want to discriminate both of them from "x + y"), then you might want to have a look at the CloneDetection framework that conducts a series of passes to find similar statements in different places of the program. One of the existing passes involves some actual hashing. Also the infrastructure for making these passes is re-usable for making different sets of passes in order to look for different similarities.