[RFC] cHash: Integration of AST hashing

Hello!

First, I'd like to give you a little bit of context, before I get into
the details what all of this has to do with the cfe-dev list.

TL;DR: cHash calculates an hash over the AST and, therefore, allows
       the detection of redundant builds.

In the last months, we have been working on a mechanism to avoid
redundant recompilations. In C/C++ projects, we observe that a change
to a central header file can cause a lot of compiler invocations.
However, not all compiler runs will lead to an object file that
differs from the last invocation. For example, if I add a typedef
that is never used in any .c file, none of the object files will
change.

Our approach to this problem is that we hook into the compiler front
end, after the parsing and the semantic analysis and recursively
calculate an AST hash. We start with all definitions (variables and
functions) and include all relevant AST-node properties and the hashes
of all children and referenced types. Thereby, we get a hash value
that changes if a changed translation unit would result in a different
object file. As a byproduct, we get a hash for every function, every
referenced type, and all global variables.

So far, we have only implemented cHash for C, not yet for C++. We have
already published our current version of the clang plugin[1] and are
also working on a GCC version.

Furthermore, we presented our approach at USENIX ATC (and got a best
paper award, yay!)[2]. Now, we consider proposing some parts of cHash
for mainline.

Our impression is that the AST hashing mechanism[3], which calculates
a semantic fingerprint for a given AST node is not only useful to
detect redundant builds. AST hashes are (computationally) fast to
calculate, easy to store, easy to calculate, and they capture all of
the semantic of an AST node. These properties should ease other techniques:

- partial recompilation: we now know at an early state that a function
  body has changed directly or indirectly.
- change impact analysis: we can now quantify which parts of the
  program is actually touched, even if we only modfiy a type.
- project-wide analysis: have all types with the same name also
  the name hash, or is there an inconsistency?

Besides the AST hashing mechanism, cHash also includes the caching and
compiler abortion logic. However, in my opinion this is better located
in an external tool, or a plugin. For example, a more close
integration of CLang with CCache, could give the compilation speedups
that we have observed in our prototype.

Before I start on preparing parts of the chash plugin to go upstream
(there is definitely some cleanup required), I would love to hear your
opinions on the following topics:

- Would you like to see the AST hashing in clang mainline?
- Would you like to have a standalone tool that calculates AST hashes?
- Should the detection of redundant compilations be part of CLang mainline?
- Should we integrate the test-suite[4] into the repo? If yes, how?

chris

[1] GitHub - luhsra/chash: CLang Plugin for calculating AST hashes
[2] cHash: Detection of Redundant Compilations via AST Hashing | USENIX
[3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
[4] https://github.com/luhsra/chash/tree/master/test
    chash/IfStmt_10.c at master · luhsra/chash · GitHub

Hey,

This looks like an awesome project! It would be great to see something like this in Clang itself.
What impact does the hashing itself have on the build time? I.e. If I were to fully rebuild a project from scratch, how much longer would it take?

Alex

Hello!

First, I'd like to give you a little bit of context, before I get into
the details what all of this has to do with the cfe-dev list.

TL;DR: cHash calculates an hash over the AST and, therefore, allows
      the detection of redundant builds.

In the last months, we have been working on a mechanism to avoid
redundant recompilations. In C/C++ projects, we observe that a change
to a central header file can cause a lot of compiler invocations.
However, not all compiler runs will lead to an object file that
differs from the last invocation. For example, if I add a typedef
that is never used in any .c file, none of the object files will
change.

Our approach to this problem is that we hook into the compiler front
end, after the parsing and the semantic analysis and recursively
calculate an AST hash. We start with all definitions (variables and
functions) and include all relevant AST-node properties and the hashes
of all children and referenced types. Thereby, we get a hash value
that changes if a changed translation unit would result in a different
object file. As a byproduct, we get a hash for every function, every
referenced type, and all global variables.

So far, we have only implemented cHash for C, not yet for C++. We have
already published our current version of the clang plugin[1] and are
also working on a GCC version.

Furthermore, we presented our approach at USENIX ATC (and got a best
paper award, yay!)[2]. Now, we consider proposing some parts of cHash
for mainline.

Our impression is that the AST hashing mechanism[3], which calculates
a semantic fingerprint for a given AST node is not only useful to
detect redundant builds. AST hashes are (computationally) fast to
calculate, easy to store, easy to calculate, and they capture all of
the semantic of an AST node. These properties should ease other techniques:

- partial recompilation: we now know at an early state that a function
body has changed directly or indirectly.

Do you currently hash the full AST at once, or do you also compute/store hashes for individual functions?

Alex Lorenz via cfe-dev <cfe-dev@lists.llvm.org> writes:

Hey,

What impact does the hashing itself have on the build time? I.e. If I
were to fully rebuild a project from scratch, how much longer would it
take?

We have build over 2300 commits from the musl libc library. In total
that are 5.6 million compiler invocations (all commits are built from
scratch). On average, the AST hashing took 9ms. In comparision, the
parsing is 187ms and the -O2 optimization and assembling 1082ms. So the
hashing time alone is almost negliable. Furthermore, the hashing time
does not depend on the syntactic elements in the AST, but only on the
used parts. So even if you have a thousands of type declarations, a file
with a single 'void foo() {}' takes no time.

We use a non-cryptographical hash (MurMur2) to speed things up a little
bit. And since we do not have to defend against evil attackers, we
believe a non-cryptographic hash is enough. However, for our current
implementation, we still have a few optimizations that could cut down a
little bit on the hashing[1].

chris

[1] We hash currently every type annotation for every AST note. However,
    only the type fields for variable declarations and function
    signatures are relevant, since all other type fields are derived
    from them in a determinsitic fashion.

I see, thanks. That sounds promising.

Is the hash stored in a central database or do you use a separate file for each translation unit (or maybe something else)?

Alex Lorenz <aleksei_lorenz@apple.com> writes:

I see, thanks. That sounds promising.

Is the hash stored in a central database or do you use a separate file
for each translation unit (or maybe something else)?

In the current implementation, we store the hash in a separate file
besides the object file. And since CMake removes the .o file, before the
compiler is invoked, we hardlink also the object file to another name.
So, for a single object file we then have:

foo.o
foo.o.hash.copy <- Hardlink to foo.o
foo.o.hash <- Hexdigest of the hash

For the redundant-compilation-avoidance, we could also store the
hexdigest in a seperate ELF (or mach-o) section and parse the header
partially in the cHash pass.

chris

Christian Dietrich via cfe-dev <cfe-dev@lists.llvm.org> writes:

In C/C++ projects, we observe that a change to a central header file can
cause a lot of compiler invocations. However, not all compiler runs will
lead to an object file that differs from the last invocation. For example,
if I add a typedef that is never used in any .c file, none of the object
files will change.

We are doing something similar in build2[1] though at the token stream
level (after preprocessing). One thing that we found to limit the
applicability of this approach is debug info: if you add a typedef
(presumably as a new line), then function definitions (if any) after
this line and in the same translation unit will have different debug
information (line position in the source code).

In build2, when calculating the hash, we include the position information
for each token. Do you do something similar at the AST level? If so, did
you find anything to mitigate such "line shift" changes?

[1] https://build2.org

Boris

That sounds like a really nice tool!

One problem I see is that we get yet another AST hashing
implementation in clang. I'm aware of at least four of them that I
personally work with:

* The normal Stmt::Profile for Stmts
* The ODR hashing for AST nodes in modules.
* The ASTDiff also has one
* The CloneDetector with its StmtDataCollector which can be
specialized for hashing

They all contain similar code and have similar goals. And they all
contain different kind of bugs. And that's just the implementations
that I'm aware of.

I would very much prefer if we don't add more redundant code here but
instead try to figure out if we can get all this code in a more
central place.

Most of the use cases are just about

I have this node, please forward all interesting data into this stream so I can [store|hash|...] it and I [do|don't] care about having cross-TU stable data and maybe I want to treat special nodes in a different way

so it's not necessarily a difficult problem to solve. If we find a
common baseline for this, we could make one visitor that people can
specialize their specific hashing needs on top.

I already started doing this for the Stmt hashing when I refactored
the StmtDataCollector. I set a common baseline that I could turn my
two hashing implementations back then into a single one and I'm aiming
at also replacing the Stmt::Profile implementations with it.

Not sure if anyone started working on something similar for generic
AST nodes (which is probably a bit more complicated than the Stmts
which are relatively simple), but maybe we should see if we can get
some people together to work on that. Otherwise we will have a dozen
hashing implementations in a few years that no one can/will maintain.

- Raphael

Boris Kolpackov <boris@codesynthesis.com> writes:

We are doing something similar in build2[1] though at the token stream
level (after preprocessing). One thing that we found to limit the
applicability of this approach is debug info: if you add a typedef
(presumably as a new line), then function definitions (if any) after
this line and in the same translation unit will have different debug
information (line position in the source code).

The benefit of doing it on the token stream is that you can avoid the
expensive parsing. But I wonder how hashing the token stream can be any
better than doing a textual hash on the preprocessed code before lexing
(as ccache is doing it)?

In the current prototype, we do not include debugging information.
Therefore, the hash does not change if you introduce a newline into the
source. In the future, line numbers/filnames should be included, if
debugging information is requested.

In build2, when calculating the hash, we include the position information
for each token. Do you do something similar at the AST level? If so, did
you find anything to mitigate such "line shift" changes?

However, the AST hash should expose less recompilations compared to
hashing the token stream. On the token stream you have to include all
the tokens and, therefore, all the line information/line shifts. The AST
hash includes only the referenced types, enums, and other declarations.
So, if there is a line shift in a header, but none of the declarations that
comes after that shift is used in this compilation unit, the hash will
not change. However, if any of the declarations after the shift is
referenced, the debug information will change and we cannot call it a
redundant recompilation anymore.

chris

Raphael Isemann <teemperor@gmail.com> writes:

That sounds like a really nice tool!

I also really like the general principle of avoiding work that is
redundant in the end. Let's save some polar bears! And some developers
from serious injuries stemming from sword fights.

One problem I see is that we get yet another AST hashing
implementation in clang. I'm aware of at least four of them that I
personally work with:

* The normal Stmt::Profile for Stmts

This does indeed look structurally similar to our implementation for
statements. However, types are only included as means of their pointers
and, therefore, the hash is not stable over compilations. I wonder for
which applications this is actually sufficient. And, does it really
speed up things to badly?

* The ODR hashing for AST nodes in modules.

This is also based on StmtProfilerWithoutPointers as far as I can
tell[1]. And it is stable over invocations as it actually handles types.
However, changing the order of declarations[1] will alter the hash. Also
one thing we did in cHash is to use pseudrandom identifiers for
different ast node types[3] to avoid some of the possible collisions. If
you only add small integer values (getKind(), etc.) It is much easier to
have a collision. Therefore, we add something like

<UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

to the hashing stream most of the time. Ideally, the text that goes into
the hash should be usable to reconstruct an AST tree with the same
structure; not neccessarily with the same information.

* The ASTDiff also has one

Ok, then I was unable to find it. Can you give me a pointer

* The CloneDetector with its StmtDataCollector which can be
specialized for hashing

They all contain similar code and have similar goals. And they all
contain different kind of bugs. And that's just the implementations
that I'm aware of.

So, when they all contain some kinds of quirks, we really should build
a good test harness around a unified implementation. Since nobody wants
to break the "give me a unique identifier" silently.

I would very much prefer if we don't add more redundant code here but
instead try to figure out if we can get all this code in a more
central place.

I like that idea and I do not necessarily stick to our implementation.

Most of the use cases are just about

I have this node, please forward all interesting data into this
stream so I can [store|hash|...] it and I [do|don't] care about
having cross-TU stable data and maybe I want to treat special nodes
in a different way

so it's not necessarily a difficult problem to solve. If we find a
common baseline for this, we could make one visitor that people can
specialize their specific hashing needs on top.

Your are right, it is not a difficult problem in the end. But when doing
It for the whole AST there are some weird corner cases that should be
handled correctly (recursive type definitions come to my mind)

I already started doing this for the Stmt hashing when I refactored
the StmtDataCollector. I set a common baseline that I could turn my
two hashing implementations back then into a single one and I'm aiming
at also replacing the Stmt::Profile implementations with it.

Not sure if anyone started working on something similar for generic
AST nodes (which is probably a bit more complicated than the Stmts
which are relatively simple), but maybe we should see if we can get
some people together to work on that. Otherwise we will have a dozen
hashing implementations in a few years that no one can/will maintain.

Bringing cHash to mainline is based on the wish that I don't want to
maintain an AST hashing mechanism in parallel to the clang development.
Therefore, I would love to help with that. We should collect some
requirements that the different users of unique identifiers have. And
perhaps we can make them all happy by introducing a few switches to a
unified visitor.

Some switches and requirements come directly to my mind:

- Syntactical vs. Semantic: There should be a switch to include only
  nodes that are actually syntacical children of the given node. Not
  everybody wants to include the transitive hull over all referenced AST
  nodes.

- Multiple Hashes in one pass: With one pass over the AST I want to have
  not only one top-level hash, but also hashes of some interesting nodes
  (top-level definitions, type declarations, global variables). cHash
  already caches hashes for some nodes in a map to speed things up.

- Debug information: I want to turn on/off the inclusion of debugging
  information (line numbers, file names, local identifier names).

- Stable?: Should the hash be stable over compiler invocations? Or can
  we take some shortcuts for types/declarations/things that are not
  syntactially enclosed into the node.

chris

Christian Dietrich <dietrich@sra.uni-hannover.de> writes:

The benefit of doing it on the token stream is that you can avoid the
expensive parsing.

And it works with any compiler that supports the -E option.

But I wonder how hashing the token stream can be any better than doing
a textual hash on the preprocessed code before lexing (as ccache is
doing it)?

Probably not by much (can ignore some redundant #line directives, etc).
The reason we do it this way in build2 is to be ready for the day when
we no longer need preprocessing, at least for some translation units
(with C++ modules being the first step in that direction). Such
translation units will still contain comments, line continuations,
etc., which we would want to ignore.

In the current prototype, we do not include debugging information.
Therefore, the hash does not change if you introduce a newline into the
source. In the future, line numbers/filnames should be included, if
debugging information is requested.

In our experience, this feature is most useful during development
when debug info is almost always required.

Boris

Raphael Isemann via cfe-dev <cfe-dev@lists.llvm.org> writes:

That sounds like a really nice tool!

One problem I see is that we get yet another AST hashing
implementation in clang. I'm aware of at least four of them that I
personally work with:

* The normal Stmt::Profile for Stmts
* The ODR hashing for AST nodes in modules.
* The ASTDiff also has one
* The CloneDetector with its StmtDataCollector which can be
specialized for hashing

They all contain similar code and have similar goals. And they all
contain different kind of bugs. And that's just the implementations
that I'm aware of.

I have done some work on making the StmtDataCollector approach
extensible [1] (it was not possible to subclass it due to the recurring
template pattern style inheritance of ConstStmtVisitor).

I think the DataCollector approach is the best one. Note that it only
collects data from a single node, so it makes sense to combine it with a
RecursiveASTVisitor in order to hash entire subtrees.

It should be possible to rewrite ODR hashing and the others with this!

(Currently the hashing for IntegerLiteral is not stable because it uses
hash_value for APInt but I imagine this can be fixed easily)

I would very much prefer if we don't add more redundant code here but
instead try to figure out if we can get all this code in a more
central place.

Most of the use cases are just about

I have this node, please forward all interesting data into this stream so I can [store|hash|...] it and I [do|don't] care about having cross-TU stable data and maybe I want to treat special nodes in a different way

so it's not necessarily a difficult problem to solve. If we find a
common baseline for this, we could make one visitor that people can
specialize their specific hashing needs on top.

I already started doing this for the Stmt hashing when I refactored
the StmtDataCollector. I set a common baseline that I could turn my
two hashing implementations back then into a single one and I'm aiming
at also replacing the Stmt::Profile implementations with it.

Not sure if anyone started working on something similar for generic
AST nodes

I am also working on a DeclDataCollector. I still have to come up with
a good way to test that two nodes indeed have a different hash value.

(which is probably a bit more complicated than the Stmts
which are relatively simple), but maybe we should see if we can get
some people together to work on that. Otherwise we will have a dozen
hashing implementations in a few years that no one can/will maintain.

- Raphael

Johannes

[1] ⚙ D36664 [analyzer] Make StmtDataCollector customizable

Christian Dietrich via cfe-dev <cfe-dev@lists.llvm.org> writes:

Raphael Isemann <teemperor@gmail.com> writes:

That sounds like a really nice tool!

I also really like the general principle of avoiding work that is
redundant in the end. Let's save some polar bears! And some developers
from serious injuries stemming from sword fights.

One problem I see is that we get yet another AST hashing
implementation in clang. I'm aware of at least four of them that I
personally work with:

* The normal Stmt::Profile for Stmts

This does indeed look structurally similar to our implementation for
statements. However, types are only included as means of their pointers
and, therefore, the hash is not stable over compilations. I wonder for
which applications this is actually sufficient. And, does it really
speed up things to badly?

* The ODR hashing for AST nodes in modules.

This is also based on StmtProfilerWithoutPointers as far as I can
tell[1]. And it is stable over invocations as it actually handles types.
However, changing the order of declarations[1] will alter the hash. Also
one thing we did in cHash is to use pseudrandom identifiers for
different ast node types[3] to avoid some of the possible collisions. If
you only add small integer values (getKind(), etc.) It is much easier to
have a collision. Therefore, we add something like

<UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

We should probably do the same, maybe also for other enum members.

In addition, it may be necessary to add some padding or separator
between when hashing multiple nodes as they have different numbers of
fields, I am not an expert though.

to the hashing stream most of the time. Ideally, the text that goes into
the hash should be usable to reconstruct an AST tree with the same
structure; not neccessarily with the same information.

* The ASTDiff also has one

Ok, then I was unable to find it. Can you give me a pointer

* The CloneDetector with its StmtDataCollector which can be
specialized for hashing

They all contain similar code and have similar goals. And they all
contain different kind of bugs. And that's just the implementations
that I'm aware of.

So, when they all contain some kinds of quirks, we really should build
a good test harness around a unified implementation. Since nobody wants
to break the "give me a unique identifier" silently.

I would very much prefer if we don't add more redundant code here but
instead try to figure out if we can get all this code in a more
central place.

I like that idea and I do not necessarily stick to our implementation.

Most of the use cases are just about

I have this node, please forward all interesting data into this
stream so I can [store|hash|...] it and I [do|don't] care about
having cross-TU stable data and maybe I want to treat special nodes
in a different way

so it's not necessarily a difficult problem to solve. If we find a
common baseline for this, we could make one visitor that people can
specialize their specific hashing needs on top.

Your are right, it is not a difficult problem in the end. But when doing
It for the whole AST there are some weird corner cases that should be
handled correctly (recursive type definitions come to my mind)

I already started doing this for the Stmt hashing when I refactored
the StmtDataCollector. I set a common baseline that I could turn my
two hashing implementations back then into a single one and I'm aiming
at also replacing the Stmt::Profile implementations with it.

Not sure if anyone started working on something similar for generic
AST nodes (which is probably a bit more complicated than the Stmts
which are relatively simple), but maybe we should see if we can get
some people together to work on that. Otherwise we will have a dozen
hashing implementations in a few years that no one can/will maintain.

Bringing cHash to mainline is based on the wish that I don't want to
maintain an AST hashing mechanism in parallel to the clang development.
Therefore, I would love to help with that. We should collect some
requirements that the different users of unique identifiers have. And
perhaps we can make them all happy by introducing a few switches to a
unified visitor.

Yes, it would be good to identify the different kinds of uniqueness that
people need.

Some switches and requirements come directly to my mind:

- Syntactical vs. Semantic: There should be a switch to include only
  nodes that are actually syntacical children of the given node. Not
  everybody wants to include the transitive hull over all referenced AST
  nodes.

Sounds good, so the syntactical one discards any references to nodes
save for its children. I can use this in ASTDiff for checking whether
two matched nodes differ.

- Multiple Hashes in one pass: With one pass over the AST I want to have
  not only one top-level hash, but also hashes of some interesting nodes
  (top-level definitions, type declarations, global variables). cHash
  already caches hashes for some nodes in a map to speed things up.

- Debug information: I want to turn on/off the inclusion of debugging
  information (line numbers, file names, local identifier names).

These two should be easy to do, I think we definitely want a system that can
have user-provided functions for each node class.

- Stable?: Should the hash be stable over compiler invocations? Or can
  we take some shortcuts for types/declarations/things that are not
  syntactially enclosed into the node.

Hopefully it is not too difficult to make it stable.

If there is a node that is referred to by its name, then it suffices to hash the
qualified name, provided that the referenced node is also included in
the hash.

So there are references to other nodes by name.
All other references between nodes I can think of are based on the tree
structure (correct me if I am wrong), so we should be fine if we hash
subtrees.

Johannes

First of all, I'm sorry for the long delay in my answer. There was
vacation and an awful lot of traveling in August and September.

Johannes Altmanninger <aclopte@gmail.com> writes:

<UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

We should probably do the same, maybe also for other enum members.

In addition, it may be necessary to add some padding or separator
between when hashing multiple nodes as they have different numbers of
fields, I am not an expert though.

I would see two basic cases:

1. The number of fields is constant -> The knowledge is derivable from
   the UNIQUE-NODE-CLASS-ID.
2. Variable number of fields (or children): Just hash in a 4 byte
   integer.

- Stable?: Should the hash be stable over compiler invocations? Or can
  we take some shortcuts for types/declarations/things that are not
  syntactially enclosed into the node.

Hopefully it is not too difficult to make it stable.

If there is a node that is referred to by its name, then it suffices to hash the
qualified name, provided that the referenced node is also included in
the hash.

I see one problem here, at least if we want to make a syntactial hash,
and I believe that we have to include not only the qualified name, but
the hash behind that name.

I think a problem arises, if we make one pass over the whole tree
(starting from a TranslationUnit) and record not only the top-level
hash, but also every function level hash. Let's assume the following two
scenarios, where we expect both function hashes of foo to be equal:

: struct bar { int x; };
:
: void foo(struct bar y) {};

hash(translation-unit) = hash(foo) + hash(struct bar)
  where
      hash(foo) = hash(void) + hash(foo)
                  + hash("struct bar") + hash(y) + hash({})
      hash(struct bar) = ....

: struct bar { long x; };
:
: void foo(struct bar y) {};

hash(translation-unit) = hash(foo) + hash(baz) + hash(struct bar)
  where
      hash(foo) = hash(void) + hash(foo)
                  + hash("struct bar") + hash(y) + hash({})
      hash(struct bar) = ....

In this case, the top-level hash changes, as struct bar changes, but the
hash for 'foo' does not change. This is OK for a syntactic hash, but not
for a semantic hash.

So there are references to other nodes by name.
All other references between nodes I can think of are based on the tree
structure (correct me if I am wrong), so we should be fine if we hash
subtrees.

It depends. You can have an expression that references a type without
explicitly naming it:

: struct bar { int x }
: struct foo { struct bar * BAR; }
:
: int foo() {
: struct foo FOO;
: FOO.BAR.x = 23;
: return FOO.BAR.x;
: }
:
: int barfoo() {
: struct *foo FOO = mallow(23);
: }

Here, we see that that:

  a) foo depends on a type it does not name explicitly, but only
     implicitly, by using foo.
  b) The binary of barfoo does NOT change, if 'struct bar' changes.

Therefore, I believe it is necessary to include all type references with
their hash for the semantic hash.

As I see, you have done a lot of work on the StmtDataCollectors and used
TableGen to generate the DataCollector. So my idea would be the following:

- Extend the StmtDataCollectors.td to handle all possible nodes/types in
  the clang AST. Invoking this operator once, we get the hash value for
  one node.

- Derive an recursive AST hashing visitor from RecursiveASTVisitor. This
  recursive AST visitor then has our desired flags:

  - semantic_hash vs syntactic_hash
  - Hash mechanism as a template parameter
  - Include debug flags

One problem I have with RecursiveASTVisitor is either preorder, or post
order. However, to calculate an Hash for let's say a type, I have to:

    PreVisitType: H = new Hash
   VisitFuncType: H.addData(FUNC_TYPE)
                  hash(T.getReturnType())
                     -> Recursion
   PostVisitType: Store Hash for function

And I have not seen, how I could do this with RecursiveASTVisitor

chris

Hi,

so I made some progress in integrating the CHash AST Hashing mechanism
into CLang: ⚙ D40731 Integrate CHash into CLang
I cleaned up the mess in our repository and based the implementation on
StmtDataCollectors.

Christian Dietrich <dietrich@sra.uni-hannover.de> writes:

As I see, you have done a lot of work on the StmtDataCollectors and used
TableGen to generate the DataCollector. So my idea would be the following:

- Extend the StmtDataCollectors.td to handle all possible nodes/types in
  the clang AST. Invoking this operator once, we get the hash value for
  one node.

Currently the change D40731 is still incomplete, but some important
parts were already implemented:

- I have extended the StmtDataCollectors.td file (and added tablegen
  files for Decl, Attr, Type).

- Provide a CHashVisitor, based on RecursiveASTVisitor, that captures
  our idea of a AST Hash (Semantic, stable across invocations).

- Make it all work with the cHash test suite.

- Start to add a (condensed) unit-test suite to validate CHashVisitor.
  Integrating the useful parts of our internal test suite here is still
  to be done.

- Derive an recursive AST hashing visitor from RecursiveASTVisitor. This
  recursive AST visitor then has our desired flags:

  - semantic_hash vs syntactic_hash
  - Hash mechanism as a template parameter
  - Include debug flags

Including these flags for the CHashVisitor is still to be done.

At this point I would appreciate some comments on the current state of
the implementation. My next steps would be to integrate larger parts of
our test suite.

chris

Raphael Isemann <teemperor@gmail.com> writes:

> That sounds like a really nice tool!

I also really like the general principle of avoiding work that is
redundant in the end. Let's save some polar bears! And some developers
from serious injuries stemming from sword fights.

> One problem I see is that we get yet another AST hashing
> implementation in clang. I'm aware of at least four of them that I
> personally work with:
>
> * The normal Stmt::Profile for Stmts

This does indeed look structurally similar to our implementation for
statements. However, types are only included as means of their pointers
and, therefore, the hash is not stable over compilations. I wonder for
which applications this is actually sufficient. And, does it really
speed up things to badly?

> * The ODR hashing for AST nodes in modules.

This is also based on StmtProfilerWithoutPointers as far as I can
tell[1]. And it is stable over invocations as it actually handles types.
However, changing the order of declarations[1] will alter the hash. Also
one thing we did in cHash is to use pseudrandom identifiers for
different ast node types[3] to avoid some of the possible collisions. If
you only add small integer values (getKind(), etc.) It is much easier to
have a collision. Therefore, we add something like

<UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

to the hashing stream most of the time. Ideally, the text that goes into
the hash should be usable to reconstruct an AST tree with the same
structure; not neccessarily with the same information.

I don't understand your need for pseudrandom identifiers. If a specific

type of AST node (for instance, a Decl) is expected, then adding to the
stream the node kind (like from Decl::getKind()) is enough to avoid
collisions. If any node can be expected, then first pass in an enum to
differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
the node kind, and then continue on. Stmt::Profile and ODRHash use this
method to avoid collisions. I don't see any gain from using new
identifiers here.

Richard Trieu <rtrieu@google.com> writes:

I don't understand your need for pseudrandom identifiers. If a specific
type of AST node (for instance, a Decl) is expected, then adding to the
stream the node kind (like from Decl::getKind()) is enough to avoid
collisions. If any node can be expected, then first pass in an enum to
differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
the node kind, and then continue on. Stmt::Profile and ODRHash use this
method to avoid collisions. I don't see any gain from using new
identifiers here.

Ok, so let's explain a little bit my idea here: Of course, including the
Decl::getKind() also distinguishes node types from each other. However,
these values come from enums and are, therefore, small integers. Small
integers (SI) occur very often in an AST. So, if a sequence of addData()
operations produces the same sequence of SIs because node types are not
distinguishable from SI that occur in the AST, we get a collision[1].

We could see the same sequence for different ASTs, if an AST node has a
variable number of children/references. Especially when using the
RecursiveASTVisitor, which reduces the implementation burden of such a
hashing method enormously[2], I'm not 100% sure, that I find all node
types with a variable number of children/references.

Introducing 32 Bit of random as an node-type identifier makes the
propability for such a constructive collision much smaller. So it is a
better-safe-than-sorry approach at this point. I also think the enum in
CHashVisitor.h is very very ugly.

chris

[1] Sure, we use hashing, so there is always the possibility of hash
    collisions, but we should at least avoid to feed the hash algorithm
    the same sequence for different AST trees.

[2] By porting[3] CHash to the RecursiveASTVisitor saved about 1200
    lines of code.

[3] More like throwing away, and building from scratch by using
    fragments of the original code.

Richard Trieu <rtrieu@google.com> writes:

> I don't understand your need for pseudrandom identifiers. If a specific
> type of AST node (for instance, a Decl) is expected, then adding to the
> stream the node kind (like from Decl::getKind()) is enough to avoid
> collisions. If any node can be expected, then first pass in an enum to
> differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
> the node kind, and then continue on. Stmt::Profile and ODRHash use this
> method to avoid collisions. I don't see any gain from using new
> identifiers here.

Ok, so let's explain a little bit my idea here: Of course, including the
Decl::getKind() also distinguishes node types from each other. However,
these values come from enums and are, therefore, small integers. Small
integers (SI) occur very often in an AST. So, if a sequence of addData()
operations produces the same sequence of SIs because node types are not
distinguishable from SI that occur in the AST, we get a collision[1].

How does the result of Decl::getKind() being a small integer allows any

chance of collision? I'll admit that arbitrary sub-sequences of different
hash streams can be equal, but we shouldn't be hashing sub-sequences.

Decl
<Decl::getKind()>

Decl with two fields
<Decl::getKind()> <Field 1> < Field 2>

Decl with one sub-Decl
<Decl::getKind()> <Stream of sub-decl>

Decl with arbitrary number of sub-Decl's
<Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
of sub-decl N>

The result of Decl::getKind() will basically specify the format of the hash
stream. I think better controlling of the input stream is a better way of
avoiding collisions then using magic numbers as identifiers.

We could see the same sequence for different ASTs, if an AST node has a

variable number of children/references. Especially when using the
RecursiveASTVisitor, which reduces the implementation burden of such a
hashing method enormously[2], I'm not 100% sure, that I find all node
types with a variable number of children/references.

Adding the number of children before visiting each child would avoid a
collision.

Richard Trieu <rtrieu@google.com> writes:

Richard Trieu <rtrieu@google.com> writes:

> I don't understand your need for pseudrandom identifiers. If a specific
> type of AST node (for instance, a Decl) is expected, then adding to the
> stream the node kind (like from Decl::getKind()) is enough to avoid
> collisions. If any node can be expected, then first pass in an enum to
> differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
> the node kind, and then continue on. Stmt::Profile and ODRHash use this
> method to avoid collisions. I don't see any gain from using new
> identifiers here.

Ok, so let's explain a little bit my idea here: Of course, including the
Decl::getKind() also distinguishes node types from each other. However,
these values come from enums and are, therefore, small integers. Small
integers (SI) occur very often in an AST. So, if a sequence of addData()
operations produces the same sequence of SIs because node types are not
distinguishable from SI that occur in the AST, we get a collision[1].

How does the result of Decl::getKind() being a small integer allows any
chance of collision? I'll admit that arbitrary sub-sequences of different
hash streams can be equal, but we shouldn't be hashing sub-sequences.

Sure, we shouldn't take an arbitrary subsequence of the stream and hash
it. That would be a bad idea.

Decl
<Decl::getKind()>

Decl with two fields
<Decl::getKind()> <Field 1> < Field 2>

Decl with one sub-Decl
<Decl::getKind()> <Stream of sub-decl>

Decl with arbitrary number of sub-Decl's
<Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
of sub-decl N>

The result of Decl::getKind() will basically specify the format of the hash
stream. I think better controlling of the input stream is a better way of
avoiding collisions then using magic numbers as identifiers.

We could see the same sequence for different ASTs, if an AST node has a

Sure, this would definitly be a better option. A quick look into the
RecursiveASTVisitor tells me that there are at least 76 for (...) loops.
So we have to add them.

Perhaps we can combine both, length-indicators and and
non-small-integers, by using llvm::hash on the <Decl::getKind()> values.
As the documentation says, we have to spent less than 20 cycles per
32-bit value on this. By this, we avoid the magic numbers. I will start
on doing the changes to incorporate the length fields.

Furthermore, I'm not sure how the clang community handles the "how to
find a reviewer for my patch". Are people adding themselves to be a
reviewer? Or do I just add people how I think could be interessted in
reviewing the change?

chris

Christian Dietrich <dietrich@sra.uni-hannover.de> writes:

Sure, this would definitly be a better option. A quick look into the
RecursiveASTVisitor tells me that there are at least 76 for (...) loops.
So we have to add them.

Unluckily, there is no general approach to get the number of elements in
an iterator_range, but calling std::distance, however:

  addData(std::distance(S->attrs().begin(), S->attrs.end()));

is kind of ugly, so we better leave that to the implmentor of addData()
by using a template matching iterator_range<> and simply give the child
range to the Collector user:

  addData(S->attrs());

In this case, the user of the collector can do more fancy things on this.

Perhaps we can combine both, length-indicators and and
non-small-integers, by using llvm::hash on the <Decl::getKind()> values.
As the documentation says, we have to spent less than 20 cycles per
32-bit value on this. By this, we avoid the magic numbers. I will start
on doing the changes to incorporate the length fields.

Wait a moment. When thinking of it, the number of children is not a
node-local property, so it should be placed in a different category.
Perhaps, we could do something similar to D40781[1]:

class Collector {
  code Local = [{}];
  code CrossRef = [{}];

  code CHash = !codeconcat(Local, CrossRef);
}

class Decl : Collector {
   code Local = [{
      addData(S->getKind());
   }];

   code CrossRef = [{
      addData(S->attrs());
   }];
}

However, D40782[2] has to land first in LLVM for this, since TableGen
has no !codeconcat command at the moment. So for the moment, I will put
it into Code = [{}];

chris

[1] ⚙ D40781 [DataCollection] Allow choosing between collection categories
[2] ⚙ D40782 [tablegen] Replace strconcat, listconcat by concat

Christian Dietrich <dietrich@sra.uni-hannover.de> writes:

> Sure, this would definitly be a better option. A quick look into the
> RecursiveASTVisitor tells me that there are at least 76 for (...) loops.
> So we have to add them.

Unluckily, there is no general approach to get the number of elements in
an iterator_range, but calling std::distance, however:

  addData(std::distance(S->attrs().begin(), S->attrs.end()));

is kind of ugly, so we better leave that to the implmentor of addData()
by using a template matching iterator_range<> and simply give the child
range to the Collector user:

  addData(S->attrs());

Decl::getAttrs() will return a reference to a SmallVector which can be

queried for the size. Decl::hasAttrs() needs to be checked before calling.