[GSoC 2017] Clang-based diff tool project

Hello,

I am currently studying Computer Science at TU Eindhoven. I am doing a
course that involves programming assignments on parts of LLVM such as
lowering, scheduling and optimization. For this year's Google Summer of
Code I plan to submit a proposal to implement a clang-based diff tool
[1].

I think it really pays off to have decent developer tools available, as
they can save tons of time. Clang tooling has obviously been very
successful. I think it would be a good idea to develop a diff tool that
considers the structure of the code, as opposed to just the lines. Plain
old diff only thinks in terms of "additions" and "deletions", although
it would be more natural to also consider "updates" and "moves".

So a structural diff would work solely on the AST, hence formatting
changes are ignored. It would allow to highlight the exact location of a
change, and not a whole line. Furthermore, it would allow to compare
pieces of code with the same structure (think subclasses).

Besides some papers with clever AST-matching algorithms, a quick web
search yielded [2], which is a proof-of-concept implementation of a
structural comparison algorithm. I think it demonstrates rather nicely
what could be done: movement of chunks of code can be easily traced.

Anyway, one could make all kinds of nice visualizations using a AST diff
tool, however, I think the initial focus should probably be on creating
one with a similar output to traditional diff, with the difference that
updates and moves are displayed in a easily readable way, which already
could improve developer productivity and happiness.

As of now I have one question: The output of the tool is meant just for
humans to read (and not for actual patching), right?

To sum up, this could be a very interesting project for me to work on,
and the result will hopefully be useful to a wide range of developers. I
would appreciate any feedback. Also, suggestions on how the diff output
should be presented are welcome. Thank you!

Johannes

[1] http://llvm.org/OpenProjects.html#clang-diff-tool
[2] ydiff: a structural program comparison tool – Surely I Am Joking

Hello,

I am currently studying Computer Science at TU Eindhoven. I am doing a
course that involves programming assignments on parts of LLVM such as
lowering, scheduling and optimization. For this year's Google Summer of
Code I plan to submit a proposal to implement a clang-based diff tool
[1].

I think it really pays off to have decent developer tools available, as
they can save tons of time. Clang tooling has obviously been very
successful. I think it would be a good idea to develop a diff tool that
considers the structure of the code, as opposed to just the lines. Plain
old diff only thinks in terms of "additions" and "deletions", although
it would be more natural to also consider "updates" and "moves".

So a structural diff would work solely on the AST, hence formatting
changes are ignored. It would allow to highlight the exact location of a
change, and not a whole line. Furthermore, it would allow to compare
pieces of code with the same structure (think subclasses).

Besides some papers with clever AST-matching algorithms, a quick web
search yielded [2], which is a proof-of-concept implementation of a
structural comparison algorithm. I think it demonstrates rather nicely
what could be done: movement of chunks of code can be easily traced.

There is also a fair amount of literature associated with "XML Diff" tools which also demonstrate this kind of structural comparison. For example, see:

   http://diffxml.sourceforge.net/
   https://www.cs.hut.fi/~ctl/3dm/
   X-Diff: An XML Diff Tool

Anyway, one could make all kinds of nice visualizations using a AST diff
tool, however, I think the initial focus should probably be on creating
one with a similar output to traditional diff, with the difference that
updates and moves are displayed in a easily readable way, which already
could improve developer productivity and happiness.

As of now I have one question: The output of the tool is meant just for
humans to read (and not for actual patching), right?

To sum up, this could be a very interesting project for me to work on,
and the result will hopefully be useful to a wide range of developers. I
would appreciate any feedback. Also, suggestions on how the diff output
should be presented are welcome. Thank you!

In the long term, I'd love to see a semantic diff tool which could help resolve merge conflicts. Merging two branches, both which have added a function to a class, or a member to a class (including updating the constructors), is a common problem. Having some way to "apply both changes" automatically would be a big help. To that end, I'd hope that the output of the tool could indeed be used for actual patching. That does not mean it would need to follow the traditional diff format (in fact, I'd expect that it would not). Moreover, the 'patch' part of the tool might well be out-of-scope for the initial project. I do think, however, we should at least have a mode where the 'diff' output is precise and machine readable so that we might later design patching tools.

  -Hal

(+CC: Greg Clayton who gave me this idea in the first place)

Hello,

I am currently studying Computer Science at TU Eindhoven. I am doing a
course that involves programming assignments on parts of LLVM such as
lowering, scheduling and optimization. For this year's Google Summer of
Code I plan to submit a proposal to implement a clang-based diff tool
[1].

Great! I look forward to see this :slight_smile:

I think it really pays off to have decent developer tools available, as
they can save tons of time. Clang tooling has obviously been very
successful. I think it would be a good idea to develop a diff tool that
considers the structure of the code, as opposed to just the lines. Plain
old diff only thinks in terms of "additions" and "deletions", although
it would be more natural to also consider "updates" and "moves".

So a structural diff would work solely on the AST, hence formatting
changes are ignored. It would allow to highlight the exact location of a
change, and not a whole line. Furthermore, it would allow to compare
pieces of code with the same structure (think subclasses).

Besides some papers with clever AST-matching algorithms, a quick web
search yielded [2], which is a proof-of-concept implementation of a
structural comparison algorithm. I think it demonstrates rather nicely
what could be done: movement of chunks of code can be easily traced.

Anyway, one could make all kinds of nice visualizations using a AST diff
tool, however, I think the initial focus should probably be on creating
one with a similar output to traditional diff, with the difference that
updates and moves are displayed in a easily readable way, which already
could improve developer productivity and happiness.

As of now I have one question: The output of the tool is meant just for
humans to read (and not for actual patching), right?

Yes. But we developed software as libraries usually. Practically I expect the main part of the work to write some piece of API that generate an “in-memory” representation of the diff.

A tool that is generating a textual-human readable output is likely the first client of this API and is likely critical to be able to functionally test it in the early development. In the future I hope it’d enable other graphical diff client to plug-in, or git-merge resolution tools as well.

Best,

So I have been playing around with this some more.

Yes. But we developed software as libraries usually. Practically I
expect the main part of the work to write some piece of API that
generate an “in-memory” representation of the diff.

Yes, of course, I think the representation will be just a set of
changes, where a change is a insert/update/removal of a node.

In the future I hope it’d enable other graphical diff client to
plug-in, or git-merge resolution tools as well.

Yes, merge tools would be a nice application for this, as Hal also
pointed out. Of course it is easy to make the output machine readable or
just use the API in the first place.

Unfortunately it is not possible to (de-)serialize the AST from/to XML
(or some other structured data format). If it was, one could use
existing structural diff tools to make prototypes or help test the final
version.

I have another question as I have never written a clang tool before: One
could use LibClang or LibTooling, the former seems easier and probably
good enough (?)

Furthermore, should I work on a small patch / fix an easy bug to show
that I am comfortable with the development process? I am a bit lost with
the bug tracker, so if you can suggest things to work I will try that.
Other than that I can work on a prototype for the tool but that takes
some time. I found one small typo in clang, I guess I could submit
that..

Johannes

diff --git a/include/clang/AST/RecursiveASTVisitor.h b/include/clang/AST/RecursiveASTVisitor.h
index 1b5850a05b..c31e4b0bb4 100644
--- a/include/clang/AST/RecursiveASTVisitor.h
+++ b/include/clang/AST/RecursiveASTVisitor.h
@@ -83,7 +83,7 @@ namespace clang {
       return false; \
   } while (false)

-/// \brief A class that does preordor or postorder
+/// \brief A class that does preorder or postorder
/// depth-first traversal on the entire Clang AST and visits each node.
///
/// This class performs three distinct tasks:

Mehdi Amini <mehdi.amini@apple.com> writes:

My original idea was to write a semantic diff tool that just does some simple things up front:

create an MD5 from all top level blocks of the code. Start by just finding matching blocks of code ('{' and '}', '(' and ')') and remember the source locations for these and their MD5 values. Run a normal diff on the code and see what blocks the diffs fall into. Then try to figure out where things moved by possibly delving deeper into each block that matched something from the diff. Also if any blocks moved to a completely different location, try and figure that out by matching the MD5 of any blocks.

For example if you had:

int main(int argc, const char **argv) {
  if (argc > 2) {
  }
  switch (argc) {
  }
}

You would first make MD5s for the '(' and ')' in the "main" line and for the '{' at the end of the main line, and ending at the end of the code. Now the code looks like:

int main(int argc, const char **argv) {
  switch (argc) {
  }
  if (argc > 2) {
  }
}

The diff would show that the "if" is gone and a new "if" is found after the switch in the new version of the file. We would notice that the diff appears inside the block from the first:

{
  if (argc > 2) {
  }
  switch (argc) {
  }
}

And in the block from the second:

{
  switch (argc) {
  }
  if (argc > 2) {
  }
}

So we would then compute the MD5 for the blocks inside each of these blocks and try to match things up. The MD5 would of course remove spaces that aren't in strings and only compute the MD5 from the characters that make sense. This simple type of approach could almost work on any language without the need to be able to correctly compile each file with all the right options.

Greg

I had a student implement such a tool a couple of years ago as a final-year (undergrad) project. He used the Python bindings to libclang and made something that worked with both the Python AST and the C AST exposed by libclang.

The core idea for his work was the ability to recognise common refactoring patterns (for example inlining, outlining, variable renaming) and some fuzzy matching (so his tool could report things like ‘you renamed X to Y everywhere except here and here’).

His code was a nice proof-of-concept, but was never tidied up enough to be ready for widespread use. It would be great to see something a bit more robust.

David

I am currently considering the gumtree algorithm which is described in
[1]. It is able to detect code moves, updates. There is also a java
implementation of gumtree. It supports C++ via srcML, however this
seems to be quite incomplete and buggy. The algorithm consists of a top
down and a bottom up traversal of the AST that result in a matching of
corresponding nodes. It contains some heuristics similarity measurement.

Greg Clayton <clayborg@gmail.com> writes:

My original idea was to write a semantic diff tool that just does some
simple things up front:

create an MD5 from all top level blocks of the code. Start by just
finding matching blocks of code ('{' and '}', '(' and ')') and
remember the source locations for these and their MD5 values. Run a
normal diff on the code and see what blocks the diffs fall into. Then
try to figure out where things moved by possibly delving deeper into
each block that matched something from the diff. Also if any blocks
moved to a completely different location, try and figure that out by
matching the MD5 of any blocks.

Ok, I think I see how this works in principle. This is possibly more
efficient than gumtree, which has O(n^2) worst case complexity where n
is the number of nodes in the AST. But I am not sure..

For example if you had:

int main(int argc, const char **argv) {
  if (argc > 2) {
  }
  switch (argc) {
  }
}

You would first make MD5s for the '(' and ')' in the "main" line and
for the '{' at the end of the main line, and ending at the end of the
code. Now the code looks like:

int main(int argc, const char **argv) {
  switch (argc) {
  }
  if (argc > 2) {
  }
}

The diff would show that the "if" is gone and a new "if" is found
after the switch in the new version of the file. We would notice that
the diff appears inside the block from the first:

{
  if (argc > 2) {
  }
  switch (argc) {
  }
}

And in the block from the second:

{
  switch (argc) {
  }
  if (argc > 2) {
  }
}

So we would then compute the MD5 for the blocks inside each of these
blocks and try to match things up. The MD5 would of course remove
spaces that aren't in strings and only compute the MD5 from the
characters that make sense. This simple type of approach could almost
work on any language without the need to be able to correctly compile
each file with all the right options.

Yeah, this is another advantage, one does not need to basically have
a compiler in order to create the diff. So this would just consider
blocks enclosed by parentheses or braces. It is nice that it works for
any language that uses those for blocks.

So this would be a more lightweight and smart tool, while the gumtree is
more powerful and extensible because it operates only on the AST. If the
goal is to produce a diffing API that is able to exploit the semantics
of the language and enables tools for visualization and merging then I
think gumtree is the better choice. On the other hand, if we just want a
Unix tool that is a smart and lean improvement to diff then your idea
might be more appropriate in addition to being more flexible.

What do others think about this?

[1] https://hal.archives-ouvertes.fr/hal-01054552/document

That seems like a good summary to me.

What I find cool about an AST based tool, is the ability to extract the diff as semantic informations, for example “function (or variable) was renamed from A to B”.
I feel that you can get a totally different level of information with such a tool.

Hi,

   This seems a very exciting project.

   As part of GSoC16 Raphael developed a code clone detection tool (LLVM2016 - Clang clone detection presentation - Google Slides). We are working on turning the infrastructure into a reusable set of components (https://reviews.llvm.org/D23418).

   Raphael hacked together a few lines of code, addressing Greg's proposal based on D23418.

r1 r2
int main(int argc, const char **argv) {
   switch (argc) {
   }
   if (argc > 2) {
     return 1;
   }
   while (false);
   int funkyVariable = 1;
   funkyVariable++;
}
  int main(int argc, const char **argv) {
   if (argc > 2) {
     return 1;
   }
   switch (argc) {
   }
   while (false);
   int funkyVariable = 1;
   funkyVariable++;
}

   ./clangDiff
   Change: SwitchStmt moved from line 2 to line 5
   Change: IfStmt moved from line 4 to line 2

   Here is how it looks https://gist.github.com/Teemperor/b252bae4b2544f57d6bb9580a0e890e4

   Let us know if we can help you further with this!

-- Vassil and Raphael

Do I take that you’re volunteering to mentor it? :wink:

Neat! :slight_smile:

I’d be happy if you could take the lead. Johannes asked earlier how to start in clang and show his ability, any bug to fix or small improvement to implement you can suggest?
He also asked about libclang vs libtooling, not sure if anyone already answered.

Very cool! Now we just need a graphical tool/IDE to integrate this with so we can do these kinds of merges graphically.

Not really :wink: We would be happy to provide help reviewing patches if you choose to work with the clone detection infrastructure. I am afraid I cannot dedicate a lot of time for this. I believe these are not difficult to fix: Documentation: * * * * C++ * * (tricky) :frowning: – Vassil

So I have been playing around with this some more.

> Yes. But we developed software as libraries usually. Practically I
> expect the main part of the work to write some piece of API that
> generate an “in-memory” representation of the diff.

Yes, of course, I think the representation will be just a set of
changes, where a change is a insert/update/removal of a node.

> In the future I hope it’d enable other graphical diff client to
> plug-in, or git-merge resolution tools as well.
Yes, merge tools would be a nice application for this, as Hal also
pointed out. Of course it is easy to make the output machine readable or
just use the API in the first place.

Unfortunately it is not possible to (de-)serialize the AST from/to XML
(or some other structured data format). If it was, one could use
existing structural diff tools to make prototypes or help test the final
version.

I have another question as I have never written a clang tool before: One
could use LibClang or LibTooling, the former seems easier and probably
good enough (?)

Hi Johannes,
I think a tool like this should use LibTooling directly since libclang does
not expose all of the information that's present in the AST.
Alex

Hi,

Vassil Vassilev <v.g.vassilev@gmail.com> writes:

Hi,

  This seems a very exciting project.

Do I take that you’re volunteering to mentor it? :wink:

Not really :wink: We would be happy to provide help reviewing patches if you
choose to work with the clone detection infrastructure.

  As part of GSoC16 Raphael developed a code clone detection tool
(LLVM2016 - Clang clone detection presentation - Google Slides).
We are working on turning the infrastructure into a reusable set of
components (⚙ D23418 [analyzer] Added a reusable constraint system to the CloneDetector).

Nice, the clone detection looks useful, I was not aware that it existed
for clang. Would also be interesting to work on that..

I think a properly implemented AST diffing interface can be used as
additional pluggable constraint for the clone detection (because it is
able to do tree matching). This way it would be possible to detect type
3 clones (which include structural differences) without having to
manually create the right constraints.

  Raphael hacked together a few lines of code, addressing Greg's
proposal based on D23418.

r1 r2
int main(int argc, const char **argv) {
  switch (argc) {
  }
  if (argc > 2) {
    return 1;
  }
  while (false);
  int funkyVariable = 1;
funkyVariable++;
}
  int main(int argc, const char **argv) {
  if (argc > 2) {
    return 1;
  }
  switch (argc) {
  }
  while (false);
  int funkyVariable = 1;
funkyVariable++;
}

  ./clangDiff
  Change: SwitchStmt moved from line 2 to line 5
  Change: IfStmt moved from line 4 to line 2

Neat! :slight_smile:

  Here is how it looks
CloneDetectionExample.cpp · GitHub

  Let us know if we can help you further with this!

I’d be happy if you could take the lead. Johannes asked earlier how to
start in clang and show his ability, any bug to fix or small
improvement to implement you can suggest?

I am afraid I cannot dedicate a lot of time for this. I believe these
are not difficult to fix:
Documentation:
   * https://bugs.llvm.org/show_bug.cgi?id=16106
   * 19260 – Show include path of a header file in the doxygen online class documentation
   * 5935 – TargetOptions::Target is documented wrong.
   * 10257 – clang -arch FOO is ignored

C++
   * 24883 – clang++ uncorrectly reports "unused typedef" warning
   * 27532 – injected-class-name should not be modelled as a CXXRecordDecl (tricky)

Thanks, I did only one so far, I am working on first C++ one at the moment.

Johannes