[RFC] [GSoC] AST differencing

Hi,

I have been working on creating a syntax tree differencing library.
So far, my focus has been on getting the core algorithm right, some
parts are missing.

I have submitted a patch [1]. I already received some great feedback and
I am always open to suggestions for improvement.

Additionally, if you have ideas to discuss about this, please let me
know, then I can keep it in mind while designing.

I think it should be fairly easy to create a tool that uses the same
output format as diff while only showing semantic changes. More advanced
tools seem to be quite possible also.

One difficulty I am facing so far is that extracting values of
individual AST nodes requires me to manually combine all relevant
attributes. The value of a node is all things that make it semantically
different from other nodes of the same kind, except for its children.

After an edit script has been computed it would be possible to apply it,
but in order such an edit script it will be necessary to implement a way
to do the opposite of extracting node information: updating the source
code based on the previously extracted information (node value). It is
probably not particularly difficult but it will require qutie some
effort to do it in a way that it can work for any node.

Johannes

[1] ⚙ D34329 [clang-diff] Initial implementation.

Hi Johannes,

I think this is a great area of research. For Java, for example, refaster (http://errorprone.info/docs/refaster) exists which lets you express simple code changes with “before / after” samples. If we can get something like this based on Clang’s AST, that would be most awesome - is this a direction you’re working towards?

Cheers,
/Manuel

Manuel Klimek <klimek@google.com> writes:

Hi Johannes,

I think this is a great area of research. For Java, for example, refaster (
Refaster templates) exists which lets you express simple
code changes with "before / after" samples. If we can get something like
this based on Clang's AST, that would be most awesome - is this a direction
you're working towards?

Hi Manuel,

Yes, this is definitely something that can be done, I will design for
that also. I am in the process of modularizing the API, so that it will
allow this kind of pattern matching.

So the main advantage of such template based refactoring rules is that
they are easy to write (compared to an AST-matchers based solution). I
think it is a very nice way of writing refactoring, I am working on a
prototype.

For the AST matching / diffing API I am thinking of getting rid of the
custom Node class and instead just let the user pass Decl or Expr
nodes. Also, it should be possible to pass a custom predicate for
matching nodes, this seems to be the best way to enable exact pattern
based matches - as opposed to the gumtree algorithm, which is based on a
heuristic.

Johannes

Manuel Klimek <klimek@google.com> writes:

Hi Johannes,

I think this is a great area of research. For Java, for example, refaster (
http://errorprone.info/docs/refaster) exists which lets you express simple
code changes with “before / after” samples. If we can get something like
this based on Clang’s AST, that would be most awesome - is this a direction
you’re working towards?

Hi Manuel,

Yes, this is definitely something that can be done, I will design for
that also. I am in the process of modularizing the API, so that it will
allow this kind of pattern matching.

So the main advantage of such template based refactoring rules is that
they are easy to write (compared to an AST-matchers based solution). I
think it is a very nice way of writing refactoring, I am working on a
prototype.

For the AST matching / diffing API I am thinking of getting rid of the
custom Node class and instead just let the user pass Decl or Expr
nodes.

I’m not sure what you mean here - the custom Node class is mainly there to make it easier to write code that works on all entities in the AST. If you want Decl and Expr, you’ll probably at least also want Type. Usually you’ll also want some for of locs, because often you’ll change NestedNameSpecifiers which do not have location information per se.