GSoC proposal - Finding and analysing copy-pasted code with clang

Hi everybody,

just wanted to post my GSoC proposal for feedback:

https://docs.google.com/document/d/1hY_EUIqeQ6cAYaIqrePWvcy6XmkMv5BoxrfskEv5Tl0/edit?usp=sharing

Feel free to comment on the document if you think something should be improved.
I also try to idle in the IRC channel from now on if someone prefers
that. My nickname there is also "teemperor".

Cheers,

Raphael Isemann

Hi all,
   I just want to resend a few things from our internal discussion here for the record. I will summarize them:
   * Raphael contacted me early (1-2 months before GSoC started) to express his interest in the copy-paste project. He wanted to continue the development of our prototype (done by Kirill in GSoC 2015).
   * Raphael learned from Kirill and me the current state of the project and the current implementation deficiencies (this section of his proposal includes our feedback).
   * Raphael started playing with the implementation early and had a few good ideas how to further improve the code-clone hashing.
   * Raphael also proposed to try to feed back as much as possible to clang's mainline (something that we didn't have time to do last year).
   * Raphael has good experience with relevant projects such as his work on WebAssembly.

   If you are interested in more technical details please let us know.

   I find this proposal very reasonable and the candidacy very strong. From the CV and the proposal I think the candidate can do what he are suggests and I'd be happy to mentor him. I'd be happy to hear Anna's comments on his proposal.
--Vassil

Hi all,
I just want to resend a few things from our internal discussion here for the record. I will summarize them:

  • Raphael contacted me early (1-2 months before GSoC started) to express his interest in the copy-paste project. He wanted to continue the development of our prototype (done by Kirill in GSoC 2015).
  • Raphael learned from Kirill and me the current state of the project and the current implementation deficiencies (this section of his proposal includes our feedback).
  • Raphael started playing with the implementation early and had a few good ideas how to further improve the code-clone hashing.
  • Raphael also proposed to try to feed back as much as possible to clang’s mainline (something that we didn’t have time to do last year).
  • Raphael has good experience with relevant projects such as his work on WebAssembly.

If you are interested in more technical details please let us know.

I find this proposal very reasonable and the candidacy very strong. From the CV and the proposal I think the candidate can do what he are suggests and I’d be happy to mentor him. I’d be happy to hear Anna’s comments on his proposal.

Hi Vassil and Raphael,

Sorry for the delay, I just got to reading your proposal. Below are some comments.

If I understand correctly, you are proposing to:

  1. Add another stand-alone tool + a library that performs cross-translation unit clone detection on AST-level.
  2. Add a checker to the Clang Static Analyzer that performs (the same?) clone detection but limited to a single translation unit.

How much code reuse will there be between the two? Will the stand-alone tool be built on top of the checker? I did not get that feeling from the proposal, especially, since the stand-alone tool will be completed first. It seems that all of the goals mentioned in the proposal except for the cross translation unit analysis could be done in the static analyzer. So why not start with that? I think it would be very beneficial for the project to have some clone detection committed in tree, immediately available to all of the existing users of the static analyzer!

One of the obstacles in contributing the existing checker to the analyzer is issue reporting. Have you considered reporting the subsequent clones as a note on the first clone? The clones are related and it looks like the current output does not highlight that. (The static analyzer does not support notes right now, so you’d need to extend that functionality.)

I am very apprehensive about adding yet another analysis tool to the clang ecosystem. Having clang-tidy, the Clang Static Analyzer + yet another tool would be quite confusing to the user. The most user friendly approach is to have a single tool that highlights all the problems the users have in their code. I do acknowledge that it would not be possible to make the clone checker scale to cross translation unit analysis since we do not currently have the infrastructure to support that. However, building the stand-alone tool on top of the checker would allow turning it into a cross-translation unit checker once the infrastructure is added to the static analyzer.

Have you looked into CodeChecker and the new scan-build.py projects? They do rely on using the compilation database, which is something you plan on doing as well. Can you reuse scan-build.py instead of writing your own build interposition? The goal of CodeChecker is to collect and display static analysis reports generated by all clang-based tools, specifically, both clang-tidy and the Clang Static Analyzer are already supported. It would be valuable if the new stand-alone tools would be incorporated into the same workflow. This way the users could have a single point of entry when they look for bugs.

CodeChecker incorporates a nice bug viewing UI. Integrating clone reporting into that UI would be great. However, you might need to extend/modify both the reporting and the UI to make it look great.

What do you think?
Anna.

Raphael,

I just noticed that you propose to store analysis results into a JASON database. Note that the static analyzer already has a rich format (plist) that is used to store reports for further processing. Also, that format is being serialized into a database by CodeChecker.

If your goal is to integrate your project into LLVM at some point, I suggest sending out incremental changes for review, following the incremental development policy. This has a lot of benefits, for example, it will allow the community provide you with sufficient feedback along the away!

Cheers,
Anna.

Hi Anna,

thanks for the comments!

I wrote my answers inline (see below):

Hi Vassil and Raphael,

Sorry for the delay, I just got to reading your proposal. Below are some
comments.

If I understand correctly, you are proposing to:
1) Add another stand-alone tool + a library that performs cross-translation
unit clone detection on AST-level.
2) Add a checker to the Clang Static Analyzer that performs (the same?)
clone detection but limited to a single translation unit.

That's right. And it is the same clone detection mechanism.

How much code reuse will there be between the two? Will the stand-alone tool
be built on top of the checker? I did not get that feeling from the
proposal, especially, since the stand-alone tool will be completed first. It
seems that all of the goals mentioned in the proposal except for the cross
translation unit analysis could be done in the static analyzer. So why not
start with that? I think it would be very beneficial for the project to have
some clone detection committed in tree, immediately available to all of the
existing users of the static analyzer!

My current plan is to have some logic for finding clones in ASTs that
is used by the standalone tool and the checker. They are both just
wrappers around this clone detection code that provide input and report the
output in a appropriate way.

The reason why I put the standalone tool as the first point in the
schedule is just that it's faster to develop the early versions of the
clone detection code inside the smaller codebase.
Rebuilding clang takes a few minutes, rebuilding the standalone tool
takes a few seconds.

But I don't see any problem with finishing the checker first as soon
as the clone detection code is mature enough.

One of the obstacles in contributing the existing checker to the analyzer is
issue reporting. Have you considered reporting the subsequent clones as a
note on the first clone? The clones are related and it looks like the
current output does not highlight that. (The static analyzer does not
support notes right now, so you'd need to extend that functionality.)

No, I haven't considered that so far but it sounds like something we should do!

I am very apprehensive about adding yet another analysis tool to the clang
ecosystem. Having clang-tidy, the Clang Static Analyzer + yet another tool
would be quite confusing to the user. The most user friendly approach is to
have a single tool that highlights all the problems the users have in their
code. I do acknowledge that it would not be possible to make the clone
checker scale to cross translation unit analysis since we do not currently
have the infrastructure to support that. However, building the stand-alone
tool on top of the checker would allow turning it into a cross-translation
unit checker once the infrastructure is added to the static analyzer.

I see the problem for the user but I'm not sure if I understood the
last sentence of this paragraph correctly. Isn't the standalone tool
always obsolete as soon as there is cross-TU support in the static
analyzer?

Also, are there already any plans on how the this cross-TU support
will look like? Because I'm interested how the problem with memory
consumption would be handled as keeping all TUs of a project in memory
is something I assume to be impossible for big projects. Someone
actually dropped me an email last week confirming this.

The standalone tool can do some trickery to get around this by
discarding TUs after the hashing the nodes and then reloading them if
necessary, but that obviously only works in the special use case of
this project.

Have you looked into CodeChecker and the new scan-build.py projects? They do
rely on using the compilation database, which is something you plan on doing
as well. Can you reuse scan-build.py instead of writing your own build
interposition? The goal of CodeChecker is to collect and display static
analysis reports generated by all clang-based tools, specifically, both
clang-tidy and the Clang Static Analyzer are already supported. It would be
valuable if the new stand-alone tools would be incorporated into the same
workflow. This way the users could have a single point of entry when they
look for bugs.

CodeChecker incorporates a nice bug viewing UI. Integrating clone reporting
into that UI would be great. However, you might need to extend/modify both
the reporting and the UI to make it look great.

I just had time for a quick peek at CodeChecker, so I can only say it
sounds like a good idea, but I can't say if I work on that without
knowing how much work it will be in the end.

Lazlo already pointed out to scan-build.py to generate the compilation
database while writing proposal and I already use it for generating
testing databases, so using scan-build.py for interposition is
confirmed I guess.

What do you think?
Anna.

I hope I didn't miss to answer something, but if I did, please point it out!

Cheers,

- Raphael

Hi,

thanks for the hints! I'll try to get early feedback on anything I'll work on :slight_smile:

- Raphael

Hi Raphael,

This looks like a very interesting project.
From what I can tell, the prototype from last year GSoC has been developed out-of-tree and without much community involvement apparently.
I’d like to make sure we (as the clang/LLVM community) get as much as possible out of this project if it gets selected, so IMO it should be:

1) developed in-tree and not in a separate clone, from day 1.
2) developed with involvement of the community, and especially making sure the design can integrate nicely with the existing infrastructure (i.e. validate with the static analyzer code owner). Ideally Anna or someone involved with the static analysis infrastructure should sign-up to co-mentor the project.

The availability of a standalone tool is a lot less important than a well integrated infrastructure/library IMO.

Can you elaborate your position with respect to that?
I’m interested in everyone's opinion as well of course.

Best,

Hi Anna,

thanks for the comments!

I wrote my answers inline (see below):

Hi Vassil and Raphael,

Sorry for the delay, I just got to reading your proposal. Below are some
comments.

If I understand correctly, you are proposing to:
1) Add another stand-alone tool + a library that performs cross-translation
unit clone detection on AST-level.
2) Add a checker to the Clang Static Analyzer that performs (the same?)
clone detection but limited to a single translation unit.

That's right. And it is the same clone detection mechanism.

How much code reuse will there be between the two? Will the stand-alone tool
be built on top of the checker? I did not get that feeling from the
proposal, especially, since the stand-alone tool will be completed first. It
seems that all of the goals mentioned in the proposal except for the cross
translation unit analysis could be done in the static analyzer. So why not
start with that? I think it would be very beneficial for the project to have
some clone detection committed in tree, immediately available to all of the
existing users of the static analyzer!

My current plan is to have some logic for finding clones in ASTs that
is used by the standalone tool and the checker. They are both just
wrappers around this clone detection code that provide input and report the
output in a appropriate way.

The reason why I put the standalone tool as the first point in the
schedule is just that it's faster to develop the early versions of the
clone detection code inside the smaller codebase.
Rebuilding clang takes a few minutes, rebuilding the standalone tool
takes a few seconds.

LLVM is following incremental development policy (http://llvm.org/docs/DeveloperPolicy.html#incremental-development). Developing on a branch/ within a separate tool and landing all of the mature changes later on into LLVM would introduce other issues.

I would much rather start with committing a weaker clone checker into the repo, changing the clang/static analyzer so that it can report the clones, and incrementally enhancing the check. I think that alone will be a lot of work! But on the upside, you will have a lot of users.

If you have time left, you could explore building a stand alone tool that calls the checker and extends the clone detection to work cross-TUs. (That should be doable.) Further, we would want to integrate this tool into an issue navigator such as CodeChecker.

But I don't see any problem with finishing the checker first as soon
as the clone detection code is mature enough.

One of the obstacles in contributing the existing checker to the analyzer is
issue reporting. Have you considered reporting the subsequent clones as a
note on the first clone? The clones are related and it looks like the
current output does not highlight that. (The static analyzer does not
support notes right now, so you'd need to extend that functionality.)

No, I haven't considered that so far but it sounds like something we should do!

I am very apprehensive about adding yet another analysis tool to the clang
ecosystem. Having clang-tidy, the Clang Static Analyzer + yet another tool
would be quite confusing to the user. The most user friendly approach is to
have a single tool that highlights all the problems the users have in their
code. I do acknowledge that it would not be possible to make the clone
checker scale to cross translation unit analysis since we do not currently
have the infrastructure to support that. However, building the stand-alone
tool on top of the checker would allow turning it into a cross-translation
unit checker once the infrastructure is added to the static analyzer.

I see the problem for the user but I'm not sure if I understood the
last sentence of this paragraph correctly. Isn't the standalone tool
always obsolete as soon as there is cross-TU support in the static
analyzer?

Not necessarily. Additional work will be required to make this particular checker work cross translation units. If you design the tool so that it is built using the checker as a primitive, less work would need to be done to add cross translation unit support in the analyzer.

Also, are there already any plans on how the this cross-TU support
will look like? Because I'm interested how the problem with memory
consumption would be handled as keeping all TUs of a project in memory
is something I assume to be impossible for big projects.

Keeping all TU in memory is not an option we are considering. Most likely we would be building summaries for functions and using those for analysis. There were various discussions on this topic on the list but it is not clear what would be the best approach.

Someone
actually dropped me an email last week confirming this.

The standalone tool can do some trickery to get around this by
discarding TUs after the hashing the nodes and then reloading them if
necessary, but that obviously only works in the special use case of
this project.

Once you have more specifics, let's discuss them.

Hi Anna & Mehdi,

having more contact with the LLVM/clang community (which is anyway the
reason I want to participate in the GSoC)
and getting early feedback by working in-tree sounds good to me!

So from my point of view I don't see any problem with following Anna's
plan and Mehdi's suggestions.

Cheers,

- Raphael