CopyPaste detection clang static analyzer

Hi,
   A few months ago I was looking for a copy-paste detector for a C++ project. I didn't find such a feature of clang's static analyzer. Is this the case?
Many thanks,
Vassil

copy-paste detector? As in plagarism detection? That is not a feature of
Clang's static analyzer, no. There may be third party packages or
extensions that use Clang for such a task.

In article <CAENS6EsgzhXWfANFze8VAp68qDGHnrHNZJaaLmi28YJtnQwOmw@mail.gmail.com>,
    David Blaikie <dblaikie@gmail.com> writes:

> A few months ago I was looking for a copy-paste detector for a C++
> project. I didn't find such a feature of clang's static analyzer. Is this
> the case?

copy-paste detector? As in plagarism detection?

I don't think plagiarism is the concern. The conern is that
copy/paste of blocks of code where the pasted block needs to be
updated in several places, but not all of the updates were performed.

Coverity can detect such instances, for instance.

Here is an article from 2006 describing such a tool:
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.123.113>

Wikipedia says PMD has a copy/paste detector that works with C++:
<http://en.wikipedia.org/wiki/PMD_(software)#Copy.2FPaste_Detector_.28CPD.29>

"Note that CPD works with Java, JSP, C, C++, C#, Fortran and PHP code.
Your own language is missing ? See how to add it here"
<http://pmd.sourceforge.net/snapshot/cpd-usage.html>

We don't have a fully general checker yet, but the static analyzer has alpha support for certain expressions that commonly show up in these cases, such as both branches of the ternary conditional operator. You can test this support by building from trunk and enabling the checker alpha.core.IdenticalExpr. The people working on this are Daniel Fahlgren, Anders Rönnholm, and Per Viberg.

And of course, patches welcome!

Jordan

Thanks for the pointers!
We have a lot of copy-pastes in our framework (with minor changes: the for loop became a while loop and so on). I was thinking to submit a GSoC proposal for copy-paste detection and maybe to extend it to semantic detection where possible. Would the static analyzer be a good platform for the students to implement such a feature? Would it be beneficial for the analyzer?
Vassil

In article <CAENS6EsgzhXWfANFze8VAp68qDGHnrHNZJaaLmi28YJtnQwOmw@mail.gmail.com>,
     David Blaikie <dblaikie@gmail.com> writes:

   A few months ago I was looking for a copy-paste detector for a C++
project. I didn't find such a feature of clang's static analyzer. Is this
the case?

copy-paste detector? As in plagarism detection?

I don't think plagiarism is the concern. The conern is that
copy/paste of blocks of code where the pasted block needs to be
updated in several places, but not all of the updates were performed.

Exactly.

Coverity can detect such instances, for instance.

Here is an article from 2006 describing such a tool:
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.123.113>

Wikipedia says PMD has a copy/paste detector that works with C++:
<http://en.wikipedia.org/wiki/PMD_(software)#Copy.2FPaste_Detector_.28CPD.29>

"Note that CPD works with Java, JSP, C, C++, C#, Fortran and PHP code.
Your own language is missing ? See how to add it here"
<http://pmd.sourceforge.net/snapshot/cpd-usage.html>

Thanks for the pointers. I think that would be good to have in the static analyzer wouldn't it?
Vassil

In article <
CAENS6EsgzhXWfANFze8VAp68qDGHnrHNZJaaLmi28YJtnQwOmw@mail.gmail.com>,
    David Blaikie <dblaikie@gmail.com> writes:

>
> > A few months ago I was looking for a copy-paste detector for a C++
> > project. I didn't find such a feature of clang's static analyzer. Is
this
> > the case?
>
> copy-paste detector? As in plagarism detection?

I don't think plagiarism is the concern. The conern is that
copy/paste of blocks of code where the pasted block needs to be
updated in several places, but not all of the updates were performed.

I've implemented this sort of thing, but it's only 80% finished and has
been kicking around on the low-priority end of my todo list for the past
couple of years. Patch attached. It'd be great if someone were interested
in finishing this off. I won't get to it soon.

Note that it's a warning instead of a static analysis check which means
that it must have an aggressively low number of false positives, and that
it must be run quickly. The implementation I have analyzes conditional
operators and if/elseif chains, but doesn't collect all the expressions
through something like a && b &&c && a. That would be the next thing to add.

It does have some really cool properties that we can only get because clang
integrates closely with its preprocessor. Consider this sample from the
testcase:

#define num_cpus() (1)
#define max_omp_threads() (1)
int test8(int expr) {
  if (expr) {
    return num_cpus();
  } else {
    return max_omp_threads();
  }
}

We know better than to warn on that, even though the AST looks the same. If
you instead write "return num_cpus();" twice, we warn on that (that's test9
in the testsuite).

Nick

Coverity can detect such instances, for instance.

clang-useless-condition-1.patch (50.4 KB)

Thanks this looks very interesting. This may be a good start for a student. IIUC a non-unique expr is the ones that have same source ranges and same FileIDs, right? Could this be upgraded to AST-node (structural) comparison? Vassil

I’d love to see a tool with this kind of functionality as part of llvm.

There is a commercial tool called “Pattern Insight” that does stuff like this. http://patterninsight.com/

Here’s another use case for you.

I’ve been in groups that have used it in the past (careful locution; I haven’t personally used it), and occasionally it finds some amazing things.
The best example (from our use):

Code block #1 is about 50 lines of code, with references to a global variable (global1, global1, global1, global1, global1).
Code block #2 is an obviously duplicated and edited block of code, with references to (global2, global2, global2, global1, global2).

Pattern Insight, while looking through this code base, emitted a message to the effect “Are you sure you don’t mean ‘global2’ here?”
(and was correct)

— Marshall

Another example: PVS Studio found several copy-paste bugs within the clang codebase:
<http://www.viva64.com/en/b/0108/>

Cheers,

In article <
CAENS6EsgzhXWfANFze8VAp68qDGHnrHNZJaaLmi28YJtnQwOmw@mail.gmail.com>,
    David Blaikie <dblaikie@gmail.com> writes:

>
> > A few months ago I was looking for a copy-paste detector for a C++
> > project. I didn't find such a feature of clang's static analyzer. Is
this
> > the case?
>
> copy-paste detector? As in plagarism detection?

I don't think plagiarism is the concern. The conern is that
copy/paste of blocks of code where the pasted block needs to be
updated in several places, but not all of the updates were performed.

I've implemented this sort of thing, but it's only 80% finished and has
been kicking around on the low-priority end of my todo list for the past
couple of years. Patch attached. It'd be great if someone were interested
in finishing this off. I won't get to it soon.

Note that it's a warning instead of a static analysis check which means
that it must have an aggressively low number of false positives, and that
it must be run quickly. The implementation I have analyzes conditional
operators and if/elseif chains, but doesn't collect all the expressions
through something like a && b &&c && a. That would be the next thing to add.

It does have some really cool properties that we can only get because
clang integrates closely with its preprocessor. Consider this sample from
the testcase:

#define num_cpus() (1)
#define max_omp_threads() (1)
int test8(int expr) {
  if (expr) {
    return num_cpus();
  } else {
    return max_omp_threads();
  }
}

We know better than to warn on that, even though the AST looks the same.
If you instead write "return num_cpus();" twice, we warn on that (that's
test9 in the testsuite).

Nick

Thanks this looks very interesting. This may be a good start for a
student. IIUC a non-unique expr is the ones that have same source ranges
and same FileIDs, right? Could this be upgraded to AST-node (structural)
comparison?

It is an AST-node comparison. In order to handle the case of different
macros, we ask the AST nodes what their SourceLocation was, and factor in
the macroid, if there was one. A large part of the patch is a change to the
Stmt::profile logic to look at all the sourcelocations in all the possible
AST nodes.

Hi all,
   I just wanted to bump this up (given GSoC is starting). I didn't manage to get a good student for this project (proposal is below) last year :(. I thought maybe if we went through the LLVM mentoring organization would be better. Do you think this would make a good GSoC project from Clang's perspective? I'd be happy to update the proposal to make it more attractive or general-purpose.
Vassil

      Code copy/paste detection

*Description*:The copy/paste is common programming practice. Most of the programmers start from a code snippet that already exists in the system and modify it to match their needs. Easily some of the code snippets end up being copied dozens of times, which leads to worse maintainability, understandability and logical design. Clang(link is external) <http://clang.llvm.org> and clang's static analyzer(link is external) <http://http://clang-analyzer.llvm.org/> provide all the building blocks to build a generic C/C++ copy/paste detector.
*Expected results*:Build a standalone tool or clang plugin being able to detect copy/pasted code. Lay the foundations of detection of slightly modified code (semantic analysis required). Implement tests for all the realized functionality. Prepare a final poster of the work and be ready to present it.
*Required knowledge*: Advanced C++, Basic knowledge of Clang/Clang Static Analyzer.

*Mentor*: Vassil Vassilev/ maybe somebody else as second mentor?
<mailto:sft-gsoc-AT-cern-dot-ch?subject=GSoC%202014%20Extending%20Cling>

Hello,

Hello,

Hi all,
   I just wanted to bump this up (given GSoC is starting). I didn't manage to get a good student for this project (proposal is below) last year :(. I thought maybe if we went through the LLVM mentoring organization would be better. Do you think this would make a good GSoC project from Clang's
perspective? I'd be happy to update the proposal to make it more attractive or general-purpose.
Vassil

       Code copy/paste detection

*Description*:The copy/paste is common programming practice. Most of the programmers start from a code snippet that already exists in the system and modify it to match their needs. Easily some of the code snippets end up being copied dozens of times, which leads to worse maintainability,
understandability and logical design. Clang(link is external) <http://clang.llvm.org> and clang's static analyzer(link is external) <http://http://clang-analyzer.llvm.org/> provide all the building blocks to build a generic C/C++ copy/paste detector.

A talk has been done on this topic at FOSDEM:
http://llvm.org/devmtg/2015-02/
Slides:
http://llvm.org/devmtg/2015-02/slides/sargsyan-code-clone.pdf

Thanks! IIUC this does clone detection on IR level, which is IMO sort-of different. Do you know where I can find the source of this project?
Vassil

This would be a very useful feature to have in the clang static analyzer and can be scoped for a GSoC project!

Anna.

That's great! What would be the next steps? Do you know who will be the GSoC org admin? Do you think we should improve the project description and nominate a backup mentor?
Vassil

There was an email sent about GCoC a couple of days ago to the LLVMDev list.

I think adding specific examples that we want to handle would be useful in scoping this down.

I think having this integrated into one of the existing clang tools should the be the goal. For example, the static analyzer is a good fit. The static analyzer does not have plugins.

Thanks for the information. I addressed all of your comments and sent a patch to OpenProjects.html, cc-ing also you, Anna, for a review. Many thanks, Vassil

Hi all,

I am a student in Peking university, China. And I am pursing my master's
degree. I have some experience with LLVM and static analysis such as wrote a
LLVM pass to detect use-after-free bugs. I'd love to work in the project
copy-paste detection. Please give me some advise about my initial proposal.
Thank you so much.

*Introduction*
We begin with a basic introduction to clone detection. There are three types
of clones.
Type 1 (Exact clones): Identical code fragments except small variations in
white spaces, layout and comments.
Type 2 (Renamed clones): Syntactically identical code fragments except for
variations in literals, identifiers, types, white spaces, layout and
comments .
Type 3 (Modified clones): These are copied fragments with additional
modifications such as changed, added or removed statements, in addition to
variations in literals, identifiers, types, white spaces, layout and
comments.

And techniques of code clone detection can be classified as follows.

These techniques have pros and cons. Comparison between them is described
in survey papers http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf
and http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf.

*Implementation Plan*
I plan to develop copy-paste detection in three steps.
1. Detection of Type 1
I’ll start off with Nick’s patch which have analyzed conditional operators
and if/elseif chains. I’ll improve Nick’s patch to collect all the
expressions through something like a && b &&c && a. This part could be
implemented as compiler diagnostics because it is efficient and with very
low false positive rate.

2. Detection of Type 2
I plan to adopt AST-based method described in
http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf . The abstract
syntax tree of program is first linearized, i.e., all sequences of
statements and definitions are presented in the abstract tree as sibling
subtrees. Then three phases of the algorithm are preformed. Three phases are
as follows.

(1) Partition similar statements into clusters using anti-unification. After
the first phase each statement is marked with its cluster ID – thus, two
statements with the same cluster ID are considered similar in this
preliminary view. This phase uses a two-pass clustering algorithm. During
the first pass all statements are gone over and each new statement is
compared with the anti-unifiers of all existing clusters. During the second
pass all the statements are traversed again and for each statement we search
for the most similar pattern from the set produced in the previous pass
(again using the anti-unification distance). And hashing (using the notion
of d-cup) is used to speed up the clustering process.
(2) Find identical sequences of cluster IDs. All pairs of sequences of
statements which are identically labeled (have the same labels on the same
position) are searched using a suffix tree approach.
(3) Refine by examining identified code sequences for structural similarity.
In this phase, every pair of candidate sequences is checked for overall
similarity at the statement level, again using anti-unification.

This part could implemented as bug checker.

3. Detection of Type 3
AST-based method can only detected Type 3 with low accuracy, because the
added or deleted instructions strictly change the structure of AST. So we
should introduce semantic analysis to achieve high accuracy. But algorithms
based on semantic analysis have high computational complexity, which makes
them unusable for large software systems analysis. Authors of the paper
http://mpcs.sci.am/filesimages/volumes/volume_42/05.pdf (Related talk
http://llvm.org/devmtg/2015-02/slides/sargsyan-code-clone.pdf) proposed a
method which computes metrics for nodes of PDG and compares them. For every
node of program dependence graph a characteristic vector is constructed,
which contains information about neighbors. These characteristic vectors are
represented as 64-bit integer numbers, which allows determining similarity
between two nodes in amortized constant time. The paper also describes new
approach for dependency graphs generation, which allows building them much
faster than any of the existing methods.
IMO we can imitate the method to implement a bug checker of type 3. But we
don’t focus on IR level.

In the mean time, I will develop different test cases of different clone
types to test precision and recall of detection. Then a thorough test suite
will be developed.
Finally, I will commit detailed documentations and prepare a final poster of
the work.

*Timeline*
TODO

Best regards,
Jiayi Ye

My comments went in google-melange. Could you please post the newest version here once again when you finish it?
Vassil