CopyPaste detection clang static analyzer

The modified version is as follows. Thanks.

Introduction

We begin with a basic introduction to clone detection. There are four types of clones.

Type 1 (Exact clones): Identical code fragments except small variations in white spaces, layout and comments.
Type 2 (Renamed clones): Syntactically identical code fragments except for variations in literals, identifiers, types, white spaces, layout and comments .
Type 3 (Modified clones): These are copied fragments with additional modifications such as changed, added or removed statements, in addition to variations in literals, identifiers, types, white spaces, layout and comments.
Type 4 (Semantic clones): Two or more code fragments that perform the same computation but are implemented by different syntactic variants.

Test Cases for Type 1, 2, 3, 4 are as follows.

1.Type 1

// expect warning because the code in “if” and the code in “else” is identical

int test1(int expr) {
if (expr) {
int a = 3, b = 4;
int c = a + b; // add a and b
return c;
}
else {
int a=3,b=4;
/* add ‘a’ and ‘b’ */
int c=a+b;
return c;
}
}

// expect warning because expressions are same on either side of condition
int test2(int a, int b, int expr){
return expr ? a + b : a+b;
}

// expect warning because “code1()” repeats in conditional expression
extern bool code1();
extern bool code2();
extern bool code3();
int test3() {
if ( code1() && code2() && code3() && code1() ) {
return 1;
}
return 0;
}

  1. Type 2
    struct S_STRUCT{
    int a;
    int b;
    }

int test4(S_STRUCT s1, S_STRUCT s2) {
int sa = (s1.a + s2.a) * (s1.a - s2.a) + s1.a * s2.a;
int sb = (s1.b + s2.b) * (s1.b - s2.b) + s1.b * s2.a; // except diagnostics “Do you mean s2.b here”
return sb - sa;
}

int func ( int a, int b, int c, int d){
return a + b + c + d;
}

int test5 (S_STRUCT s1, S_STRUCT s2) {
int a = func (s1.a, s1.b, s1.a * s1.b, s1.a / s1.b);
int b = func (s2.a, s2.b, s2.a * s2.b, s2.a / s1.b); // except diagnostics “Do you mean s2.b here”
return a + b;

}

  1. Type3
    #define THRESHOLD 100
    int test6 (int a, int b) {
    int sum = 0;
    if (a > THRESHOLD) {
    int x = 3, y = 4;
    sum += a * (x + y);
    }
    if (b >= THRESHOLD) { // except diagnostics “Do you mean > here”
    int x = 3, y = 4;
    sum += b * (x + y);
    }
    return sum;
    }

#define THRESHOLD 100
int test6 (int &a, int &b) {
int sum = 0;
if (a > THRESHOLD) {
int x = 3, y = 4;
sum += a * (x + y);
a = 0;
}
if (b > THRESHOLD) {
int x = 3, y = 4;
sum += b * (x + y);
// except diagnostics “Do you miss b = 0 here”
}
return sum;
}

  1. Type 4
    void test7(){
    func1();
    func2(); // func1() and fun2() are semantically same.
    }

int func1(){
int sum = 0;
for (int i = 0; i < 10; ++i) {
sum += i;
}
return sum;
}

int func2(){
int sum = 0;
int i = 0;
while(i < 10){
sum += i;
++i;
}
return sum;
}
And techniques of code clone detection can be classified as Text-based, Token-based, Metrics-based, AST-based and PDG-based.

These techniques have pros and cons. Comparison between them is described in survey papers http://research.cs.queensu.ca/TechReports/Reports/2007-541.pdf, http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf and http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf.

Implementation Plan

I plan to adopt AST-based method described in http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf to detect Type 1, Type 2 and Type 3. Firstly, the abstract syntax tree of program is first linearized, i.e., all sequences of statements and definitions are presented in the abstract tree as sibling subtrees. Then three phases of the algorithm are performed. Three phases are as follows.

(1) Partition similar statements into clusters using anti-unification. After the first phase each statement is marked with its cluster ID - thus, two statements with the same cluster ID are considered similar in this preliminary view. This phase uses a two-pass clustering algorithm. During the first pass all statements are gone over and each new statement is compared with the anti-unifiers of all existing clusters. During the second pass all the statements are traversed again and for each statement we search for the most similar pattern from the set produced in the previous pass (again using the anti-unification distance). And hashing (using the notion of d-cup) is used to speed up the clustering process.

(2) Find identical sequences of cluster IDs. All pairs of sequences of statements which are identically labeled (have the same labels on the same position) are searched using a suffix tree approach.

(3) Refine by examining identified code sequences for structural similarity. In this phase, every pair of candidate sequences is checked for overall similarity at the statement level, again using anti-unification.

As regards to Type 3, AST-based method may detected Type 3 with low accuracy, because the added or deleted instructions strictly change the structure of AST. So we should introduce semantic analysis to achieve high accuracy. But algorithms based on semantic analysis have high computational complexity, which makes them unusable for large software systems analysis.

About Type 3, I still have not figured out a better way to deal with. So at this time I keep to adopt the method described above. And later I’ll try to make some change to improve accuracy.

Type 4 is more complex and we don’t focus on it now.

All in all, I plan to develop copy-paste detection in three steps.

  1. Compiler Diagnostics Development
    I’ll start off with Nick’s patch which have analyzed conditional operators and if/elseif chains. I’ll improve Nick’s patch to collect all the expressions through something like a && b &&c && a. In addition, I’ll implement detection of clone code blocks (consist of sequence of same expressions).

The detection of Type 1 could be implemented as compiler diagnostics because it is efficient and with very low false positive rate.

  1. Bug Checker Development
    I plan to implement the detection of Type 2 and Type 3 as bug checker in the static analyzer because it introduces guess of programmers’ intention and may lead to some false positives.

  2. Test Suite Development
    In the mean time, I will develop different test cases of different clone types to test precision and recall of detection. Then a thorough test suite will be developed.

Finally, I will commit detailed documentations and prepare a final poster of the work.

Timeline

Community bonding period:
Thanks for initial work done by Nick which have analyzed conditional operators and if/elseif chains. I’ll start off with the patch and study documents about compiler diagnostics as well as Clang Static Analyzer development. In addition, since now the static analyzer has alpha support for detecting identical expressions using alpha.core.IdenticalExpr, I’ll check the difference between Nick’s patch and the one supporting alpha.core.IdenticalExpr. And I’ll try to figure out how to improve accuracy of detection of Type 3.

Week 1 - Compiler Diagnostics Development (1)

Improve Nick’s patch to collect all the expressions through something like a && b && c && a.

Week 2 - Compiler Diagnostics Development (2)
Implement detection of clone code blocks (consist of sequence of same expressions) using method proposed above.
Implement the first phase – partitioning similar statements into clusters.
Perform two-pass clustering algorithm to mark each statement with a cluster ID.

Week 3 - Compiler Diagnostics Development (3)
Implement the second phase – Find pairs of identical cluster sequences.
Use a suffix tree approach to search pairs.

Week 4 - Compiler Diagnostics Development (4)
Implement the third phase – Refine by examining identified code sequences for structural similarity.
Use anti-unification to check every pair of for overall similarity.
We select pairs that are exactly the same.

Week 5 - Test Suite Development (Type 1)
Develop test cases of Type 1
Fix bugs and polish code

Week 6 - Bug Checker Development (1)
Implement detection of Type 2, Type 3 as bug checker using the method proposed above.

Week 7 - Bug Checker Development (2)
Implement detection of Type 2, Type 3 as bug checker using the method proposed above.
Compare the selected clone pairs and make warning such as “Do you mean global2 here”

Week 8 - Test Suite Development (Type 2)
Develop test cases of Type 2
Fix bugs and polish code

Week 9 - Bug Checker Development (3)
Implement improvement of detection of Type 3

Week 10 - Test Suite Development (Type 3)
Develop test cases of Type 3
Fix bugs and polish code

Week 11 - Improvement
Make improvements.
Review all code.
Write documentations.

Week 12 - Documentations
Write documentations and prepare a final poster of the work.