C++ algorithm analysis tool

Hey everyone,

Just wondering if there’s a clang tool that can analyse an algorithm to suggest where standard algorithms can be replace handwritten algos?
E.g.

int i = 0;
for (; i < v.size(); ++i)
if (v[i] == expected)
break;
if (i != v.size())

some_function(v[i]);

Can be rewritten to

auto i = find(v.begin(), v.end(), expected);
if (i != v.end())
some_function(*i);

or in C++17:

if (auto i = find(v.begin(), v.end(), expected); i != v.end())
some_function(*i);

If not, how difficult a task is it to write such a tool? Is there anything that one should take into special consideration while writing this tool? Do you think it would have a lot of use-cases? (I do, based on my company’s code base, and code I have seen while marking assignments).

Cheers,

Chris

Hi Chris,

To my knowledge, there isn’t.

I don’t recall where I got the idea, but I gave it a try last summer trying to implement a clang-tidy check doing what you proposed. I didn’t have enough time to complete it, though, and I only managed to detect one or two very simple patterns.

After thinking about this idea for some time I found that clang-tidy might be a perfect place for that, not sure whether a separate tool would be beneficial. The task of detecting a specific pattern is very similar to what clang-tidy checks do in a wide range of tasks. Also, there’d be a separate heuristic set for each standard algorithm, which makes the partitioning into different checks (for each popular standard library algorithm) natural.

In my opinion, such checks would be useful, I’d be interested in seeing a proof-of-concept of some sort.

One more idea I have in mind: it might be interesting to try using CloneChecker (a check of Clang Static Analyzer) to detect similar patterns in a generic way, but I’m not sure how beneficial that would be in practice. Still, might worth a try.

+CC Alex, he might have some thoughts about this.

Kind regards,
Kirill

Hmm, missed this one. Adding to Kirill's answer.

Yep, clang-tidy seems to be the the right place for such check. They already have the `modernize' package which suggests using tricks from higher versions of the C++ standard. They support fix-it hints to suggest how to rewrite the highlighted code. They are also eagerly accepting checks that are not highlighting bugs, unlike Clang Static Analyzer, where, at least for now, the reputation requires to only include critical warnings that most likely require immediate attention. And your check doesn't seem to require symbolic execution, so relying on the Analyzer's path-sensitive engine is not necessary.

You should be able to write particular checks easily using ASTMatchers; with this awesome mechanism, the code for matching a particular pattern wouldn't be much bigger than the pattern itself, and you'd be able to re-use sub-patterns.

+CC CloneDetector guys.

Hmm, the idea of making body-farms and then using CloneChecker to find clones of synthesized bodies in the actual code looks curious and funny, though i'm not immediately seeing how is it superior to ASTMatchers.

I feel the implementation of an algorithm in the STL and the
equivalent implementation written by the user are usually so different
in terms of syntax that the current clone detection won't work that
well. If we had a constraint for estimating semantic equality, then it
would be a different situation, but I don't expect an implementation
of this anytime soon :).

However, if we are satisfied with syntactic equality, then this can be
done in a few minutes with the clone detector infrastructure with
something like this pseudocode:

   detector.findClones(RecursiveTypeIIConstraint(),
MinFunctionBodyCount(1), MinComplexity(20));

This would detect code like this:

void bubblesort(int *array, int length) {
     for (int i = 0; i < length - 1; ++i)
        for (int j = 0; j < length - i - 1; ++j)
            if (array[j] > array[j + 1]) {
                int tmp = array[j];
                array[j] = array[j + 1];
                array[j + 1] = tmp;
            }
}

int main() {
     for (int i = 0; i < length - 1; ++i) // expect-warning{You could
call the function 'bubblesort' instead
        for (int j = 0; j < length - i - 1; ++j)
            if (array[j] > array[j + 1]) {
                int tmp = array[j];
                array[j] = array[j + 1];
                array[j + 1] = tmp;
            }
}

- Raphael

I feel the implementation of an algorithm in the STL and the
equivalent implementation written by the user are usually so different
in terms of syntax that the current clone detection won't work that
well. If we had a constraint for estimating semantic equality, then it
would be a different situation, but I don't expect an implementation
of this anytime soon :).

However, if we are satisfied with syntactic equality, then this can be
done in a few minutes with the clone detector infrastructure with
something like this pseudocode:

    detector.findClones(RecursiveTypeIIConstraint(),
MinFunctionBodyCount(1), MinComplexity(20));

This would detect code like this:

void bubblesort(int *array, int length) {
      for (int i = 0; i < length - 1; ++i)
         for (int j = 0; j < length - i - 1; ++j)
             if (array[j] > array[j + 1]) {
                 int tmp = array[j];
                 array[j] = array[j + 1];
                 array[j + 1] = tmp;
             }
}

int main() {
      for (int i = 0; i < length - 1; ++i) // expect-warning{You could
call the function 'bubblesort' instead
         for (int j = 0; j < length - i - 1; ++j)
             if (array[j] > array[j + 1]) {
                 int tmp = array[j];
                 array[j] = array[j + 1];
                 array[j + 1] = tmp;
             }
}

- Raphael

   That being said, patches are welcome. The current infrastructure is friendly for implementing semantic clone detection constraints (clones type III). We'd be interested to help if necessary.

-- Vassil

As Artem noted, there are already a few clang-tidy checks that try to suggest using more compact coding patterns like range-for-loops, more suitable algorithms, shrink_to_fit(), etc. It should be equally possible to detect certain hand-rolled algorithms and replace them with STL. As for the amount of work it might take, you can take a look at the implementation of modernize-loop-convert, for example. If you would like to try this, you can start here.

On second thought, maybe it's quite neat.

I mean, like, a person who doesn't know anything about AST and about matchers and whatever, could probably do something like that

$ cat < pattern.c
void bubblesort(int *array, int length) {
       for (int i = 0; i < length - 1; ++i)
          for (int j = 0; j < length - i - 1; ++j)
              if (array[j] > array[j + 1]) {
                  int tmp = array[j];
                  array[j] = array[j + 1];
                  array[j + 1] = tmp;
              }
}
^D

and then

$ clang-finder --pattern pattern.c my_project.c

and the tool would find all clones of the pattern in the project, by constructing the AST for the pattern and hashing it and looking for clones. The pattern would be in a different AST context, but i guess for the clone checker it shouldn't be a problem, so we wouldn't need to deal with ASTImporter for that.

On second thought, maybe it's quite neat.

I mean, like, a person who doesn't know anything about AST and about
matchers and whatever, could probably do something like that

$ cat < pattern.c
void bubblesort(int *array, int length) {
      for (int i = 0; i < length - 1; ++i)
         for (int j = 0; j < length - i - 1; ++j)
             if (array[j] > array[j + 1]) {
                 int tmp = array[j];
                 array[j] = array[j + 1];
                 array[j + 1] = tmp;
             }
}
^D

and then

$ clang-finder --pattern pattern.c my_project.c

and the tool would find all clones of the pattern in the project, by
constructing the AST for the pattern and hashing it and looking for clones.
The pattern would be in a different AST context, but i guess for the clone
checker it shouldn't be a problem, so we wouldn't need to deal with
ASTImporter for that.

Pardon my ignorance, but will the clone checker match patterns that have
different variable names (e.g. values instead of array and size instead of
length)?

Also, is it possible to use patterns recursively? E.g. Say I want to detect
naive swap and std::swap in the bubble sort:

swap1.c:
                 int tmp = array[j];
                 array[j] = array[j + 1];
                 array[j + 1] = tmp;
swap2.c:
                 std::swap(array[j], array[j + 1]);

pattern.c:
void bubblesort(int *array, int length) {
      for (int i = 0; i < length - 1; ++i)
         for (int j = 0; j < length - i - 1; ++j)
             if (array[j] > array[j + 1]) {
                 SWAP(array, j)
             }
}

Clone checker currently does discard variable names, as long as they follow the same pattern, eg. `a = a + b` is a clone of `c = c + d` but not of `a = b + b`. It also has a mode in which it warns about patterns that are almost-clones (broken in one variable) as copy-paste errors.

Function names matter though, so if we expect to find both SWAP and std::swap, we'd need two runs for now.

As soon as https://reviews.llvm.org/D23418 lands, which is right about now, the clone checker would also be *highly* configurable and modular, and all sorts of per-use-case tweaks would be possible. Fuzzy-matching function names is probably not impossible.