Clang Static Analyzer without scan-build

I was looking at the same problem and planning to work on it.

What I’m planning to do is having a compiler flag which enables a user to perform compilation as well as static analysis at the same time,

and make relevant changes in the clang driver to build a set of ‘Actions’ in the pipeline such that static analysis and compilation takes place simultaneously.

This will have an overhead on the overall compilation time which is often not the desirable thing. But there is an advantage that this flag can be incorporated in the build-system of software.

Since the build systems are really good at tracking the files which have changed and compiling only the minimal set of required files,

the overall turnaround time of static analysis will be very small and user can afford to run static analyzer with every build.

I wanted some feedback if this is a good idea or not.

-Aditya

I was looking at the same problem and planning to work on it.
What I’m planning to do is having a compiler flag which enables a user to perform compilation as well as static analysis at the same time,
and make relevant changes in the clang driver to build a set of ‘Actions’ in the pipeline such that static analysis and compilation takes place simultaneously.
This will have an overhead on the overall compilation time which is often not the desirable thing. But there is an advantage that this flag can be incorporated in the build-system of software.
Since the build systems are really good at tracking the files which have changed and compiling only the minimal set of required files,
the overall turnaround time of static analysis will be very small and user can afford to run static analyzer with every build.

Have you looked at how scan-build currently works? It does compile and analyze the source files (clang is called twice). It is also driven by the build system, so we are not reanalyzing files that the build system would not recompile.

The main advantage of keeping the scan-build-like interface is that, in the future, we plan to extend the analyzer to perform cross-file(translation unit) analysis. This is why we encourage the use of a single entry point (scan-build) when analyzing a project.

Said that, the current implementation of scan-build is hacky and could be improved (see http://clang-analyzer.llvm.org/open_projects.html).

Cheers,
Anna.

I was looking at the same problem and planning to work on it.****
What I’m planning to do is having a compiler flag which enables a user to
perform compilation as well as static analysis at the same time,****
and make relevant changes in the clang driver to build a set of ‘Actions’
in the pipeline such that static analysis and compilation takes place
simultaneously.

This will have an overhead on the overall compilation time which is often
not the desirable thing. But there is an advantage that this flag can be
incorporated in the build-system of software.****
Since the build systems are really good at tracking the files which have
changed and compiling only the minimal set of required files,****
the overall turnaround time of static analysis will be very small and user
can afford to run static analyzer with every build.****

Have you looked at how scan-build currently works? It does compile and
analyze the source files (clang is called twice). It is also driven by the
build system, so we are not reanalyzing files that the build system would
not recompile.

The main advantage of keeping the scan-build-like interface is that, in
the future, we plan to extend the analyzer to perform
cross-file(translation unit) analysis. This is why we encourage the use of
a single entry point (scan-build) when analyzing a project.

Said that, the current implementation of scan-build is hacky and could be
improved (see http://clang-analyzer.llvm.org/open_projects.html).

For what it's worth, I think the way to do large scale static analysis is
to run over each TU in isolation, and output all the information needed to
do the global analysis. Then, run the global analysis as a post-processing
step, after sharding the information from that first step into
parallelizable pieces.

Note that I'm not trying to contradict what you said :slight_smile: Just wanted to
throw in some experience. We are currently starting to run the analyzer on
our internal code base (see Pavel's work) based on the Tooling/ stuff
(clang-check has grown a --analyze flag) and would be very interested in
having a system that allows full codebase analysis and still works on
~100MLOC codebases... :wink:

Cheers,
/Manuel

This makes perfect sense. Our current cross-function analysis assumes availability of the function implementations and we know that this approach definitely will not scale to cross-translation unit analysis.

Very exciting. We’d be very interested to find out what you learn from this (good and bad)!

By the way, you can run the analyzer in “shallow” mode, which will turn off most of the interprocedural analysis and minimize the analysis time in other ways. This might be an option when the default analyzer mode does not scale.

I was looking at the same problem and planning to work on it.****
What I’m planning to do is having a compiler flag which enables a user to
perform compilation as well as static analysis at the same time,****
and make relevant changes in the clang driver to build a set of ‘Actions’
in the pipeline such that static analysis and compilation takes place
simultaneously.

This will have an overhead on the overall compilation time which is often
not the desirable thing. But there is an advantage that this flag can be
incorporated in the build-system of software.****
Since the build systems are really good at tracking the files which have
changed and compiling only the minimal set of required files,****
the overall turnaround time of static analysis will be very small and
user can afford to run static analyzer with every build.****

Have you looked at how scan-build currently works? It does compile and
analyze the source files (clang is called twice). It is also driven by the
build system, so we are not reanalyzing files that the build system would
not recompile.

The main advantage of keeping the scan-build-like interface is that, in
the future, we plan to extend the analyzer to perform
cross-file(translation unit) analysis. This is why we encourage the use of
a single entry point (scan-build) when analyzing a project.

Said that, the current implementation of scan-build is hacky and could be
improved (see http://clang-analyzer.llvm.org/open_projects.html).

For what it's worth, I think the way to do large scale static analysis is
to run over each TU in isolation, and output all the information needed to
do the global analysis. Then, run the global analysis as a post-processing
step, after sharding the information from that first step into
parallelizable pieces.

This makes perfect sense. Our current cross-function analysis assumes
availability of the function implementations and we know that this approach
definitely will not scale to cross-translation unit analysis.

Note that I'm not trying to contradict what you said :slight_smile: Just wanted to
throw in some experience. We are currently starting to run the analyzer on
our internal code base (see Pavel's work) based on the Tooling/ stuff
(clang-check has grown a --analyze flag) and would be very interested in
having a system that allows full codebase analysis and still works on
~100MLOC codebases... :wink:

Very exciting. We'd be very interested to find out what you learn from
this (good and bad)!

See the bugs Pavel is filing, and the work on making the C++ analysis
sane(r). We make pretty heavy use of C++ features. Where it finds bug it
often seems like magic ("wait, so this is called here, and then ...
ooooooooh").

By the way, you can run the analyzer in "shallow" mode, which will turn
off most of the interprocedural analysis and minimize the analysis time in
other ways. This might be an option when the default analyzer mode does not
scale.

Oh, single-TU analysis is not a scalability problem for us (we have
machines :wink:

Thanks you for your feedback.

I tried scan-build for our codebase. The problem with scan-build is that it has to be customized for every build-system and it becomes very tricky with for e.g., scons.

Additionally, scan-build just overrides the CC/CXX flag which is not a robust approach in my opinion.

With the ‘compiler-flag’ based approach the overhead of tailoring the static-analyzer’s front end to every build system does not come.

The advantage of having a flag to do both the tasks at once is programmers could do this several times (during the development process which I think is the best time to solve a potential bug), possibly with a very limited set of checkers.

Although, I’m not sure how the problem of cross TU analysis could be handled using the ‘compiler-flag’ based approach. I’ll have to think about it.

-Aditya

I have a preliminary working version of my patch and I ran it through our test framework (>100 MLOC), and it has worked fine.

To generate the summarized report (index.html) I copied some portions of scan-build and generated the summary.

I’m planning to write some post-processing program that parses the report-*.html files and stores them in a database.

Will that be useful?

I have a preliminary working version of my patch and I ran it through our test framework (>100 MLOC), and it has worked fine.
To generate the summarized report (index.html) I copied some portions of scan-build and generated the summary.
I’m planning to write some post-processing program that parses the report-*.html files and stores them in a database.
Will that be useful?

This depends on the workflow you have in mind. What reports the database will contain: ex, the results of the latest build or results of every build...

I think the database might have limited usefulness if you are doing partial builds, at least you don't really know which set of bugs you are looking at.

A useful workflow includes running the analyzer as part of continuous integration and storing results of every build. In that scenario, it is useful to only show / highlight the diff of issues or only new issues that the analyzer produces on the latest build. We do not have a very good infrastructure for this, but the first step would be to look at utils/analyzer/CmpRuns.py script and see if it could be useful for you.

Note, when running the analyzer as part of the compilation, one issue you'll have to worry about is the __clang_analyzer__ macro, which is only defined when you run the analyzer and not the compiler.

Anna.

In article <001101ce8d6d$1019efd0$304dcf70$@codeaurora.org>,
    "Aditya Kumar" <hiraditya@codeaurora.org> writes:

This will have an overhead on the overall compilation time which is often
not the desirable thing. [...]

I wanted some feedback if this is a good idea or not.

On a code base where I can build the entire tree with gcc 4.6 in about
10 minutes, running the static analyzer took multiple hours. I didn't
time it exactly because I wasn't expecting THAT much of a slowdown.

So, based on that experience, anything that makes this lengthy process
take even longer would be a thumbs down from me.

I have a preliminary working version of my patch and I ran it through our test framework (>100 MLOC), and it has worked fine.

To generate the summarized report (index.html) I copied some portions of scan-build and generated the summary.

I’m planning to write some post-processing program that parses the report-*.html files and stores them in a database.

Will that be useful?

This depends on the workflow you have in mind. What reports the database will contain: ex, the results of the latest build or results of every build…

I think the database might have limited usefulness if you are doing partial builds, at least you don’t really know which set of bugs you are looking at.

A useful workflow includes running the analyzer as part of continuous integration and storing results of every build. In that scenario, it is useful to only show / highlight the diff of issues or only new issues that the analyzer produces on the latest build. We do not have a very good infrastructure for this, but the first step would be to look at utils/analyzer/CmpRuns.py script and see if it could be useful for you.

Thanks for telling me about utils/analyzer/CmpRuns.py.

By the way, I am totally in favor of having a stand-alone tool like scan-build which makes it easy to run static analyzer separately as a part of

nightly/weekly builds or by a group of people specially assigned to track down bugs in a software infrastructure. The idea of storing report-statistics in a database could be a useful addition to standalone tools.

Currently, what I have implemented is the following:

  1. compile the programs with a flag e.g., (clang++ --compile-and-analyze <path/to/report-dir> -c test.cpp). This stores all the report-*.html files in report-dir.

Also, I have created a post-processing program which does the following:

  1. parse the report-*.html file and generate index.html. Ensure uniqueness of each report by comparing the sha1 keys (I’m using the linux system call to compute the sha1 keys for now).

  2. populate the table in the database (mysql) with same details. To ensure uniqueness of details I store the sha1 key of report-*.html files along with the bug-details corresponding to each report.

Note, when running the analyzer as part of the compilation, one issue you’ll have to worry about is the clang_analyzer macro, which is only defined when you run the analyzer and not the compiler.

So the way I have defined the action-pipeline is:

0: input, “test.cpp”, c++

1: analyzer, {0}, plist

2: input, “test.cpp”, c++

3: preprocessor, {2}, c+±cpp-output

4: compiler, {3}, assembler

5: assembler, {4}, object

6: linker, {5}, image

I think, this way the frontend should define the clang_analyzer during analysis and not during the compilation.

-Aditya

Anna.

From: Anna Zaks [mailto:ganna@apple.com]
Sent: Wednesday, July 31, 2013 3:50 PM
To: Aditya Kumar
Cc: Manuel Klimek; clang-dev Developers; Michele Galante
Subject: Re: [cfe-dev] Clang Static Analyzer without scan-build

I have a preliminary working version of my patch and I ran it through our test framework (>100 MLOC), and it has worked fine.
To generate the summarized report (index.html) I copied some portions of scan-build and generated the summary.
I’m planning to write some post-processing program that parses the report-*.html files and stores them in a database.
Will that be useful?

This depends on the workflow you have in mind. What reports the database will contain: ex, the results of the latest build or results of every build...

I think the database might have limited usefulness if you are doing partial builds, at least you don't really know which set of bugs you are looking at.

A useful workflow includes running the analyzer as part of continuous integration and storing results of every build. In that scenario, it is useful to only show / highlight the diff of issues or only new issues that the analyzer produces on the latest build. We do not have a very good infrastructure for this, but the first step would be to look at utils/analyzer/CmpRuns.py script and see if it could be useful for you.

Thanks for telling me about utils/analyzer/CmpRuns.py.
By the way, I am totally in favor of having a stand-alone tool like scan-build which makes it easy to run static analyzer separately as a part of
nightly/weekly builds or by a group of people specially assigned to track down bugs in a software infrastructure. The idea of storing report-statistics in a database could be a useful addition to standalone tools.

I thought the main issue for you was that scan-build does not support your build system, so you would not be able to use it as is regardless.
(I just want to reiterate that I think that improving scan-build/building a better version of it is the right approach as this is currently considered the gateway for all analyzer users.)

Currently, what I have implemented is the following:
1. compile the programs with a flag e.g., (clang++ --compile-and-analyze <path/to/report-dir> -c test.cpp). This stores all the report-*.html files in report-dir.
Also, I have created a post-processing program which does the following:
1. parse the report-*.html file and generate index.html. Ensure uniqueness of each report by comparing the sha1 keys (I’m using the linux system call to compute the sha1 keys for now).

Uniqueing reports using sha1 of the html file is not robust. Consider what happens when someone adds a line of code to the file containing the report somewhere before the report location.

2. populate the table in the database (mysql) with same details. To ensure uniqueness of details I store the sha1 key of report-*.html files along with the bug-details corresponding to each report.

Note, when running the analyzer as part of the compilation, one issue you'll have to worry about is the __clang_analyzer__ macro, which is only defined when you run the analyzer and not the compiler.

So the way I have defined the action-pipeline is:
0: input, "test.cpp", c++
1: analyzer, {0}, plist
2: input, "test.cpp", c++
3: preprocessor, {2}, c++-cpp-output
4: compiler, {3}, assembler
5: assembler, {4}, object
  6: linker, {5}, image
I think, this way the frontend should define the __clang_analyzer__ during analysis and not during the compilation.

You probably want to compile before the analysis. The analyzer generally assumes that it runs on code that compiles without errors. (This is also the workflow of scan-build.)

From: cfe-dev-bounces@cs.uiuc.edu [mailto:cfe-dev-bounces@cs.uiuc.edu]
On Behalf Of Richard
Sent: Wednesday, July 31, 2013 4:33 PM
To: cfe-dev@cs.uiuc.edu
Subject: Re: [cfe-dev] Clang Static Analyzer without scan-build

In article <001101ce8d6d$1019efd0$304dcf70$@codeaurora.org>,
    "Aditya Kumar" <hiraditya@codeaurora.org> writes:

> This will have an overhead on the overall compilation time which is
> often not the desirable thing. [...]
>
> I wanted some feedback if this is a good idea or not.

On a code base where I can build the entire tree with gcc 4.6 in about
10 minutes, running the static analyzer took multiple hours. I didn't

time it

exactly because I wasn't expecting THAT much of a slowdown.

That is just a one-time overhead. The next time you run the static analyzer
it should take very less time, because the static analyzer will take
advantage of the incremental build system.

So, based on that experience, anything that makes this lengthy process

take

even longer would be a thumbs down from me.

Having compile-and-analyze flag takes 'less' time than scan-build. The
overhead I was talking about is the 'little-extra' time every time the
program is built during the development process.
It will give some useful static analysis information with every build, I
hope.

I have a preliminary working version of my patch and I ran it through our test framework (>100 MLOC), and it has worked fine.

To generate the summarized report (index.html) I copied some portions of scan-build and generated the summary.

I’m planning to write some post-processing program that parses the report-*.html files and stores them in a database.

Will that be useful?

This depends on the workflow you have in mind. What reports the database will contain: ex, the results of the latest build or results of every build…

I think the database might have limited usefulness if you are doing partial builds, at least you don’t really know which set of bugs you are looking at.

A useful workflow includes running the analyzer as part of continuous integration and storing results of every build. In that scenario, it is useful to only show / highlight the diff of issues or only new issues that the analyzer produces on the latest build. We do not have a very good infrastructure for this, but the first step would be to look at utils/analyzer/CmpRuns.py script and see if it could be useful for you.

Thanks for telling me about utils/analyzer/CmpRuns.py.

By the way, I am totally in favor of having a stand-alone tool like scan-build which makes it easy to run static analyzer separately as a part of

nightly/weekly builds or by a group of people specially assigned to track down bugs in a software infrastructure. The idea of storing report-statistics in a database could be a useful addition to standalone tools.

I thought the main issue for you was that scan-build does not support your build system, so you would not be able to use it as is regardless.

(I just want to reiterate that I think that improving scan-build/building a better version of it is the right approach as this is currently considered the gateway for all analyzer users.)

So there are two problems.

  1. One software infrastructure which has scons build-system cannot be analyzed with scan-build for now. This is in-fact a general problem with scan-build or any other static analysis enterprise tool that build-system integration is non-trivial. For that I have tried to implement --compile-and-analyze flag. Using this facility, I was able to run clang static analyzer on all programs/test-infrastructure available to us without having to worry about different kinds of build system. What I’m trying to say is we should also have facility to compile-and-analyze within the compiler as well. This will help developers track down potential bugs as quickly as possible. I do not want to touch scan-build because it is written in Perl. Initially I copied some portions of it to generate summarized report, but now I have a C++ implementation which parses all the report-*.html files and generates a summary. I can put my patch up for review if it can be helpful.

  2. In general I would like to improve/add support for single-entry-point-based static analysis tool. I thought that a facility to store reports at regular intervals (using some database etc.), will be a small (but useful) addition in this direction.

Currently, what I have implemented is the following:

  1. compile the programs with a flag e.g., (clang++ --compile-and-analyze <path/to/report-dir> -c test.cpp). This stores all the report-*.html files in report-dir.

Also, I have created a post-processing program which does the following:

  1. parse the report-*.html file and generate index.html. Ensure uniqueness of each report by comparing the sha1 keys (I’m using the linux system call to compute the sha1 keys for now).

Uniqueing reports using sha1 of the html file is not robust. Consider what happens when someone adds a line of code to the file containing the report somewhere before the report location.

Yes, in that case the sha1 will change, but even scan-build follows the same approach. I’ll try to find an alternative solution. Thanks for pointing this out.

  1. populate the table in the database (mysql) with same details. To ensure uniqueness of details I store the sha1 key of report-*.html files along with the bug-details corresponding to each report.

Note, when running the analyzer as part of the compilation, one issue you’ll have to worry about is the clang_analyzer macro, which is only defined when you run the analyzer and not the compiler.

So the way I have defined the action-pipeline is:

0: input, “test.cpp”, c++

1: analyzer, {0}, plist

2: input, “test.cpp”, c++

3: preprocessor, {2}, c+±cpp-output

4: compiler, {3}, assembler

5: assembler, {4}, object

6: linker, {5}, image

I think, this way the frontend should define the clang_analyzer during analysis and not during the compilation.

You probably want to compile before the analysis. The analyzer generally assumes that it runs on code that compiles without errors. (This is also the workflow of scan-build.)

That seems to be a better idea. Thanks for the suggestion.

-Aditya

Anna.

From: Anna Zaks [mailto:ganna@apple.com]
Sent: Wednesday, July 31, 2013 5:05 PM
To: Aditya Kumar
Cc: Manuel Klimek; clang-dev Developers; Michele Galante
Subject: Re: [cfe-dev] Clang Static Analyzer without scan-build

From: Anna Zaks [mailto:ganna@apple.com]
Sent: Wednesday, July 31, 2013 3:50 PM
To: Aditya Kumar
Cc: Manuel Klimek; clang-dev Developers; Michele Galante
Subject: Re: [cfe-dev] Clang Static Analyzer without scan-build

I have a preliminary working version of my patch and I ran it through our test framework (>100 MLOC), and it has worked fine.
To generate the summarized report (index.html) I copied some portions of scan-build and generated the summary.
I’m planning to write some post-processing program that parses the report-*.html files and stores them in a database.
Will that be useful?

This depends on the workflow you have in mind. What reports the database will contain: ex, the results of the latest build or results of every build...

I think the database might have limited usefulness if you are doing partial builds, at least you don't really know which set of bugs you are looking at.

A useful workflow includes running the analyzer as part of continuous integration and storing results of every build. In that scenario, it is useful to only show / highlight the diff of issues or only new issues that the analyzer produces on the latest build. We do not have a very good infrastructure for this, but the first step would be to look at utils/analyzer/CmpRuns.py script and see if it could be useful for you.

Thanks for telling me about utils/analyzer/CmpRuns.py.
By the way, I am totally in favor of having a stand-alone tool like scan-build which makes it easy to run static analyzer separately as a part of
nightly/weekly builds or by a group of people specially assigned to track down bugs in a software infrastructure. The idea of storing report-statistics in a database could be a useful addition to standalone tools.

I thought the main issue for you was that scan-build does not support your build system, so you would not be able to use it as is regardless.
(I just want to reiterate that I think that improving scan-build/building a better version of it is the right approach as this is currently considered the gateway for all analyzer users.)

So there are two problems.
1. One software infrastructure which has scons build-system cannot be analyzed with scan-build for now. This is in-fact a general problem with scan-build or any other static analysis enterprise tool that build-system integration is non-trivial. For that I have tried to implement --compile-and-analyze flag. Using this facility, I was able to run clang static analyzer on all programs/test-infrastructure available to us without having to worry about different kinds of build system. What I’m trying to say is we should also have facility to compile-and-analyze within the compiler as well. This will help developers track down potential bugs as quickly as possible. I do not want to touch scan-build because it is written in Perl. Initially I copied some portions of it to generate summarized report, but now I have a C++ implementation which parses all the report-*.html files and generates a summary. I can put my patch up for review if it can be helpful.

It would be great if we could keep just one static analyzer tool/entry point. If that means that scan-build should be rewritten, that's fine (it's actually one of the items on the todo list I've mentioned earlier).

2. In general I would like to improve/add support for single-entry-point-based static analysis tool. I thought that a facility to store reports at regular intervals (using some database etc.), will be a small (but useful) addition in this direction.

Yes, infrastructure for using the analyzer in a continuous integration setting would be a useful addition.

Currently, what I have implemented is the following:
1. compile the programs with a flag e.g., (clang++ --compile-and-analyze <path/to/report-dir> -c test.cpp). This stores all the report-*.html files in report-dir.
Also, I have created a post-processing program which does the following:
1. parse the report-*.html file and generate index.html. Ensure uniqueness of each report by comparing the sha1 keys (I’m using the linux system call to compute the sha1 keys for now).

Uniqueing reports using sha1 of the html file is not robust. Consider what happens when someone adds a line of code to the file containing the report somewhere before the report location.
Yes, in that case the sha1 will change, but even scan-build follows the same approach. I’ll try to find an alternative solution. Thanks for pointing this out.

Scan build does not do build results comparison. This is why I've suggested the CmpRuns.py script, which does provide a better alternative to what you are doing.