Clang analyzer Google Summer of Code ideas/proposals

Hello,

I am interested in doing a project with Clang in the upcoming Google
Summer of Code. I am currently a sophomore at the South Dakota School
of Mines and Technology, and I have some C++, Perl, and Javascript
programming experience. I have been interested in Clang and LLVM for a
while, and I've looked through some of the code before. I am most
interested in the analyzer component though.

I have two possible project ideas I am interested in:

A) Bug database

Create a tool to store bugs and track changes over time.

This tool would use the XML analyzer output and the CIndex library to
correlate bugs over multiple runs. The tool would provide, at a
minimum, a diff-like output given a pair of runs. Ideally, this would
create and update a database with all the runs, and statuses for all
the bugs (uninspected, false positive, verified, fixed). The tool
would provide reports with chosen subsets of the bugs and annotations
such as first run present and current status. The reports could be
html output, reusing the existing infrastructure, or be viewable in a
gui application.

The database could be XML, SQLite, or some plain-text format. I am
unsure whether this tool should be integrated into the clang binary,
be a separate executable, or even use a scripting language like
Python. However it is implemented, it would be integrated into
scan-build/scan-view.

I am interested in this project because it would make using the
analyzer easier for larger projects. The diff output could be used as
a regression finder or fix checker. The database would allow users to
keep track of bugs better, and to provide statistics of bugs over
time.

B) User-made checkers

This would provide some sort of easy extension mechanism to the
analyzer to allow simple domain-specific checks. I have a couple of
ideas of how this would look.

1) The first would be to read and use mygcc [1] rules to detect bugs.
I believe this would would only provide simple flow-sensitive
analysis, but it looks useful nonetheless. This would require making a
pattern matcher to match ast nodes based on a parsed text expression.

2) Second, would be an interface to the analysis engines from a
scripting language, perhaps python. This would be more complicated to
use than mygcc, but likely more useful. For example, a check to make
sure open has a third parameter if the CREATE flag is present is very
simple given a scripting language, but impossible using mygcc rules
[2].

If I was to do this project, I would likely try to do the second idea
first, and if time permits, write a mygcc matcher in the scripting
language. Implementing mygcc rules in the scripting language would
provide a good test of the interface completeness.

I am interested in this because the clang analyzer could be easily
extended with domain specific checks. For example, specialized locking
rules could be checked using mygcc rules. A trickier example [3] would
be to make sure a llvm::StringRef is not assigned a std::string that
goes out of scope before it. This would be possible using a scripting
language binding, and easier than modifying the Clang source. These
types of checks are already being implemented in Clang, but it is
infeasible for specialized checks for arbitrary given projects to be
embedded. This project would be a way around the problem.

3) The closest tool I have seen to #2 is Dehydra [4], which also has a
goal of allowing user-defined bug finding scripts. A complicating
factor is that the scripting language is Javascript, and it may be
infeasible to provide a compatible interface. Nevertheless, I am
including replicating the interface here as a third possibility.

Sorry for the incredibly long email. :slight_smile:

Are either of these proposals interesting? Any criticisms, ideas? All
comments and questions would be appreciated.

Thanks,
- Sam

[1] http://mygcc.free.fr/
Note: I forget how I found this, I believe it was through an email on
this list, but I can't find it.

[2] example taken from Clang source
lib/Checker/UnixAPIChecker.cpp

[3] example again from an existing Clang check
lib/Checker/LLVMConventionsChecker.cpp line 133

[4] https://developer.mozilla.org/en/Dehydra

Hi Samuel,

I haven’t thought through about what the bug database or extension script would be like. So I couldn’t comment on your proposals.

But I personally prefer some improvements over the current core analysis engine. For example, improve the inter-procedural analysis, add an integer overflow detector, add a more powerful constraint manager, or add C++ support, etc.

2010/3/26 Samuel Harrington <samuel.harrington@mines.sdsmt.edu>

Hi Sam,

I think these are all great ideas. Comments inline.

Hello,

I am interested in doing a project with Clang in the upcoming Google
Summer of Code. I am currently a sophomore at the South Dakota School
of Mines and Technology, and I have some C++, Perl, and Javascript
programming experience. I have been interested in Clang and LLVM for a
while, and I've looked through some of the code before. I am most
interested in the analyzer component though.

I have two possible project ideas I am interested in:

A) Bug database

Create a tool to store bugs and track changes over time.

This tool would use the XML analyzer output and the CIndex library to
correlate bugs over multiple runs. The tool would provide, at a
minimum, a diff-like output given a pair of runs. Ideally, this would
create and update a database with all the runs, and statuses for all
the bugs (uninspected, false positive, verified, fixed). The tool
would provide reports with chosen subsets of the bugs and annotations
such as first run present and current status. The reports could be
html output, reusing the existing infrastructure, or be viewable in a
gui application.

The database could be XML, SQLite, or some plain-text format. I am
unsure whether this tool should be integrated into the clang binary,
be a separate executable, or even use a scripting language like
Python. However it is implemented, it would be integrated into
scan-build/scan-view.

I am interested in this project because it would make using the
analyzer easier for larger projects. The diff output could be used as
a regression finder or fix checker. The database would allow users to
keep track of bugs better, and to provide statistics of bugs over
time.

I think a bug database would be extremely useful, and the ability to correlate analysis results across runs would be really powerful.

Ideally the infrastructure for a bug database would be split into "backend" and "frontend" pieces, where the backend would be the core logic for processing results across runs and the frontend integration into something like scan-view. This decoupling allows the database to be potentially be reused in other contexts, e.g. a Trac plugin.

One tricky aspect is dealing with correlating analysis results across an evolving codebase. The code surrounding a bug may change but the bug would be the same. This is an arbitrarily complicated problem, but correlating across runs should at least need to be not overly sensitive to line number changes, etc.

B) User-made checkers

This would provide some sort of easy extension mechanism to the
analyzer to allow simple domain-specific checks. I have a couple of
ideas of how this would look.

I think having more ways to specify domain-specific checkers would be fantastic.

1) The first would be to read and use mygcc [1] rules to detect bugs.
I believe this would would only provide simple flow-sensitive
analysis, but it looks useful nonetheless. This would require making a
pattern matcher to match ast nodes based on a parsed text expression.

This would be extremely useful, and this has been requested a couple times. It is also a well-scoped project, and I think it would make a great GSoC project. Part of the work would also involve relaying useful diagnostics to user as well as having acceptable performance.

2) Second, would be an interface to the analysis engines from a
scripting language, perhaps python. This would be more complicated to
use than mygcc, but likely more useful. For example, a check to make
sure open has a third parameter if the CREATE flag is present is very
simple given a scripting language, but impossible using mygcc rules
[2].

If I was to do this project, I would likely try to do the second idea
first, and if time permits, write a mygcc matcher in the scripting
language. Implementing mygcc rules in the scripting language would
provide a good test of the interface completeness.

I am interested in this because the clang analyzer could be easily
extended with domain specific checks. For example, specialized locking
rules could be checked using mygcc rules. A trickier example [3] would
be to make sure a llvm::StringRef is not assigned a std::string that
goes out of scope before it. This would be possible using a scripting
language binding, and easier than modifying the Clang source. These
types of checks are already being implemented in Clang, but it is
infeasible for specialized checks for arbitrary given projects to be
embedded. This project would be a way around the problem.

This is a far more ambitious project than the mygcc support. As you say this has the potential to have a lot of impact, but there are a couple concerns that come to mind that might make this much bigger than a GSoC project:

1) The internal interface between the analyzer and the external plugin support would need to be well-defined.

2) What do you expose at the higher level? There is both syntactic information (the ASTs) and semantic information (analysis state) that can be exposed to a checker. Both sets of information are currently available to C++ checks that derive from the Checker class, and to build great checks both would need to be exposed at a higher level. There is a lot of information to expose just for the llvm::StringRef check.

3) Performance. The analyzer is very compute-intensive; will path-sensitive checks written in an external scripting language be too slow in practice when analyzing moderate to large codebases? (this isn't a conclusion, just an open question)

4) Lots of infrastructure details including data management, etc., between the analysis core and the external checker.

My feeling is that this is a big project. I think the work on the mygcc support would be a great starting point, as the bulk of the logic would be on the Clang-side, and then as you get experience working with the analysis engine you can gradually "move out" of Clang's interior and have plugins that interface with the analysis core. The nice thing about tackling the smaller piece first is that (a) you would make steady progress instead of waiting for the "big feature" to get completed and (b) you will likely finish the GSoC project with a set of very usable pieces that can be used by users (even though a few big pieces that might not be finished).

3) The closest tool I have seen to #2 is Dehydra [4], which also has a
goal of allowing user-defined bug finding scripts. A complicating
factor is that the scripting language is Javascript, and it may be
infeasible to provide a compatible interface. Nevertheless, I am
including replicating the interface here as a third possibility.

Replicating DeHydra's interface might be very useful for leveraging some of its checks. One big caveat I see is that this as has the challenges of (2) but also the additional burden that you are taking both a language *and* and a checker API that someone else has already defined and then try to match it to Clang's way of doing things. I think this would be more feasible if the base infrastructure for (2) was already in place, but without it you are more at risk of not having time to finish the project.

For work done on the analyzer, I'd prefer GSoC projects that brings a new feature reasonably close to being usable by others. If your project contains a set of milestones that deliver pieces of great functionality (e.g., mygcc support) on the way towards implementing some bigger feature then the project work is always a net win. I'd be happy mentoring GSoC work on any of these project ideas as long as it had this kind of trajectory.

Hi Sam,

I think these are all great ideas. Comments inline.

...

B) User-made checkers

This would provide some sort of easy extension mechanism to the
analyzer to allow simple domain-specific checks. I have a couple of
ideas of how this would look.

I think having more ways to specify domain-specific checkers would be fantastic.

1) The first would be to read and use mygcc [1] rules to detect bugs.
I believe this would would only provide simple flow-sensitive
analysis, but it looks useful nonetheless. This would require making a
pattern matcher to match ast nodes based on a parsed text expression.

This would be extremely useful, and this has been requested a couple times. It is also a well-scoped project, and I think it would make a great GSoC project. Part of the work would also involve relaying useful diagnostics to user as well as having acceptable performance.

...

For work done on the analyzer, I'd prefer GSoC projects that brings a new feature reasonably close to being usable by others. If your project contains a set of milestones that deliver pieces of great functionality (e.g., mygcc support) on the way towards implementing some bigger feature then the project work is always a net win. I'd be happy mentoring GSoC work on any of these project ideas as long as it had this kind of trajectory.

Thanks for your comments!

I've submitted a formal proposal to implement mygcc rules, visible at:
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/samlh/t127053978725

I have also attached it in text form.

Any further comments or questions would be appreciated. My primary
concern is whether the proposed matching method is acceptable. I hope
this proposal is clear, complete, and interesting!

Thanks,
Samuel Harrington

proposal.txt (4.96 KB)