Ben Laurie of Google has up to $50K to spend on a pilot project to
improve the state of static analysis of C code for open source
projects. Among other things Ben is involved with the OpenSSL project
and has tried some static analyzers such as Deputy, and Cyclone (which
is also a language extension of C), and has noted various problems and
limitations with these tools.
Hi Monty, Ben,
It’s great to hear about your interest! Since some of your questions are fairly broad, I have tried to respond to your questions with a varying levels of detail. Please do not hesitate to follow up with specific questions.
The goal of this pilot project is to get a static analyzer tool
developed/modified so that it is truly useful to the open source
community and can become a standard part of the development process.
The ability to customize the analysis is strongly desired. For
instance, after a security exploit is reported people might want to
review the rest of the code for the same problem. An analyzer that
helped do that would be quite beneficial.
If the pilot project goes well then additional funding is possible.
 The analyzer core can be modified or extended with new program analysis algorithms to meet the needs of different clients (e.g., provide the flexibility to tradeoff between accuracy and computational complexity, etc.).
 The analyzer core can be reused in a variety of contexts, such as in a standalone command-line tool or an IDE.
 The analyzer can be extended with new sets of “checks” by not invasively modifying the analyzer internals. Such extensibility can be provided with different layers of abstraction from the core analyzer API, with very high-level checks possibly being written in a very high-level manner (e.g., declarative) while some checks requiring more direct access to the core analyzer APIs. Both ends of the spectrum are important because some checks will require sophisticated algorithms with deep reasoning about the program while others might be much simpler and just clue off of common interfaces and APIs.
The analyzer core in Clang is intended to (eventually) meet all three goals. So far efforts have focused on  and , although there has been some work on  with regards to checks that directly use the analyzer’s C++ APIs. Providing higher-level interfaces to declaring checks is a great long term goal, but my feeling is that it makes more sense to mature the core technology first.
As a high-level design strategy, the analyzer core is written as a library (libAnalysis). This makes it inherently much easier to reuse in difference settings, which directly addresses goal . The structure of libAnalysis is fairly modular, with a set of key abstract interfaces put in place to allow us to swap in and out different key components of the analyzer. For example, the engine that reasons about feasible paths within a function uses a “ConstraintManager” interface that is responsible for reasoning about the possible state of possible values along a path. Different implementations of ConstraintManager can be implemented to provide different levels of precision depending on the needs of clients.
Another high-level goal of the analyzer is to support the relaying of rich diagnostics to end-users about how a bug manifests in their program. The diagnostic reporting mechanism in the analyzer also uses a set of abstract interfaces so that bug reports can be rendered in a variety of ways (e.g., to the console, to an HTML page, within an IDE, etc.). Providing rich diagnostics is an important goal because without them the results of a static analysis algorithm is only useful to graduate students studying program analysis techniques rather than programmers who want to fix bugs.
From the research I have done, clang seems to be the best front end
for this type of project. I have some questions:
- What is the state of the static analyzer? Where do I start
learning what needs done on it? Is there a long term plan for it?
The talk I gave last August provides a broad overview of the project and its state:
Finding Bugs with the Clang Static Analyzer
More information on using the analyzer can be found at:
There is currently no real “internals” documentation on the analyzer, which of course is something that needs to be written. Perhaps the easiest way to look at the state of the analyzer is to look at the implementation of specific checks in the lib/Analysis directory of the Clang sources. Another good starting point is the file Driver/AnalysisConsumer.cpp; this file corresponds to the code in the driver that invokes the static analysis algorithms on a specific source file.
At a high level the analyzer consists of two analysis “engines”:
[a] A flow-sensitive, intra-procedural dataflow engine.
[b] A path-sensitive, intra-procedural (and eventually inter-procedural) dataflow engine.
Engine [a] mirrors the logic that typically goes on for flow-sensitive compiler checks, e.g.: live variables analysis, basic uninitialized variable checking, etc. The analyzer also has a “dead stores” checker that is based on [a]. The benefit of [a] is that the running type of an analysis is linear. The downside of [a] is that information can be lost where paths merge (e.g., at confluence points after branches).
Information on the theory and algorithms of [a] can be found in fairly standard compiler texts such as:
Advanced Compiler Design and Implementation
Compilers: Principles, Techniques and Tools Second Edition
authors: Aho, Lam, Sethi, Ullman
In libAnalysis, two analyses use [a]: LiveVariables.cpp and UninitializedValues.cpp.
Engine [b] is a “path-sensitive” dataflow analysis engine which essentially is a symbolic simulator. This engine reasons about path-specific bugs such as null-dereferences, memory leaks, etc. It’s basically the core technology of what we generally equate with an “advanced” static analysis tool.
At a high-level, [b] reasons about the “reachable states” of a program be exploring an abstract state-space graph (represented as an “exploded graph”; ExplodedGraph.[h,cpp]). A “bug” is essentially a path that reaches a state that is considered to be “bad” (e.g., a null pointer was dereferenced). States are simply a collection of symbolic values representing an abstraction of a program’s state at a particular location in the program.
Operationally, the analyzer essentially interprets each expression in a program’s source code and reasons about the “state transitions” by considering the effect of an expression on a given state. From dataflow theory, the theoretical parlance for this operation is called a “transfer function”, but from an engineering standpoint it is essentially a visitor function that takes as input a state and an expression and returns a new state.
The analyzer design is inspired and similar (but in many ways quite different) to the algorithms and ideas in several well-known papers:
Precise interprocedural dataflow analysis via graph reachability
Checking system rules using system-specific, programmer-written compiler extensions
Symbolic path simulation in path-sensitive dataflow analysis
(and several others)
From an engineering perspective, the trick is making the transfer function modular in construction, so that some aspects handle basic symbolic interpretation such as the manipulation of basic values (i.e., determine the result of “i + 10”) while other aspects are “checker specific” and reason about things such as the state of a lock, a file handle, whether or not a piece of data is considered “unsafe”, and so forth. The modular design of the transfer function interface is one important aspect of the analyzer that needs to be matured, although what’s there isn’t too bad of a starting point. This component is so important because it represents the core axis of flexibility with writing a diversity of checks, and thus maturing it is necessary in order to think about how to write checks using different high-level interfaces/
Other important components of the [b] include:
(1) ConstraintManager - the module that reasons about constraints on symbolic values
(2) StoreManager - the module that reasons about the bindings of variables to values. This is essentially the symbolic representation of the heap/stack.
The current implementation of ConstraintManager is “BasicConstraintManager”, which tracks only basic equality and inequality relationships (i.e., “a != 10”, “b == 4”). This is suitable for most predicates in system code (and for catching things like null dereferences) but not all. One goal is to implement another ConstraintManager that does basic range analysis (i.e., track “a > b”). This would be especially useful for buffer overflow detection.
For StoreManager, we are actively using BasicStoreManager, which just tracks bindings for variables on the stack. Zhongxing Xu has put a fair amount of work into developing a new StoreManager called “RegionStoreManager” that will extend the analyzer’s abilities to reason about the values of fields, arrays, etc. We certainly invite anyone who is interested in helping complete this feature to get involved.
Probably the other biggest point worth mentioning is that the path-sensitive engine is currently restricted to function-at-a-time analysis. Doing inter-procedural analysis is something that is definitely planned, but requires different levels of infrastructure depending on the scope of the analysis. In the most general case we need to be able to perform whole-program analysis across source files. This requires being able to build an entire image of the program (including information about linkage so we know we know the exact corresponding definitions for called functions). There are a variety of ways to tackle this task, and this is a major piece of infrastructure that certainly could benefit from anyone wishing to help get involved.
- The ability to add custom logic to the analyzer is quite desirable.
Perhaps this could be made easier by integrating with a scripting
language like Python. To me, even the ability to write a script to
pattern match against the AST or other intermediate forms could be
- Simply managing the volume of warnings can be difficult. I would
like to integrate some method of tracking warnings from build to build
to see what’s new and perhaps to be able to prioritize what should be
investigated first. This would probably be separate from the
analyzer, but a useful front end will help the tool get adopted more
“Issue tracking” would useful for a wide range of clients. To do this well, however, I don’t think it can be done completely separate from the analyzer. Issue tracking requires rich diagnostics that describe a bug but also must be insensitive to code churn that has no bearing on a particular bug or false positive. In some cases this may require issuing queries to the analyzer to extract more information. As a first cut, however, the output could be based entirely on the diagnostics.
As an interesting side note, the developers of Adium (http://www.adiumx.com) have actually been regularly scanning their sources using Clang’s static analyzer. As part of their process, they wrote some scripts that inspect the emitted diagnostics (in this case HTML files) and did some munging to determine if the a report emitted on a given scan was the same as in a prior scan. Their goal is to avoid re-inspecting previously diagnosed bugs and false positives. Having a general mechanism in the analyzer library would be quite useful, and would be a good first order project for anyone who is interested in getting involved.
- Annotations can be helpful to guide an analyzer. How difficult
would it be to extend the parser to accept a simple annotation syntax?
C and its derivatives are complex languages, so it really depends on the annotation syntax. For example, it is really easy to add GCC-style attributes to Clang (whose syntax could then be wrapped by macros):
The analyzer already exploits a variety of attributes such as “nonnull”, “unused”, “IBOutlet” (the latter being added for an Objective-C check the analyzer performs), and new ones could readily be added as well. It all depends on what you want to express.
I am open to collaborating on this project if anyone is available.
That would be wonderful. My suggestion is to start with a simple set of annotations for which the corresponding checks are not too complicated. The design can then iterate from there. There has been a lot of work on annotation systems, and probably the most traction is to start with annotations that would have the widest practical benefit (i.e., low cost of adoption, easy implementation, etc.).
Also, if you would like to learn more about this project or submit
your own proposal, feel free to contact “Ben Laurie”
If you wish to discuss more aspects of the analyzer I highly recommend you subscribe (if you haven’t already) to cfe-dev. For those who wish to get directly involved in development, we discuss patches on cfe-commits.
I mentioned a fair number of points in this email regarding directions for the static analyzer. Two particularly important action points (which are fairly accessible to anyone wishing to get involved) are:
- improving the web app to manage false positives
- run-to-run tracking of analysis results
Both points go right to the issue of user workflow and the overall usability of the analyzer, and don’t require (at least initially) having an intimate knowledge of the analyzer’s internals or its algorithms.
If you have any specific questions, please do not hesitate to ask.