LLVM for static code analysis

Hi,
Apart from the Calysto project (http://www.cs.ubc.ca/~babic/index_calysto.htm), is there any other static code analysis tool based of the LLVM framework ?
Calysto may be great but it seems that the source is not available (yet?).
I was quite excited by Oink/Elsa few years ago but the project is almost dead even if the C++ parser is far from being complete.
It seems to me that everything is ready in LLVM to build industrial-strength static analysis tools. Clang is of course a big step towards real-time parsing and IDE integration but the quality of llvm-gcc should be enough for many practical applications.
I am interested in automated code review, coverage and metrics. as done by commercial products like Parasoft C++test. What I am not sure yet is whether the LLVM IR is rich enough for the job or if I should wait for the dedicated C++ ASTof clang.

Best regards,
Emmanuel Bastien
Amadeus IT Group SA

Hi Emmanuel,

We are currently building a static analysis framework as part of clang. The goal is to provide a framework for a variety of tools that could benefit from source-code level analysis, with a particular focus on bug-finding (and possibly verification) tools. This work is currently in the early stages, but we expect it to rapidly progress over the next 6 months. Naturally this work would target what languages are currently supported (or partially supported) by clang (C and Objective-C), but of course the framework could naturally progress to analyzing C++ as that language becomes supported by the frontend. We currently already have a library in clang for performing flow-sensitive, intra-procedural dataflow analyses, and plan on eventually providing a framework for inter-procedural, path-sensitive analysis over entire code bases. If you are interested in following the progress of this work, I encourage you to subscribe to the cfe-dev mailing list. You are also more than welcome to get involved in the actual development of this framework by submitting patches or providing feedback.

Aside from our plans, it is probably worth me taking a moment to explain why we are even implementing a source-level analysis framework, especially when LLVM already supports an IR for analysis and transformation. The motivation for providing the ability to perform static analysis at the source-level all comes down to tradeoffs. The LLVM IR has some truly beautiful properties such as an SSA-form and a low-level IR that is essentially a typed assembly language. The IR can capture much of the type information of the original program while still providing a lowered program representation that simplifies many kinds of analyses and program optimizations. This lowering, however, is also a double-edged sword. Much of the original (high-level) type information of the program is discarded in the LLVM IR, which becomes extremely important when we start talking about analyzing objected-oriented languages or any language that has a rich type system. Such information can be extremely useful when improving the precision of an analysis, or simply for providing diagnosable output for a user concerning possible bugs found by the tool. Moreover, a source-level analysis framework captures a wide variety of other sources of information from the program, such as macros, templates, scope, loop constructs, accurate information regarding variable and function names, etc. All of these things are marginalized away in the lowering to LLVM IR. It is also in many cases much easer to provide diagnosable output to the user about potential bugs when full source-level information is available (which is more than just lines and column information which may be present in a .o file's debugging information or an LLVM bitcode file). Of course analyzing the original source can be much messier; a language like C contains far more esoteric edges cases to reason about than the LLVM IR.

Most state-of-the-art (commercial) bug-finding tools based on static analysis operate on an IR that is close to the source-level. Analysis tools that operate on Java code can often get away with doing just analysis on the bytecode level since the bytecode contains enough information to recreate much of the original Java program (the type system of Java is captured explicitly in the bytecode). Nevertheless, this isn't always enough information. Things especially get difficult in a language like C++, where macros, template instantiation, and operator overloading can significantly obfuscate the mapping between a lowered IR such as that used by LLVM and the original source code. There are many other tradeoffs between doing source-level and LLVM IR-level analysis. Which one you use at the end of the day depends on your application and your precise goals. Of course many bug-finding analyses could actually be done (well) at the LLVM IR level, while others could be done far more successfully at the source-level.

Finally, an analysis framework that allows us to reason statically at the source-level about the properties of C/Objective-C/C++/whatever programs only provides another tool in the LLVM toolbox. When building a bug-finding tool, one can potentially use both the LLVM IR and the source-level analysis framework that will be built into clang, although we believe that in order to build a successful (static) bug-finding tool a good source-level analysis framework is a prerequisite piece.

Ted