Sorry for the late response to this email. As I promised in my personal communication, I wanted to take a look at what you did in some detail after the holidays so that I could share it with the list. I think it is exciting what you were able to do with clang in such a short time. Comments inline.
I used clang to make a static code analyzer tool, based on the ARCHER paper
from Stanford, albeit simpler. It is able to detect both static and dynamic
memory overflows. It only supports intra-procedural analysis. It also
provides analysis for the PHP interpreter API varargs functions (printf
For those of you who are not familiar, Archer essentially performs a DFS traversal of the possible "states" of a program/function, starting from the function entrance. Each state consists of a set of constraints on the values of one or more variables; e.g. that the variable 'x' is in the range '0' to '5'. These symbolic ranges can then be used to flag possible buffer overruns by determining if an index value could ever exceed the bounds of an array.
I like how your implementation resembles an interpreter. While the implementation style varies from using a dispatches based on switch statements or a chain of if..else's, overall it's the same mindset that I believe was used to construct the original Archer tool, and a similar design view will be incorporated into the path-sensitive solver in clang.
The checking of the parameters for PHP is also really nice. With not that much code you were able to write a custom check for a code base that in practice can be really useful.
Regarding your implementation of the buffer overrun checker, one thing that I wasn't certain about was whether or not your analysis did any backtracking when it encountered an infeasible state. For example:
if (x == 1) // do something
if (x == 1) // do something
Here the two branches are control-dependent, and there are only two paths here. Backtracking in the DFS once an infeasible path is encountered can greatly reduce your search space, and thus increase your scalability. From what I can tell the checking of constraints is lazy; I believe they are only checked at an array access, and this is to only determine if an out-of-bounds access can occur, not if the state is infeasible. For checking feasibility of paths, one should try and do this eagerly to prune off as much of the search space as possible. Explicit state space exploration with caching in the worst case is exponential, but never caching means that the algorithm is always exponential. It also means that loops will be explored indefinitely, and thus cause certain paths to never be explored because of the nature of DFS.
As a possible algorithmic optimization, while the Archer paper did not use state caching (arguing that it didn't help), I believe that one could use liveness information to prune variables from a state, and thus increase the possibility of a cache hit. Caching is one of the key ways that analyses that perform an explicit state space exploration actually achieve real scalability. Clang has a liveness analysis already available, so it could be employed for this purpose.
In case someone is interested, the full source-code is available at:
It also includes a presentation of the project in Portuguese, as well as
some examples of bugs that it is able to find.
FYI: Using Google's translation tools, I was able to read most of your presentation. It's a good short presentation. Thanks for providing it!
My code doesn't use the clang analysis framework, as the path-sensitive
analyzer wasn't ready by the time I started the project.
So, about clang.. It is a very nice tool with a low learning curve. really.
I once tried to look to the gcc code and I gave up (I admit I didn't try too
much, but..). From all the compiler tools I've worked so far, clang proved
to be the easiest one. This is due to the nice C++/OOP usage, as well as an
intuitive AST (if you know C, you know how the AST looks like).
A con of clang in the point of view of code analysis is that clang is
optimized for IDEs. That means that some AST nodes could be removed
altogether (e.g. ParenExpr). Also, similar expressions are represented
int x; x=2;
This makes sense in the IDE world, but only makes things more difficult in
the analysis world. But I'm not sure how clang could be improved any further
about this point.
This is an excellent point, and as I believe you are implying, such things are an inherit tradeoff of trying to capture more of the lexical information of the program in the ASTs. There are two things I would like to add about this point when it comes to thinking about writing source-level tools.
The first point I would like to mention is that in general different analyses will "care" about certain kinds of AST nodes and be completely uninterested in other nodes. Thus almost any analysis at the source-level has to deal with "cruft" that it has to ignore, even though the nodes it ignores are semantically meaningful. From this perspective, adding things like ParenExpr, while sometimes annoying to deal with, doesn't necessarily require much more effort to handle, especially if you add wrapper functions (as you did in your implementation) to essentially strip such nodes away.
The second point I would like to add concerns dealing with similar expressions such as:
int x; x=2;
In general, handling both cases should fall out automatically in most cases simply by making the analysis *semantically* driven rather than syntactically driven. If one reasons about the semantics of the expressions using first-principles, such "corner cases" often are handled automatically. Your implementation of the buffer overrun checker actually appears to be fairly semantically based, which means it doesn't have to resort so much to the hacks that appear in bug-finding tools like lint (and even gcc) when flagging certain kinds of errors. Tools that essentially try to perform a "grep" on the AST tend to be fairly fragile, and can miss bugs easily because the same thing is written in a slightly different way. Of course there will always be corner cases that tools must be engineered to handle, but not making unnecessary assumptions about the syntactic structure of code goes a long way.
Also using clang as a gcc replacement is very difficult, mainly where you
are using ./configure && make. I had to do a script to strip unknown
options, as well as run gcc in parallel to clang (as ./configure usually
checks if the compiler is able to create executable files).
This is another, extremely valid point. The current incarnation of the clang driver is not a plug-in replacement for gcc; it doesn't support many (most?) of its arguments. There are many parties interested in developing a driver or evolving the current one to provide (near) plug-in replacement for gcc. As usual, anyone interested in helping out with this effort is more than encouraged to submit patches!
Stepping back a bit, almost all source-code analysis tools like those vended by GrammaTech and Coverity use various techniques to intercept the compiler, parse the source file that the compiler would parse, store some representation of the parsed file on disk, and then invoke the regular compiler as usual. The actual source-code analysis is then done offline, apart from the actual build of the codebase. The reason for this is that static analysis tools that are interprocedural require a whole-program image, which really can only be done after "capturing the build" of the entire codebase.
We are currently building infrastructure that can be used to capture the build for performing interprocedural analysis. Essentially we will intercept the native compiler, serialize the ASTs to disk, and then later perform some indexing on the serialized ASTs. Capturing the build of course is not only useful for finding bugs, but for other kinds of analyses as well, and so it is something we hope to do right in clang to serve various clients.
If I would recomend clang? Yes, sure! Although the API is not stable, it's
still a nice framework.
Wonderful! Natural we hope to have the API stabilize through continued use and development of the front-end.
I hope this is not the end of my work in the compiler world
So do we.
Thanks for sharing with us your experiences with using clang!