I am looking at clang static analyzer recently, and thinking whether I could make it’s abstract interpretation work across different source files, so that the constraints from source file a.c could be applied to other source files like b.c and c.c.
Any directions/hints on this will be much appreciated.
Hi, Larry. This is something we've wanted to do for a long time, but it's not a project to undertake lightly. Currently, the static analyzer uses the same logic as the compiler to parse a source file (and its headers) into an AST, and then run that. When you start talking about multiple source files, what we have isn't immediately reusable—Clang's just not set up to handle multiple translation units that also interact, with the exception of some of the indexing work. So the challenge is to come up with some way to share information across translation units (or, less plausibly, to rewire Clang to handle multiple translation units in a common context).
In some of our discussions on this, we've come up with the ideas of either "marshalling" data from one ASTContext to another (which turned out to be quite difficult to get right), or of recording "summaries" of each function that describe how to evaluate it from another context. I think the summary-based approach is a better avenue to go down, but then you have to decide how to get these summaries in and out of the analyzer and what information to include in them. And they have to be better than the default behavior we have now for opaque function calls...but then this has the possibility to be really, really useful.
(Who is "we"? Mostly Ted Kremenek, the code owner for the analyzer, along with Anna Zaks and myself.)
I hope that shines a light on some of the difficulties in implementing this well—it's a project of weeks, if not months. If you're still interested in looking at this, let's come up with a plan to tackle some of these problems. Alternately, I'd be happy to see you contributing to the analyzer, but starting out with something possibly less daunting.
I had the same question. But in each specific case I found that I could:
Divide the analysis in to a data gathering phase and an error reporting phase (so two global make passes).
Write a single bi-modal AST visitor or path sensitive checker that emits a flat text file which abstracts the results of the first phase.
Write some python or awk that munges the output of the first phase into a single file.
Have the checker read that file in the second phase and emit appropriate diagnostics.
So for example, say we would like to analyze NULL pointer dereferences. In phase 1, for each TU, for function definition emit (in structured flat text)
DependentReturn CALLER CALLEE
Where FUNCTION, CALLER, and CALLEE uniquely identify a function (I use PATH:LINE:COL:NAME of the canonical declaration and emit errors when functions may not be canonically declared).
Step 3 above does the graph theoretic reduction to determine which functions may return NULL and puts this in a single file.
Step 4 reruns make and passes that file as an argument or environmental variable to the checker/AST visitor.
Usually I modify ccc-analyzer by adding some environmental variables to decide which files to write in phase 1 and which to read in phase 2. Then I write a bash script that wraps make and sets everything up.
It probably sounds worse than it is. All of that said, native support for global analysis would be awesome.
My GSoC proposal 1 is slightly related. It does not involve automatic synthesis of models and in the long run the textual representation of model files will not be appropriate. But it can be considered as a first step towards global analysis.
The approach you describe certainly would work in many cases, but not all. What you describe is essentially a “summary based analysis” where the first pass gathers enough facts to summarize the effects of a function call that can then be used to improve analysis results in the second pass. I think what you’ve described is very practical for many kinds of problems, but I suspect it will not yield the best results in general.
A few things to consider:
(1) If we ignore recursive functions for a moment and consider DAG-like call graphs, called functions can be interdispersed through a codebase in different files. If the summary you need for the “second pass” does not depend on context, then one pass is sufficient to gather the information for the second pass. If the summaries are context-sensitive (i.e., they depend on how functions were called) then multiple passes may be needed to propagate the context/summary information accordingly. Traditionally global analysis tools address this problem by not really doing multiple builds, but instead recording enough information during the first pass on the side to do post-analysis of the call graph itself.
(2) Summaries can be very rich. It’s totally plausible (as you describe) to come up with some hand-tuned summaries for specific problems (e.g., NULL pointer analysis) that work well (although possibly loose some precision if context-sensitivity is needed), but general summary based analysis in the analyzer will likely need something a bit more automatic to contend with all the different kinds of checkers people write.
(3) Another goal of the analyzer is to integrate it naturally into development workflows. In a continuous integration system, a build is likely to run once. Some commercial analyzer tools try to capture enough of the build in a lightweight manner (so not to interfere with the speed of the build) and then do the analysis on the side. Relying on repeating a build to do global analysis might not be feasible in practice.
That said, I think many of your observations are spot on, and capture some of the flavor of what would need to be done in general.
I think it comes down to the properties you want to analyze. We have an interest (as Jordan said) in making the analyzer effectively do global analysis for the different kinds of checkers it supports, but if you are interested in analyzing a specific property that doesn’t require a grand general solution then the solution that John Hammond mentioned certainly could work in many cases.