Hi!
Due to a relatively involved bughunt over the course of the last week,
I've got the silly idea to look into static analysis to guard against
the mistake in the future 
Hi Kay,
It sounds like a great idea!
The problems I'm looking detect involve inter-procedural analysis
since that's what has been biting us. Specifically I need to look for
accesses to certain data structures and make sure those accesses are
properly bracketed by calls to functions (only a small set of known
functions, which makes it easier to detect - doesn't need to be fully
generic for this problem at hand). It gets worse by the fact that the
data structure is passed around quite a lot and that the name is
reused all over the place.
libAnalysis currently doesn't have any boilerplate infrastructure for doing inter-procedural analysis. We do provide two solvers for intra-procedural analysis (see below) on which you could build inter-procedural analysis, but there is some heavy lifting involved on your part. Building a generic inter-procedural analysis engine is something we would like to do, but it's a big project that will be tackled over a long period of time.
Effectively I guess I need to perform some kind of "alias" analysis.
I'd like to do it at the source level, though in theory I suppose I
could compile the code and perform those checks on LLVM bitcode/IR,
too. For the tool to be most helpful, it should be able to report
source locations, though.
Having good source information is one of the prime motivators that "clang static analyzer" was implemented on top of clang ASTs instead of the LLVM IR.
In order to get a better feeling for what's in clang today, I
performed some experiments. Here's a very simple example I've been
playing around with:
int first(int param1) {
int first1 = param1;
return param1 + 1;
}
int main(int argc, char *argv) {
int main1;
int main2 = main1;
main1 = first(main2);
return 0;
}
Running it with:
clang -warn-dead-stores -warn-uninit-values -cfg-dump main.c
emits the warnings (three in total) in two separate blocks, and does
not warn about an uninitialized value for param1 in first(). The order
and format of the output suggests that each function is looked at
separately right after it has been parsed. Is that a correct
observation?
That is correct. These particular analyses are local (intra-procedural). The uninitialized values analysis (which in this case is a very simple check) doesn't know anything about the actual value of param1 in the context of the call to 'first' in 'main'.
So far I seem to have not enough knowledge about the code
to verify it myself (note to self: spend more time in the debugger :))
Further, I've not been able to answer the question whether there is
existing code for transfer functions that performs any chasing of
CallExprs. LiveVariables does not seem to have special code to
"follow" calls, although I see a bunch of other Visitxxx methods.
You can define arbitrary transfer functions for arbitrary expressions (see below).
Would it be as "simple" as performing something like LiveVariables
does, but implementing VisitCallExpr and hooking that up? It seems
like it, but since I'm not familiar enough with it yet, I thought I'd
ask first before programming myself into a corner...
Sort of. The difficult is handling the inter-procedural information. Basically, you need to write the glue yourself to capture the information from analyzing one function and propagating it when analyzing another. There are many ways to do this. Here is one generic way:
- Build "summaries" of each analyzed function, which summarizes the effect of calling that function (e.g., the function "first" returns "uninitialized" if "param1" is uninitialized). The detail of the summary depends on your analysis. Summaries can be computed out-of-order, lazily as you analyze one function and it calls another, etc. When you encounter a CallExpr when analyzing a function, you then consult the called function's summary to evaluate its effects.
There are many variations on how this can be carried out.
Given the above observation about the time the analysis is performed,
it seems to me that I at least need to postpone my analysis after
everything is parsed. Does that make sense? Or is the order of the
output purely coincidental in that it simply visits the blocks in that
order and prints out diagnostics associated with the subgraph?
libAnalysis currently consists of two static analysis "engines":
(1) a flow-sensitive dataflow solver
(2) a path-sensitive dataflow engine
Both are intra-procedural analysis engines, but there is no reason why one could not implement inter-procedural analyses with them.
(1) is used by LiveVariables and UninitializedValues. It merges information at confluence points using a specified "merge" operation. The uninitialized values analysis built on the flow-sensitive solver basically implements the -Wuninitialized check in gcc (a simple quick check for the use of uninitialized values).
(2) is used by the retain/release memory checker, and also does a bunch of built-in checks such as check for null dereferences, uses of uninitialized values along a specific path, etc.
I'll first talk about the flow-sensitive solver (1).
For (1), you basically implement visitor methods in your transfer function object (as you have already noted). Essentially, the method BlockVisit is called for each block-level expression in a CFG basic block. To implement your transfer function object, typically you subclass CFGStmtVisitor<...> as defined in CFGStmtVisitor.h, which implements default visitor methods that essentially do nothing. You then override the methods that you care about.
For example, the dataflow solver calls BlockStmt_Visit on a CallExpr*. The default implementation of BlockStmt_Visit calls BlockStmt_VisitCallExpr. The default implementation of BlockStmt_VisitCallExpr calls BlockStmt_VisitExpr. The default implementation of BlockStmt_VisitExpr then calls BlockStmt_VisitStmt. As you see, you override the method with the level of granularity that you care about.
Note that the visitor methods will not (by default) do a recursive walk of an expression. You either have to implement the recursion yourself (as done by UninitializedValues.cpp) or use an enhanced visitor class that does the recursion for you (as done in LiveVariables.cpp).
To do inter-procedural analysis, essentially you need to implement your own transfer function logic for CallExpr* (which may involving spawning off a new analysis for the called function or consulting a precomputed summary) or you use the intra-procedural data flow analysis to create a data structure which you can then later use to do your IPA (i.e., you construct just the facts you need from the analysis of a function to do a combined IPA over multiple functins). You really have a lot of flexibility; it just depends on your analysis needs.
For (2), the transfer function mechanism right now is still in flux. I'm gradually building up the infrastructure to "layer" or "compose" various transfer functions. The path-sensitive solver operates on an explicit reachable state graph, where each node in the graph refers to a location in the CFG and a "state" representing the values of your analysis. The transfer functions simply generate new states/graph nodes. Some of the transfer function logic in place in the path-sensitive engine do some things you probably don't want to re-implement: constant propagation, local aliasing relationships, symbolic value tracking, and so on. The hope is by having a layered approach one will be able to create difference checkers that represent the composition of different sub-analyses, with the checker writer only caring about writing the code that checks the property they care about. This clean composition isn't there yet, but an example of how this can be done is with the retain/release checker (CFRetain.cpp).
In order to do IPA within a file, your best bet is probably to wait until the entire file (TranslationUnit) has been parsed. You can then visit the function declarations in any order you wish, build up the necessary mappings you need, etc. Note that this kind of IPA is restricted to analyzing the functions with a single file. If your function implementations are scattered across multiple source files, you'll need to do your IPA by analyzing your functions locally, building up the information you need to glue their analysis results later, and then persistently store those results on disk (for later composition). As I mentioned before, we're working on adding the necessary infrastructure to do full IPA across an entire codebase, but it's a big project and will take some time.