Interprocedural data flow analysis

Hi!

Due to a relatively involved bughunt over the course of the last week, I've got the silly idea to look into static analysis to guard against the mistake in the future :wink:

The problems I'm looking detect involve inter-procedural analysis since that's what has been biting us. Specifically I need to look for accesses to certain data structures and make sure those accesses are properly bracketed by calls to functions (only a small set of known functions, which makes it easier to detect - doesn't need to be fully generic for this problem at hand). It gets worse by the fact that the data structure is passed around quite a lot and that the name is reused all over the place.
Effectively I guess I need to perform some kind of "alias" analysis. I'd like to do it at the source level, though in theory I suppose I could compile the code and perform those checks on LLVM bitcode/IR, too. For the tool to be most helpful, it should be able to report source locations, though.

In order to get a better feeling for what's in clang today, I performed some experiments. Here's a very simple example I've been playing around with:

int first(int param1) {
  int first1 = param1;
  return param1 + 1;
}

int main(int argc, char *argv) {
  int main1;
  int main2 = main1;
  main1 = first(main2);
  return 0;
}

Running it with:
  clang -warn-dead-stores -warn-uninit-values -cfg-dump main.c
emits the warnings (three in total) in two separate blocks, and does not warn about an uninitialized value for param1 in first(). The order and format of the output suggests that each function is looked at separately right after it has been parsed. Is that a correct observation? So far I seem to have not enough knowledge about the code to verify it myself (note to self: spend more time in the debugger :))

Further, I've not been able to answer the question whether there is existing code for transfer functions that performs any chasing of CallExprs. LiveVariables does not seem to have special code to "follow" calls, although I see a bunch of other Visitxxx methods.

Would it be as "simple" as performing something like LiveVariables does, but implementing VisitCallExpr and hooking that up? It seems like it, but since I'm not familiar enough with it yet, I thought I'd ask first before programming myself into a corner...

Given the above observation about the time the analysis is performed, it seems to me that I at least need to postpone my analysis after everything is parsed. Does that make sense? Or is the order of the output purely coincidental in that it simply visits the blocks in that order and prints out diagnostics associated with the subgraph?

Thanks,
-k

P.S.: This is with the latest clang/llvm trunk versions.

Hi!

Due to a relatively involved bughunt over the course of the last week,
I've got the silly idea to look into static analysis to guard against
the mistake in the future :wink:

Hi Kay,

It sounds like a great idea!

The problems I'm looking detect involve inter-procedural analysis
since that's what has been biting us. Specifically I need to look for
accesses to certain data structures and make sure those accesses are
properly bracketed by calls to functions (only a small set of known
functions, which makes it easier to detect - doesn't need to be fully
generic for this problem at hand). It gets worse by the fact that the
data structure is passed around quite a lot and that the name is
reused all over the place.

libAnalysis currently doesn't have any boilerplate infrastructure for doing inter-procedural analysis. We do provide two solvers for intra-procedural analysis (see below) on which you could build inter-procedural analysis, but there is some heavy lifting involved on your part. Building a generic inter-procedural analysis engine is something we would like to do, but it's a big project that will be tackled over a long period of time.

Effectively I guess I need to perform some kind of "alias" analysis.
I'd like to do it at the source level, though in theory I suppose I
could compile the code and perform those checks on LLVM bitcode/IR,
too. For the tool to be most helpful, it should be able to report
source locations, though.

Having good source information is one of the prime motivators that "clang static analyzer" was implemented on top of clang ASTs instead of the LLVM IR.

In order to get a better feeling for what's in clang today, I
performed some experiments. Here's a very simple example I've been
playing around with:

int first(int param1) {
  int first1 = param1;
  return param1 + 1;
}

int main(int argc, char *argv) {
  int main1;
  int main2 = main1;
  main1 = first(main2);
  return 0;
}

Running it with:
  clang -warn-dead-stores -warn-uninit-values -cfg-dump main.c
emits the warnings (three in total) in two separate blocks, and does
not warn about an uninitialized value for param1 in first(). The order
and format of the output suggests that each function is looked at
separately right after it has been parsed. Is that a correct
observation?

That is correct. These particular analyses are local (intra-procedural). The uninitialized values analysis (which in this case is a very simple check) doesn't know anything about the actual value of param1 in the context of the call to 'first' in 'main'.

So far I seem to have not enough knowledge about the code
to verify it myself (note to self: spend more time in the debugger :))

Further, I've not been able to answer the question whether there is
existing code for transfer functions that performs any chasing of
CallExprs. LiveVariables does not seem to have special code to
"follow" calls, although I see a bunch of other Visitxxx methods.

You can define arbitrary transfer functions for arbitrary expressions (see below).

Would it be as "simple" as performing something like LiveVariables
does, but implementing VisitCallExpr and hooking that up? It seems
like it, but since I'm not familiar enough with it yet, I thought I'd
ask first before programming myself into a corner...

Sort of. The difficult is handling the inter-procedural information. Basically, you need to write the glue yourself to capture the information from analyzing one function and propagating it when analyzing another. There are many ways to do this. Here is one generic way:

- Build "summaries" of each analyzed function, which summarizes the effect of calling that function (e.g., the function "first" returns "uninitialized" if "param1" is uninitialized). The detail of the summary depends on your analysis. Summaries can be computed out-of-order, lazily as you analyze one function and it calls another, etc. When you encounter a CallExpr when analyzing a function, you then consult the called function's summary to evaluate its effects.

There are many variations on how this can be carried out.

Given the above observation about the time the analysis is performed,
it seems to me that I at least need to postpone my analysis after
everything is parsed. Does that make sense? Or is the order of the
output purely coincidental in that it simply visits the blocks in that
order and prints out diagnostics associated with the subgraph?

libAnalysis currently consists of two static analysis "engines":

(1) a flow-sensitive dataflow solver
(2) a path-sensitive dataflow engine

Both are intra-procedural analysis engines, but there is no reason why one could not implement inter-procedural analyses with them.

(1) is used by LiveVariables and UninitializedValues. It merges information at confluence points using a specified "merge" operation. The uninitialized values analysis built on the flow-sensitive solver basically implements the -Wuninitialized check in gcc (a simple quick check for the use of uninitialized values).

(2) is used by the retain/release memory checker, and also does a bunch of built-in checks such as check for null dereferences, uses of uninitialized values along a specific path, etc.

I'll first talk about the flow-sensitive solver (1).

For (1), you basically implement visitor methods in your transfer function object (as you have already noted). Essentially, the method BlockVisit is called for each block-level expression in a CFG basic block. To implement your transfer function object, typically you subclass CFGStmtVisitor<...> as defined in CFGStmtVisitor.h, which implements default visitor methods that essentially do nothing. You then override the methods that you care about.

For example, the dataflow solver calls BlockStmt_Visit on a CallExpr*. The default implementation of BlockStmt_Visit calls BlockStmt_VisitCallExpr. The default implementation of BlockStmt_VisitCallExpr calls BlockStmt_VisitExpr. The default implementation of BlockStmt_VisitExpr then calls BlockStmt_VisitStmt. As you see, you override the method with the level of granularity that you care about.

Note that the visitor methods will not (by default) do a recursive walk of an expression. You either have to implement the recursion yourself (as done by UninitializedValues.cpp) or use an enhanced visitor class that does the recursion for you (as done in LiveVariables.cpp).

To do inter-procedural analysis, essentially you need to implement your own transfer function logic for CallExpr* (which may involving spawning off a new analysis for the called function or consulting a precomputed summary) or you use the intra-procedural data flow analysis to create a data structure which you can then later use to do your IPA (i.e., you construct just the facts you need from the analysis of a function to do a combined IPA over multiple functins). You really have a lot of flexibility; it just depends on your analysis needs.

For (2), the transfer function mechanism right now is still in flux. I'm gradually building up the infrastructure to "layer" or "compose" various transfer functions. The path-sensitive solver operates on an explicit reachable state graph, where each node in the graph refers to a location in the CFG and a "state" representing the values of your analysis. The transfer functions simply generate new states/graph nodes. Some of the transfer function logic in place in the path-sensitive engine do some things you probably don't want to re-implement: constant propagation, local aliasing relationships, symbolic value tracking, and so on. The hope is by having a layered approach one will be able to create difference checkers that represent the composition of different sub-analyses, with the checker writer only caring about writing the code that checks the property they care about. This clean composition isn't there yet, but an example of how this can be done is with the retain/release checker (CFRetain.cpp).

In order to do IPA within a file, your best bet is probably to wait until the entire file (TranslationUnit) has been parsed. You can then visit the function declarations in any order you wish, build up the necessary mappings you need, etc. Note that this kind of IPA is restricted to analyzing the functions with a single file. If your function implementations are scattered across multiple source files, you'll need to do your IPA by analyzing your functions locally, building up the information you need to glue their analysis results later, and then persistently store those results on disk (for later composition). As I mentioned before, we're working on adding the necessary infrastructure to do full IPA across an entire codebase, but it's a big project and will take some time.

Ted,

many thanks for the exhaustive reply :slight_smile:

It sounds like a great idea!

I always get myself in this kind of trouble :wink:

The order and format of the output suggests that each function is looked at
separately right after it has been parsed. Is that a correct
observation?

That is correct. These particular analyses are local (intra-procedural). The uninitialized values analysis (which in this case is a very simple check) doesn't know anything about the actual value of param1 in the context of the call to 'first' in 'main'.

Ok, I see.

So far I seem to have not enough knowledge about the code
to verify it myself (note to self: spend more time in the debugger :))

Further, I've not been able to answer the question whether there is
existing code for transfer functions that performs any chasing of
CallExprs. LiveVariables does not seem to have special code to
"follow" calls, although I see a bunch of other Visitxxx methods.

You can define arbitrary transfer functions for arbitrary expressions (see below).

Would it be as "simple" as performing something like LiveVariables
does, but implementing VisitCallExpr and hooking that up? It seems
like it, but since I'm not familiar enough with it yet, I thought I'd
ask first before programming myself into a corner...

Sort of. The difficult is handling the inter-procedural information. Basically, you need to write the glue yourself to capture the information from analyzing one function and propagating it when analyzing another. There are many ways to do this. Here is one generic way:

- Build "summaries" of each analyzed function, which summarizes the effect of calling that function (e.g., the function "first" returns "uninitialized" if "param1" is uninitialized). The detail of the summary depends on your analysis. Summaries can be computed out-of-order, lazily as you analyze one function and it calls another, etc. When you encounter a CallExpr when analyzing a function, you then consult the called function's summary to evaluate its effects.

There are many variations on how this can be carried out.

Yes, I was expecting that much of the glue code would have to be written from scratch, but it's good to know that I seem to have grasped the general idea of it.

Given the above observation about the time the analysis is performed,
it seems to me that I at least need to postpone my analysis after
everything is parsed. Does that make sense? Or is the order of the
output purely coincidental in that it simply visits the blocks in that
order and prints out diagnostics associated with the subgraph?

libAnalysis currently consists of two static analysis "engines":

(1) a flow-sensitive dataflow solver
(2) a path-sensitive dataflow engine

Both are intra-procedural analysis engines, but there is no reason why one could not implement inter-procedural analyses with them.

(1) is used by LiveVariables and UninitializedValues. It merges information at confluence points using a specified "merge" operation. The uninitialized values analysis built on the flow-sensitive solver basically implements the -Wuninitialized check in gcc (a simple quick check for the use of uninitialized values).

It seems that for answering "uninitialized vs initialized" and "live vs dead" questions you really want to compute what I'll call "partial derivates" of those functions. (My math prof would kill me if read this...). This info can be trivially cached and each function could be looked at individually, as long as you know which ones are actual parameters of it. Trivially that's true for its actual declared parameters, but globals would be "parameters" as well, complicating the issue. It would turn into a path-sensitive dataflow problem for those.

(2) is used by the retain/release memory checker, and also does a bunch of built-in checks such as check for null dereferences, uses of uninitialized values along a specific path, etc.

I'll first talk about the flow-sensitive solver (1).

For (1), you basically implement visitor methods in your transfer function object (as you have already noted). Essentially, the method BlockVisit is called for each block-level expression in a CFG basic block. To implement your transfer function object, typically you subclass CFGStmtVisitor<...> as defined in CFGStmtVisitor.h, which implements default visitor methods that essentially do nothing. You then override the methods that you care about.

For example, the dataflow solver calls BlockStmt_Visit on a CallExpr*. The default implementation of BlockStmt_Visit calls BlockStmt_VisitCallExpr. The default implementation of BlockStmt_VisitCallExpr calls BlockStmt_VisitExpr. The default implementation of BlockStmt_VisitExpr then calls BlockStmt_VisitStmt. As you see, you override the method with the level of granularity that you care about.

That saves a lot of reading, thanks!

Note that the visitor methods will not (by default) do a recursive walk of an expression. You either have to implement the recursion yourself (as done by UninitializedValues.cpp) or use an enhanced visitor class that does the recursion for you (as done in LiveVariables.cpp).

Is there an inherit reason for having both types in the code, or is that purely historical? At first it sounds like you only want to have it in the enhanced visitor class, but I might be missing a key point here.

To do inter-procedural analysis, essentially you need to implement your own transfer function logic for CallExpr* (which may involving spawning off a new analysis for the called function or consulting a precomputed summary) or you use the intra-procedural data flow analysis to create a data structure which you can then later use to do your IPA (i.e., you construct just the facts you need from the analysis of a function to do a combined IPA over multiple functins). You really have a lot of flexibility; it just depends on your analysis needs.

Two passes seem easier to get a firm grip on the problem, but it probably requires more code. Having the entire information available in the second pass might help in generating better messages, though. For example, I noticed that the warnings from my example are printed out of order compared to their appearance in the source code, which is understandable with your explanation from above.
main.c:2:6: warning: Value stored to 'first1' during its initialization is never read
         int first1 = param1;
             ^ ~~~~~~
main.c:9:2: warning: Value stored to 'main1' is never read
         main1 = first(main2);
         ^ ~~~~~~~~~~~~
main.c:8:14: warning: use of uninitialized variable
         int main2 = main1;
                     ^
3 diagnostics generated.

For (2), the transfer function mechanism right now is still in flux. I'm gradually building up the infrastructure to "layer" or "compose" various transfer functions. The path-sensitive solver operates on an explicit reachable state graph, where each node in the graph refers to a location in the CFG and a "state" representing the values of your analysis. The transfer functions simply generate new states/graph nodes. Some of the transfer function logic in place in the path-sensitive engine do some things you probably don't want to re-implement: constant propagation, local aliasing relationships, symbolic value tracking, and so on. The hope is by having a layered approach one will be able to create difference checkers that represent the composition of different sub-analyses, with the checker writer only caring about writing the code that checks the property they care about. This clean composition isn't there yet, but an example of how this can be done is with the retain/release checker (CFRetain.cpp).

I assume you mean CFRefCount.cpp? Looking at ArgEffect/RefEffect and RetainSummary I think you implement a variant of the "derivative" idea from above. I guess for dealing with refcounting you don't need the information how each argument is affecting the return value (or values for (pointer to) struct returns), though I can imagine that for certain situations it is helpful, for example inspecting the effect of a NULL parameter to a function that returns a newly malloc'ed and hopefully initialized structs. Especially to check more reliably if the client code is implementing all null checks correctly. If you can prove that on one argument configuration a return value will always be NULL that should greatly improve analysis strength.

In order to do IPA within a file, your best bet is probably to wait until the entire file (TranslationUnit) has been parsed. You can then visit the function declarations in any order you wish, build up the necessary mappings you need, etc. Note that this kind of IPA is restricted to analyzing the functions with a single file. If your function implementations are scattered across multiple source files, you'll need to do your IPA by analyzing your functions locally, building up the information you need to glue their analysis results later, and then persistently store those results on disk (for later composition). As I mentioned before, we're working on adding the necessary infrastructure to do full IPA across an entire codebase, but it's a big project and will take some time.

Sure. And of course, yes, the code in question is scattered over a hundred or so translation units :\
Worse still, most of the code is only reachable through function pointers, because it is written in C and heavily plugin based. I'm happy to find a specific solution for the set of possible targets of those pointers, however, because the set is small (5-10 of implemented functions per plugin) and pretty stable.

In the light of persisting the analysis results I think that would mean changes to clang.cpp as it seems to operate on translation units as the coarsest granularity, correct?
The full IPA tool would need to pick up all persisted files and make sense of it, whether that's clang or not, it would almost like a linker have to resolve dependencies (unless all analysis results go into one file which is read). hmm, I'll have to think about that one.

cheers,
-k