Path explosion Problem

Hi clang,

While testing the UncheckedReturn Checker, i got a path explosion problem while clang static-analyzer analyze a giant function that has a huge mounts of paths.

For example the function BZ2_decompress in “bzip2.c”. The source code of “bzip2.c” can be found here http://pastebin.com/BzzPEWrs.

When i executed static-analyzer with the command line “-cc1 -analyze -analyzer-checker=core.
experimental.UncheckedReturn -analyzer-store region /home/polo/test/largetest/bzip2.c”, it worked OK. But the result was not exactly what i want, since the analyzer reached the maximum number of exploded nodes. So i added “-analyzer-max-nodes 0” to the command line, after a while the static-analyzer crashed because exhausted all my memory(about 3G).

I thought it was my fault in the UncheckedReturn checker, but after i tried some other checkers with the same command line i found it was a path explotion. And i found with “-analyze-function BZ2_decompress” the analyzer still crashed.

terminate called after throwing an instance of ‘std::bad_alloc’
what(): std::bad_alloc
0 clang 0x09fc26a7
1 clang 0x09fc2434
2 0x007c3400 __kernel_sigreturn + 0
3 libc.so.6 0x0021aa82 abort + 386
4 libstdc++.so.6 0x009e152f __gnu_cxx::__verbose_terminate_handler() + 335
5 libstdc++.so.6 0x009df465
6 libstdc++.so.6 0x009df4a2
7 libstdc++.so.6 0x009df5e1
8 libstdc++.so.6 0x009dfc5f operator new(unsigned int) + 127
9 clang 0x092a7da0
10 clang 0x092a7345
11 clang 0x092a648d
12 clang 0x092a53da
13 clang 0x092a4489
14 clang 0x092a39b0
15 clang 0x092a1ece
16 clang 0x092a2895
17 clang 0x0927546f clang::ento::GRStateManager::removeDeadBindings(clang::ento::GRState const*, clang::StackFrameContext const*, clang::ento::SymbolReaper&) + 239
18 clang 0x092577fb clang::ento::ExprEngine::ProcessStmt(clang::CFGStmt, clang::ento::StmtNodeBuilder&) + 371
19 clang 0x09257610 clang::ento::ExprEngine::processCFGElement(clang::CFGElement, clang::ento::StmtNodeBuilder&) + 150
20 clang 0x092474ae clang::ento::CoreEngine::HandlePostStmt(clang::CFGBlock const*, unsigned int, clang::ento::ExplodedNode*) + 266
21 clang 0x092468a0 clang::ento::CoreEngine::ExecuteWorkList(clang::LocationContext const*, unsigned int, clang::ento::GRState const*) + 1102
22 clang 0x0917e9e0 clang::ento::ExprEngine::ExecuteWorkList(clang::LocationContext const*, unsigned int) + 54
23 clang 0x0917d1c6
24 clang 0x0917d2bd
25 clang 0x0917d343
26 clang 0x0917d090
27 clang 0x0917ca97
28 clang 0x0917cd31
29 clang 0x08e6698f clang::ParseAST(clang::Sema&, bool) + 617
30 clang 0x08bcdf63 clang::ASTFrontendAction::ExecuteAction() + 253
31 clang 0x08bcdbbe clang::FrontendAction::Execute() + 328
32 clang 0x08bb5fe7 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) + 779
33 clang 0x08b5ccb7 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) + 835
34 clang 0x08b4f485 cc1_main(char const**, char const**, char const*, void*) + 1014
35 clang 0x08b588d7 main + 521
36 libc.so.6 0x00203bd6 __libc_start_main + 230
37 clang 0x08b4eb61
Stack dump:
0. Program arguments: clang -cc1 -analyze -analyzer-checker=unix.experimental.Chroot -analyzer-store region -analyze-function BZ2_decompress -analyzer-max-nodes 0 /home/polo/test/largetest/bzip2.c

  1. parser at end of file
  2. /home/polo/test/largetest/bzip2.c:3443:4: Error evaluating statement
    [1]- Killed emacs
    Aborted

So here’s my problem, if we want to gather path-sensitive statistical infomation, we probably need to analyze all the paths. But the upper problem didn’t allow us to do so.

IMO, there may be several ways overcome this:

  1. Increase my computer’s memory…but i think it may not solve the problem.
  2. Change the worklist Algorithm form BFS to DFS, and after a path was analyzed, release the memory generated in current path analyze. Is this feasible or useful?
  3. Or is there any other way to compromise?
    ps: We should not let clang crashed even if the memory exhausted, right?

Hi Lei,

Running out of memory is a real problem that needs a general solution. I have been mulling this over myself, and have a few ideas that match with #2 and #3.

Before I get into those, I think it is worth pondering the fact that unless we force the path exploration to combine paths, there will always be cases where we don’t exhaustively explore all paths. For example, consider the kind of dataflow analysis done in many compiler analyses: paths are always merged at confluence points, guaranteeing convergence and bounding memory usage. The path-sensitive analysis doesn’t force this kind of convergence (although paths converge when the states are the same), so if you are interested in forcing convergence I think we’ll need something else in the analyzer that isn’t there right now.

Let’s talk about #2. Essentially, the wordlist algorithm (which currently is either BFS or DFS) is all about prioritizing what part of the state space gets explored first. My goal is that we eventually have other intelligent algorithms besides BFS and DFS, that allow us to exploit even more information. Thus, switching between BFS and DFS to solve this problem isn’t the answer. Switching between those should help if, say, you want to bound the amount of work done by the analyzer (which we do) and want to prioritize that some types of paths are prioritized over others. This may help you find different kinds of bugs more effectively. For “all paths” checkers, like yours, BFS will tend to cover paths that end sooner, but it won’t explore paths that require going deeper into the state space (like DFS does). This will directly impact the results you get from your checker, but those are just tradeoffs in the results. It doesn’t actually solve the scalability problem. If the number of paths to explore are too costly to enumerate, we are going to run into memory problems regardless. Put another way, DFS and BFS are just enumerating paths in a different order. The cost to enumerate all of them is the same.

To tackle memory scalability, the direction I have been gradually pushing the analyzer is to have the ability to gradually reclaim ExplodedNodes (and states) as the graph grows larger. It use to be that we would just generate GRStates, many which would just get thrown away and not included in ExplodedNodes, and not reclaim that memory until we reclaimed all the memory for an analysis. Now GRStates are reference counted, so we can reclaim some of the memory more eagerly. To take this further, we can possibly prune out parts of the ExplodedGraph as it grows if we can deem those parts “uninteresting.” I think this is very doable, but it’s a lot trickier than it sounds, and it requires consulting which ExplodedNodes are currently referenced by BugReports. If we can identify paths that are essentially uninteresting, nor are deemed critical by a checker, then we can potentially throw away those nodes. Note that when we do this, we are throwing away analysis state. If we throw away a path, only to explore another path that would have merged with the discarded path, we are going to end up analyzing the same path again. That’s an inherit time-space tradeoff.

So, I think the solution to your problem requires a combination of several approaches:

a) Make path analysis more scalable in general by more eagerly reclaiming memory when the ExplodedGraph starts to get large. This is essentially a “garbage collection” of uninteresting paths.

b) Because (a) doesn’t completely “solve” the path explosion problem (i.e., it doesn’t guarantee that all paths are explored), if exploring all paths matters to you we will need to devise a strategy where we can (optionally) merge paths. This requires being able to define a merge operation for everything in GRState, including checker-specific state. It will then be up to some policy to decide when the merge takes place, as merging too frequently will cause us to discard too much information (i.e., it basically reduces to a flow-sensitive analysis).

Ted