Thank you for taking a crack at looking at how to speed up the analyzer. The last time this was done was years ago, and I’m glad to see some fresh eyes taking a look at it.
I think to answer your questions I think it is worth stepping back and understanding a bit better how the pieces of the analyzer work together, and how they accomplish things as a whole. While the profile data is illuminating, I don’t think it provides a total picture of what is going on.
To put things in perspective, the static analyzer employs a path-sensitive dataflow analysis, or analogously, a symbolic execution. In the worst case this is a super exponential algorithm. Analysis is essentially driven by a worklist algorithm with a finite number of items that can be removed from the worklist, which is essentially bounds the amount of work. That still means the analyzer can take a long time, but it is guaranteed to terminate.
Because the algorithm is super exponential, in practice the real performance wins are usually algorithmic. Essentially, they can fall into two categories:
1. Don’t repeat the same work. The main example of this in the analyzer right now is to perform memoization so that the same path is not analyzed more than once. The FoldingSet code that you see show up in the profile is crucial to this memoization. It’s not surprising that it shows up a lot, but keep in mind that it is being used to do intelligent caching so that we don’t get super exponential complexity in practice.
2. Do the same work more intelligently in a way that, algorithmically, takes less time. One example of this is that currently the analyzer unrolls loops 3 times, but there are potential ways to analyze some loops without unrolling. You thus can get the same analysis quality with less work, and increase scalability.
Speeding up the raw data structures in isolation can sometimes be a win, but usually this results in a constant factor win at best. That really doesn’t chip away at the scalability problem. From your email, it sounds like you’re initially interested in reducing the current analysis time with keeping the analyzer doing the same amount of work. That’s great, but also keep in mind there is a bunch of analysis that is not done right now because the analyzer doesn’t scale. Finding ways to truly make the analyzer scale to larger problems requires much deeper changes to the analyzer, but will likely result in far more profound performance wins in the end.
Another crucial thing to take into account when profiling the analyzer is memory usage. Some of the current tradeoffs were reached because keeping the memory usage down is a critical part of making the analyzer perform. There’s an immense amount of parallelism in the analyzer’s analysis; the most basic one that is used now is that multiple translation units can be analyzed in parallel, which means N number of clang processes doing static analysis (where N is the number of cores the machine can saturate). If the memory usage is too high, the machine will start to thrash. More intelligent scheduling of analyzer processes is also another way to increase the analyzer’s scalability in practice, and this is not a task anyone has taken on. There’s also an immense amount of parallelism to be exploited when analyzing a single translation unit. That can be done in two ways: parallelize the current worklist algorithm, which will be tricky (but not impossible) since many of the data structures used are not thread-safe, or move more towards a summary-based algorithm approach the a few of us having been chatting about for years (which would be the basis of doing whole program, interprocedural analysis). Both are non-trivial efforts, but are likely to achieve HUGE wins in performance and scalability because they algorithmically scale up the performance of the analyzer.
I’ll respond to your other points inline, since they have important context.
I am trying to speed up the static analyzer. I also haven’t really worked on LLVM or Clang much in the past, so some of this may be old news, or naïve, or just plain wrong. Regardless, here are some of the things I’ve discovered while profiling and experimenting in the code base:
1. Getting consistent static analysis times is difficult. This has been making it difficult for me to quantify the change in performance as I make changes. If anyone has suggestions, or even a standard benchmark to use, then it would help me out a lot.
2. For my toy test case, roughly 10% of the CPU’s time is spent in the checkers (inclusive time / time including children). This seems low.
3. Roughly 20% (inclusive) is spent in the Immutable family of data structures.
4. Roughly 25% (inclusive) is spent in the llvm::FoldingSet code. Much of this is memory / cache bound.
Neither 3 or 4 is surprising. This is basically the meat-and-potatoes of the analyzer engine. Making these faster would obviously speed up the analyzer.
5. In my toy test case, ExplodedGraph::getNode returns a new node ~99% of the time. The majority of the old nodes found are purge nodes, but there are a few non-purge nodes that get returned.
With all this in mind, it seems reasonable for me to focus on the data structures behind the static analyzer, rather than on the checkers.
To be honest, I’m not convinced this is the best investment of your time. I don’t want to discourage you, but aren’t going to get the huge wins you allude to in the following sentence:
I suspect that there are some algorithmic changes hiding that could improve the speed by orders of magnitude, if I only understood the underlying code and their assumptions better.
That said, speeding up the core data structures could be instructive… but you must do it with GREAT care. The data structures chosen have specific semantic properties which make them viable. I’ll comment on them more below. But I do think there are wins to be made here, but a key requirement is that the memory usage of the analyzer doesn’t greatly increase. In fact, I’d be willing to slow the analyzer down slightly if we could save more memory. Using less memory makes the analyzer more viable on more complicated projects where we tend to hit the maximum analysis time.
For example, with ExplodedGraph::Node, I think I am averaging 3 cache misses per ExplodedGraph::getNode call in order to support deduplication, even though less than 1% of lookups are going to be deduplicated. Does anyone on this list know if there are some reasonable criteria to know in advance if a node needs to be stored for later deduplication? If we know that a node won’t / doesn’t need to be reduplicated, then we could insert the node blindly into the hash map without looking at all the colliding nodes.
Yes, there are a bunch of heuristics you can employ, but I think you’ll need to do a lot more profiling to be sure.
Essentially the de-duplication happens when paths can possibly merge together. I can see this in three places:
(1) At confluence points in the CFG, where multiple branches merge together at one basic block. At that point, multiple paths may have the same ProgramState, and thus one of them can “cache out”. Profiling would show this in practice.
(2) After nodes generated by a checker. Checkers can generate new ProgramStates that after reaching some critical point in the analysis have information removed or otherwise reach an equilibrium, and thus ExplodedNodes from different paths may suddenly have the same ProgramStates (and thus cache out). This is only a theory, however, in how much of an impact this really has in practice. Profiling would show this in practice.
(3) After a ProgramState has been cleaned out using RemoveDeadBindings. The ProgramState returned by RemoveDeadBindings may have a bunch of data cleansed from it that made it more unique from other states. At that point, the path is more likely to cache out. Again a theory, and one that can be verified by more profiling.
When profiling, you’ll also want to analyze a variety of codebases, as the characteristics could be very different.
The other big area that I’m confused by is the use and purpose of the ImmutableMap typedef Environment::BindingsTy, which maps EnvironmentEntry keys to SVal values. 5% of my time is getting spent in this instantiations ImmutableMap::add, largely because of a call to getCanonicalTree. Why is ImmutableMap being used instead of alternative associative types? Of particular note is that even iteration is expensive because ImmutableMap nodes can’t know their parent. There are some other ImmutableMaps being used, but I’m hoping that explanations of the EnvironmentEntry->SVal instantiation will help me understand the other instantiations.
Stepping back, ProgramStates are immutable values. The symbolic simulation works by keeping all ProgramStates along a path, and generated a reachable state graph (the ExplodedGraph). Since ProgramStates are likely to differ only slightly from each other, we use a functional data structure, ImmutableMap, that represents a mapping using a functional AVL tree. When you add a value to an ImmutableMap, you really just create a new ImmutableMap and the original stays intact. The trick is that ImmutableMaps *share* large amounts of their map representation (subtrees of the AVL tree) between each other, which represents a significant memory savings. Alternatively, we could use a reference-counted std::map or something, and when we needed to add a value to the map to create a new ProgramState would could copy the entire map. That would likely be very wasteful in practice. Also, ImmutableMap uses a BumpAllocator, which is very fast for allocations. This is the reason we use ImmutableMap.
But… your profiling data hits on something I saw before. getCanonicalTree() gets called when we unique the maps. This uniquing is critical for de-duplication of ProgramStates, but we don’t always need to de-duplicate ProgramStates unless we think there is a strong possibility we will want to merge two paths together. That said, recall I mentioned that saving memory was also important. We also want to de-duplicate ImmutableMaps, since they are used everywhere, for memory savings. Again, I think profiling data could be the key here, and we can be in the position where we de-duplicate some ImmutableMaps we use less often than others based depending on what the profiling data says. But the profiling data must include memory usage as a factor; otherwise it would be easy to optimize these CPU numbers but tank memory usage.
Another thing to explore is that for smaller maps maybe we don’t want an ImmutableMap at all, but just something like an ImmutableList to record a bunch of key value pairs. Once we de-duplicate states, we could canonicalize around use ImmutableMaps when creating a canonicalized ProgramState, or we can switch to ImmutableMap once the list gets too big. For example, Environment uses an ImmutableMap, but there is an access pattern to values in an Environment. Typically we just want the subexpressions of the current expression. If we are working with a binary operator or anything with multiple subexpressions an ImmutableMap might be good, but if there is only one subexpression sometimes an ImmutableMap is overkill. Or perhaps we can create a data structure that is a mixture of both ImmutableLists and ImmutableMaps, chaining them together. Why would we do this? It’s less work to manage a small list and an immutable map, but it would be an increase in implementation complexity (that could possibly be hidden by the right abstractions).
1. Is there a good, existing benchmark for static analysis that I should be using?
In the past, I’ve often looked at:
These were good stable C baselines. I’ve also analyzed the Python codebase in the past, which also exhibited some of the more algorithmic issues in the analyzer (analyzing the interpreter code in Python has to deal with a bunch of branches and switch statements). Sqlite3 is nice since it is one file.
LLVM/Clang is also a good codebase for C++. I think we should also pick some other C++ codebases.
2. Is there some reasonable criteria to know in advance if an ExplodedNode needs to be stored for later reduplication?
See my comment above.
3. Why is ImmutableMap<EnvironmentEntry, SVal> being used instead of std::map, std::hash_map, or even llvmFoldingSet?
Hopefully I’ve answered this question.