[Analyzer] Attempting to speed up static analysis

I am trying to speed up the static analyzer. I also haven’t really worked on LLVM or Clang much in the past, so some of this may be old news, or naïve, or just plain wrong. Regardless, here are some of the things I’ve discovered while profiling and experimenting in the code base:

  1. Getting consistent static analysis times is difficult. This has been making it difficult for me to quantify the change in performance as I make changes. If anyone has suggestions, or even a standard benchmark to use, then it would help me out a lot.

  2. For my toy test case, roughly 10% of the CPU’s time is spent in the checkers (inclusive time / time including children). This seems low.

  3. Roughly 20% (inclusive) is spent in the Immutable family of data structures.

  4. Roughly 25% (inclusive) is spent in the llvm::FoldingSet code. Much of this is memory / cache bound.

  5. In my toy test case, ExplodedGraph::getNode returns a new node ~99% of the time. The majority of the old nodes found are purge nodes, but there are a few non-purge nodes that get returned.

With all this in mind, it seems reasonable for me to focus on the data structures behind the static analyzer, rather than on the checkers. I suspect that there are some algorithmic changes hiding that could improve the speed by orders of magnitude, if I only understood the underlying code and their assumptions better.

For example, with ExplodedGraph::Node, I think I am averaging 3 cache misses per ExplodedGraph::getNode call in order to support deduplication, even though less than 1% of lookups are going to be deduplicated. Does anyone on this list know if there are some reasonable criteria to know in advance if a node needs to be stored for later deduplication? If we know that a node won’t / doesn’t need to be deduplicated, then we could insert the node blindly into the hash map without looking at all the colliding nodes.

The other big area that I’m confused by is the use and purpose of the ImmutableMap typedef Environment::BindingsTy, which maps EnvironmentEntry keys to SVal values. 5% of my time is getting spent in this instantiations ImmutableMap::factory::add, largely because of a call to getCanonicalTree. Why is ImmutableMap being used instead of alternative associative types? Of particular note is that even iteration is expensive because ImmutableMap nodes can’t know their parent. There are some other ImmutableMaps being used, but I’m hoping that explanations of the EnvironmentEntry->SVal instantiation will help me understand the other instantiations.

Question summary:

  1. Is there a good, existing benchmark for static analysis that I should be using?

  2. Is there some reasonable criteria to know in advance if an ExplodedNode needs to be stored for later deduplication?

  3. Why is ImmutableMap<EnvironmentEntry, SVal> being used instead of std::map, std::hash_map, or even llvmFoldingSet?

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

Hi,

This profiling work is really interesting!

3. Why is ImmutableMap<EnvironmentEntry, SVal> being used instead of std::map, std::hash_map, or even llvmFoldingSet?

I think I can answer #3 in a very general way. IIRC ImmutableMap is cheaper to copy than std::*map, and frequent copies are required when forking ProgramState objects.

We need to fork ProgramState objects often (e.g When creating a sink node, or when analyzing a branch).

best
vedant

Hi Ben,

Thank you for taking a crack at looking at how to speed up the analyzer. The last time this was done was years ago, and I’m glad to see some fresh eyes taking a look at it.

I think to answer your questions I think it is worth stepping back and understanding a bit better how the pieces of the analyzer work together, and how they accomplish things as a whole. While the profile data is illuminating, I don’t think it provides a total picture of what is going on.

To put things in perspective, the static analyzer employs a path-sensitive dataflow analysis, or analogously, a symbolic execution. In the worst case this is a super exponential algorithm. Analysis is essentially driven by a worklist algorithm with a finite number of items that can be removed from the worklist, which is essentially bounds the amount of work. That still means the analyzer can take a long time, but it is guaranteed to terminate.

Because the algorithm is super exponential, in practice the real performance wins are usually algorithmic. Essentially, they can fall into two categories:

1. Don’t repeat the same work. The main example of this in the analyzer right now is to perform memoization so that the same path is not analyzed more than once. The FoldingSet code that you see show up in the profile is crucial to this memoization. It’s not surprising that it shows up a lot, but keep in mind that it is being used to do intelligent caching so that we don’t get super exponential complexity in practice.

2. Do the same work more intelligently in a way that, algorithmically, takes less time. One example of this is that currently the analyzer unrolls loops 3 times, but there are potential ways to analyze some loops without unrolling. You thus can get the same analysis quality with less work, and increase scalability.

Speeding up the raw data structures in isolation can sometimes be a win, but usually this results in a constant factor win at best. That really doesn’t chip away at the scalability problem. From your email, it sounds like you’re initially interested in reducing the current analysis time with keeping the analyzer doing the same amount of work. That’s great, but also keep in mind there is a bunch of analysis that is not done right now because the analyzer doesn’t scale. Finding ways to truly make the analyzer scale to larger problems requires much deeper changes to the analyzer, but will likely result in far more profound performance wins in the end.

Another crucial thing to take into account when profiling the analyzer is memory usage. Some of the current tradeoffs were reached because keeping the memory usage down is a critical part of making the analyzer perform. There’s an immense amount of parallelism in the analyzer’s analysis; the most basic one that is used now is that multiple translation units can be analyzed in parallel, which means N number of clang processes doing static analysis (where N is the number of cores the machine can saturate). If the memory usage is too high, the machine will start to thrash. More intelligent scheduling of analyzer processes is also another way to increase the analyzer’s scalability in practice, and this is not a task anyone has taken on. There’s also an immense amount of parallelism to be exploited when analyzing a single translation unit. That can be done in two ways: parallelize the current worklist algorithm, which will be tricky (but not impossible) since many of the data structures used are not thread-safe, or move more towards a summary-based algorithm approach the a few of us having been chatting about for years (which would be the basis of doing whole program, interprocedural analysis). Both are non-trivial efforts, but are likely to achieve HUGE wins in performance and scalability because they algorithmically scale up the performance of the analyzer.

I’ll respond to your other points inline, since they have important context.

I am trying to speed up the static analyzer. I also haven’t really worked on LLVM or Clang much in the past, so some of this may be old news, or naïve, or just plain wrong. Regardless, here are some of the things I’ve discovered while profiling and experimenting in the code base:

1. Getting consistent static analysis times is difficult. This has been making it difficult for me to quantify the change in performance as I make changes. If anyone has suggestions, or even a standard benchmark to use, then it would help me out a lot.
2. For my toy test case, roughly 10% of the CPU’s time is spent in the checkers (inclusive time / time including children). This seems low.
3. Roughly 20% (inclusive) is spent in the Immutable family of data structures.
4. Roughly 25% (inclusive) is spent in the llvm::FoldingSet code. Much of this is memory / cache bound.

Neither 3 or 4 is surprising. This is basically the meat-and-potatoes of the analyzer engine. Making these faster would obviously speed up the analyzer.

5. In my toy test case, ExplodedGraph::getNode returns a new node ~99% of the time. The majority of the old nodes found are purge nodes, but there are a few non-purge nodes that get returned.

With all this in mind, it seems reasonable for me to focus on the data structures behind the static analyzer, rather than on the checkers.

To be honest, I’m not convinced this is the best investment of your time. I don’t want to discourage you, but aren’t going to get the huge wins you allude to in the following sentence:

  I suspect that there are some algorithmic changes hiding that could improve the speed by orders of magnitude, if I only understood the underlying code and their assumptions better.

That said, speeding up the core data structures could be instructive… but you must do it with GREAT care. The data structures chosen have specific semantic properties which make them viable. I’ll comment on them more below. But I do think there are wins to be made here, but a key requirement is that the memory usage of the analyzer doesn’t greatly increase. In fact, I’d be willing to slow the analyzer down slightly if we could save more memory. Using less memory makes the analyzer more viable on more complicated projects where we tend to hit the maximum analysis time.

For example, with ExplodedGraph::Node, I think I am averaging 3 cache misses per ExplodedGraph::getNode call in order to support deduplication, even though less than 1% of lookups are going to be deduplicated. Does anyone on this list know if there are some reasonable criteria to know in advance if a node needs to be stored for later deduplication? If we know that a node won’t / doesn’t need to be reduplicated, then we could insert the node blindly into the hash map without looking at all the colliding nodes.

Yes, there are a bunch of heuristics you can employ, but I think you’ll need to do a lot more profiling to be sure.

Essentially the de-duplication happens when paths can possibly merge together. I can see this in three places:

(1) At confluence points in the CFG, where multiple branches merge together at one basic block. At that point, multiple paths may have the same ProgramState, and thus one of them can “cache out”. Profiling would show this in practice.

(2) After nodes generated by a checker. Checkers can generate new ProgramStates that after reaching some critical point in the analysis have information removed or otherwise reach an equilibrium, and thus ExplodedNodes from different paths may suddenly have the same ProgramStates (and thus cache out). This is only a theory, however, in how much of an impact this really has in practice. Profiling would show this in practice.

(3) After a ProgramState has been cleaned out using RemoveDeadBindings. The ProgramState returned by RemoveDeadBindings may have a bunch of data cleansed from it that made it more unique from other states. At that point, the path is more likely to cache out. Again a theory, and one that can be verified by more profiling.

When profiling, you’ll also want to analyze a variety of codebases, as the characteristics could be very different.

The other big area that I’m confused by is the use and purpose of the ImmutableMap typedef Environment::BindingsTy, which maps EnvironmentEntry keys to SVal values. 5% of my time is getting spent in this instantiations ImmutableMap::factory::add, largely because of a call to getCanonicalTree. Why is ImmutableMap being used instead of alternative associative types? Of particular note is that even iteration is expensive because ImmutableMap nodes can’t know their parent. There are some other ImmutableMaps being used, but I’m hoping that explanations of the EnvironmentEntry->SVal instantiation will help me understand the other instantiations.

Stepping back, ProgramStates are immutable values. The symbolic simulation works by keeping all ProgramStates along a path, and generated a reachable state graph (the ExplodedGraph). Since ProgramStates are likely to differ only slightly from each other, we use a functional data structure, ImmutableMap, that represents a mapping using a functional AVL tree. When you add a value to an ImmutableMap, you really just create a new ImmutableMap and the original stays intact. The trick is that ImmutableMaps *share* large amounts of their map representation (subtrees of the AVL tree) between each other, which represents a significant memory savings. Alternatively, we could use a reference-counted std::map or something, and when we needed to add a value to the map to create a new ProgramState would could copy the entire map. That would likely be very wasteful in practice. Also, ImmutableMap uses a BumpAllocator, which is very fast for allocations. This is the reason we use ImmutableMap.

But… your profiling data hits on something I saw before. getCanonicalTree() gets called when we unique the maps. This uniquing is critical for de-duplication of ProgramStates, but we don’t always need to de-duplicate ProgramStates unless we think there is a strong possibility we will want to merge two paths together. That said, recall I mentioned that saving memory was also important. We also want to de-duplicate ImmutableMaps, since they are used everywhere, for memory savings. Again, I think profiling data could be the key here, and we can be in the position where we de-duplicate some ImmutableMaps we use less often than others based depending on what the profiling data says. But the profiling data must include memory usage as a factor; otherwise it would be easy to optimize these CPU numbers but tank memory usage.

Another thing to explore is that for smaller maps maybe we don’t want an ImmutableMap at all, but just something like an ImmutableList to record a bunch of key value pairs. Once we de-duplicate states, we could canonicalize around use ImmutableMaps when creating a canonicalized ProgramState, or we can switch to ImmutableMap once the list gets too big. For example, Environment uses an ImmutableMap, but there is an access pattern to values in an Environment. Typically we just want the subexpressions of the current expression. If we are working with a binary operator or anything with multiple subexpressions an ImmutableMap might be good, but if there is only one subexpression sometimes an ImmutableMap is overkill. Or perhaps we can create a data structure that is a mixture of both ImmutableLists and ImmutableMaps, chaining them together. Why would we do this? It’s less work to manage a small list and an immutable map, but it would be an increase in implementation complexity (that could possibly be hidden by the right abstractions).

Question summary:
1. Is there a good, existing benchmark for static analysis that I should be using?

In the past, I’ve often looked at:

- sqlite3.c
- postgresql

These were good stable C baselines. I’ve also analyzed the Python codebase in the past, which also exhibited some of the more algorithmic issues in the analyzer (analyzing the interpreter code in Python has to deal with a bunch of branches and switch statements). Sqlite3 is nice since it is one file.

LLVM/Clang is also a good codebase for C++. I think we should also pick some other C++ codebases.

2. Is there some reasonable criteria to know in advance if an ExplodedNode needs to be stored for later reduplication?

See my comment above.

3. Why is ImmutableMap<EnvironmentEntry, SVal> being used instead of std::map, std::hash_map, or even llvmFoldingSet?

Hopefully I’ve answered this question.

Cheers,
Ted

Hi,

Thank you Ted for this great writeup on the analyzer internals! I have noticed one possible algorithmic performance improvement. See my comment inline.

That’s a great insight. That indeed could be a huge win. We would need to evaluate both the memory and CPU impact of the change, and big-O isn’t everything. Sometimes the constant factor is important, especially if the majority of maps have a small number of elements.

Thanks Ted. Great information. I was hoping you would chime in J

RE: FoldingSet

Some of the reasons that I’m asking about de-duplication is that I’ve already attempted to change things in some reckless experiments. My quick-and-dirty attempts at removing the ExplodedGraph cache “worked” (for some value of working) and was faster, but the analysis quality was lowered enough that check-clang failed. I made a few feeble attempts at being selective with de-duplication, but those haven’t worked out either.

I have had some success with forking + modifying llvm::FoldingSet. Instead of hashing in terms of a FoldingSetID, I just have a trait with a Hash and an Equals method. This gives me a lot more flexibility in calculating the hash more quickly. I even went as far as caching the hash on the ExplodedNode itself. This is looking like it will result in a ~9% speedup, at the cost of an extra 4 bytes per ExplodedNode, and the (possibly temporary) maintenance burden of an extra FoldingSet class.

I’ve considered trying to use a b-tree instead of a hash map. If (and that’s a big if) I can find / construct a small comparison key that is accurate and encourages good access locality, then I think it may end up being a more cache friendly structure than the hash map. I think that the insertions and lookups happen ‘near’ each other in terms of the AST, so I think the cache-friendliness can likely outweigh the downgrade from O(1) to O(lg n). But that does all hinge on making a small accurate comparison key.

RE: ImmutableMap

I’m going to rephrase what you said to check for my own understanding.

Instead of ImmutableMap, a naïve implementation could have stored a std::map (or some other associative container) instead. Doing so would get the same quality of analysis. However, it would use a lot more memory, and the allocations would be slower, since they would be using the system / mallocator instead of the bump allocator. The ImmutableMap is taking advantage of the fact that the program state changes in little bits and pieces, instead of being drastically different from one node to the next. The ImmutableMap allows multiple ExplodedNodes to share most of the common program state elements, without mutating the old elements.

If that is accurate, then I can definitely see why an ImmutableMap is being used here. I’ll have to think more about what can change. I’ll also keep finger search trees in mind.

RE: General scalability

For now, I am just trying to make the existing 150,000 iterations provide the same quality in less time. Longer term, it may make sense for me to look at things like the summary based algorithm. One other hand-wavey approach that I’ve considered is to save off a file with an intermediate analysis state, so that future analysis runs can reuse the parts of that analysis that haven’t changed. Both of those are going to need to wait until I’ve gotten more familiar with the code base.

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

Hi Ben,

Great points, and I’ve provided responses inline.

Thanks Ted. Great information. I was hoping you would chime in J

RE: FoldingSet
Some of the reasons that I’m asking about de-duplication is that I’ve already attempted to change things in some reckless experiments. My quick-and-dirty attempts at removing the ExplodedGraph cache “worked” (for some value of working) and was faster, but the analysis quality was lowered enough that check-clang failed. I made a few feeble attempts at being selective with de-duplication, but those haven’t worked out either.

I have had some success with forking + modifying llvm::FoldingSet. Instead of hashing in terms of a FoldingSetID, I just have a trait with a Hash and an Equals method. This gives me a lot more flexibility in calculating the hash more quickly. I even went as far as caching the hash on the ExplodedNode itself. This is looking like it will result in a ~9% speedup, at the cost of an extra 4 bytes per ExplodedNode, and the (possibly temporary) maintenance burden of an extra FoldingSet class.

I’ll be honest that an extra 4 bytes per ExplodedNode does make me cringe, but maybe that is not so bad. I’m totally fine with ditching FoldingSetNodeID; that was just a way to get some hashing based on a mechanism already in place in LLVM.

It may be more profitable to put the cached hash in ProgramState. Conceptually, ExplodedNode is just a <ProgramState, ProgramPoint> pair, and both ProgramPoint and ProgramState should be independently hashable. ProgramPoints should be quickly hashable on their own, but ProgramStates are very expensive to both hash and semantically compare. That’s why we do all sorts of uniquing with the ImmutableMaps so that we can often use referential identity in some places to provide semantic identity, even thought he referential identity of the underlying ImmutableMap objects doesn’t actually matter.

I’ve considered trying to use a b-tree instead of a hash map.

Just for clarification, which hash map are you talking about?

  If (and that’s a big if) I can find / construct a small comparison key that is accurate and encourages good access locality, then I think it may end up being a more cache friendly structure than the hash map. I think that the insertions and lookups happen ‘near’ each other in terms of the AST, so I think the cache-friendliness can likely outweigh the downgrade from O(1) to O(lg n). But that does all hinge on making a small accurate comparison key.

RE: ImmutableMap
I’m going to rephrase what you said to check for my own understanding.
Instead of ImmutableMap, a naïve implementation could have stored a std::map (or some other associative container) instead. Doing so would get the same quality of analysis. However, it would use a lot more memory, and the allocations would be slower, since they would be using the system / mallocator instead of the bump allocator. The ImmutableMap is taking advantage of the fact that the program state changes in little bits and pieces, instead of being drastically different from one node to the next. The ImmutableMap allows multiple ExplodedNodes to share most of the common program state elements, without mutating the old elements.

If that is accurate, then I can definitely see why an ImmutableMap is being used here. I’ll have to think more about what can change. I’ll also keep finger search trees in mind.

Just to be clear, it’s mostly about semantics. The allocations using a bump allocator is gravy, because we allocate so many ImmutableMap.

A ProgramState, which itself is an immutable value, contains a bunch of ImmutableMaps to represent things like bindings of variables to values, constraints on symbols, and so on. You don’t mutate an ImmutableMap at all; you create a new ImmutableMap from an existing one, and they share state for the purpose of efficiency. That memory sharing between ImmutableMap values is purely an (important) optimization; what is most important is the semantics.

When you add a value to an std::map, that map changes in place. If two ProgramStates had the same std::map, then adding a value to that std::map would change multiple ProgramStates, which is not what we want at all. Indeed, a common mistake that people make when making changes to the analyzer or writing checkers is using mutating data structures (e.g., DenseMap) when they are in a context where immutable values are needed.

We could conceptually use std::map in the same places we used ImmutableMap, and even have sharing of std::maps between ProgramState values, but we would need to do some kind of copy-on-write when changing the maps for a ProgramState so that it did not change the underlying values of another ProgramState. This is entirely doable, but essentially would re-implement much of the core semantics of ImmutableMap. ImmutableMap also can have maps of maps, all composing to preserve immutability, which is really important. If you try building that all on top of std::map directly you can have a big house of cards that is hard to get semantic integrity in the analyzer core.

RE: General scalability
For now, I am just trying to make the existing 150,000 iterations provide the same quality in less time. Longer term, it may make sense for me to look at things like the summary based algorithm.

That is totally fine, and a great place to start. Even a constant factor improvement will make a big impact in practice.

  One other hand-wavey approach that I’ve considered is to save off a file with an intermediate analysis state, so that future analysis runs can reuse the parts of that analysis that haven’t changed. Both of those are going to need to wait until I’ve gotten more familiar with the code base.

I think this is all great, but please when making changes keep into account the memory overhead. I put a fair amount of tweaking into the current uses of ImmutableMap, including how aggressively we reclaim ExplodedNodes, based on tuning for memory usage. Changes in core representation of these data structures, or even minor algorithmic tweaks, can have a huge impact on memory. I even changed some of the balancing heuristics for the AVL tree so that the trees for the ImmutableMaps would be slightly less balanced at the benefit of using up less memory in practice.

To put it into context, suppose we made a change that caused the analyzer to use 10MB more memory on a complicated translation unit (this isn’t a totally unrealistic number for very complicated files). If we had 8 cores each doing static analysis, that’s 80MB additional memory being used for static analysis on a machine. As 10MB goes to 20MB, 20MB goes to 30MB, you can quickly see how much that additional memory overhead really matters. And the problem tends to be more pronounced on machines with more cores, with don’t always have the same amount of memory per core as machines with fewer cores.

Cheers,
Ted

Progress so far…

With the FoldingSet changes that I have prototyped on my machine, I’ve reduced a 6m57.730s analysis to 6m20.126s. This is without caching a hash value, but it does do an up-front bucket reservation that I will discuss below.

My responses inline.

I’ll be honest that an extra 4 bytes per ExplodedNode does make me cringe, but maybe that is not so bad. I’m totally fine with ditching FoldingSetNodeID; that was just a way to get some hashing based on a mechanism already in place in LLVM.

I’ve got an alternative pitch. There are two times when the hash of a given node is needed:

  1. When it is added to the FoldingSet

  2. When the FoldingSet grows and needs to rebucket items.

Saving the hash made case 2 significantly faster. I could instead try to avoid growing the FoldingSet entirely by guessing that the eventual number of buckets will roughly the number of work items that will be run. This approach will make simple analyze runs take up more memory, but it will make large analyze runs take up less, since we won’t need to allocate, then free the too-small buckets.

It may be more profitable to put the cached hash in ProgramState. Conceptually, ExplodedNode is just a <ProgramState, ProgramPoint> pair, and both ProgramPoint and ProgramState should be independently hashable. ProgramPoints should be quickly hashable on their own, but ProgramStates are very expensive to both hash and semantically compare. That’s why we do all sorts of uniquing with the ImmutableMaps so that we can often use referential identity in some places to provide semantic identity, even thought he referential identity of the underlying ImmutableMap objects doesn’t actually matter.

Currently, the address of the ProgramState is going into the hash. The contents of the ProgramState aren’t inspected at all for hashing or equality purposes. I think this is a good way to have it too. The performance trouble I’m seeing mostly come from cache misses, and cache misses are normally a result of chasing pointers. I don’t want to put a hash inside of ProgramState, because that’s just one more pointer to chase.

I’ve considered trying to use a b-tree instead of a hash map.

Just for clarification, which hash map are you talking about?

I was referring to the Nodes hash map inside of ExplodedGraph. I have my doubts that I’ll be able to squeeze a comparison key and a pointer into 16-20 bytes though. I would need to get the key-pointer pair down to that size in order to get a 3 element b-tree node into a 64-byte cache line.

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project