As part of our work on containing the analysis time, I’ve been ideating over the following slow case:
// Analyze with
// clang++ -cc1 -analyze -analyzer-checker=core repro-slow-steps/ceph/reduced-small.cc
void opaque(void*);
struct Option {
enum type_t { TYPE_INT, TYPE_SIZE };
type_t type = TYPE_INT;
Option() {
opaque(this);
if (type == TYPE_INT) 0;
}
};
void use(...);
#define REPEAT10(X) X, X, X, X, X, X, X, X, X, X
#define REPEAT100(x) REPEAT10(REPEAT10(x))
void get_global_options() {
use(REPEAT100(Option()), REPEAT100(Option()), REPEAT100(Option()));
}
It is a reduced example from a real (although generated) function from the ceph open-source project.
The code above takes 5 seconds on my machine to analyze. That converts to 10 seconds in our CSA downstream configuration, and to 90 seconds for the actual function, that is somewhat more complicated. Yet, the main challenges remain the same.
I’ve struggled to optimize the analysis of this entry point and tried several approaches. None of them worked to my satisfaction, so I share this as an experience report.
Here is the -ftime-trace
of this analysis:
This code has 3 challenges:
-
It branches the exploded graph excessively in a single statement: Each constructor call to
Option()
introduces a state split for a new conjured symbol, so the state never joins back. -
It grows the environment (mapping from expressions to their assumed values) substantially. Given that all the arguments are alive as long as we are still processing
use(...)
call, by the time CSA evaluates the last arguments, it will have ~300 elements in the environment. In the actual case it was 1000+ elements. AFAIK, the environment wasn’t designed to accommodate this number of elements. -
CSA calls
removeDead
very often for no real benefit.removeDead
in this situation is slow because it performs copying garbage collection (GC) on the entire environment which is inefficient for the case where almost all bindings are alive.
The first bullet point is not really a problem on its own: it does not block the discovery of issues after this statement thanks to unexplored_first_queue
exploration strategy, and the upper limit on the number of analysis steps is low enough for analysis of most functions to finish much faster. For the rest of the post I will focus on the other two challenges.
Each removeDead
can only remove the temporary bindings for the Option()
call, which are rather few, and it keeps the temporary values of the many function arguments (or init-list elements as in the original code).
CSA invokes removeDead
before calling the next constructor, and a couple of times before exiting from the constructor.
As a result each analysis step (each exploded node) takes 400 µs on average (versus the common average duration of 16 µs).
So here are the ideas I tried out speed up analysis of this entry point:
-
“Optimistic
removeDead
”: Replace copying GC with mark&sweep. -
“Generational
removeDead
”: FocusremoveDead
only on the stack frames of the current functions and its callees -
Bound the environment size and halt the exploded path that leads to an overly large environment
-
Throttle
removeDead
calls for the same statement
1. Optimistic removeDead
This is a rather simple idea: instead of recreating the environment from scratch and copying the live binding to it, make a copy-on-write duplicate and remove the bindings that didn’t make it.
This change improved the analysis time by about 20%. Yet, I recognize that this case is an outlier and I would need some complex heuristic to make sure that in normal cases, i.e., 99.9% of the times, the original implementation kicks in.
The gain in this extreme case is not high enough to justify introducing such complexity in my eyes.
2. Generational removeDead
Given that caller’s bindings are never dead in the middle of the callee evaluation (correct me if I’m wrong), why are we trying to expire them at all?
Let’s split the environment per stack frame and keep the caller bindings alive without needlessly copying them from the old and the new “cleaned” environment every time.
I split the single Environment
map Expr*LocCtx -> SVal
into LocCtx -> (Expr -> SVal)
. Then, I applied the copying GC only to the current frame and the child frames. Unfortunately, I still have to visit all bindings to properly mark the live regions and constraints. This could be eliminated if we split all of the State
per stack frame and keep pre-computed SymReaper
ready for each frame. However, that is a pretty large refactoring that I am not ready to embark on.
Splitting only the environment, on its own, degraded the analysis performance even for the code at hand: about 27% slowdown.
Focusing removeDead
then only on the relevant frames, however, recovered that loss and brought some more: 15% of a speed up compared to the baseline.
Given that this split introduces a noticeable overhead that is likely to affect most of the code, and the optimization that follows is likely to work only on a narrow slew of cases, I believe overall it will degrade performance. However, a per-frame split of the entire State might be beneficial.
3. Bound Environment Size
The idea: abandon the exploded paths that grow the environment too much.
In practice, that means that a huge function call like in the example, or a large initializer list like in the original code, would stop analysis. I consider this a better outcome than effectively “hanging” it. If this idea would play out, we could mitigate its effect similarly to how we ignore some inlined functions if they take too much budget.
To avoid extra overhead, I used the map height as a proxy for the environment size. So if the environment reaches the configured height threshold, it generates a sink node.
Turns out that some of the lit tests require at least the height of 7, i.e., environments bigger than 16 bindings. The entry point I was working with was affected for the threshold values between 5 and 9, but at value >= 7 the speed up did not exceed 40%. Intuitively I felt that 7 is too low because lit tests contain only simple expressions compared to what we might face in the real code, so I did not pursue the idea further.
A slight twist on this idea is to add such a threshold per stack frame. However the overhead of maintaining environment per stack frame makes it impractical (see #2).
Is it worth bounding the environment?
Here is a histogram of the maximum size environment reached analyzing each of the 450k non-trivial entry points. You can see, it can reach pretty scary sizes (even ignoring outliers that had hundreds of thousands bindings).
However, looking at the average time per step it takes, on this log-log plot, you can notice that there is no persistent impact of extreme environment size (i.e., above 10³ bindings):
On this plot each point represents an entry point, x-axis corresponds to its average step duration in micro seconds, y-axis corresponds to its maximum reached environment size.
So it seems bounding the environment size seems to be not worth the added complexity.
4. Throttle removeDead
calls
The idea is: it is not worth it to call removeDead
frequently while we are still in the same statement: it is not going to clean up much, and all the rest can be cleaned by the end of the statement. Of course, this holds only within one stack frame.
ExprEngine
invokes ExprEngine::removeDead
in several situations. One of them is upon a function or constructor call. In this example, there are 301 calls in a single statement. Given that the exploded graph branches within every argument of the use(..)
call, it makes for quite a lot of removeDead
invocations.
So I throttled the invocations of removeDead
on a call expr within a single statement. I added a map for the number of removeDead
triggers per stackframe that would be cleaned up each time we reach the end of a statement.
Practically, it turned out that it is important to invoke removeDead
once. The other invocations on a call within the same statement make no difference, but eat into your analysis time. So I replaced the map with a set.
This optimization also produced roughly 20% speed up. Seeing no apparent downsides, I proceeded with a full-scale evaluation.
Unfortunately, it did not look compelling on our evaluation code base overall:
-
because of the change of the
removeDead
schedule, some issues, notably the memory leaks, now appear later. Having such a shift in the issue’s primary location is inconvenient. -
Performance impact seems to be rather small, and slightly negative on average. Probably, the additional set to keep track of shows the overhead more often than the skipped
removeDead
calls.
Conclusion
After trying several ideas I found no easy optimizations that would fix it or, at least, contain the performance impact of the code like this, i.e., code with a lot of calls in a list: an argument list or init-list.