Unsuccessful attempts to fix a slow-analysis case related to `removeDead` and environment size

As part of our work on containing the analysis time, I’ve been ideating over the following slow case:

// Analyze with
// clang++ -cc1 -analyze -analyzer-checker=core repro-slow-steps/ceph/reduced-small.cc
void opaque(void*);
struct Option {
  enum type_t { TYPE_INT, TYPE_SIZE };
  type_t type = TYPE_INT;
  Option() {
    opaque(this);
    if (type == TYPE_INT) 0;
  }
};
void use(...);

#define REPEAT10(X) X, X, X, X, X, X, X, X, X, X
#define REPEAT100(x) REPEAT10(REPEAT10(x))

void get_global_options() {
  use(REPEAT100(Option()), REPEAT100(Option()), REPEAT100(Option()));
}

It is a reduced example from a real (although generated) function from the ceph open-source project.

The code above takes 5 seconds on my machine to analyze. That converts to 10 seconds in our CSA downstream configuration, and to 90 seconds for the actual function, that is somewhat more complicated. Yet, the main challenges remain the same.

I’ve struggled to optimize the analysis of this entry point and tried several approaches. None of them worked to my satisfaction, so I share this as an experience report.

Here is the -ftime-trace of this analysis:

This code has 3 challenges:

  • It branches the exploded graph excessively in a single statement: Each constructor call to Option() introduces a state split for a new conjured symbol, so the state never joins back.

  • It grows the environment (mapping from expressions to their assumed values) substantially. Given that all the arguments are alive as long as we are still processing use(...) call, by the time CSA evaluates the last arguments, it will have ~300 elements in the environment. In the actual case it was 1000+ elements. AFAIK, the environment wasn’t designed to accommodate this number of elements.

  • CSA calls removeDead very often for no real benefit. removeDead in this situation is slow because it performs copying garbage collection (GC) on the entire environment which is inefficient for the case where almost all bindings are alive.

The first bullet point is not really a problem on its own: it does not block the discovery of issues after this statement thanks to unexplored_first_queue exploration strategy, and the upper limit on the number of analysis steps is low enough for analysis of most functions to finish much faster. For the rest of the post I will focus on the other two challenges.

Each removeDead can only remove the temporary bindings for the Option() call, which are rather few, and it keeps the temporary values of the many function arguments (or init-list elements as in the original code).

CSA invokes removeDead before calling the next constructor, and a couple of times before exiting from the constructor.

As a result each analysis step (each exploded node) takes 400 µs on average (versus the common average duration of 16 µs).

So here are the ideas I tried out speed up analysis of this entry point:

  1. “Optimistic removeDead”: Replace copying GC with mark&sweep.

  2. “Generational removeDead”: Focus removeDead only on the stack frames of the current functions and its callees

  3. Bound the environment size and halt the exploded path that leads to an overly large environment

  4. Throttle removeDead calls for the same statement

1. Optimistic removeDead

This is a rather simple idea: instead of recreating the environment from scratch and copying the live binding to it, make a copy-on-write duplicate and remove the bindings that didn’t make it.

This change improved the analysis time by about 20%. Yet, I recognize that this case is an outlier and I would need some complex heuristic to make sure that in normal cases, i.e., 99.9% of the times, the original implementation kicks in.

The gain in this extreme case is not high enough to justify introducing such complexity in my eyes.

2. Generational removeDead

Given that caller’s bindings are never dead in the middle of the callee evaluation (correct me if I’m wrong), why are we trying to expire them at all?

Let’s split the environment per stack frame and keep the caller bindings alive without needlessly copying them from the old and the new “cleaned” environment every time.

I split the single Environment map Expr*LocCtx -> SVal into LocCtx -> (Expr -> SVal). Then, I applied the copying GC only to the current frame and the child frames. Unfortunately, I still have to visit all bindings to properly mark the live regions and constraints. This could be eliminated if we split all of the State per stack frame and keep pre-computed SymReaper ready for each frame. However, that is a pretty large refactoring that I am not ready to embark on.

Splitting only the environment, on its own, degraded the analysis performance even for the code at hand: about 27% slowdown.

Focusing removeDead then only on the relevant frames, however, recovered that loss and brought some more: 15% of a speed up compared to the baseline.

Given that this split introduces a noticeable overhead that is likely to affect most of the code, and the optimization that follows is likely to work only on a narrow slew of cases, I believe overall it will degrade performance. However, a per-frame split of the entire State might be beneficial.

3. Bound Environment Size

The idea: abandon the exploded paths that grow the environment too much.

In practice, that means that a huge function call like in the example, or a large initializer list like in the original code, would stop analysis. I consider this a better outcome than effectively “hanging” it. If this idea would play out, we could mitigate its effect similarly to how we ignore some inlined functions if they take too much budget.

To avoid extra overhead, I used the map height as a proxy for the environment size. So if the environment reaches the configured height threshold, it generates a sink node.

Turns out that some of the lit tests require at least the height of 7, i.e., environments bigger than 16 bindings. The entry point I was working with was affected for the threshold values between 5 and 9, but at value >= 7 the speed up did not exceed 40%. Intuitively I felt that 7 is too low because lit tests contain only simple expressions compared to what we might face in the real code, so I did not pursue the idea further.

A slight twist on this idea is to add such a threshold per stack frame. However the overhead of maintaining environment per stack frame makes it impractical (see #2).

Is it worth bounding the environment?

Here is a histogram of the maximum size environment reached analyzing each of the 450k non-trivial entry points. You can see, it can reach pretty scary sizes (even ignoring outliers that had hundreds of thousands bindings).

However, looking at the average time per step it takes, on this log-log plot, you can notice that there is no persistent impact of extreme environment size (i.e., above 10³ bindings):

On this plot each point represents an entry point, x-axis corresponds to its average step duration in micro seconds, y-axis corresponds to its maximum reached environment size.

So it seems bounding the environment size seems to be not worth the added complexity.

4. Throttle removeDead calls

The idea is: it is not worth it to call removeDead frequently while we are still in the same statement: it is not going to clean up much, and all the rest can be cleaned by the end of the statement. Of course, this holds only within one stack frame.

ExprEngine invokes ExprEngine::removeDead in several situations. One of them is upon a function or constructor call. In this example, there are 301 calls in a single statement. Given that the exploded graph branches within every argument of the use(..) call, it makes for quite a lot of removeDead invocations.

So I throttled the invocations of removeDead on a call expr within a single statement. I added a map for the number of removeDead triggers per stackframe that would be cleaned up each time we reach the end of a statement.

Practically, it turned out that it is important to invoke removeDead once. The other invocations on a call within the same statement make no difference, but eat into your analysis time. So I replaced the map with a set.

This optimization also produced roughly 20% speed up. Seeing no apparent downsides, I proceeded with a full-scale evaluation.

Unfortunately, it did not look compelling on our evaluation code base overall:

  • because of the change of the removeDead schedule, some issues, notably the memory leaks, now appear later. Having such a shift in the issue’s primary location is inconvenient.

  • Performance impact seems to be rather small, and slightly negative on average. Probably, the additional set to keep track of shows the overhead more often than the skipped removeDead calls.

Conclusion

After trying several ideas I found no easy optimizations that would fix it or, at least, contain the performance impact of the code like this, i.e., code with a lot of calls in a list: an argument list or init-list.

2 Likes

Thank you for documenting your journey. I wish we had more stories like this. (along with successful attempts of course :slight_smile:)

I think we may come back to this thread once we have a better picture about overall running times now that we have more tracing tools under our belt by using -ftime-trace that you contributed.

Thanks for summarizing your findings, very interesting read!

I wanted to share some random brain dumps, although it is unlikely to make any difference, as most of these would be huge undertaking to implement with no guaranteed results.

  • I don’t remember us ever experimenting with replacing some data structures in the analyzer. There are some other immutable data structures that might have slightly different performance characteristics that might worth looking into. We could experiment with Patricia trees (an implementation is available here, or finger trees. We could also look into if there are any opportunities to optimize some of the immutable data structures that we have. I haven’t look too deeply but I think the iterators could be simplified a bit (I am not convinced we need the VisitFlag), and maybe some of the “mutating” operations could also be improved. I think the nodes are refcounted but I don’t think we try to do any optimizations for uniquely owned nodes (with the refcount of one) where we can do certain operations in place.
  • Probably not the main source of the inefficiency but my intuition is that LocCtx -> (Expr -> SVal) might be an overkill for the generational removeDead. Given that the analyzer usually has a very small limit on the size of the call stack, I think we could have a stack of (Expr -> SVal, LocCtx)s, and do linear search on LocCtx to look things up.
  • A crazy idea is, I am not actually sure we need the Environment to be a map. Just as during execution the “active” values are stored on a stack and popped off once we no longer need them, we could do something similar during symbolic execution. If we already had a stack-like representation, implementing generational removeDead would become easier. The caveat is that we would probably need to keep some dead values in this stack until the corresponding frame is actually popped.

Regarding option 4, I am not sure I understand why we need the set. Isn’t it possible to just invoke removeDead at the end of the statements without any sets or maps?

I think in the past, Gábor Márton experimented with swapping the immutable map and set with Immer’s immutable counterparts. However, we didn’t see major difference at the time wrt. RT.
It could be that the integration wasn’t perfect, or that the lookups weren’t on the hot path in the test project. Now we would have much better tools for evaluating such a change.

Nice ideas!

I especially like your last crazy idea and, I wonder: why was the environment a map in the first place? In any case, I will not have time for such a large change in the analyzer architecture.

I didn’t want to completely disable calls to removeDead in the middle of interpreting a statement. And, in practice, completely disabling removeDead for call expressions in the middle of a statement lead to a substantial slowdown. I din’t investigate further, though.

So I wanted to run removeDead once for call expression in any statement, but now I had to differentiate between stack frames and I used a map for that. Perhaps, it could be done with a small vector as you suggested in your other point.