[analyzer][RFC] Taming Z3 query times

Context

While upgrading from clang-17 to clang-18, we faced serious running time regressions, and we had to revert partial changes and postpone others.
In particular, the GenericTaintChecker, the BitwiseShiftChecker and the OOBv2 changes appeared to contribute to these slowdowns, by applying more constraints which cause Z3 to hit the poor performance characteristics a lot more frequently.

Now that isTainted() is fixed, we still have rarer regressions. Those can be grouped into 2 categories:

  • Z3 crosschecking hits the 15 seconds timeout for bug report equivalence classes many many times in a row, ballooning the running time of processing the affected bug report equivalence classes, as mentioned here.
    Fixing this is the objective of this RFC.

  • Symbols got a lot more complicated in some cases. This affects projects having many nested loops and array accesses inside, like FFmpeg. The worst slowdown was 28x (with this RFC applied), on FFmpeg sheervideo.c, which I reported here. Fixing this is out of scope of this RFC.


Problem

Z3 crosschecking should not impose noticeable runtime regression, but the current implementation does not impose guarantees like that at all.
Consequently, the time necessary for processing a report equivalence class is uneven and unpredictable; while the outcome is limited: 1 diagnostic to the end user.

Notice, that having only a single badly behaving large report equivalence class might ruin the whole analysis because of this - which makes it basically as severe as a crash.

Objective

Add safeguards for Z3 running times ensuring that Z3 crosschecking cannot dominate the analysis time.

Ideally, have some hard guarantees/limits for bounding how long report selection takes from the report equivalence class. In our downstream prototype, we set this to 1 second.
This means that no matter how wild Z3 gets, we would still meet our deadline even at the expense of dropping some diagnostics - from the user’s perspective.

sub-objectives

  • Should impose the least amount of flaky reports as possible, so it should be deterministic (ideally).
  • Should not change the set of visible bug reports we have today, as it would be disruptive to rolling out this feature.

Implementation considerations

Z3 option: rlimit

I wanted to use the Z3’s rlimit, which supposed to be a deterministic counter - correlating to running time.
But turns out that exclusively using that is not enough, as there are Z3 queries that take 60+ minutes but barely consume any rlimit while doing so - defeating this threshold. This is suggests that we must use regular timeouts in conjunction to rlimit.

Data

I’ve gathered data for 200+ projects, C, C++, small and huge.
For this data, I disabled short-circuiting of finding the first valid bug report of the report eqclass. I’ve done this to get a better picture of the Z3 queries we could potentially make.
On this scatter plot, you can see how the running time (ms) relates the rlimit consumed by the Z3 queries. Both scales are logarithmic.

We can see that there is no really useful correlation between running time and rlimit. Disregard the horizontal lines are because the measurement is only millisecond granular, but note the dots at the very top, where we hit the current 15 seconds hard-coded timeout. Those are the ones that hurt us.

If we make the dots smaller, then we can get a sneak-peak into the density of the solid blue parts of the chart to understand better how the 90%+ of the queries behave.

We can see that there is a dense spot that we need to consider when choosing a suitable rlimit and timeout.

By using an rlimit of 400k, then it would deterministically cut most of the worst cases. It would also sacrifice 0.04% of the potential reports where we have small running times with huge rlimit.

We can also see that we only rarely have queries taking over 1 second, and they usually take around 20 to at most 50 milliseconds.

Python prototyping, exploration

I’ve implemented a simulator in Python to experiment with different heuristics and a method for comparing the current eqclass selection outcome: which report we accept as the valid bug report (and display to the user); against the one that the new heuristic would pick. While doing so, aggregating how much query time the decision would take.

After this, I gathered the rows for the mismatches: when the new heuristic would determine a different outcome. And looked for patterns that I should watch out.

By following these steps I ended up with the following:

  • Sometimes we have large eqclasses with many long running Z3 queries (~200ms), which eventually return UNSAT. This balloons up the total running time; so I track the total running time spent on an eqclass, and cut it if it takes more than 700ms.
  • I’ve considered a different metric, calculating the average query time in the eqclass and cut it if it proves that these queries take long, let’s say over 100ms. For this, we would also need to gain confidence first, to have let’s say at least 5 queries to take the average. But at that point this heuristic no longer much different than the previous one.
  • I track how many queries we made so far in the eqclass, and if it’s over 100, I cut the eqclass. This is also quite similar to the aggregated timeout heuristic like the first one, but this is deterministic at least.
  • Count how many times we get UNDEF because we hit the timeout limit, and if it happens for the second time in the eqclass, drop the eqclass. I don’t have a proof why not to drop for the first time such happens, but my consideration was that the system might just had some rare spike of cpu utilization or some other anomaly like that, so it might be better to not cut the eqclass right after the first such happens. I don’t have proofs for this.
  • Same as the previous but for rlimit. This has the advantage of being deterministic at least. However, as you have seen from the scatter plot, the occasional large rlimit does not always directly imply long running times, so I chose a more conservative 5 occurrences to trigger the cut. I also have no data to back this arbitrary decision.

Tried configurations in the simulator

Number of eqclasses processed: 471489
The mismatch means that the outcome (the selected diagnostic we accept) of the given heuristic differs from the one that the baseline heuristic would have selected

  1. Baseline: the current behavior: Accept report on SAT, or timeout; Reject report (basically try the next one in the eqclass) on UNSAT.
    Total simulated Z3 query time: 4.21 hours (15151900 ms)
    Mismatches 0 % of the times (0 out of 471489)

  2. Baseline2 (rej UNSAT): Accept on SAT; Reject on UNSAT or timeout.
    Total simulated Z3 query time: 9.34 hours (33619000 ms)
    This suggests you why we accepted the report on hitting the 15 seconds timeout today. At least that is my guess why it’s done so.
    Mismatches 0.034 % of the times (161 out of 471489)

  3. Proposed heuristic: 300ms timeout, 400k rlimit, 700ms max aggregated query time, 100 max queries, cut after 2 timeouts, cut after 5 rlimit hits.
    Total simulated Z3 query time: 1.64 hours (5891304 ms)
    Mismatches 0.059 % of the times (277 out of 471489)

Also tried some other configurations, which I don’t mention here.

Implementation

I’ll prepare a PR implementing the proposed heuristic, but I’m open for suggestions refining the idea.

EDIT: Here is the change in two commits:
1 Prepare with an NFC commit: [analyzer][NFC] Reorganize Z3 report refutation by steakhal · Pull Request #95128 · llvm/llvm-project · GitHub
2 Implement the “oracle” determining when to cut the eqclass: [analyzer] Harden safeguards for Z3 query times by steakhal · Pull Request #95129 · llvm/llvm-project · GitHub

Considered alternative approaches

  • Fuse the individual Z3 queries into a single optimization query by the following technique. Basically, have a single Z3 invocation per eqclass to make it scale better that way:
    Have an integer symbol for the index, representing the query index, call it i.
    (i=0^eqclass0_constraints)v(i=1^eqclass1_constraints)v... optimize for lowest i.
    However, this would likely not improve the running time as I don’t expect Z3 to reuse proofs across different terms if those terms use completely different symbols. So I did not implement this.
  • Have the trimmed graph for the eqclass’ reports - like we have today, but then use Z3 push/pops when the given node has multiple children. This way we could do recursively descend into the paths and by using push/pops the solver can/could reuse the learned proofs for the common paths. This would likely improve some performance. However, as a usual query seems to be quick, I don’t think this would really help in the outlier cases, which initially motivated this RFC. Consequently, this was not explored, but remains open for the future.

Feedback is welcome

Let me know what you think of this proposal. The patch is ready downstream, and I’ll soon post it.

@Xazax-hun @DonatNagyE @NoQ @mikhailrgadelha @Szelethus

I’ve also done a flakyness test for the proposed heuristic, where I added uniform random ±10% query time for every query and repeated the whole test multiple times in the simulation.
It showed only 3-7 report changes compared the first run of the heuristic for each run, across the whole project set - which is really good.

Thanks for this through analysis of the Z3 query times!

Your reasoning is very convincing, I would be completely satisfied with your proposed heuristic. (You even picked my favorite colors for the bikeshed, so I don’t have anything to say here :smiling_face:)

<out of scope>
You write that “Symbols got a lot more complicated in some cases.” but I’m not completely sure that this is a real effect. As far as I know this observation comes from logging code that was added into ArrayBoundV2 – and your bisection revealed that the increase of symbol complexity appears at a commit where ArrayBoundV2 switches to a different callback.

Based on this I’d say that it’s likely that the apparent increase of symbol complexity is an artifact introduced by the measurement method; however, I can be easily convinced about the opposite if there is more data. (And I can contribute to analyzing this situation if you need help.)
</out of scope>

I’ve decided to give a shot for an even more simplified variant of my proposed heuristic: 300ms query timeout, 400k rlimit, aggr query time 700ms, cut after first timeout or rlimit exhaustion.
The simulation suggests this would be less “flaky”, but would also suppress a touch more reports.

Total simulated Z3 query time: 1.6 hours (5771256 ms)  # from originally proposed 1.64 hours
Number of eqclasses processed: 471489
Mismatches 0.066 % of the times (310 out of 471489) # from 277 (0.059%)
 - aggr-querytime-over-threshold      49
 - consumed-rlimit-too-many-times     51
 - timed-out                         210

For the perturbed simulations, it produced 1,2,1,3,2,2,3,4,2,2 report differences compared to the non-perturbed run. And these diffs only came from 2 projects: HHVM (3 times), and taskflow (22 times) across only a handful of TUs. This suggests to me that those TUs are simply more prone to Z3 timeouts than others, hence more sensitive. This would imply that when doing differential testing, one could quickly learn that those are likely the flukes and focus on the relevant differences.

Because of this, I’d personally prefer this configuration of the heuristic over the originally proposed.

Thanks! It was fun to work on this.

Once I have time, I’ll come back to that regression too, as I believe no matter the use case, one should never have 28x slowdown after an upgrade. I’ll create a separate thread about that subject, where I elaborate the evidence I have so far, backing my statements - but that has lower priority, because it’s not seen as a blocker issue.

The simplified proposal is also a nice option, feel free to use that.

:thinking: The results

 - aggr-querytime-over-threshold      49
 - consumed-rlimit-too-many-times     51
 - timed-out                         210

suggest that perhaps the rlimit could be a bit more strict to ensure that the non-deterministic timeouts are less common than the deterministic rlimit blocks.

Also, perhaps it would be useful to add a flag to configure/increase the refutation timeout for users who would accept slower runtimes if they get more deterministic results. (Although this is low priority, I don’t know whether it would be used in practice… perhaps not.)

While it seems temping, if you look at the density chart I attached, it highlights that a more tight rlimit might cause significantly more cuts. This limits how tight we can set this. I could tighten this, but then we risk over-fitting the heuristic to my dataset.

Yes, I have a flag for both the timeout and the rlimit, but I highly discourage users tinkering with it. I’d even highly suggest vendors to not tweak this as it should result in suboptimal tradeoffs.

The proposed solution makes sense to me given the data. I think tuning the cut heuristics make a lot of sense. That being said, in the long term I hope we could also explore some alternative directions (most of which are already mentioned in the post), like:

  • Try using multiple solvers as they might have different outliers
  • Try constructing a single query for the whole eq. class
  • Try incremental solving for the common prefixes within the eq. class

Also, (as this post mentioned) too many complex constraints/symbolic expressions might be problematic for other parts of the analyzer, and can become bottlenecks. So, it is possible, that we might have to address the root cause of the problem and not generate certain constraints regardless of Z3 performance. That being said, we still want to have better guarantees in place to avoid outliers regressing the performance too badly as hard to solve queries are generally unpredictable and hard to avoid.

While it seems temping, if you look at the density chart I attached, it highlights that a more tight rlimit might cause significantly more cuts. This limits how tight we can set this.

I see and I agree after eyeballing the graphs again. Having 80%+ timeouts (instead of rlimit exclusions) felt like too much, but the graph shows that it’s unavoidable.

I could tighten this, but then we risk over-fitting the heuristic to my dataset.

Also a very important point.

Yes, I have a flag for both the timeout and the rlimit, but I highly discourage users tinkering with it. I’d even highly suggest vendors to not tweak this as it should result in suboptimal tradeoffs.

Yup, you’re right that tweaking these won’t be too useful in practice.

Thank you all for the feedback.

I’m also curious, however, after looking at flamegraphs Z3 is no longer hot, and well within 1% of overall running time usually. I don’t think the return of investment would justify my curiosity.

So, it is possible, that we might have to address the root cause of the problem and not generate certain constraints regardless of Z3 performance.

I had a ~120-liner Z3 query, which ran in a couple milliseconds, and then I had a same one but with about 10 extra lines. I killed Z3 after more than an hour.

we still want to have better guarantees in place to avoid outliers regressing the performance too badly as hard to solve queries are generally unpredictable and hard to avoid.

It’s hard to enforce such limits, but with the proposed heuristic, the mass extinction of reports would suggest a sudden Z3 regression, because on timeout/rlimit we drop the eqclass - hence losing the report.
Beyond that sign, I instrumented my implementation with llvm statistics, so vendors can interpret those numbers to prevent regressions.
At Sonar, we will definitely keep an eye on those, but we are not tracking clang trunk day-to-day, so like in this case, we may report things with a considerable delay.

My plan is that I merge the two patches by the end of the Tuesday (tomorrow).
Although, I dont have the approvals, or any feedback (yet). So, please, review the second patch too.

FYI both commits were merged:

EDIT: Reverted, as this needs a higher minimal Z3 version than specified in cmake. I submitted separate RFC for bumping the minimal Z3 version here, to satisfy this RFC.

Hey @steakhal, sorry for the delay!

That’s some great work analyzing z3’s performance and indeed the right configuration can improve its performance considerably.

I don’t remember the code exactly, but do we configure Z3 in any way here? Does your python tool also try different Z3 configurations?

If I could suggest only one change would be disabling relevancy [0] and maybe adding some simplify tactics when creating the solver [1]; these two changes improved the performance considerably when I tried them a few years ago in a different tool but also verifies C programs.

Also, have you considered using other solvers? I know this may entail a lot of work but may give a greater return than trying to find the right configuration for Z3 [2].

[0] esbmc/src/solvers/z3/z3_conv.cpp at master · esbmc/esbmc · GitHub
[1] esbmc/src/solvers/z3/z3_conv.cpp at master · esbmc/esbmc · GitHub
[2] I added support for Boolector, Yices, CVC4, and MathSAT a few years ago but for bug refutation there was no gain in switching so I didn’t pursue it further.

1 Like

Thanks for your reply!

We do. But these apply across all solver instantiations regardless of their use-case. This is a problem for timeouts, and this is why the hard-coded 15 seconds timeout doesn’t really work for us.

Depends on the configuration you have in mind. I explored different rlimit and timeout values - but no other configuration values or tactics. I was primarily looking at ways to deterministically cut Z3 queries - preferably superseeding the existing timeout-based mechanism.

I guess, that would make sense for exploring improvement opportunities for the Z3-based solver - but for refutation I don’t really see much value. We are usually well within sub 1 percent of running time in Z3. The goal of this RFC was not to improve running time of Z3, but to offer hard guarantees / safeguards for the existing runtime behavior. Of course a better solver configuration could possibly reduce the number of timeouts or rlimit exhaustions, but as the data suggests that happens really rarely - to a degree, that I wouldn’t investigate.

I think everything I just shared also applies justifying the work (not investing) for exploring other solvers as well.

These are great ideas for making the Z3-based solver actually useful, but that’s definitely out of scope for me.