[analyzer][RFC] Retry Z3 crosscheck queries on timeout

Brief

In this RFC, I’ll display some statistics about an experiment I did improving the run-to-run stability of analyses with Z3 refutation by ~80-90% according to my data.

History and motivation

In a previous RFC “Taming Z3 query times”, I shared that in some rare cases Z3 refutation can effectively halt an analysis, because it would just run for close to 15 seconds for all report candidate in a report equivalence class - which can take surprisingly long, for large eqclasses. Or we would have many small eqclasses, but many of them that would exercise a similar “hang”.

I argued, that a “hang” is as serious as a “crash”, motivating me to put some safeguards for ensuring hard guarantees that a report equivalence class is always processed in at most 1 second (but usually much much faster than that).
The proposed solution traded off these hangs for a limited, but measurable nondeterministic (flaky) reports - due to the nature of tighter timeouts.

This was reported by @szelethus in a reply, that the increase of the flaky reports is a no-go, thus was sort of reverted last week.
I think this was a good call overall, but I wanted to come back and reduce the flaky behavior even with the values previous to the revert, so here are my results.

Idea: retry the failing queries

Let’s take a query that should take as long as the timeout threshold we have. Then sometimes if would pass, other times it would time out - causing a flaky report. (Remember, a query that consistently times out for a threshold is not a problem, because it behaves consistently, thus the outcome is deterministic.)

So, if we have a flaky report, we should be able to improve the stability of the outcome by running the query multiple times if it times out.

This wouldn’t completely fix the flaky reports problem caused by the nondeterministic timeout problem with Z3 refutation, but it may reduce the magnitude to an acceptable level at the expense of spending a bit more time on some Z3 queries.
The main benefit of this is that we would keep similar hard guaranties of the safeguard we introduced in RFC “Taming Z3 query times”.

Measurements

I used our pool of 200+ projects.

  • Setting a timeout of 15 seconds, 0 rlimit, 0 accumulated eqclass threshold and 20 retries, we should have the most stable run we ever had. Diffing two identical runs of this configuration revealed 7 flaky issues.
    I could have just disabled Z3 crosschecking to confirm that these are flaky for other reasons than Z3. This is basically the “best” we can hope for, to set the scene.

  • A run with 300ms timeout, 400k rlimit, 700ms accumulated timeout resulted in ~500 issues (the sum of + appearing and - disappearing issues). This was in line with our past experience of using this default configuration - call this “baseline”.

    • run 1 vs run 2: +259 -251
  • A run with 300ms timeout, 400k rlimit, 700ms accumulated timeout and allowing 1 retry resulted in about 50 to 100 flaky issues, which is 5 fold better than “baseline”.

    • run 1 vs run 2: +27 -23
    • run 2 vs run 3: +10 -90
    • run 3 vs run 1: +89 -13
  • A run with 300ms timeout, 400k rlimit, 700ms accumulated timeout and allowing 2 retries resulted in about 60-110 flaky issues, which is basically identical to having 1 retry.

    • run 1 vs run 2: +27 -34
    • run 2 vs run 3: +58 -53
    • run 3 vs run 1: +56 -54

This data suggests that having 1 retry helps to reduce the flakyness by a lot. In my experience, it reduced by 80-90%. Having more retries didn’t really help in our case, but I think it may be dependent on the compute cluster the analysis runs on, so having the default retry count set to more than 1 (like 2) should be likely a better default than just 1.

What’s next?

Unfortunately, as long as we have timeouts, we will always have flaky reports no matter what. Caching Z3 queries would help to “pin” the outcome of the first run where we run a Z3 query, as described here. However, all these are just smart ways of fighting nondeterminism, not a silver bullet.

We plan to come back to implementing and measuring a prototype for the Z3 query cache along with improving the determinism of all the other parts of the analyzer if our bandwidth allows in the future.

I’m currently crafting the test case for the PR before I’d submit it.
EDIT: Here is the PR.

I’ll migrate questions raised at the PR, as I believe the direction should be discussed here, and then the implementation details at the PR.

@DonatNagyE I’m a bit surprised by the idea of using multiple attempts instead of a single run with a larger timeout – intuitively we’re wasting the already performed calculations if we are impatient and abort+restart the calculations after each short timeout (instead of allocating all the time for a single run).

Excellent question. I should have explained this upfront.
In the “Z3 safeguards” RFC, I posted a chart about the Z3 queries we make.
The wast majority of them are quick, and frequently the ones that run long, run so consistently. Neither of the quick and slow queries are a big problem in practice because they behave consistently across runs.
Increasing the timeout would also help, and this is why we reverted to 15 seconds timeout last week - until we reduce the flakyness.
However, a longer timeout means we would consistently exhausting that long timeout every time for bad behaving queries. With retries, we could set a smaller threshold and still achieve a consistent result run to run, while possibly beating the longer timeout.

We had exactly the same question, and we decided to measure how long retried Z3 queries run. I set 15 seconds timeout, and disabled the rest of the thresholds - just like you have upstream now. Then I measured how long the check() runs, up to 21 attempts (1 initial and 20 retry). This is the data I used for the evaluations and simulations later.

You can see that there are occasional (but rare) spikes in the measured query times. My theory is that those are likely caused by VM preemptions and similar anomalies. Most of the queries run roughly for same length, but usually retries attempts consistently beat the initial attempt. My theory is that this could be tied to warming up the cache and branch predictions for the retries. Finally, there a couple of queries where the initial attempt runs slooow, but the retries run a lot quicker. This suggests to me that Z3 must do something internally if we reuse the same Solver object and call repeated check() calls. One theory is that it reuses/caches something internally that makes it consistently for retries. Otherwise I can’t explain this.

Disclaimer: I had so much data, that I could only chart one fifth of the sampled data. I also dropped the slowest 1% of attempt 0 queries to have a better view of the vast majority of the dataset.
Surface height represents the query times (ms) of the individual measurements, which are also depicted by a color-scale to easier spot the peaks.

After this looked into which is more beneficial: increasing a timeout or increasing the number of retries to achieve better stability for a query outcome. The outcome of a query sequence (initial query and their retries) is SAT, UNSAT if any of the attempts resulted in SAT or UNSAT. It’s UNDEF otherwise.

I captured the data for a second time and joined the two measurements by file, eqclass ID and report ID - that uniquely identifies a row in my data.

After this, I built a simulation for a configuration, telling me the simulated outcome. Doing this for both run1 and run2 and then comparing the simulated outcomes of the two would tell us if the report would be flaky or not - according to the simulation. Then, I could display the number of “mismatches” for each configuration and judge what is better: longer timeout or more retry attempts.

On this chart, we have 4 dimensions. The simulated configuration is represented by the x,y,z axis, while the number of simulated mismatches is represented by the color of the dots.

On the bottom left, we can see our current upstream configuration. The bottom right bubble refers to the configuration we have in production. And we can see that in the simulation, we would start to have much less mismatches even with a single retry attempt - closing the gap between the 15 seconds timeout config, while having a much tighter timeouts at worst case: running 2 times a 300ms query vs 15 seconds.

Timeout 0 actually means 15 seconds - as that was the threshold when I gathered the data. Rlimit 0 means unlimited rlimit.

Disclaimer: I had a similar plot for 20 retries, but that would be just noise as the “mismatches” quickly converge; which means that we should only consider a handful of retries. This plot better delivers the message.

Here is the csv data and python code for an interactive plot

joined_run1_and_run2_mismatching_simulated_outcomes.csv.txt (3.7 KB)

from matplotlib import colors
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import mplcursors
import numpy as np
import pandas as pd

df = pd.read_csv("joined_run1_and_run2_mismatching_simulated_outcomes.csv", on_bad_lines='warn')
norm = colors.LogNorm(vmin=df['mismatches'].min(), vmax=df['mismatches'].max())
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
log_rlimits = np.log10(df['rlimit'].replace({0:10}))
sc = ax.scatter(df['timeout'], log_rlimits, df['retries'], c=df['mismatches'], norm=norm, cmap='turbo')
cbar = plt.colorbar(sc, ax=ax, pad=0.1)
cbar.set_label('mismatches (log scale)')
ax.set_xlabel('timeout (ms)')
ax.set_ylabel('rlimit (log10)')
ax.set_zlabel('retries')
ax.set_zticks([0, 1, 2])
cursor = mplcursors.cursor(sc, hover=True)
@cursor.connect("add")
def on_add(sel):
  rlimit = int(10**log_rlimits.iloc[sel.index])
  rlimit = 0 if rlimit == 10 else rlimit
  sel.annotation.set_text(
    f'timeout: {df["timeout"].iloc[sel.index]}\n'
    f'rlimit: {rlimit:_}\n'
    f'retries: {df["retries"].iloc[sel.index]}\n'
    f'mismatches: {df["mismatches"].iloc[sel.index]}'
  )

plt.show()

I could share all the data, the code that did the simulation and charting.

We settled on having 2 retries - although neither the simulation nor the measurement of the actual implementation (see the initial post of this RFC) really showed improvements after 1 retry, to have some additional safety margin. 3x 300ms is still completely fine for us per query at worst case.


I can imagine a world where the runtime of the probability distribution of Z3 queries has a very “large tail”, and in that case the restarts would be useful. E.g. as a toy example consider a probability distribution where rerunning one particular query takes <200 ms with 90% chance and 2000 ms in the remaining 10% of runs – in this world doing three runs with 300 ms timeout is much better better than doing one run with 900 ms timeout.

Exactly. Having one-off swings in query time can happen.

[…] To check this question, it would be nice to have measurements that compare the total analysis time and flakiness of e.g. t i m e o u t = X , r e p e a t s = 2 and t i m e o u t = 3 X , r e p e a t s = 0 for various values of X . In addition to the timeout X = 300 ms that you suggest as default (because it fits the capabilities of your system) it would be important to have a few other measurements that e.g. use X = 150 ms or X = 200 ms etc. to approximate the situation when somebody uses a timeout of 300 ms on a system that is weaker than yours.

To simplify the comparison, I’d suggest doing these comparisons without specifying an aggregate timeout – but if you can suggest a reasonable value for it, then I’m also fine with that.

Ah, I think I just explained this. Let me know if I left out something.

Thanks for this additional information – it convincingly suggests that retries are indeed useful in this situation.

I’m still a bit confused to see that restarting the Z3 query improves the runtime – if the outlier slow runs were caused by things like VM preemptions then I don’t see how the restarted run would escape the same slowdown (which, according to my rough understanding, should affect the whole VM for a certain amount of time).

This is why I was suggesting a direct experimental performance comparison between e.g. 300 ms timeout + 2 retries and 900 ms timeout – even if there are rare but significant spikes in the observed distribution of runtimes, we don’t know whether aborting those spikes with a sharper timeout would provide a “usual” quick run or just another slow execution (because the system is still overloaded at that point in time).

However, this commit is conservative and not too large, and your measurements already prove that it plausibly helps, so if you feel that you wish to merge it without further measurements, then I’m not opposed to it.

I can easily test any configurations. Just suggest me 2 configurations that I should evaluate. What should I set these values for?

  • Z3CrosscheckEQClassTimeoutThreshold
  • Z3CrosscheckTimeoutThreshold
  • Z3CrosscheckRLimitThreshold
  • Z3CrosscheckRetriesOnTimeout

If you do one pair of measurements, I’d suggest

Option name Config1 Config2
EQClassTimeout 0 0
Timeout 300 900
RLimit 0 0
RetriesOnTimeout 2 0

(I’m suggesting disabled RLimit and EqClassTimeout to avoid “unfair” influence from them; in real-world use obviously they’d be set to reasonable values.)

If you have capacity for a second comparison, also compare

Option name Config3 Config4
EQClassTimeout 0 0
Timeout 200 600
RLimit 0 0
RetriesOnTimeout 2 0

to model the use of the 300 ms timeout on a slower machine.

@DonatNagyE Here are the results.
I did 8 runs in total, and diffed identical config runs against each other to see how much flakyness we would observe if we had the same config for two subsequent runs.

config vs config (same) differences
200ms 0rlim 0acc 2re 200ms 0rlim 0acc 2re 1 (+0 -1)
600ms 0rlim 0acc 0re 600ms 0rlim 0acc 0re 56 (+26 -30)
300ms 0rlim 0acc 2re 300ms 0rlim 0acc 2re 64 (+21 -43)
900ms 0rlim 0acc 0re 900ms 0rlim 0acc 0re 69 (+37 -32)

The data is from 200+ projects.

Hmm. I am a bit concerned about how different the 200ms 2re and the 300ms 2re configs are in flakyness. The same difference we see between 200ms and 300ms is probably observable at the same timeout but on different hardware. That being said, maybe both of those are acceptable at 200+ projects.

We rarely if ever had identical diff without ~20 changes. They are likely coming from other sources of nondeterminism.
To me, all of these are basically equally good results, within noise.

Probably the 200ms case was really lucky, and we got similar outcomes for the two runs. I don’t really want to repeat the experiment with a 3rd run.

Thanks for doing the measurements!

Unfortunately I think the results are very noisy, because

  1. there is just one single difference between the two 200ms 2runs, which must be an extreme outlier; and
  2. shorter timeouts produced less differences than longer runs (#diffs(200ms 2re) < #diffs(300ms 2re) and #diffs(600ms 0re) < #diffs(900ms 0re)), while common sense predicts that shorter timeouts produce more flakiness.

Sidenote: More precisely, very short timeouts would produce little flakiness (because the queries would consistently time out), but among “normal” timeouts (where most of the queries succeed) longer ones should produce less flakiness on average.

Due to this noise in the measurement results, I feel that these eight runs are not enough to definitely demonstrate the usefulness of the suggested patch (i.e. that multiple small queries are better than just one query with an appropriately longer timeout). (At least their trend points in the right direction, but this trend is weaker than the definitely-just-noise trend that lower timeouts reduce the flakiness…)

However, there is still no reason to fear that the patch is dangerous or causes any problems, so if wish to merge it, then I’m not opposed.

1 Like