[analyzer][RFC] Retry Z3 crosscheck queries on timeout

I’ll migrate questions raised at the PR, as I believe the direction should be discussed here, and then the implementation details at the PR.

@DonatNagyE I’m a bit surprised by the idea of using multiple attempts instead of a single run with a larger timeout – intuitively we’re wasting the already performed calculations if we are impatient and abort+restart the calculations after each short timeout (instead of allocating all the time for a single run).

Excellent question. I should have explained this upfront.
In the “Z3 safeguards” RFC, I posted a chart about the Z3 queries we make.
The wast majority of them are quick, and frequently the ones that run long, run so consistently. Neither of the quick and slow queries are a big problem in practice because they behave consistently across runs.
Increasing the timeout would also help, and this is why we reverted to 15 seconds timeout last week - until we reduce the flakyness.
However, a longer timeout means we would consistently exhausting that long timeout every time for bad behaving queries. With retries, we could set a smaller threshold and still achieve a consistent result run to run, while possibly beating the longer timeout.

We had exactly the same question, and we decided to measure how long retried Z3 queries run. I set 15 seconds timeout, and disabled the rest of the thresholds - just like you have upstream now. Then I measured how long the check() runs, up to 21 attempts (1 initial and 20 retry). This is the data I used for the evaluations and simulations later.

You can see that there are occasional (but rare) spikes in the measured query times. My theory is that those are likely caused by VM preemptions and similar anomalies. Most of the queries run roughly for same length, but usually retries attempts consistently beat the initial attempt. My theory is that this could be tied to warming up the cache and branch predictions for the retries. Finally, there a couple of queries where the initial attempt runs slooow, but the retries run a lot quicker. This suggests to me that Z3 must do something internally if we reuse the same Solver object and call repeated check() calls. One theory is that it reuses/caches something internally that makes it consistently for retries. Otherwise I can’t explain this.

Disclaimer: I had so much data, that I could only chart one fifth of the sampled data. I also dropped the slowest 1% of attempt 0 queries to have a better view of the vast majority of the dataset.
Surface height represents the query times (ms) of the individual measurements, which are also depicted by a color-scale to easier spot the peaks.

After this looked into which is more beneficial: increasing a timeout or increasing the number of retries to achieve better stability for a query outcome. The outcome of a query sequence (initial query and their retries) is SAT, UNSAT if any of the attempts resulted in SAT or UNSAT. It’s UNDEF otherwise.

I captured the data for a second time and joined the two measurements by file, eqclass ID and report ID - that uniquely identifies a row in my data.

After this, I built a simulation for a configuration, telling me the simulated outcome. Doing this for both run1 and run2 and then comparing the simulated outcomes of the two would tell us if the report would be flaky or not - according to the simulation. Then, I could display the number of “mismatches” for each configuration and judge what is better: longer timeout or more retry attempts.

On this chart, we have 4 dimensions. The simulated configuration is represented by the x,y,z axis, while the number of simulated mismatches is represented by the color of the dots.

On the bottom left, we can see our current upstream configuration. The bottom right bubble refers to the configuration we have in production. And we can see that in the simulation, we would start to have much less mismatches even with a single retry attempt - closing the gap between the 15 seconds timeout config, while having a much tighter timeouts at worst case: running 2 times a 300ms query vs 15 seconds.

Timeout 0 actually means 15 seconds - as that was the threshold when I gathered the data. Rlimit 0 means unlimited rlimit.

Disclaimer: I had a similar plot for 20 retries, but that would be just noise as the “mismatches” quickly converge; which means that we should only consider a handful of retries. This plot better delivers the message.

Here is the csv data and python code for an interactive plot

joined_run1_and_run2_mismatching_simulated_outcomes.csv.txt (3.7 KB)

from matplotlib import colors
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import mplcursors
import numpy as np
import pandas as pd

df = pd.read_csv("joined_run1_and_run2_mismatching_simulated_outcomes.csv", on_bad_lines='warn')
norm = colors.LogNorm(vmin=df['mismatches'].min(), vmax=df['mismatches'].max())
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
log_rlimits = np.log10(df['rlimit'].replace({0:10}))
sc = ax.scatter(df['timeout'], log_rlimits, df['retries'], c=df['mismatches'], norm=norm, cmap='turbo')
cbar = plt.colorbar(sc, ax=ax, pad=0.1)
cbar.set_label('mismatches (log scale)')
ax.set_xlabel('timeout (ms)')
ax.set_ylabel('rlimit (log10)')
ax.set_zlabel('retries')
ax.set_zticks([0, 1, 2])
cursor = mplcursors.cursor(sc, hover=True)
@cursor.connect("add")
def on_add(sel):
  rlimit = int(10**log_rlimits.iloc[sel.index])
  rlimit = 0 if rlimit == 10 else rlimit
  sel.annotation.set_text(
    f'timeout: {df["timeout"].iloc[sel.index]}\n'
    f'rlimit: {rlimit:_}\n'
    f'retries: {df["retries"].iloc[sel.index]}\n'
    f'mismatches: {df["mismatches"].iloc[sel.index]}'
  )

plt.show()

I could share all the data, the code that did the simulation and charting.

We settled on having 2 retries - although neither the simulation nor the measurement of the actual implementation (see the initial post of this RFC) really showed improvements after 1 retry, to have some additional safety margin. 3x 300ms is still completely fine for us per query at worst case.


I can imagine a world where the runtime of the probability distribution of Z3 queries has a very “large tail”, and in that case the restarts would be useful. E.g. as a toy example consider a probability distribution where rerunning one particular query takes <200 ms with 90% chance and 2000 ms in the remaining 10% of runs – in this world doing three runs with 300 ms timeout is much better better than doing one run with 900 ms timeout.

Exactly. Having one-off swings in query time can happen.

[…] To check this question, it would be nice to have measurements that compare the total analysis time and flakiness of e.g. t i m e o u t = X , r e p e a t s = 2 and t i m e o u t = 3 X , r e p e a t s = 0 for various values of X . In addition to the timeout X = 300 ms that you suggest as default (because it fits the capabilities of your system) it would be important to have a few other measurements that e.g. use X = 150 ms or X = 200 ms etc. to approximate the situation when somebody uses a timeout of 300 ms on a system that is weaker than yours.

To simplify the comparison, I’d suggest doing these comparisons without specifying an aggregate timeout – but if you can suggest a reasonable value for it, then I’m also fine with that.

Ah, I think I just explained this. Let me know if I left out something.