Brief
In this RFC, I’ll display some statistics about an experiment I did improving the run-to-run stability of analyses with Z3 refutation by ~80-90% according to my data.
History and motivation
In a previous RFC “Taming Z3 query times”, I shared that in some rare cases Z3 refutation can effectively halt an analysis, because it would just run for close to 15 seconds for all report candidate in a report equivalence class - which can take surprisingly long, for large eqclasses. Or we would have many small eqclasses, but many of them that would exercise a similar “hang”.
I argued, that a “hang” is as serious as a “crash”, motivating me to put some safeguards for ensuring hard guarantees that a report equivalence class is always processed in at most 1 second (but usually much much faster than that).
The proposed solution traded off these hangs for a limited, but measurable nondeterministic (flaky) reports - due to the nature of tighter timeouts.
This was reported by @szelethus in a reply, that the increase of the flaky reports is a no-go, thus was sort of reverted last week.
I think this was a good call overall, but I wanted to come back and reduce the flaky behavior even with the values previous to the revert, so here are my results.
Idea: retry the failing queries
Let’s take a query that should take as long as the timeout threshold we have. Then sometimes if would pass, other times it would time out - causing a flaky report. (Remember, a query that consistently times out for a threshold is not a problem, because it behaves consistently, thus the outcome is deterministic.)
So, if we have a flaky report, we should be able to improve the stability of the outcome by running the query multiple times if it times out.
This wouldn’t completely fix the flaky reports problem caused by the nondeterministic timeout problem with Z3 refutation, but it may reduce the magnitude to an acceptable level at the expense of spending a bit more time on some Z3 queries.
The main benefit of this is that we would keep similar hard guaranties of the safeguard we introduced in RFC “Taming Z3 query times”.
Measurements
I used our pool of 200+ projects.
-
Setting a timeout of 15 seconds, 0 rlimit, 0 accumulated eqclass threshold and 20 retries, we should have the most stable run we ever had. Diffing two identical runs of this configuration revealed 7 flaky issues.
I could have just disabled Z3 crosschecking to confirm that these are flaky for other reasons than Z3. This is basically the “best” we can hope for, to set the scene. -
A run with 300ms timeout, 400k rlimit, 700ms accumulated timeout resulted in ~500 issues (the sum of
+
appearing and-
disappearing issues). This was in line with our past experience of using this default configuration - call this “baseline”.- run 1 vs run 2: +259 -251
-
A run with 300ms timeout, 400k rlimit, 700ms accumulated timeout and allowing 1 retry resulted in about 50 to 100 flaky issues, which is 5 fold better than “baseline”.
- run 1 vs run 2: +27 -23
- run 2 vs run 3: +10 -90
- run 3 vs run 1: +89 -13
-
A run with 300ms timeout, 400k rlimit, 700ms accumulated timeout and allowing 2 retries resulted in about 60-110 flaky issues, which is basically identical to having 1 retry.
- run 1 vs run 2: +27 -34
- run 2 vs run 3: +58 -53
- run 3 vs run 1: +56 -54
This data suggests that having 1 retry helps to reduce the flakyness by a lot. In my experience, it reduced by 80-90%. Having more retries didn’t really help in our case, but I think it may be dependent on the compute cluster the analysis runs on, so having the default retry count set to more than 1 (like 2) should be likely a better default than just 1.
What’s next?
Unfortunately, as long as we have timeouts, we will always have flaky reports no matter what. Caching Z3 queries would help to “pin” the outcome of the first run where we run a Z3 query, as described here. However, all these are just smart ways of fighting nondeterminism, not a silver bullet.
We plan to come back to implementing and measuring a prototype for the Z3 query cache along with improving the determinism of all the other parts of the analyzer if our bandwidth allows in the future.
I’m currently crafting the test case for the PR before I’d submit it.
EDIT: Here is the PR.