Context
While upgrading from clang-17 to clang-18, we faced serious running time regressions, and we had to revert partial changes and postpone others.
In particular, the GenericTaintChecker, the BitwiseShiftChecker and the OOBv2 changes appeared to contribute to these slowdowns, by applying more constraints which cause Z3 to hit the poor performance characteristics a lot more frequently.
Now that isTainted()
is fixed, we still have rarer regressions. Those can be grouped into 2 categories:
-
Z3 crosschecking hits the 15 seconds timeout for bug report equivalence classes many many times in a row, ballooning the running time of processing the affected bug report equivalence classes, as mentioned here.
Fixing this is the objective of this RFC. -
Symbols got a lot more complicated in some cases. This affects projects having many nested loops and array accesses inside, like FFmpeg. The worst slowdown was 28x (with this RFC applied), on FFmpeg
sheervideo.c
, which I reported here. Fixing this is out of scope of this RFC.
Problem
Z3 crosschecking should not impose noticeable runtime regression, but the current implementation does not impose guarantees like that at all.
Consequently, the time necessary for processing a report equivalence class is uneven and unpredictable; while the outcome is limited: 1 diagnostic to the end user.
Notice, that having only a single badly behaving large report equivalence class might ruin the whole analysis because of this - which makes it basically as severe as a crash.
Objective
Add safeguards for Z3 running times ensuring that Z3 crosschecking cannot dominate the analysis time.
Ideally, have some hard guarantees/limits for bounding how long report selection takes from the report equivalence class. In our downstream prototype, we set this to 1 second.
This means that no matter how wild Z3 gets, we would still meet our deadline even at the expense of dropping some diagnostics - from the user’s perspective.
sub-objectives
- Should impose the least amount of flaky reports as possible, so it should be deterministic (ideally).
- Should not change the set of visible bug reports we have today, as it would be disruptive to rolling out this feature.
Implementation considerations
Z3 option: rlimit
I wanted to use the Z3’s rlimit
, which supposed to be a deterministic counter - correlating to running time.
But turns out that exclusively using that is not enough, as there are Z3 queries that take 60+ minutes but barely consume any rlimit
while doing so - defeating this threshold. This is suggests that we must use regular timeouts in conjunction to rlimit
.
Data
I’ve gathered data for 200+ projects, C, C++, small and huge.
For this data, I disabled short-circuiting of finding the first valid bug report of the report eqclass. I’ve done this to get a better picture of the Z3 queries we could potentially make.
On this scatter plot, you can see how the running time (ms) relates the rlimit
consumed by the Z3 queries. Both scales are logarithmic.
We can see that there is no really useful correlation between running time and rlimit
. Disregard the horizontal lines are because the measurement is only millisecond granular, but note the dots at the very top, where we hit the current 15 seconds hard-coded timeout. Those are the ones that hurt us.
If we make the dots smaller, then we can get a sneak-peak into the density of the solid blue parts of the chart to understand better how the 90%+ of the queries behave.
We can see that there is a dense spot that we need to consider when choosing a suitable rlimit
and timeout.
By using an rlimit
of 400k, then it would deterministically cut most of the worst cases. It would also sacrifice 0.04% of the potential reports where we have small running times with huge rlimit
.
We can also see that we only rarely have queries taking over 1 second, and they usually take around 20 to at most 50 milliseconds.
Python prototyping, exploration
I’ve implemented a simulator in Python to experiment with different heuristics and a method for comparing the current eqclass selection outcome: which report we accept as the valid bug report (and display to the user); against the one that the new heuristic would pick. While doing so, aggregating how much query time the decision would take.
After this, I gathered the rows for the mismatches: when the new heuristic would determine a different outcome. And looked for patterns that I should watch out.
By following these steps I ended up with the following:
- Sometimes we have large eqclasses with many long running Z3 queries (~200ms), which eventually return UNSAT. This balloons up the total running time; so I track the total running time spent on an eqclass, and cut it if it takes more than 700ms.
- I’ve considered a different metric, calculating the average query time in the eqclass and cut it if it proves that these queries take long, let’s say over 100ms. For this, we would also need to gain confidence first, to have let’s say at least 5 queries to take the average. But at that point this heuristic no longer much different than the previous one.
- I track how many queries we made so far in the eqclass, and if it’s over 100, I cut the eqclass. This is also quite similar to the aggregated timeout heuristic like the first one, but this is deterministic at least.
- Count how many times we get UNDEF because we hit the timeout limit, and if it happens for the second time in the eqclass, drop the eqclass. I don’t have a proof why not to drop for the first time such happens, but my consideration was that the system might just had some rare spike of cpu utilization or some other anomaly like that, so it might be better to not cut the eqclass right after the first such happens. I don’t have proofs for this.
- Same as the previous but for
rlimit
. This has the advantage of being deterministic at least. However, as you have seen from the scatter plot, the occasional largerlimit
does not always directly imply long running times, so I chose a more conservative 5 occurrences to trigger the cut. I also have no data to back this arbitrary decision.
Tried configurations in the simulator
Number of eqclasses processed: 471489
The mismatch
means that the outcome (the selected diagnostic we accept) of the given heuristic differs from the one that the baseline heuristic would have selected
-
Baseline: the current behavior: Accept report on SAT, or timeout; Reject report (basically try the next one in the eqclass) on UNSAT.
Total simulated Z3 query time: 4.21 hours (15151900 ms)
Mismatches 0 % of the times (0 out of 471489)
-
Baseline2 (rej UNSAT): Accept on SAT; Reject on UNSAT or timeout.
Total simulated Z3 query time: 9.34 hours (33619000 ms)
This suggests you why we accepted the report on hitting the 15 seconds timeout today. At least that is my guess why it’s done so.
Mismatches 0.034 % of the times (161 out of 471489)
-
Proposed heuristic: 300ms timeout, 400k
rlimit
, 700ms max aggregated query time, 100 max queries, cut after 2 timeouts, cut after 5 rlimit hits.
Total simulated Z3 query time: 1.64 hours (5891304 ms)
Mismatches 0.059 % of the times (277 out of 471489)
Also tried some other configurations, which I don’t mention here.
Implementation
I’ll prepare a PR implementing the proposed heuristic, but I’m open for suggestions refining the idea.
EDIT: Here is the change in two commits:
1 Prepare with an NFC commit: [analyzer][NFC] Reorganize Z3 report refutation by steakhal · Pull Request #95128 · llvm/llvm-project · GitHub
2 Implement the “oracle” determining when to cut the eqclass: [analyzer] Harden safeguards for Z3 query times by steakhal · Pull Request #95129 · llvm/llvm-project · GitHub
Considered alternative approaches
- Fuse the individual Z3 queries into a single optimization query by the following technique. Basically, have a single Z3 invocation per eqclass to make it scale better that way:
Have an integer symbol for the index, representing the query index, call iti
.
(i=0^eqclass0_constraints)v(i=1^eqclass1_constraints)v...
optimize for lowesti
.
However, this would likely not improve the running time as I don’t expect Z3 to reuse proofs across different terms if those terms use completely different symbols. So I did not implement this. - Have the trimmed graph for the eqclass’ reports - like we have today, but then use Z3 push/pops when the given node has multiple children. This way we could do recursively descend into the paths and by using push/pops the solver can/could reuse the learned proofs for the common paths. This would likely improve some performance. However, as a usual query seems to be quick, I don’t think this would really help in the outlier cases, which initially motivated this RFC. Consequently, this was not explored, but remains open for the future.
Feedback is welcome
Let me know what you think of this proposal. The patch is ready downstream, and I’ll soon post it.