[analyzer][RFC] Taming Z3 query times

Context

While upgrading from clang-17 to clang-18, we faced serious running time regressions, and we had to revert partial changes and postpone others.
In particular, the GenericTaintChecker, the BitwiseShiftChecker and the OOBv2 changes appeared to contribute to these slowdowns, by applying more constraints which cause Z3 to hit the poor performance characteristics a lot more frequently.

Now that isTainted() is fixed, we still have rarer regressions. Those can be grouped into 2 categories:

  • Z3 crosschecking hits the 15 seconds timeout for bug report equivalence classes many many times in a row, ballooning the running time of processing the affected bug report equivalence classes, as mentioned here.
    Fixing this is the objective of this RFC.

  • Symbols got a lot more complicated in some cases. This affects projects having many nested loops and array accesses inside, like FFmpeg. The worst slowdown was 28x (with this RFC applied), on FFmpeg sheervideo.c, which I reported here. Fixing this is out of scope of this RFC.


Problem

Z3 crosschecking should not impose noticeable runtime regression, but the current implementation does not impose guarantees like that at all.
Consequently, the time necessary for processing a report equivalence class is uneven and unpredictable; while the outcome is limited: 1 diagnostic to the end user.

Notice, that having only a single badly behaving large report equivalence class might ruin the whole analysis because of this - which makes it basically as severe as a crash.

Objective

Add safeguards for Z3 running times ensuring that Z3 crosschecking cannot dominate the analysis time.

Ideally, have some hard guarantees/limits for bounding how long report selection takes from the report equivalence class. In our downstream prototype, we set this to 1 second.
This means that no matter how wild Z3 gets, we would still meet our deadline even at the expense of dropping some diagnostics - from the user’s perspective.

sub-objectives

  • Should impose the least amount of flaky reports as possible, so it should be deterministic (ideally).
  • Should not change the set of visible bug reports we have today, as it would be disruptive to rolling out this feature.

Implementation considerations

Z3 option: rlimit

I wanted to use the Z3’s rlimit, which supposed to be a deterministic counter - correlating to running time.
But turns out that exclusively using that is not enough, as there are Z3 queries that take 60+ minutes but barely consume any rlimit while doing so - defeating this threshold. This is suggests that we must use regular timeouts in conjunction to rlimit.

Data

I’ve gathered data for 200+ projects, C, C++, small and huge.
For this data, I disabled short-circuiting of finding the first valid bug report of the report eqclass. I’ve done this to get a better picture of the Z3 queries we could potentially make.
On this scatter plot, you can see how the running time (ms) relates the rlimit consumed by the Z3 queries. Both scales are logarithmic.

We can see that there is no really useful correlation between running time and rlimit. Disregard the horizontal lines are because the measurement is only millisecond granular, but note the dots at the very top, where we hit the current 15 seconds hard-coded timeout. Those are the ones that hurt us.

If we make the dots smaller, then we can get a sneak-peak into the density of the solid blue parts of the chart to understand better how the 90%+ of the queries behave.

We can see that there is a dense spot that we need to consider when choosing a suitable rlimit and timeout.

By using an rlimit of 400k, then it would deterministically cut most of the worst cases. It would also sacrifice 0.04% of the potential reports where we have small running times with huge rlimit.

We can also see that we only rarely have queries taking over 1 second, and they usually take around 20 to at most 50 milliseconds.

Python prototyping, exploration

I’ve implemented a simulator in Python to experiment with different heuristics and a method for comparing the current eqclass selection outcome: which report we accept as the valid bug report (and display to the user); against the one that the new heuristic would pick. While doing so, aggregating how much query time the decision would take.

After this, I gathered the rows for the mismatches: when the new heuristic would determine a different outcome. And looked for patterns that I should watch out.

By following these steps I ended up with the following:

  • Sometimes we have large eqclasses with many long running Z3 queries (~200ms), which eventually return UNSAT. This balloons up the total running time; so I track the total running time spent on an eqclass, and cut it if it takes more than 700ms.
  • I’ve considered a different metric, calculating the average query time in the eqclass and cut it if it proves that these queries take long, let’s say over 100ms. For this, we would also need to gain confidence first, to have let’s say at least 5 queries to take the average. But at that point this heuristic no longer much different than the previous one.
  • I track how many queries we made so far in the eqclass, and if it’s over 100, I cut the eqclass. This is also quite similar to the aggregated timeout heuristic like the first one, but this is deterministic at least.
  • Count how many times we get UNDEF because we hit the timeout limit, and if it happens for the second time in the eqclass, drop the eqclass. I don’t have a proof why not to drop for the first time such happens, but my consideration was that the system might just had some rare spike of cpu utilization or some other anomaly like that, so it might be better to not cut the eqclass right after the first such happens. I don’t have proofs for this.
  • Same as the previous but for rlimit. This has the advantage of being deterministic at least. However, as you have seen from the scatter plot, the occasional large rlimit does not always directly imply long running times, so I chose a more conservative 5 occurrences to trigger the cut. I also have no data to back this arbitrary decision.

Tried configurations in the simulator

Number of eqclasses processed: 471489
The mismatch means that the outcome (the selected diagnostic we accept) of the given heuristic differs from the one that the baseline heuristic would have selected

  1. Baseline: the current behavior: Accept report on SAT, or timeout; Reject report (basically try the next one in the eqclass) on UNSAT.
    Total simulated Z3 query time: 4.21 hours (15151900 ms)
    Mismatches 0 % of the times (0 out of 471489)

  2. Baseline2 (rej UNSAT): Accept on SAT; Reject on UNSAT or timeout.
    Total simulated Z3 query time: 9.34 hours (33619000 ms)
    This suggests you why we accepted the report on hitting the 15 seconds timeout today. At least that is my guess why it’s done so.
    Mismatches 0.034 % of the times (161 out of 471489)

  3. Proposed heuristic: 300ms timeout, 400k rlimit, 700ms max aggregated query time, 100 max queries, cut after 2 timeouts, cut after 5 rlimit hits.
    Total simulated Z3 query time: 1.64 hours (5891304 ms)
    Mismatches 0.059 % of the times (277 out of 471489)

Also tried some other configurations, which I don’t mention here.

Implementation

I’ll prepare a PR implementing the proposed heuristic, but I’m open for suggestions refining the idea.

EDIT: Here is the change in two commits:
1 Prepare with an NFC commit: [analyzer][NFC] Reorganize Z3 report refutation by steakhal Ā· Pull Request #95128 Ā· llvm/llvm-project Ā· GitHub
2 Implement the ā€œoracleā€ determining when to cut the eqclass: [analyzer] Harden safeguards for Z3 query times by steakhal Ā· Pull Request #95129 Ā· llvm/llvm-project Ā· GitHub

Considered alternative approaches

  • Fuse the individual Z3 queries into a single optimization query by the following technique. Basically, have a single Z3 invocation per eqclass to make it scale better that way:
    Have an integer symbol for the index, representing the query index, call it i.
    (i=0^eqclass0_constraints)v(i=1^eqclass1_constraints)v... optimize for lowest i.
    However, this would likely not improve the running time as I don’t expect Z3 to reuse proofs across different terms if those terms use completely different symbols. So I did not implement this.
  • Have the trimmed graph for the eqclass’ reports - like we have today, but then use Z3 push/pops when the given node has multiple children. This way we could do recursively descend into the paths and by using push/pops the solver can/could reuse the learned proofs for the common paths. This would likely improve some performance. However, as a usual query seems to be quick, I don’t think this would really help in the outlier cases, which initially motivated this RFC. Consequently, this was not explored, but remains open for the future.

Feedback is welcome

Let me know what you think of this proposal. The patch is ready downstream, and I’ll soon post it.

@Xazax-hun @DonatNagyE @NoQ @mikhailrgadelha @Szelethus

I’ve also done a flakyness test for the proposed heuristic, where I added uniform random ±10% query time for every query and repeated the whole test multiple times in the simulation.
It showed only 3-7 report changes compared the first run of the heuristic for each run, across the whole project set - which is really good.

Thanks for this through analysis of the Z3 query times!

Your reasoning is very convincing, I would be completely satisfied with your proposed heuristic. (You even picked my favorite colors for the bikeshed, so I don’t have anything to say here :smiling_face:)

<out of scope>
You write that ā€œSymbols got a lot more complicated in some cases.ā€ but I’m not completely sure that this is a real effect. As far as I know this observation comes from logging code that was added into ArrayBoundV2 – and your bisection revealed that the increase of symbol complexity appears at a commit where ArrayBoundV2 switches to a different callback.

Based on this I’d say that it’s likely that the apparent increase of symbol complexity is an artifact introduced by the measurement method; however, I can be easily convinced about the opposite if there is more data. (And I can contribute to analyzing this situation if you need help.)
</out of scope>

I’ve decided to give a shot for an even more simplified variant of my proposed heuristic: 300ms query timeout, 400k rlimit, aggr query time 700ms, cut after first timeout or rlimit exhaustion.
The simulation suggests this would be less ā€œflakyā€, but would also suppress a touch more reports.

Total simulated Z3 query time: 1.6 hours (5771256 ms)  # from originally proposed 1.64 hours
Number of eqclasses processed: 471489
Mismatches 0.066 % of the times (310 out of 471489) # from 277 (0.059%)
 - aggr-querytime-over-threshold      49
 - consumed-rlimit-too-many-times     51
 - timed-out                         210

For the perturbed simulations, it produced 1,2,1,3,2,2,3,4,2,2 report differences compared to the non-perturbed run. And these diffs only came from 2 projects: HHVM (3 times), and taskflow (22 times) across only a handful of TUs. This suggests to me that those TUs are simply more prone to Z3 timeouts than others, hence more sensitive. This would imply that when doing differential testing, one could quickly learn that those are likely the flukes and focus on the relevant differences.

Because of this, I’d personally prefer this configuration of the heuristic over the originally proposed.

Thanks! It was fun to work on this.

Once I have time, I’ll come back to that regression too, as I believe no matter the use case, one should never have 28x slowdown after an upgrade. I’ll create a separate thread about that subject, where I elaborate the evidence I have so far, backing my statements - but that has lower priority, because it’s not seen as a blocker issue.

The simplified proposal is also a nice option, feel free to use that.

:thinking: The results

 - aggr-querytime-over-threshold      49
 - consumed-rlimit-too-many-times     51
 - timed-out                         210

suggest that perhaps the rlimit could be a bit more strict to ensure that the non-deterministic timeouts are less common than the deterministic rlimit blocks.

Also, perhaps it would be useful to add a flag to configure/increase the refutation timeout for users who would accept slower runtimes if they get more deterministic results. (Although this is low priority, I don’t know whether it would be used in practice… perhaps not.)

While it seems temping, if you look at the density chart I attached, it highlights that a more tight rlimit might cause significantly more cuts. This limits how tight we can set this. I could tighten this, but then we risk over-fitting the heuristic to my dataset.

Yes, I have a flag for both the timeout and the rlimit, but I highly discourage users tinkering with it. I’d even highly suggest vendors to not tweak this as it should result in suboptimal tradeoffs.

The proposed solution makes sense to me given the data. I think tuning the cut heuristics make a lot of sense. That being said, in the long term I hope we could also explore some alternative directions (most of which are already mentioned in the post), like:

  • Try using multiple solvers as they might have different outliers
  • Try constructing a single query for the whole eq. class
  • Try incremental solving for the common prefixes within the eq. class

Also, (as this post mentioned) too many complex constraints/symbolic expressions might be problematic for other parts of the analyzer, and can become bottlenecks. So, it is possible, that we might have to address the root cause of the problem and not generate certain constraints regardless of Z3 performance. That being said, we still want to have better guarantees in place to avoid outliers regressing the performance too badly as hard to solve queries are generally unpredictable and hard to avoid.

While it seems temping, if you look at the density chart I attached, it highlights that a more tight rlimit might cause significantly more cuts. This limits how tight we can set this.

I see and I agree after eyeballing the graphs again. Having 80%+ timeouts (instead of rlimit exclusions) felt like too much, but the graph shows that it’s unavoidable.

I could tighten this, but then we risk over-fitting the heuristic to my dataset.

Also a very important point.

Yes, I have a flag for both the timeout and the rlimit, but I highly discourage users tinkering with it. I’d even highly suggest vendors to not tweak this as it should result in suboptimal tradeoffs.

Yup, you’re right that tweaking these won’t be too useful in practice.

Thank you all for the feedback.

I’m also curious, however, after looking at flamegraphs Z3 is no longer hot, and well within 1% of overall running time usually. I don’t think the return of investment would justify my curiosity.

So, it is possible, that we might have to address the root cause of the problem and not generate certain constraints regardless of Z3 performance.

I had a ~120-liner Z3 query, which ran in a couple milliseconds, and then I had a same one but with about 10 extra lines. I killed Z3 after more than an hour.

we still want to have better guarantees in place to avoid outliers regressing the performance too badly as hard to solve queries are generally unpredictable and hard to avoid.

It’s hard to enforce such limits, but with the proposed heuristic, the mass extinction of reports would suggest a sudden Z3 regression, because on timeout/rlimit we drop the eqclass - hence losing the report.
Beyond that sign, I instrumented my implementation with llvm statistics, so vendors can interpret those numbers to prevent regressions.
At Sonar, we will definitely keep an eye on those, but we are not tracking clang trunk day-to-day, so like in this case, we may report things with a considerable delay.

My plan is that I merge the two patches by the end of the Tuesday (tomorrow).
Although, I dont have the approvals, or any feedback (yet). So, please, review the second patch too.

FYI both commits were merged:

EDIT: Reverted, as this needs a higher minimal Z3 version than specified in cmake. I submitted separate RFC for bumping the minimal Z3 version here, to satisfy this RFC.

Hey @steakhal, sorry for the delay!

That’s some great work analyzing z3’s performance and indeed the right configuration can improve its performance considerably.

I don’t remember the code exactly, but do we configure Z3 in any way here? Does your python tool also try different Z3 configurations?

If I could suggest only one change would be disabling relevancy [0] and maybe adding some simplify tactics when creating the solver [1]; these two changes improved the performance considerably when I tried them a few years ago in a different tool but also verifies C programs.

Also, have you considered using other solvers? I know this may entail a lot of work but may give a greater return than trying to find the right configuration for Z3 [2].

[0] esbmc/src/solvers/z3/z3_conv.cpp at master Ā· esbmc/esbmc Ā· GitHub
[1] esbmc/src/solvers/z3/z3_conv.cpp at master Ā· esbmc/esbmc Ā· GitHub
[2] I added support for Boolector, Yices, CVC4, and MathSAT a few years ago but for bug refutation there was no gain in switching so I didn’t pursue it further.

1 Like

Thanks for your reply!

We do. But these apply across all solver instantiations regardless of their use-case. This is a problem for timeouts, and this is why the hard-coded 15 seconds timeout doesn’t really work for us.

Depends on the configuration you have in mind. I explored different rlimit and timeout values - but no other configuration values or tactics. I was primarily looking at ways to deterministically cut Z3 queries - preferably superseeding the existing timeout-based mechanism.

I guess, that would make sense for exploring improvement opportunities for the Z3-based solver - but for refutation I don’t really see much value. We are usually well within sub 1 percent of running time in Z3. The goal of this RFC was not to improve running time of Z3, but to offer hard guarantees / safeguards for the existing runtime behavior. Of course a better solver configuration could possibly reduce the number of timeouts or rlimit exhaustions, but as the data suggests that happens really rarely - to a degree, that I wouldn’t investigate.

I think everything I just shared also applies justifying the work (not investing) for exploring other solvers as well.

These are great ideas for making the Z3-based solver actually useful, but that’s definitely out of scope for me.

Hi!

Sorry that I joined the convo so late! These measurements and the preparation is world class, especially the annotated charts! Well done! I’m raising some technical issues here, but that shouldn’t take away from the work or the actual achievements.

We noticed lately that our jobs became severely non-deterministic. My measurements show that even on very small projects, clang reliably produces different result sets with the same configuration.

This is a diff in between two identical runs, all configs are set to their default values, yet there are differences in results:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=qtbase_v6.2.0_determinism_check9A_1&run=libwebm_libwebm-1.0.0.27_determinism_check9A_1&run=vim_v8.2.1920_determinism_check9A_1&newcheck=qtbase_v6.2.0_determinism_check9B_1&newcheck=libwebm_libwebm-1.0.0.27_determinism_check9B_1&newcheck=vim_v8.2.1920_determinism_check9B_1

This is a diff in between two identical runs with the two timeout Z3 configs set to 0, the rlimit config is left unchanged, still, different result sets:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=qtbase_v6.2.0_determinism_check12A_1&run=libwebm_libwebm-1.0.0.27_determinism_check12A_1&run=vim_v8.2.1920_determinism_check12A_1&newcheck=qtbase_v6.2.0_determinism_check12B_1&newcheck=libwebm_libwebm-1.0.0.27_determinism_check12B_1&newcheck=vim_v8.2.1920_determinism_check12B_1

This is a diff in between two identical runs with only the ā€œcrosscheck-with-z3-rlimit-thresholdā€ set to 0, other configs are unchanged, result sets aren’t identical, even though this is supposed to be a deterministic cutoff:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=qtbase_v6.2.0_determinism_check13A_3&run=libwebm_libwebm-1.0.0.27_determinism_check13A_3&run=vim_v8.2.1920_determinism_check13A_3&newcheck=qtbase_v6.2.0_determinism_check13B_3&newcheck=libwebm_libwebm-1.0.0.27_determinism_check13B_3&newcheck=vim_v8.2.1920_determinism_check13B_3

This is a diff in between two identical runs with ALL new Z3 configs set to 0, finally seems deterministic, but of course, catching non-determinism is easy as long as find a mismatch, not so much the other way around:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=qtbase_v6.2.0_determinism_check13A_2&run=libwebm_libwebm-1.0.0.27_determinism_check13A_2&run=vim_v8.2.1920_determinism_check13A_2&newcheck=qtbase_v6.2.0_determinism_check13B_2&newcheck=libwebm_libwebm-1.0.0.27_determinism_check13B_2&newcheck=vim_v8.2.1920_determinism_check13B_2

To us, nondeterminism is a major dealbreaker, and we issued an emergency fix for our downstream branch. I’m writing this comment to discuss whether we should set these configs to 0 by default, because my gut feeling is that we shouldn’t release an analyzer version with so much (any, really) non-determinism, certainly not by default.

Can I suggest that we offer a frontend, or even driver level flag like ā€˜-analyzer-quick’ that sets these to their appropriate values, but is off by default?

Again, big respect for the real advantages introduced by the patch and the thorough work to back it up!

Hello everyone!

Just noticed that discussion and seems like I’ve faced something similar (see issue clangsa+CTU analisys produces different resuls on big code base Ā· Issue #4368 Ā· Ericsson/codechecker Ā· GitHub). I’ve filled a bug to codechecker, since i thought it might be related to how codechecker manages CTU information (I was completely wrong, I guess).

This issue is a big blocker for us, since we cannot enable CSA on CI (but we really want to) =(

Long time no see Husi!

This means a lot, thanks!

I’m surprised to hear this.
From my experience, before and after these PRs, nondeterministic issues remained on the same order of magnitude. Remember, here I wanted to tackle excessive running time spent during Z3 refutation to produce a bug report.
This will by nature also cause some flaky behavior in the rare cases when even setting a low rlimit wouldn’t cut the query time of Z3. (This is the top-left corner on the chart, of ~0.01% of the queries.)

Nondetministic reports/paths in the past

We also had nondeterminism in the past, when each Z3 query was set with a timeout of 15 seconds. If timed out, we accepted the report - even though othertimes Z3 finished within the timebox an possibly returned UNSAT, in which case we picked the next bug report candidate from the bug report eqclass - changing flakyness in the path because now a different path is selected. It may finished in time with all the queries within the eqclass in which case the bug report won’t appear at all.
So, even in the past we had nondeterminism.

Large bug report eqclasses with long Z3 query times:

We also had cases when there was a bug report eqclass of ~500 where each report took ages to refute (probably because they shared a common path prefix that encoded some difficult constraints to solve with Z3). This mean that to produce a single bug report in the output - or rather to refute and reject it - it could have took 500x10 seconds, which is more than an hour btw. We had user complaints that suddenly they could not analyze files because the analysis halt due to this bug. When this happened, they had to disable Z3 refutation, which was an unpleasant mitigation.

After these PRs, we at least have a hard guarantee that it shouldn’t take longer than 1 second to produce a bug report. This is at the expense of a tighter timeout, along with hitting more frequently that timeout. rlimit should make it more or less deterministic - I can’t guarantee because I don’t know how it is used in Z3, or how Z3 is implemented itself - but that was the best I had in my toolbox at the time. As you have seen on the charts, even rlimit doesn’t limit Z3 query times. Consequently, we still had to set timeouts - then the question was what threshold.

If you know of a better heuristic that would also enable us having hard guarantees about how long it takes to produce (or not produce) a bug report, that would be great. I tried several, but none were impeachable.

Other sources of nondeterminism

I’ve spent more days than I wanted to track down other sources of nondeterministic behavior. CSA currently has rare, but funny cases. They fall into 3 categories that I’m aware of currently:

  1. Some ordering depends on pointer addresses (maybe open-addressing hash tables, like Dense{Set,Map} and friends - remember, their hash uses the address of some function as the seed. This means that each time clang runs, the virtual pages are different, thus the order becomes different. One can disable ASLR (address space layout randomization) on Linux, which helps to achieve consistent behavior at the expense of security.
    See here how to do it system-wise or here to disable it for a single process.

  2. If a bug report eqclass has a report that is reachable from 2 equal long paths - think of a diamond from the root node to the error node - then we will pick only one path, the lucky path, for testing it using the usual path sensitive bug reporter visitors. If that happens to be marked invalid then we move on and select the next report in the eqclass. However, if we had picked the other side of the diamond-shaped path of the same length from the root to the error node, then the visitors might have accepted the report. This is not fixed, and not necessarily related to the Z3 visitor.

  3. Finally, the reason why you are here. Z3 visitor could cause nondeterministic behavior due to the nature of timeouts. It was in the past with 15 seconds, now it is much tighter with 300ms, but the threshold doesn’t change the fact that it was flaky. It made it worse for sure, but it depends on how much is acceptable for the gain. It’s a tradeoff. I wish there was a truly deterministic way of cutting Z3 query times.

The way forward

I think I could come back to this story. First, I’d propose ensuring we have consistent data, and that it’s this Z3 cut heuristic that causes the flaky behavior. Then we should understand exactly how that bug report eqclass Z3 queries behave.

BTW, the version of Z3 may be different to what I used during evaluation.
I used the latest at the time, z3-4.13.0. Different versions can produce vastly different running times.

I think we all share this sentiment. Determinism is really important, just like having no crashes or hangs. From what I’ve seen, it’s a tradeoff of having hangs vs nondeterminism, but it’s a delicate dance.

To me, there is not enough evidence if setting a different threshold (let’s say disabling both thresholds) would be a better tradeoff.

If we think other thresholds would be better, we should just change the defaults.
I’m against adding a new option for this. I would find that confusing.


Thanks for the xref. This is a really valuable feedback. I think you can try to override the configs. It’s there exactly for this reason. You can also try setting both to 0 to see if that works better for you. Let me know if it helped!

Sure! We confirmed (again) that that 15 second upper bound, or something has to exist. With that said, we rarely noticed any nondeterminism with that value in practice.

Damn, thats frustrating. Especially that our results pretty conclusive on the fact that not only is rlimit as it is now an insufficient cutoff, but also fails on the determinism department.

We took a long hard look at the random seed used in Z3 (among other things), but the seed itself seems to be deterministically initialized to 0. I can only suspect that you are suspecting – Z3 is buggy at this time. If we use it with any timeouts or limits (which we also confirmed is a must) we just have to accept some nondeterminism. If we push the limit hard, in some scenarios (like for us) it could further degrade the randomness of the results.

So I’ve heard, we should definitely hunt those down eventually.

My understanding is that we deterministically choose the shortest path. I’m not immediately aware of nondeterminism arising from this area.

We recently upgraded our Z3 version according to your suggestion and got back the same results, unfortunately. It might just be that our machines are slower, so more than 0.01% of the reports were affected, but that still is a sign that the current defaults need adjusting.

What exactly are you looking for in addition to the data I shared? To me, its striking that even on relatively small projects (the entire job took 1.5 hours to run on vim,libwebm,qtbase twice) the randomness of the results is huge. Its almost deterministically undeterministic. With zero intent to sound like an alarmist, this sounds pretty dire backed by plenty of proof.

Don’t get me wrong, I am more than happy to produce some more benchmarks if we genuinely missed something!

We found that reverting to the original 15 second timeout (by effectively reverting this change on our internal branch) was a better solution, otherwise we faced intolerable analysis times and crashes.

I have no evidence for nondeterministic behavior (result or query time) of Z3 if only an rlimit is set (without timeout). I can only assume that it may have nondeterminism, just like in any sufficiently piece of software. That said those are likely considered ā€œbugsā€, but that is just the way it is. Software has bugs. So this means we can strive for determinism, but we should arm ourselves for nondetministic behavior. This is especially true for software component boundaries, like 3rd parties, like Z3 in our case. We as the community, likely won’t go and fix Z3 nondetminism if it has some (again; no evidence that it has). This exposes us to 3rd party maintainers.

This is another nail in the timeout-based approach. Time is relative to hardware performance and load. Good point. BTW the 0.01% appears to be inline with my measurements if you look at the top-left corner of my chart.
If we don’t change the underlying heuristic (decision making model) but only the thresholds. What thresholds would you suggest? If you would change the heuristic, in what way would you?
BTW I can share privately my data and python scripts that I used for the charts and all the evaluation to experiment with other heuristics and thresholds without running a full-blown measurement across 200+ projects.

What I wanted to refer to was that the biggest difference out of those 4 links (measurements) you shared was 6 reports. Judging that it was diffing 3 project, including qtbase this is inline with my expectation (unfortunately). There were 2 reports that were included in the diff of two separate measurements (qtbase/src/3rdparty/sqlite/sqlite3.c@23530 and vim/src/syntax.c@2716).

Like I said, I don’t challenge that flickering issues is a problem. I’m also think that having a predictable budget how much time we spend producing an issue is also important. It’s fine if we spend 1 sec or 10 sec at most for producing a bug report, as long as we have some limit that would prevent us from spending the 90+% of the total running time spent in Z3 refutation. (This was the case sometimes in the past) I’m not sure about how to bring guarantees while not worsening observable nondeterminism.
Arguing that having an analysis spend all of its time in Z3 is basically a hang, which is at the same order of magnitude in severity as a crash; having timeouts, thus rare flaky issues is a better tradeoff. Picking a threshold is arbitrary, and all I could do was to be transparent and document my choices.

I’m still very open for a different solution, assuming we can also prevent Z3 hangs - which was the motivation of this effort.
This may be very well admitting that rlimit should just work, and we should consistently report any sufficiently diverging cases when the runing time wouldn’t be bounded by rlimit (basically the cases why we currently need to also set the timeout parameter). If we hadn’t this problem, we wouldn’t need to set an aggressive timeout, and we wouldn’t have flaky issues because of the tight timeout. Currently, this is the only path I see forward.

This is really interesting to hear. This would mean that the escape-hatch doesn’t work as I intended. Setting 15 sec timeout with setting 0 rlimit is not equivalent with the past behavior. The difference is that we would not ā€œacceptā€ the report unless we has SAT for the Z3 query (unlike we did in the past). We can consider changing this. We should definitely not have crashes when overriding those thresholds to any non-negative value. Feel free to report crashes of this nature, and I’ll have a look.

I suggest that we revert to the values before the patch (no need to remove the configs, just restore the previous behaviour in terms of speed). Do I understand correctly that we would still expect fewer reports because ā€œwe would not ā€œacceptā€ the report unless we has SAT for the Z3 queryā€?

The best way to avoid picking an arbitrary value was just what you did – make measurements, and pick a number that makes sense from them. This is why I appreciated the hustle so much! It just so happened that those measurements were not representative with our use case, and seemingly not with others, unfortunately.

In the end, the question is what is more important: not to cripple analysis time, or have (much more) reliable analysis results. Considering that we use the CSA in pull request gating, almost no price is too high to us to achieve greater reliability. I venture to say this is likely more in like with general expectations of a compiler (and sure, I respect that we are not really compiling).

Knowing the specific characteristics of your machines, you could still set this values downstream, so you could enjoy the fruits of all these amazing measurements! I’m not sure how widespread the ā€œslow is hang, hang is crashā€ sentiment is in the wide world. That is certainly not our experience, though our whole program analyses often run overnight where lost speed doesn’t sting quite as bad.

Considering that its only those values need changing, it shouldn’t be a big maintenance nightmare downstream. The main point I’m trying to make is that the old values make more sense for the general userbase of the analyzer on the upstream.

When I said the technically reverted the patch downstream, I actually meant that we literally reverted it :slight_smile: I made the nondeterminism measurements with your patch applied and values set to 0 (in a broader effort bisecting this release to find what patch was problematic), but our internal release will go off with it being fully reverted.