[RFC] Make `isTainted()` and complex symbols friends

As promised in my previous RFC to come back to the last remaining running time regression, I’m opening the discussion here.


After the clang-18 release, when we did the benchmarks we realized some serious slowdowns. Most of them are fixed now/will be fixed if my proposed patches land from my previous RFC.

Assuming this, there is only one critical slowdown that I know of. This is reproducible on FFmpeg sheervideo.c. In this particular case, I can trace back the problem to having much more complicated symbols - of complexity 30, 31 at times. This acts poorly with the current isTainted() APIs, which in turn recursively traverses a lot of nodes of such complicated SymExprs.
This is likely due to BinarySymExpr::computeComplexity, but I’ll leave this intentionally vague as I have not gathered indisputable evidence.
This was reported here.

Here are some numbers for the time spent inside the static analyzer:

  • clang-17 variant of OOBv2: 33 seconds
  • clang-18 variant of OOBv2 & MaxSymbolComplexity=10: 33 seconds (on par with baseline)
  • clang-18 variant of OOBv2: 927 seconds (28x baseline)

Here are the corresponding flame graphs:

clang-17 variant of OOBv2:


clang-18 variant of OOBv2 & MaxSymbolComplexity=10


clang-18 variant of OOBv2:


These numbers suggest to me that the ballooned symbol complexity poses some challenges as it’s present on trunk currently, and we can hit poor performance characteristics under certain workloads.


I can see some ways mitigating this:

  • Limit the MaxSymbolComplexity. (Easy, quick and dirty. Doesn’t really solve the issue)
  • Apply some caching or other means of performance optimizations to isTainted(). (One needs to be really careful not using local statics, but rather properly lifetime-bound caches.)
  • Find and roll-back the hunk which causes the appearance of such complicated symbols. (This doesn’t really solve the issue)

To close this, I’d invite you to discuss this.
@DonatNagyE @Xazax-hun @NoQ @Szelethus

Personally, I’d probably perfer option no. 2, but I’m also open for option no. 1 assuming it does not degrade the quality of the analysis.

I think the morale of this story is that we do not have the necessary metrics in place, or at least we do not have a good way to track them over time. I think it would be nice to have some aggregate statistics like the distribution of the complexity of symbolic expressions that we could track over time on some projects.

Limiting the MaxSymbolComplexity might make sense, but I think it would be great to get a better understanding of the consequences. Are there parts of the analyzer that could realistically reason about huge symbolic expressions? When would those break down? So, I think it might make more sense to go backwards, and see what are the sizes that other parts of the analyzer (not just taint) can realistically handle.

I think my point is, if taint is the only problem, we should fix taint. If other parts of the analyzer also suffer from perf blowups, that is a stronger argument for reducing the MaxSymbolComplexity.

Well, I caught it :smiley:

I could gather statistics, but that needs to be really well specified. What API should we hook and inspect? Symbol creation is not really representative, as those are handed out from a cache, so basically the most frequently seen symbol used by the checkers (from any callback) will be also just once constructed - distorting the distributions.
Let me know if you have a specific use-case in mind that I could measure.

I understand the sentiment, but I don’t know a reliable way to measure how well something can reason about something.

I can totally agree with this section.
But one thing should be clear: we are exposed, and there is a risk that something else might break only on certain workloads - thus that workload might slip through all of our testing and we regress unaware of it.
Limiting the MaxSymbolComplexity would reduce that risk.

This time it was isTainted(), but we could have something else next day.
That being said, I’m also on the side of just fixing isTainted() and wait for another regression of similar kind to justify the next step on MaxSymbolComplexity.

For the sake of clarity, I note that this issue is different from the issue that was resolved by [analyzer] Fix performance of getTaintedSymbolsImpl() by NagyDonat · Pull Request #89606 · llvm/llvm-project · GitHub

(At first I added a comment claiming that this issue was resolved by that PR, then I realized that this is a somewhat similar, but separate problem.)

I revisited the slowdown - the excessive recursions in isTainted, and here are my thoughts.

Why urgent?
We are pressured by the llvm release, if we don’t want to massively regress on analyzing hash-heavy projects. (many nested loops with array accesses which individually call isTainted, etc.)

Why does isTainted do a symbol traversal in the first place, causing massive recursions?
It does because we can’t maintain (or propagate) taint eagerly, as checkers and the engine may construct new symbols wrapping a tainted symbol, that we would also need to recognized as tainted even if the GenericTaintChecker doesn’t know about their existence. The only way to get this information is by walking the whole tree of the symbol to see if any of the nodes inside are considered tainted.

Let’s apply caching for isTainted:

First, I considered piggybacking on some private ProgramState GDM for holding the cache, holding the result of the isTainted calls, but that would mean that isTainted and friends would need to change the State to commit the cache, which is ugly, surprising and inconvenient. So, I dropped this idea.

Next, I considered a private static global cache, while being careful to clear the cache after finishing each entry point, by subscribing for the check::EndAnalysis callback. In my prototype, I had to be really careful to erase the cache for all symbols and all nested sub-symbols when we add or remove taint, as such can and will alter what isTainted would report.
As the taint::isTainted, taint::addTaint and taint::removeTaint has 3-4 overloads each, and all recurse among themselves; there is no centralized place where I could do the cache invalidation/population. This makes the approach unmaintainable long term, and lead to hard-to-debug, rare bugs w.r.t. taint propagation. The review of such a change would be also pretty bad, as one would need to look for possible cases when I’m missing a cache invalidation - which won’t appear in the diffs.

This also means that we do redundant cache-invalidation (basically a hashtable lookup, and eventually concluding that there is nothing to erase as we already erased it), which is unfortunate.

When adding taint to a symbol, then we unwrap the SymbolicCasts of it to apply taint; but this also means that we need to invalidate the cache for all unwrapped layers of Symbols to make sure the cache won’t have any stale entries, as previous isTainted calls might have been invoked over that exact SymbolCast we just unwrapped. The exact same methodology applies for every other case when we unwrap some region or symbol, thus it gets fairly complicated and scattered quickly.

The problem is that the API surface of taint is massive, and IMO we should reduce the number of APIs. That would allow us to regain better control over the recursion entry points, and apply caching cleanly.
We should also avoid ad-hoc controlflow and dispatching in the implementation, and prefer describing the traversal using an SValVisitor. This would lead to a more declarative approach for describing recursion.

We could also consider restricting caching for only “complex” symbols, let’s say above 10. This could reduce the memory footprint of caching if that turns out to be a problem. (TBD)
(Just think about a minute how funny bugs could pop up if the bug would happen only across entry points, for complex symbols. It would be fun to reduce and investigate why the taint cache got stale; so we need to get this right if we do this)

This would be a complicated change, and would probably not justify the risk for backporting it to clang-19. So, I don’t think we have any other options but to temporarily reduce the max symbol complexity from 35 to some smaller threshold. So far I was considering 10, but I’d need to check on a large corpus of projects to see what would be the impact.
My assumption is that most projects shouldn’t be signifficantly impacted by this reduction - except maybe for the projects where we would have really complicated symbols. But luckly those are the ones where would would regress anyways w.r.t. running time. So the tradeoff for those is clear to me, and we need to bite the bullet for having report changes there.

My plans

  • Test what symbol complexity would work.
  • Reduce the max complexity threshold upstream.
  • Backport the commit to clang-19, while having a note in the ReleaseNotes.
  • Reduce the API surface of taint::*
  • Rework the recursion, and preferably describe recursion by using Visitors.
  • Apply caching to the isTainted API call.
  • Celebrate :tada: :tada: :tada:

WDYT @DonatNagyE @Xazax-hun @NoQ?

I think I was not completely accurate here. Clang-18 already regressed, we just had downstream workarounds. So in that sense clang-19 would not regress further. I’d say, its probably still good to target a fix for clang-19 as a backport.

Sounds like a good plan :smile:

I don’t have any concrete feedback right now, but I broadly agree with all of your ideas. (And thanks for troubleshooting this aspect of the analyzer!)

I did measurements on 200+ projects including C, C++, small, medium and large projects.

I found out that lowering the MaxSymbolComplexity threshold from 35 to 20, 10, or 5 had an increasingly negative effect on the quality of the results. Even though the number of FPs grow by lowering MaxSymbolComplexity, the number of report diffs are not too bad. Yet, I’m now considering a different approach. Skip to the end to see why.

In total we are talking about roughly 145’000 translation units, and 59’000 static analyzer diagnostics.

To given idea about the scales:

  • 100 report means 0.17%
  • 200 report means 0.34%

I’ll describe the comparisons in triplets: (+N -M ~C) meaning:

  • N appearing issue
  • M disappearing issue
  • C updated issue; basically issues that lead to the same place via slightly different notes.

MaxSymbolComplexity=35: baseline

MaxSymbolComplexity=20: against baseline (-72 +95 ~97)
91 projects had absolutely no difference.
For the most part, issues that disappear, reappear at a slightly different place, so the real number of new issues is closer to 95-72=23 than to 95.

The issue diffs is mixed bag. Some disappear and at a slightly other (but close) place the same issue appears. I’d say that we should ignore those.
We have a couple real new appearing issues; and also some that disappear. All of these look FPs to me.

The number of updated issues are frequently caused by non-deterministic behavior, when for example selecting which bug path the engine reports, so this is not a surprise for us that we have some of that. Like in one case we take a false branch, while in the other we take the true branch, but actually it doesn’t matter which we pick.


MaxSymbolComplexity=10: against baseline (-119 +177 ~247), against prev conf (-124 +159 ~221)
92 projects had absolutely no difference. (1 more than in the previous configuration; probably some flakyness)


MaxSymbolComplexity=5: against baseline (-155 +732 ~650), against prev conf (-113 +632 ~618)
66 projects had absolutely no difference.


I also looked at analysis times, and in general, setting a lower threshold does not lower analysis times. It makes sense because this issue appeared only on a tiny narrow set of projects, mostly doing some sort of hashing: xxhash, md5, sha1, sha256, some string manipulating stuff.

That said, in those cases, setting MaxSymbolComplexity=10 could improve analysis times by 20-30%, and in one case (that was motivating this thread, around 20x. But note that we are talking about a tiny fraction of translation units.

What’s next

In the end, I’m not convinced that lowering the threshold would be a good mitigation, as it would also imply a half percent report changes.

What I’m still considering is to patch isTainted to first check the symbol complexity, and if that’s let’s say over 10, bail out and say not tainted.
This would limit the blast radius, while also bounding the recursion of isTainted.
This should have a more favorable report diff footprint, while mitigating the issue.

I plan to do a measurement to back this assumption up, and if barely any reports change, and the running times back in control, I’ll propose a patch.

What I’m still considering is to patch isTainted to first check the symbol complexity, and if that’s let’s say over 10, bail out and say not tainted.

I really like this approach :smile:

I suspect that the effects of this limitation will be negligible because I don’t think that we’ll see many taint sources near the complex hashing logic that produces the complex symbols. Moreover, even if the limitation triggers, it can only cause a false negative – in an area (taint checking) that is already fuzzy and heuristic-based.

I’m looking forward to see your measurements – but I think this particular plan is innocent enough to be acceptable even without measurements.

Alright. After restricting isTainted to ignore “complex symbols” (10), the report diff looks a lot better: (-61 +63 ~81) About half of the projects were completely unmodified.
None of the affected reports were taint-related. Mostly OOBv2 (-15 +25), NullDeref (-20 +16), garbage read (-11 +12) and none of the projects were more targeted than the rest.

When looking at the running times, the translation unit for ffmpeg sheervideo.c, it takes only 5.6% of the time without this fix.

I think this approach is the way.
Now, I’ll have a bit of a vacation, but after that I’ll come back to finish this.

How can changing the result of isTainted() from true to false influence a non-taint-related report?

Is this caused by (not) hitting some performance protection cutoffs? Is this pure random noise?

(By the way have a nice vacation :smile: ! My questions are not urgent, feel free to answer them when you return.)

This is basically noise. Its hard to tell more accurately. We also plan to make run to run variance lower, but the nondet behav is really tricky. I spent so far 5 days on a single case. It’s mad. What I can tell already, that disabling ASLR would help. We dont disable it for measurements as of now.

1 Like