Loop handling improvement plans

DonatNagyE · July 28, 2024, 9:01pm

I spent lots of time thinking about loop modeling within Clang Static Analyzer, and here I’m trying to summarize my current plans and observations.

Note that my primary goal is reducing the amount of false positives caused by unfounded assumptions, but some plans could also improve the coverage (by continuing the analysis after the loop) and obviously I need to consider the performance impact of any plans that I hope to implement.

My observations are not revolutionary, I’d guess that most of them are already familiar for the “veterans” of this project. However, I think it’s still valuable to describe my exact position to ensure that we’re on a common page and highlight my goals.

Also, I’m sure that there are some mistakes and inaccuracies – please correct these if you spot them!

General background

The current situation is very easy to summarize: by default, the analyzer doesn’t use any loop-specific logic during the analysis. Loops are simply backwards edges in the graph of the basic blocks and there is a threshold called maxBlockVisitOnPath which kills execution paths that encounter the same block too many times.

The well-known (?) “we model the first three iterations of loops” rule is just one special case of this limitation: the default value of maxBlockVisitOnPath is 4, so the modeling is stopped after the third iteration, when BlockCount >= maxBlockVisitOnPath becomes true.

The command-line option -analyzer-max-loop is not loop-specific: it just specifies the value of maxBlockVisitOnPath, so it also affects repeated function calls and other situations where the same block can be reached multiple times. (Also note that the name -analyzer-max-loop is misleading: if you want to run e.g. 5 loop iterations, you need to specify -analyzer-max-loop 6.)

Unfortunately the maxBlockVisitOnPath threshold can cause a loss of coverage (when some code cannot be reached without visiting the same block many times), but the analyzer tries to restart the analysis without inlining the currently analyzed function, so if we have code like

void long_loop() {
  for (int i=0; i<100; i++)
     do_something();
  buggy_code(1);
}
void uses_long_loop() {
  long_loop();
  buggy_code(2);
}

then the analyzer cannot report buggy_code(1) (because during the analysis of long_loop it cannot find a code path that can reach it without running into maxBlockVisitOnPath), but it can report buggy_code(2) (on a code path where long_loop() is not inlined).

Off-by-default features

We also have two off-by-default loop-specific features which were developed by Péter Szécsi in 2017: loop-unrolling (implemented in LoopUnrolling.cpp) and loop-widening (implemented in LoopWidening.cpp).

The loop-unrolling feature detects certain “simple”, bounded loops with AST matchers, and simply disables checking maxBlockVisitOnPath during their execution. According to the old reviews, this feature is (or at least was?) good enough to be enabled by default – but it was eventually left in an off-by-default state.

The loop-widening feature is activated at the beginning of the last modeled iteration within a loop (where BlockCount == maxBlockVisitOnPath - 1) and generalizes the State by invalidating lots of information. This tweak can “help” the execution path leave the loop (e.g. if i becomes unknown, the analyzer will be able to assume that i < 100 is false), and therefore it increases the coverage.

If I understand it correctly, this feature is also in a good shape, but it isn’t as complete as loop-unrolling.

Known problems

Currently the analyzer has several serious limitations that are directly connected to the (lack of) loop handling logic:

Unfounded assumptions after loop: Each check of the loop condition can introduce a state split, so after the loop we will have separate execution paths where the loop was executed exactly 0/1/2/3 times. These paths place very narrow constraints onto the loop variable and/or the variables that are compared to it; which produces lots of false positives when one of these execution paths is semantically infeasible (e.g. the programmer knows that the list is not empty) but the analyzer doesn’t realize this.
Unfounded assumption of a third iteration: If the code contains a loop, then it’s valid to assume that the developer intended at least two iterations (because otherwise they would’ve written an if instead) – but the analyzer shouldn’t assume that a third iteration is possible, unless that is implied by the loop conditions. In ffmpeg this produces dozens of ArrayBoundV2 false positives where the analyzer doesn’t understand the loop condition and assumes that it can perform three iterations over a two-element array.
Lack of coverage after loop: If the analyzer cannot find a path through the loop, then the code after it won’t be analyzed.

Among these I’m primarily focusing on (1.), with (2.) as a secondary goal; while Péter Szécsi was mostly working on (3.).

Note that these are generic issues that affect many different checkers, so I think it’s not effective to introduce hacks that fix them for one particular checker.

Plans for unfounded assumptions after loops

The root of the issue is that if the analyzer cannot deduce the (true/false) value of the loop condition, then we get several separate code paths (“didn’t enter the loop”, “did 1 iteration”, “did 2 iterations” etc.) instead of continuing on a single combined “did some number of iterations” code path.

Here we do want to continue the analysis after the loop (loops are too common to ignore everything after them), but it’s logically invalid to single out any one iteration count and report the issues deduced from it.

There are several potential approaches to resolve this issue:

(A) Force the merge of the code paths: Collect all the code paths that are exiting from the loop, take the union of their state and continue the analysis from this “artificial” exploded graph node (instead of the original “natural” paths).

The union of several State objects can be defined as:

for each symbol, take the union of the RangeSets corresponding to it;
for each equality/disequality relationship between symbols, keep it only if it’s present in all of the states;
for each stored binding, if the different States contain different SVals, then introduce a new symbolic value to represent them and introduce a proper range for it;
for checker-specific state entries, do something checker-specific if they’re different.

Although theoretically this solution is the most accurate one (in some sense), it would require introducing a novel operation on the complex State data structures, and that is probably prohibitively difficult.

(B) Pick one code path and invalidate information where it differs from the others: This is a simplified variant of (A) where the symmetric (and unprecedented) merge operation is replaced by comparison of states (as far as I know, this is implementable) and invalidations (which is a standard operation).

In this case we may keep all the paths after the loop, but as they would become very similar with the invalidations, it’s probably better to keep just one and discard the rest (for performance reasons).

However, this is still a complex plan that requires a “collect all outgoing transitions and replace them with a single one” step. (I’d guess that implementing this is difficult, but manageable, but I’m not familiar with the graph traversal implementation.)

(C) After the loop invalidate the variables participating in the loop condition: This is a simple and brutal hack that would address the majority of the issues in practice.

The main limitation is that this doesn’t cover the cases when the unreasonably assumed concrete iteration count manifests through an effect that’s implemented in the loop body, e.g. if we have code that looks like

int array[10]; // uninitialized
for (int i = 0; i < get_count(); i++) {
  array[i] = i * i;
}

then the analyzer would create a branch where array[0] is known to be uninitialized (because it assumed 0 iterations).

(D) After the loop invalidate all variables/locations that are potentially written in the loop: This is a theoretically sound widening step that can be applied even in situations when the analyzer doesn’t find a path that manages to get out of the loop within a few iterations (so this is also helpful for the “lack of coverage after loop” issue).

This plan could be refined by running one final iteration of the loop after this invalidation step to ensure that (1) the loop condition is evaluated (with symbolic operands) and we record the assumption that the right relationship holds between the (symbolic, after-invalidation) values (2) invariants that are true after every loop iteration (e.g. a certain variable holds a positive value etc.) are true/visible after the loop.

Obviously this widening is not needed if the analyzer can determine the exact number of steps taken in the loop and it’s small enough to directly unroll the loop.

Here we could use Péter Szécsi’s loop unrolling feature (which uses AST pattern matching to detect the “nice” loops), but I could also imagine another solution that tries to unroll any loop for a medium number of iterations (e.g. <128 as in Péter’s solution) as long as the analyzer can exactly determine the branches that are taken within the loop. (If we encounter ambiguity/branching, we just stop the analysis of the loop and continue the analysis after the loop with the widening step.)

~~(E) Post-process the bug reports and discard the ones that depend on a certain concrete iteration count~~ I’m mentioning this for the sake of completeness and to note that this is impractical because the analysis only samples an unpredictable subset of the paths and we cannot distinguish

“we only found this bug on the 0 iteration discard reports that only appear n branch because that’s an unjustified assumption”

and

“we only found this bug on the 0 iteration branch because that’s the only path where we explored something unrelated that’s needed for the bug”

~~(F) Create a bug reporter visitor that discards reports where the bug depends on the loop variable (or a variable compared with it)~~ This could have been a hacky, but practical solution for the issue, but unfortunately concrete values don’t have identities, so we cannot reliably determine that “the value of this loop variable is interesting/important for that bug”.

I don’t think that it’s useful to aim for a hacky solution of this kind, because with the work that’s needed to handle the technical difficulties we could also implement a non-hacky solution (e.g. (B) or (D)).

Plans for unfounded assumption of a third iteration

Earlier I was planning to reduce the default maxBlockVisitOnPath value to 3, which would’ve implied that the analyzer only models two iterations within a loop, however now I think that the raw maxBlockVisitOnPath threshold should be a relatively rare “fallback” limitation and we should implement more concrete unrolling / widening / iteration-count-limiting for the specific cases where they’re warranted.

For this reason I’m putting this goal on hold for now; I’m planning to return to it after I select and implement a plan for the “unfounded assumption after loop” issue.

Plans for lack of coverage after loop

This is not a primary goal for me, but e.g. plan (D) for the “unfounded assumption after loop” issue would also be very helpful in this area.

Moreover, enabling Péter Szécsi’s features by default (after ensuring that they’re stable enough) could be a significant improvement in this area.

DonatNagyE · July 30, 2024, 9:33am

After a discussion with my team, I decided to start implementing plan (D), which would both fix the “unfoundend assumption after loop” issues and improve the checker coverage after loops. (Obviously this is a gradual process, and I’ll report my progress here and/or in PRs that I will create eventually.)

@NoQ @Xazax-hun @steakhal @ anybody else interested:
What do you think about this plan (or the other alternatives)?

Xazax-hun · July 30, 2024, 12:05pm

I think both the loop unrolling and loop widening feature was developed with the goal in mind that they would be on by default eventually. Unfortunately, they are hard to get right (especially the loop widening), and even harder to measure that they actually improve the quality of the warnings. Any improvements in this area is welcome.

That being said, one of the main advantages of the Clang Static Analyzer is the relatively simple model to develop checks. Any solution that complicates the life of the check authors needs a really strong justification. So, I think one of the most important question is how would your proposed changes affect the individual checks. Do we need to update them? How hard is that process?

Something to keep in mind, both too little and too much invalidation can lead to false positives. I like that in your proposal you are considering to add back some of the invariants after a widened loop (like the loop condition is potentially false after the loop unless there are multiple exits). Loop widening is a really complex and open ended topic, it is possible that we need to introduce multiple dataflow analyses to do it right (e.g., to reason about the value of the induction variable, or to over approximate points-to information to make the invalidation more precise).

DonatNagyE · July 30, 2024, 1:03pm

So, I think one of the most important question is how would your proposed changes affect the individual checks. Do we need to update them? How hard is that process?

I completely agree that this is important, and I don’t expect systemic impact that could degrade the checker result quality.

The current plan would introduce some additional invalidations, but we already get lots of invalidations from e.g. non-inlined function calls, so they shouldn’t cause problems for the (stable) checkers. Obviously I’ll examine and validate the actual results before merging anything significant.

Something to keep in mind, both too little and too much invalidation can lead to false positives.

I know, although I’d argue that a false positive “caused by” too much invalidation usually (but perhaps not always) means that the checker is buggy. Invalidation places “this may be anything” markers, and if the checker concludes that a “this may be anything” wildcard cannot be correct, then something is fishy

Loop widening is a really complex and open ended topic, it is possible that we need to introduce multiple dataflow analyses to do it right (e.g., to reason about the value of the induction variable, or to over approximate points-to information to make the invalidation more precise).

Yes, we’ll see what will be needed.

If needed, we can fall back to some limited variant like (C) that tries to generalize/sanitize the state after a concrete “did N iterations” path (instead of doing a “full” widening that jumps from before the loop to after the loop). Of course that would mean that we don’t get coverage after loops where the analyzer “cannot pass through”, but that’s not my main goal right now.

NoQ · July 31, 2024, 3:46am

Actually this is exactly how we’ve approached these issues so far. The function trackExpressionValue() does exactly what you think it doesn’t do: track concrete values back to their origin. And the entire reason why it’s so complicated is because, yeah, it needed to work with values such as 0 S32b, 0 (Loc), and Undefined in order to produce better diagnostics and suppress weird-path-related false positives in warnings about division by zero, null pointer dereference, and use of uninitialized variables, respectively. In other words, it gives concrete values “an identity” on as-needed basis.

So for example when we’re trying to suppress the false positive

void foo(int *x) {
   if (!x) return;
   *x = 1;
}

void bar(int *x) {
    foo(x);
    int *y = x;
    *y = 2; // FP: "null dereference"
}

(which is false because the branch in foo() doesn’t allow us to conclude that x can be null in bar() so it’s an unjustified state split), the way we suppress it is exactly this: we use trackExpressionValue() to confirm that the concrete value 0 (Loc) that we’re seeing at the dereference operator was constant-folded from the symbol reg_$0<int *x> (at the moment of assignment to y) and only then we’ll be able to notice that this symbol has collapsed into a constant inside function foo() at a program point whose location context isn’t a parent of the buggy location context.

Our job would have been much simpler if we knew about reg_$0<int *x> at the null dereference location. But we don’t, so we needed trackExpressionValue() to figure it out for us.

So, yes, we have solutions for this. And it could be a good solution for loops too, because fundamentally that’s the same problem: an unjustified state split. (I.e. various versions of #61669.)

That said, of course, the loop problem is much more complex and I’m really not confident whether a similar approach would yield good results. It’s not like it’s working very well in the basic case either; even if we assume that trackExpressionValue() is perfect, the heuristic of looking at the location context is still extremely crude.

A better heuristic could involve looking at the space of all bug paths on which a duplicate of this bug was discovered, which roughly corresponds to your approach (E). Currently a bug visitor looks only at the shortest valid bug path. Then the visitor is allowed to mark that bug path as invalid so that it’s re-invoked on a longer bug path. This allows us to continue reporting a bug even if some instances of that bug were found insufficiently trustworthy. But we never consider the entire set of bug paths simultaneously, as a thing of its own. And it may be a good idea to start doing that.

Say, in the case of loops, a bug that only manifests when a loop is executed 0 times is probably untustworthy: what if the loop always runs for at least one iteration? Same with exactly 1 time, exactly 2 times, exactly 3 times - neither of these are really guaranteed to happen. But what if we noticed that no matter how many iterations we’ve tried to model (be it 0, 1, 2, 3), we’ve hit the same bug every time? Would that make it significantly more trustworthy?

This could be the better heuristic in the basic case too. Eg., in the example from that umbrella bug:

  void foo(int x, int y) {
    if (x) { ... }
    if (!y) {
      return 123 / x; // FP: "division by zero"
    }
  }

we could make an argument that the split into the x == 0 && y != 0 branch isn’t justified because the code would still make perfect sense even if that branch is never hit in practice. (Just like with loops, we currently don’t make that argument, but we could.) However, in slightly different code:

  void foo(int x, int y) {
    if (x) { ... }
    if (!y) {
      return 123 / x; // TP: a real division by zero!
    } else {
      return 456 / x; // TP: a real division by zero!
    }
  }

we could notice that division by zero happens regardless of the value of y, the only difference is where exactly does it happen. So even though both bugs are untrustworthy on their own, and even though we still need to make an unjustified state split in order to discover either of them, the “space” of all bugs becomes “complete”: it has so many bug reports that the code makes zero sense no matter how you look at it. You have to inevitably conclude that the only way to make this code work would be to treat if (x) as dead code because x always takes the same value. We could use this to justify emitting both bug reports even when they aren’t trustworthy individually.

Of course we have to somehow acknowledge that we don’t always explore all possible paths. But I think this only skews us in the right direction: if we didn’t explore enough paths then such heuristic will give us false negatives, not false positives. Which is exactly what we want.

In conclusion, I’m not sure this is the right solution. But I definitely wouldn’t discard it right away. Definitely not just because trackExpressionValue() is an impossible thing to implement. It may be one of the more convoluted parts of our codebase but the fundamental problem it’s trying to solve is fairly simple, even when the tracked value is concrete.

DonatNagyE · August 1, 2024, 9:34am

The function trackExpressionValue() does exactly what you think it doesn’t do: track concrete values back to their origin.

I do understand the behavior of trackExpressionValue() because a few months ago I spent lots of time on reading its spaghetti code and understanding the purpose of each hacky corner case. Based on that I’m convinced that it’s extremely fragile and I wouldn’t add it to any new checker unless it was completely rewritten.

Its behavior is experimentally fine-tuned for the ~30 checkers that are currently using it, so it works there because issues where patched by hacky “solutions” – but if it’s added to a new checker, it will probably produce obviously wrong output.

<offtopic><rant>
As a concrete example, I tried to apply trackExpressionValue() in ArrayBoundV2 to track the array subscript expression and:

When it was called on an expression like array[idx + 1] it wasn’t able to conclude that idx is interesting and should be tracked. (It does propagate from array[expr] to expr, but it doesn’t jump from a+b to a or b.)
When array was uninitialized and the checker called trackExpressionValue() on the array[1] subexpression of an assignment that looked like array[1] = 42;, it produced an “array is uninitialized” note which was completely irrelevant in that situation. (It does not distinguish lvalues and rvalues properly – if I recall correctly, it usually tracks the rvalue of an expression, but e.g. trackExpressionValue(ptr) can “jump to” tracking the rvalue stored in *ptr if ptr is a pointer.)

Obviously, these individual issues could be fixed, but as I started to plan improvements/cleanup for trackExpressionValue() I had to conclude that the whole codebase is in terrible shape and if I touched it, then would need to reorganize everything and fix many unrelated deficiencies (while maintaining compatibility with the 30 checkers that are using it!).

I hope that eventually someone will fix trackExpressionValue(), but until then I’m avoiding it as if it was on fire.
</rant></offtopic>

Moreover, I’m convinced that it’s much better to model the right paths instead of propagating unreasonable assumptions and then trying to compensate our mistakes by postprocessing the results found on the wrong paths.

Postprocessing can be an effective (although hacky) solution for silencing false positives, but its accuracy is limited (so it shouldn’t be used for producing results) and if we rely on it, then the code will be very fragile and hard to reason about.

In loop handling this means that after the loop we should run the modeling from a sufficiently widened state instead of filtering or combining the results found on several overly specific code paths. (This also provides performance benefits: we don’t need to reanalyze the same code with several different concrete iteration counts.)

<offtopic>
This principle also applies to managing the identity of concrete values, where I strongly feel that it would be better to propagate this information (or at least produce a clear “breadcrumb path”) when those values are created, instead of using a hack like trackExpressionValue() which duplicates the “how do we calculate this value” logic and can produce hard-to-debug discrepancies.

[…] I definitely wouldn’t discard it [plan (E/F)] right away. Definitely not just because trackExpressionValue() is an impossible thing to implement.

I don’t claim that trackExpressionValue() is impossible to implement – in fact, I’m fairly certain that it can be implemented to produce almost totally correct results – but I claim that:

its current implementation is extremely fragile (difficult to adapt for new applications, unless the developer is willing to put three new ugly special cases onto the heap),
fixing its current implementation would be a big project,
it is bad architecturally because duplicate the “how was this value calculated” logic.

</offtopic>

A better heuristic could involve looking at the space of all bug paths on which a duplicate of this bug was discovered, which roughly corresponds to your approach (E). Currently a bug visitor looks only at the shortest valid bug path. Then the visitor is allowed to mark that bug path as invalid so that it’s re-invoked on a longer bug path. This allows us to continue reporting a bug even if some instances of that bug were found insufficiently trustworthy. But we never consider the entire set of bug paths simultaneously, as a thing of its own. And it may be a good idea to start doing that.

I suspect that this “consider the entire set of bug paths simultaneously” approach will be useful for something (e.g. by de-emphasizing the “optional” messages that are not present on all paths that lead to a given bug; or perhaps your examples with the two independent "if"s), but I still think that it isn’t the right solution for loop handling.

The elephant in the room is obviously

Of course we have to somehow acknowledge that we don’t always explore all possible paths.

which implies that the “examine the different bug paths” logic is inherently fragile and unreliable.

For example, if this logic influences whether a given bug report is emitted, and this leads to false positives/negatives, then the user who wants to report this issue probably can’t create a small reproducer (because in the small reproducer the analyzer has time to traverse more alternative paths than in one small corner of a large codebase).

NoQ · August 2, 2024, 2:06am

I think you’re absolutely correct in your assessment that it’s buggy, convoluted, unreliable and in a huge need of getting fixed.

I think it should still be used as-is in the vast majority of checkers regardless, because it has relatively little power over the final report to do anything really bad, and almost everything it actually ends up doing is better than nothing.

Though, of course, this doesn’t make it suitable for the array bounds checker specifically, or for loops specifically. As is, we should at most trust it to do what it says: explain where the value came from, and/or how exactly did it end up being stored in a given location. (The last part is non-trivial for symbolic values too.) If this information could be useful, it’s probably a good idea to call trackExpressionValue() to discover it, even in its today’s shape and form. But, yeah, no, we shouldn’t necessarily use it for the more complicated logic such as dealing with loops or explaining the complicated logic behind array index/bound constraints. And we definitely shouldn’t put more stuff into it until we figure out how to make the complexity manageable.

DonatNagyE:

Moreover, I’m convinced that it’s much better to model the right paths instead of propagating unreasonable assumptions and then trying to compensate our mistakes by postprocessing the results found on the wrong paths.

Postprocessing can be an effective (although hacky) solution for silencing false positives, but its accuracy is limited (so it shouldn’t be used for producing results) and if we rely on it, then the code will be very fragile and hard to reason about.

In loop handling this means that after the loop we should run the modeling from a sufficiently widened state instead of filtering or combining the results found on several overly specific code paths. (This also provides performance benefits: we don’t need to reanalyze the same code with several different concrete iteration counts.)

Yes, this would indeed be great. But I also think that figuring out a correct way to “widen” the state, and then implementing the checker-side support for the widening operation over all state traits, is significantly more work than fixing trackExpressionValue() and figuring out a post-processing-based solution.

Additionally, widening is only a part of the problem: it (hopefully) fixes the behavior after the loop, but not during the loop. We simply can’t step into the loop at all without immediately introducing unjustified state splits. And we cannot “join back” to eliminate the unjustified state split until the end of the loop.

Similarly, with correlated if-statements, we can always join back after the if-statement, but not in the middle of the if-statement.

And in the case of the if-statement, even if we successfully conclude the join, it wouldn’t really produce the desired result; in this case the logic based on “the space of explored path” actually gives a more precise result! Indeed, in the 456/x example, a widened state after if (x) does not allow you to diagnose any of the divisions by zero, but we could correctly identify the bug by analyzing the space of explored paths. So I suspect that the widening approach would ultimately throw away many more correct reports than post-processing approach. And that’s on top of already worrying that the post-processing approach would throw away too many correct reports.

So with the post-processing approach we’re actually approaching the mathematically perfect solution, whereas with widening our “ceiling” is significantly more imperfect.

Because for most checkers the current system works, like, fairly well. It’s just a few experimental checkers sensitive to loops that really suffer from this.

In particular, it’d be a shame if we had to regress the usefulness of every other checker dramatically just because we needed a specific form of widening in order to make the array bounds checker work.

Such incomplete information only ever leads to false negatives, never to false positives. Because the fewer execution paths we explore, the less likely we are to conclude that we’ve explored sufficiently many of them. We never need to figure out how many paths we didn’t explore; we just see if we’ve gained sufficient “coverage” of the CFG. (Where the concept of “coverage” is significantly more complex than in the phrase “test coverage” given how exploring two consecutive if-statement with two bug reports that go through the opposite branches of each of them isn’t good enough for us.)

So not only the whole-space-of-paths analysis never introduces new false positives (because it doesn’t emit new bug reports to begin with), but also it’s actually capable of reaching zero false positives (because all insufficiently reliable warnings are refuted) while still emitting a few warnings that the widening approach will incorrectly suppress.

Compared to that, the incomplete information caused by the widening operation may easily introduce false positives, given that it doesn’t prevent us from exploring infeasible paths in the source ranges between each split and its respective merge; moreover, it causes us to explore more infeasible paths due to loss of information (on each of those ranges except the very first one). So it’s not actually sufficient for reaching zero false positives. It’s not even obvious that it’s going to be a quantitatively better tradeoff between false positives and false negatives than whatever we’re currently doing!

So the “real” problem, in my opinion (let’s call it Problem A or something), may be that the whole-space-of-paths analysis is completely worthless specifically for loops. Like, do we ever really explore enough paths? If not, do we end up with a static analyzer that simply gives up and throws away all bug reports every time it sees a loop? That wouldn’t contradict my previous praise of that approach but such a trade-off is probably going to be quantitatively worse than the widening approach, even if it technically allows us to reach zero false positives. We’re just loosing too many warnings to remain useful.

If Problem A is confirmed to be a problem, then I’m probably advocating for a “hybrid” approach that combines:

loop widening with…
a whole-space-of-paths analysis (or a simpler visitor-based analysis) to mitigate false positives produced by the loss of information during widening.

This way we can probably reach a good tradeoff between false positives and false negatives in each case.

However, in order to get there, we’ll probably need to figure out the trackExpressionValue() situation first. And not just that - but gain a bit of practice automatically figuring out what constitutes an infeasible path.

It could also be true that the Problem A is a problem but not a blocker; a nicer heuristic can be implemented that doesn’t make the analysis completely worthless but still suppresses a lot of false positives.

But, yeah, I’m somewhat suspicious of a pure widening-based solution that targets loops specifically. We already respond to incomplete information in the ProgramState very poorly: it causes us to explore infeasible paths. The widening operation produces even more incomplete information, and only then it attempts to mitigate some - but not all - consequences of it. As in, the correlated loops may get better than before, but the correlated if-statements will get way worse. And I’m really not sure it’s easy to reach a point where it’s actually better than whatever we’re currently doing.

DonatNagyE · August 2, 2024, 7:12pm

@NoQ I tried to answer all parts of your comment and the result is… long, probably too long.

However, the most important part is that my goal is not just “the correlated loops may get better than before” – with the current loop evaluation even a single loop can confuse the engine into producing FPs (and this repeatedly occurs on real-world code). For an example and more details, search for [JUMPHERE] in my comment.

Re: trackExpressionValue()

As is, we should at most trust it to do what it says: explain where the value came from, and/or how exactly did it end up being stored in a given location. (The last part is non-trivial for symbolic values too.)

I agree that a “fixed” version of trackExpressionValue() is probably the good tool for explaining the origin of values in bug reports. (The current version may also work well in this role, but it may also emit unwanted diagnostics or cause other issues.)

I feel that its use in bug invalidation (i.e. “if this value came from that kind of location, then ignore the report”) is never the ideal solution; however I admit that it can be a useful hack if it’s used sparingly.

But I also think that figuring out a correct way to “widen” the state, and then implementing the checker-side support for the widening operation over all state traits, is significantly more work than fixing trackExpressionValue() and figuring out a post-processing-based solution.

I didn’t start to write the concrete implementation of the loop widening, but I’m optimistic about it because I think that it’s closely analogous with the existing invalidation procedures.

The raw skeleton of the widening algorithm that I propose is just:

start the invalidation from the state before the loop;
let L be the list of variables that are declared outside the loop and lvalue-referenced within the loop (this can be found with a straightforward AST matcher);
analogously let R be the set of rvalue-referenced variables;
act as if CallEvent::invalidateRegions() was called on a function call whose argument list consists of the variables within L (taken by reference) and the variables within R (taken by value);
don’t invalidate all the globals unless the body of the loop contains an opaque function that would invalidate them (but note that individual globals may be invalidated if they are referenced in the loop);
assume that the loop condition is false (unless it’s possible to leave the loop via e.g. a break);
continue the analysis after the loop.

Note that this “act as if the loop was an opaque function” plan is not my idea, I heard it from @steakhal.

For context: In addition to running this loop widening algorithm, I want to run the first two iterations of the loop (unless the analyzer can prove that we’ll exit earlier) to find bugs within the loop body. Moreover as a form of loop unrolling I would like to allow running a moderate amount (e.g. 128) of iterations as long as the loop doesn’t cause any state splits (or other complex stuff); and if the path unambiguosly exits the loop during this, then it can continue after the loop without any widening.

Obviously I don’t claim that this algorithm is entirely accurate, but I think that

on average, a loop handled by this algorithm doesn’t cause more problems than a function call that is not inlined;
loops are not more common than non-inlined function calls;

and these put an acceptable upper bound on the amount of potential problems.

Also note that by default this does not invalidate the globals, while AFAIK invalidation-related bugs are often related to global variables. There will be many loops where the analyzer will only invalidate one or two variables.

As this algorithm reuses an existing procedure, I hope that I won’t need to develop any special case logic that tweak checker-specific state information. (However, it’s possible that note tag generation / bug reporters need to be adapted to produce more user-friendly messages around these “loop skipped” locations.)

Additionally, widening is only a part of the problem: it (hopefully) fixes the behavior after the loop, but not during the loop. We simply can’t step into the loop at all without immediately introducing unjustified state splits. And we cannot “join back” to eliminate the unjustified state split until the end of the loop.

I’m focusing on handling the behavior after the loop (with a widening step), because that’s “more unjustified” than the behavior during the loop.

[JUMPHERE] The problem that I’m trying to eliminate is not just a “correlated loops” issue: with the current execution model, a single loop can produce FPs without any other assumptions:

int example(unsigned n) {
  int sum = 0;
  for (int i=0; i < n; i++) {
    sum += some_function(i);
  }
  return sum / n;
}

This is a clear and natural way to implement this particular logic and code like this exists in many real-world codebases – but the analyzer will produce a division by zero report because it separates the “loop is executed zero times” case from the other iteration counts.

I’m focusing on the “continue with concrete assumed iteration count after loop” and “assume that there is a third iteration” patterns because these are not justified in isolation and can cause FPs without the “help” of other assumptions.

Usually the analyzer assumes things that are justified in isolation: for example if the analyzer encounters the block

  if (check_foo()) {
    run_foo();
  } else {
    return -404;
  }

then it essentially says that “both branches should be possible, because otherwise one of them would be dead code”.

The “correlated ifs” problem is caused by the fact that the combination of two justified in isolation assumption is not always justified: a code fragment like

  if (check_foo())
    run_foo();
  if (check_bar())
    run_bar();

can be the shortest idiomatic way to handle a situation where only three of the four paths are semantically valid.

In a loop it’s justified in isolation to enter the loop and assume that at least two iterations will be executed (if the loop condition is ambiguous in both cases), because

if the loop condition is always false before the first iteration, then the loop is dead code;
if the loop condition is always false between the first and second iterations, then it would’ve been much more idiomatic to write a simple if instead of the loop.

On the other hand, for any concrete value N, it’s not justified in isolation to assume that we can leave the loop after exactly N iterations, because there is no shorter or more idiomatic alternative that the programmers can use when they know that the “exactly N iterations” case is impossible.

(By the way, it’s also not justified in isolation to assume that we will enter the third iteration of the loop, because there is no simpler special syntax for loops that have at most 2 iterations.)

It may be difficult to return to a state that is justified (at least in isolation) by widening the loop, but these are just technical difficulties and at least it’s possible to determine that the analysis is in an unjustified state.

On the other hand, I’m convinced that the general “correlated if-statements” issue (i.e. which set of “justified in isolation” assumptions produce an unjustified combination) is difficult or even unsolvable on a theoretical level. More precisely (let’s call this Problem B), I think it is usually impossible for the analyzer to distinguish

(1) “the programmer assumed that all four paths through the two if statements are valid, but made a bug that breaks one of them” and
(2) “the programmer knew that only three of the four paths can occur and created an implementation that doesn’t handle the impossible variant”

because it’s common that the difference between these two is only present in the mind of the programmer (or it’s encoded in opaque functions that we can’t investigate).

If we ignore all performance constraints (and analyzed all the exponentially many code paths), the whole-space-of-paths post-processing approach could find all the bug reports that are justified in the sense that they “only depend on a single justified in isolation assumption (or no assumptions at all)” – but I fear that in real-world code this would be a very small fraction of the currently reported “true positive” reports of the analyzer.

Because for most checkers the current system works, like, fairly well. It’s just a few experimental checkers sensitive to loops that really suffer from this.

ArrayBoundV2 is not “sensitive to loops”, it “handles a language feature that’s actually frequently used”! I’m fairly sure that the (count of ArrayBoundV2 false positives) / (number of array subscript expressions) ratio is not worse than the analogous ratios for various stable checkers – but array indexing is a very common operation, so the raw number of false positive becomes too large.

It is not surprising that “for most checkers the current system works, like, fairly well” – because most of the checkers handle niche language features which are together much less common than array indexing alone.

However, I agree that we can highlight that ArrayBoundV2 is “sensitive” to numeric knowledge (as opposed to resource allocation-release state machines like MallocChecker or StreamChecker).

In particular, it’d be a shame if we had to regress the usefulness of every other checker dramatically just because we needed a specific form of widening in order to make the array bounds checker work.

Obviously I’ll pay attention to the results of all the checkers and ensure that my changes don’t cause regressions in other checkers.

However, I’m fairly sure that the suggested loop widening wouldn’t cause “dramatic” regressions – even Péter Szécsi’s loop widening (LoopWidening.cpp) was close to being accepted, despite the fact that he used a significantly more brutal widening (which nukes all locals) than the plans that I’m suggesting.

[…] the user who wants to report this issue probably can’t create a small reproducer […]

Such incomplete information only ever leads to false negatives, never to false positives. Because the fewer execution paths we explore, the less likely we are to conclude that we’ve explored sufficiently many of them.

Fair point. The non-exhaustive search already implies that some results will disappear when the code complexity increases and we have less time to analyze the same code.

whole-space-of-paths analysis never introduces new false positives (because it doesn’t emit new bug reports to begin with), but also it’s actually capable of reaching zero false positives (because all insufficiently reliable warnings are refuted) while still emitting a few warnings that the widening approach will incorrectly suppress.

I’m used to seeing true positive bug reports with dozens of assumptions, where many of those assumptions are necessary to dodge early exits and ensure that the execution path reaches the area where the bug happens. In fact, I’d guess that in an average project the majority of the source code is “hard to access” in the sense that the analyzer cannot reach it without making many independent assumptions.

I cannot imagine a coherent whole-space-of-paths analysis algorithm which accepts these results as “true positives” and still filters out a significant amount of FPs.

Compared to that, the incomplete information caused by the widening operation may easily introduce false positives, given that it doesn’t prevent us from exploring infeasible paths in the source ranges between each split and its respective merge […]

This isn’t an issue in the loop widening that I’m proposing. Before jumping to the “we exited the loop somehow” state, I’m only analyzing the first two iterations, which should be analyzed IMO.

So it’s not actually sufficient for reaching zero false positives. It’s not even obvious that it’s going to be a quantitatively better tradeoff between false positives and false negatives than whatever we’re currently doing!

Yes, loop widening is not sufficient for reaching zero FPs, but that’s not a practical goal.

My suggestions are not obviously better, but I think they are definitely promising enough to create a prototype and check the results experimentally.

What I can clearly see is that “whatever we’re currently doing” is not good enough to report out-of-bounds errors (the engine feeds ArrayBoundV2 with enough garbage that even a perfect checker code wouldn’t be enough); and that’s not something that we can just accept.

(“Hi, we have a nice and mature analyzer for detecting software bugs. Oh, and it does not even try to detect a very common and serious bug… but look at all the shiny checkers that prevent misuse of the FooBar::Biz API in these specific circumstances!”)

EDIT: I’m sorry for the salty tone, but I’m a bit frustrated by the fact that I spent a year working on ArrayBoundV2 and it still produces hundreds of FPs on our usual test projects…

So the “real” problem, in my opinion (let’s call it Problem A or something), may be that the whole-space-of-paths analysis is completely worthless specifically for loops. Like, do we ever really explore enough paths? […] We’re just loosing too many warnings to remain useful.

I would guess that this is a severe issue, not just for loops, but for whole-space-of-paths of other reports as well.

If you really want to focus on whole-space-of-paths analysis, you can probably tweak the graph traversal algorithm to ensure that we also visit paths that are “similar” to the bug report path – but this is not easy to implement and doesn’t prevent Problem B.

If Problem A is confirmed to be a problem, then I’m probably advocating for a “hybrid” approach that combines:

loop widening with…

a whole-space-of-paths analysis (or a simpler visitor-based analysis) to mitigate false positives produced by the loss of information during widening.

It is likely that it’ll be possible to refine the loop widening approach with some post-processing tricks to filter out certain false positives – but I think that we should see the “raw” loop widening results before we start to plan these.

However, in order to get there, we’ll probably need to figure out the trackExpressionValue() situation first.

For me trackExpressionValue() is low priority, because ~~it’s just cursed~~ I think that the importance / (implementation difficulty) ratio is really bad.

Refactoring trackExpressionValue() would disturb the results of thirty checkers and the end result is just slightly better note tags in half of the checkers (compared to e.g. markInteresting()) and perhaps a few new whack-a-mole heuristics that kill some false positives.

And not just that - but gain a bit of practice automatically figuring out what constitutes an infeasible path.

This is the Problem B that I highlighted and I’d say that automatic recognition of “infeasible” paths is actually impossible – we just don’t have enough data.

The widening operation produces even more incomplete information […] the correlated if-statements will get way worse

I agree that this is a potential issue but I think that a pessimistic guess for “way worse” might be “+50% correlated if-statement FPs” and not something brutal like “+1000%”, because loops are not 1000% more common than opaque function calls (and my suggested invalidations will frequently preserve the globals).

I also have very vague plans that could mitigate the side effects of invalidation, e.g. by saving the pre-invalidation knowledge somewhere, and either

silencing bug reports where the pre-invalidation and post-invalidation assumptions don’t match,
or keeping these bug reports and annotating them with notes like “Assuming that opaque_function() changes the value of global_pointer from non-null to null”.

(I’m pretty sure that this is not my original idea, I must’ve read somewhere in a github issue or something similar. On a longer term I’d be happy to work on this kind of improvement, and I’d prioritize this if this is what’s needed for my loop widening plans.)

steakhal · August 6, 2024, 12:29pm

Thanks for the topic and the interesting discussion.
I’d like to thank you @DonatNagyE for your time working on this, and I’m looking forward a prototype to test at scale.

Unfortunately, I don’t have much to add to the discussion - maybe only that so far I only considered a relaxed loop-widening invalidation as an option for resolving this.
Lately, I’m also considering some pre-analysis pass on the loop to infer properties of the loops and their variables, such as how a container is being used in a range-based loop (which is more and more common). We could apply the deduced invariants before even entering the loop making it more accurate.

The other side of the coin is that it’s a pity we don’t really find off-by-one errors with the bounds checker, and loop widening or report post processing wouldn’t really change that. I know that’s on the FNs side, and this post is about FPs - yet I still wanted to mention it.

Maybe this one.

DonatNagyE · August 7, 2024, 10:44am

Lately, I’m also considering some pre-analysis pass on the loop to infer properties of the loops and their variables, such as how a container is being used in a range-based loop (which is more and more common). We could apply the deduced invariants before even entering the loop making it more accurate.

I didn’t think too much about the modeling of range-based loops because the analyzer is already bad at modeling containers (e.g. the iterator checkers are cursed). However, you’re right that it’s important because it’s becoming more common; so it would be good to see this kind of improvement.

The other side of the coin is that it’s a pity we don’t really find off-by-one errors with the bounds checker, and loop widening or report post processing wouldn’t really change that. I know that’s on the FNs side, and this post is about FPs - yet I still wanted to mention it.

We could detect off-by-one errors in loops where the loop advancement is legible (e.g. i++, i -= 3 or something similar with no in-loop modifications of i) by an algorithm that:
(1) performs the widening invalidation (which models “we ran this loop several times”);
(2) assumes that the loop condition is true (= there will be an additional iteration);
(3) assumes that the loop condition would be false after one additional advancement step (= this will be the last iteration);
(4) runs the loop body from this state.

By the way, running one last iteration after the widening step is probably useful for other things as well: it can ensure that things that are done in every iteration will be true after the loop. If the loop condition is not legible, we can still do this with a logic that
(1) performs the widening invalidation (which models “we ran this loop several times”);
(2) assumes that the loop condition is true (= there will be an additional iteration);
(3) runs the loop body from this state;
(4) assumes that the loop condition is false (= this was the last iteration).
This variant is not good enough for catching off-by-one errors within the loop (because in the last iteration we don’t know that it’s the last iteration), but it’d still be an useful fallback for ensuring a better state after the loop.

Maybe this one.

This writeup wasn’t familiar for me, so I probably didn’t read it earlier, but I really like it and I hope that something like this can be implemented.

DonatNagyE · August 7, 2024, 11:39am

As a rough approximation of my planned changes, I did some experimentation with Péter Szécsi’s loop widening feature (which invalidates more areas than what I would like to do).

Note: during these tests I used the packages core, security, unix and cplusplus and the checker alpha.security.ArrayBoundV2.

Eyeballing the results, it seems that activating loop-widening=true (with no custom changes) increases the number of bug reports by 2-3%. In theory some of these should be coming from the increased coverage, but as I looked at some randomly picked ones, it seems that they are mostly false positives caused by the too broad invalidation. (However, I think the more selective invalidation will handle them correctly.)

I also created an experimental branch where loop-widening=true is hardcoded, and the analyzer only exits from modelling a loop

when the loop condition is unambiguously evaluates to false,
when the loop widening is triggered,
with break and similar effects.

This should (at least in theory) prevent the “unfounded assumption of a concrete iteration count after the loop” issues, but the actual observed impact was smaller than what I expected: e.g. on ffmpeg there are 5000+ reports, enabling loop-widening introduced 150+ new ones and disabling the non-widened ambiguous loop exits discarded ~40 results.

I’m not entirely sure that my quick hack works exactly as I intended, I’ll probably do some more testing in the future.

Based on these observations, I think that I can implement my plan in several steps that are also useful individually:

Improve the existing loop-widening feature by making the invalidation more selective. A TODO by Péter Szécsi already requests that we should only invalidate the values that can be changed by the loop. This seems to be a straightforward, medium complexity task, I think I can definitely do this.
Measure the effects of this improved loop-widening and enable it by default if it’s good enough. If it’s not good enough, I’ll try to implement
- further reductions of the set of variables/regions that will be invalidated;
- better handling of invalidated values e.g. bug visitors that discard reports with a contradiction before and after invalidation.
Discard the execution paths where the analyzer exits the loop without widening (except for the case when the loop conditions are determined and the analyzer can prove that there is only one path). This should eliminate the “unfounded assumptions after loop” issues.
Increase the raw maxBlockVisitOnPath limit from 4 to a significantly larger value, e.g. 128 – but discard the execution path (and switch to widening or non-inlined call) if a state split occurs at block count > 2. This would be a nice and general variant of loop unrolling and if the details are chosen correctly, it would also prevent the “unfounded assumption of third iteration” problems.
After these we can implement “nice to have” features like additional logic to detect off-by-one errors (by modeling the last iteration).

NoQ · August 14, 2024, 2:40am

Yeah I think that’s a very good way to look at this. Our current loop unrolling is like function inlining, whereas a naive widening machine (“let’s drop all state that could have been touched, considering all variables accessed by the loop”) is very similar to conservative evaluation of a function (“let’s drop all state that could have been touched, considering all variables passed as parameters”), and a clever widening machine would be akin to collecting a “summary” of a function (probably through data flow analysis), to carefully evaluate the exact effect of the function call on the state without making unnecessary assumptions about which path is taken. In case of loops such analysis could figure out the loop invariant and such.

Right now we believe that inlining is typically better than conservative evaluation. Even though we suppress inlining in many cases (eg. the C++ standard library where we really don’t want to deal with red-black tree paths in std::map or even the realloc paths in std::vector), even though we have to implement post-processing passes such as the “inlined defensive checks” suppression… We still believe that inlining typically yields better results, simply because the state [that it doesn’t discard] is typically much better than nothing.

The main historical “motivation” for the static analyzer back in the day was the RetainCountChecker, which is a checker that doesn’t really benefit from interprocedural analysis directly; instead, it relies on ownership annotations to tell it what the function does. It was able to catch use-after-free or a memory leak when the function that allocates memory has nothing to do with the function that deletes it, without ever observing these functions together “in the same room”. It didn’t need inlining to catch those bugs or eliminate those false positives. And even then, inlining was a huge benefit for that checker because it caused the static analyzer to explore fewer infeasible paths due to perfect information.

However, both with functions and with loops, I can absolutely see how a proper flow-sensitive summary/widening machine would do a better job.

Yeah that’s a good way to think about it too!

Which of those is more popular, probably depends on the codebase. Eg., in codebases with “unity builds” such as sqlite, most functions will be inlined (or at least attempted). If you make all loops evaluated conservatively, it could be a big regression.

Another thing that makes the loop situation a bit unusual is the user’s “expectation” that the code is evaluated literally because the code is “right there”. With a function, the user often doesn’t even know where the function is defined, so the user is a bit more likely to “give us a pass” if we don’t know what it does, even when it’s available for inlining. They’re a bit more likely to say “Well, the function may not do that right now, but that’s not technically a part of it’s contract, so I understand the concern and agree that it’s better to add an assert regardless”. But when it comes to a loop, it’s typically spelled out very clearly in the code we’re already looking at, and it’s not usually considered to have a separate “contract” in isolation from the rest of the function. So I’m somewhat worried that every time we fail to account for an effect of the loop (or lack of such effect) we’d be rightfully called out and the false positive will be less “defensible” - depending on how obvious was the invariant we’ve missed.

DonatNagyE:

On the other hand, I’m convinced that the general “correlated if-statements” issue (i.e. which set of “justified in isolation” assumptions produce an unjustified combination) is difficult or even unsolvable on a theoretical level. More precisely (let’s call this Problem B), I think it is usually impossible for the analyzer to distinguish

(1) “the programmer assumed that all four paths through the two if statements are valid, but made a bug that breaks one of them” and

(2) “the programmer knew that only three of the four paths can occur and created an implementation that doesn’t handle the impossible variant”

because it’s common that the difference between these two is only present in the mind of the programmer (or it’s encoded in opaque functions that we can’t investigate).

I think Problem B applies equally to loops. Eg., in your example(), it can be argued that the warning is actually a true positive: mathematically speaking, a function that computes an “average” of n elements is probably ill-defined for n == 0 so it needs to handle it as a special case. In this specific example, if the function was named “compute_average” or something like that, a lot of people would be very happy to see such warning and address it with eg. if (n == 0) return 0; at the beginning of the function.

In other cases it’d be completely unacceptable and result in an ugly false positive. It entirely depends on the programmer’s perspective and it’s not visible at all in the code. So we can’t reasonably justify considering the N = 0 case, but we can’t reasonably discard it either.

And the correlated-if-statements variant of the problem also supports a “widening”-based solution. Such solution is significantly simpler to implement for if-statements than for loops: you simply need to stitch the two states back together. (You don’t even need to look at the AST for that. You can implement it purely as a binary operation on the ProgramState class.)

But we aren’t going for it because, well, it’ll probably do more harm than good in the case of if-statements. In case of loops it’s definitely a different story; I think widening is actually not bad. And I’m worried about the post-processing approach significantly more when applied to loops, than when applied to if-statements.

Yeah but, we’re still allowed to make unjustified assumptions. It’s just that we need to demonstrate that the unjustified ones didn’t matter in hindsight. This doesn’t mean that we give up as soon as the program has more than one if-statement.

A widening pass eagerly destroys previous assumptions, no matter how justified, because it absolutely needs to reach a unified state at the end in order to continue path exploration.

My approach doesn’t require that. It’s more relaxed. It’s more like, maybe let’s see if we can justify some of them later, and maybe a lot of them never needed to be justified? If we can’t, we can always give up on them anyway.

So I think what my approach accomplishes is (very roughly) equivalent to what widening would accomplish if it was applied:

selectively - only to the “unimportant” branches,
where “unimportant” is decided on per-report basis,
with the benefit of hindsight - again, specific to each report.

Which could be way better than applying it “unconditionally”.

Are you sure this comparison remains favorable when you compare it to the number of core.Dereference false positive divided by the number of all pointer dereference operations? (Including, well, a solid portion of array subscript operations that aren’t performed on immediate arrays?) What about the number of core.uninitialized.* warnings divided by the total number of reads from arbitrary variables?

That said, that’s a very interesting argument regardless. We don’t gather such data very frequently and we probably should do this way more often.

The reason why the default metric, “false positive rate” (false warnings divided by all warnings), is so useful, is because it corresponds nicely to a user’s first impression: they run a tool for the first time, they see all those warnings, they decide if they want to receive these warnings in the future or the entire tool is completely busted.

So we often give “quiet” checkers a pass even when they behave very speculatively (eg. the move checker’s local variable heuristic) - simply because even if their warnings are false, they never become a fundamental blocker for people using the tool. If the majority of the warnings reported by the static analyzer by default are false, that’s a much bigger problem that immediately disqualifies the static analyzer in the eyes of a potential new user.

But I still think that the argument “improved loop handling is particularly important for the array bounds checker” holds true. I’m making it entirely based on my old attempts to debug a few warnings of those checkers and pinning down the root cause to be loop modeling. This happens significantly more rarely with other checkers.

(Many false positives have multiple root causes, i.e. multiple independent problems need to be addressed before the false positive goes away naturally. So what I’m saying is, I was seeing that array bound checker false positives typically won’t go away until we have better loop modeling, even if other bugs unrelated to loops were contributing to them too.)

And that’s why you want to improve loop modeling too right?

Yes, I completely agree that this is not a good mindset at all. I agree that the buffer overflow problem is important enough to figure something out.

Even if we have to instantiate an entirely separate analysis that cannot co-exist with other checkers in the same ExplodedGraph, we should seriously consider automatically invoking the engine twice, with different loop unrolling rules, when the array bounds checker is enabled.

I think it’s important to keep the existing checkers running, especially the general-purpose ones such as use-after-free/leak checkers, uninitialized variable checkers, null dereference checkers. We’re developing a general-purpose tool and it’s often not up to us to decide which checker is more important and which one is ok to sacrifice. But we can absolutely consider sacrificing other things (eg. performance - run the analysis twice) if it means finally producing a good checker for buffer overflows specifically, considering how incredibly important of a problem this is.

On that note, have you explored my old ideas about “emitting the warning before entering the loop”?

The idea is: don’t try to fix loops; the examples that we can catch by unrolling loops aren’t the interesting ones anyway, they’re mostly just toy examples with constant array size and constant loop bound. No matter how embarrassing it is that we can’t catch them, real vulnerabilities rarely look like that. Focus on examples that can’t be caught this way, because they’re the really interesting ones in practice - and, ironically, we may be able to do a much better job there too, if we try a little harder.

Effectively, imagine if instead of every loop you had a library function such as memcpy(). Your explicit model for a standard function of that nature doesn’t typically try to simulating every assignment. Instead, you try to figure out ahead of time what is the range it access the buffer in, and work with that as a precondition under which the function’s behavior isn’t outright undefined. You can easily formulate the rule: both the source buffer and the destination buffer need to be at least as wide as the size parameter says. You can also think of it as a taint problem: if either the size parameter, or either of the buffer sizes, are attacker-controlled then you’re in big trouble.

So when you’re seeing a loop, try to see it as a memcpy() in disguise. You don’t know what it’ll do but you aren’t forced to discover it by direct simulation either. Try to figure out in advance what the loop’s preconditions are. Instead, conduct an auxiliary path-insensitive analysis to discover the array access patters of the loop.

Eg., in this example (it can be caught by direct simulation but it’s an ok example nonetheless):

if (n < 20)
    return;

int arr[10];
for (size_t i = 0; i < n; ++i) {
    arr[i] = i + 1;
}

we don’t have to simulate the loop path-sensitively 11 times in order to discover an OOB access. Instead, before we even start simulating the loop, we can conduct an auxiliary path-insensitive analysis (similar to Peter Szecsi’s loop unrolling analysis, possibly much smarter, but never reliant on direct simulation) that would tell us that the loop accesses the entire range i ∈ [0, n). Combining this information with the path-sensitively-discovered value of n on the current path (i.e. n ≥ 20), we can instantly see that the loop is going to overrun the buffer, even if we never simulate it that far.

The finished warning may look like this:

if (n < 20) // path note: Assuming n ≥ 20
    return;

int arr[10];
for (size_t i = 0; i < n; ++i) {  // warning: We're about to overrun 'arr' by 20 elements.
    arr[i] = i + 1;  // non-path note: Overrun is going to happen here.
}

With a smart enough constraint solver, we could do the same even when loop bounds and array sizes are both pure symbols, and even consider taint analysis as an extra source of inspiration.

Then, we still have to simulate the loop for the rest of the analysis. But don’t emit array bounds warnings during such simulation; we already did our best.

We could still reclaim some of the “toy” examples by finding them with direct simulation when we know the exact loop bound. I.e. when we’re literally inside Peter Szecsi’s unrolling situation. But if we don’t know the bound we probably should just use the other method.

DonatNagyE · August 16, 2024, 12:35pm

@NoQ Thanks for the feedback, I think I mostly understand your positions. However, I still think that the widening/invalidation approach is at least worth a try, because:

implementing a prototype for my plans is significantly easier than the work needed for a bug report set post-processing framework (even cleaning up trackExpressionValue is more work than my plan);
the results of Péter Szécsi’s loop widening feature are promising (even if that simple implementation is not sufficient by itself, it’s not terribly bad).

Based on this I’ll start to work on creating a prototype for my plans (by refining the code of Péter Szécsi) and try to get some experimental results as soon as possible.

I think we will be able to continue this discussion more productively once I’ll have some measurements (or perhaps, I find something that blocks my plans).

Yeah I think that’s a very good way to look at this. Our current loop unrolling is like function inlining, whereas a naive widening machine (“let’s drop all state that could have been touched, considering all variables accessed by the loop”) is very similar to conservative evaluation of a function (“let’s drop all state that could have been touched, considering all variables passed as parameters”), and a clever widening machine would be akin to collecting a “summary” of a function […]

There is one fundamental difference between inlining a function and executing some iterations of a loop: if we see a function call foo() we know that it is equivalent to one execution of its body – while a loop like while (condition()) { /*body*/ } may mean zero or one or several executions of the loop body.

We still believe that inlining typically yields better results, […]

If the iteration count can be determined (not necessarily before the loop – it’s enough if the condition is not ambiguous whenever we reach it), I obviously support unrolling the loop (= “inlining its body”). (With the obvious exception that we don’t want to unroll 1000 iterations.)

However, if the analyzer cannot determine the iteration count, then we don’t know what to “inline”, and we get false positives if we inline the wrong amount of iterations.

[…] simply because the state [that it doesn’t discard] is typically much better than nothing.

Discarding real knowledge is bad, but discarding (or preferably not even making) unfounded assumptions is often good.

Which of those is more popular, probably depends on the codebase. Eg., in codebases with “unity builds” such as sqlite, most functions will be inlined (or at least attempted). If you make all loops evaluated conservatively, it could be a big regression.

Fair point, but we could only regress to the level of other projects with a different build system (which is still usable).

Also note that my experiments with Péter Szécsi’s loop widening (which is more rough than what I propose) only added +2-3% results and it is a feature with the explicit goal of increasing coverage, so we expect that it increases the number of results with findings from previously un-analyzed code.

Based on this I think it’s safe to say that we won’t get a big regression.

Another thing that makes the loop situation a bit unusual is the user’s “expectation” that the code is evaluated literally because the code is “right there”.

That’s a good observation, I’ll pay attention to it!

However note that even according to my plans we do evaluate a few iterations (to catch errors within the loop) and we do unroll and evaluate the loop if we can resolve the loop condition.

The loop widening is only applied when the loop condition is opaque, and in that case the user usually won’t expect a certain concrete iteration count – and they may be confused/annoyed if the analyzer assumes a certain concrete iteration count without a reason to do so.

So I’m somewhat worried that every time we fail to account for an effect of the loop (or lack of such effect) we’d be rightfully called out and the false positive will be less “defensible” - depending on how obvious was the invariant we’ve missed.

The invalidation that I’m planning will be conservative enough to invalidate everything that is obviously visible for the user.

More precisely (let’s call this Problem B), I think it is usually impossible for the analyzer to distinguish

(1) “the programmer assumed that all four paths through the two if statements are valid, but made a bug that breaks one of them” and
(2) “the programmer knew that only three of the four paths can occur and created an implementation that doesn’t handle the impossible variant”

because it’s common that the difference between these two is only present in the mind of the programmer (or it’s encoded in opaque functions that we can’t investigate).

I think Problem B applies equally to loops.

I had highlighted Problem B as a point against general thoughts like “it [postprocessing the set of results] is actually capable of reaching zero false positives (because all insufficiently reliable warnings are refuted)”.

We can and should take the conservative (a.k.a. “justified” a.k.a. “cowardly”) approach in some specific situations to limit the amount of false positives (it’s worth to lose some TPs to get rid of a comparable amount of FPs).

However, I claim that we cannot always take the conservative approach, because that would amount to losing practically all TPs (to get rid of practically all FPs) – and that scaled-up trade is a bad one.

I’m saying that we shouldn’t aim for a general conservative solution for the “correlated ifs” problem because it would discard a significant majority of TPs (I feel that the majority of bug paths contain several if assumptions). We may introduce heuristics for certain suspicious if-pairs (e.g. if a variable was invalidated between them), but in general we should act as if unrelated ifs were uncorrelated (and we can take any combination of feasible assumptions).

On the other hand, I’m suggesting that we should not consider the N=0 assumption after the loop because that’s one specific situation and it can discard only a small percentage of bug reports (<1% according to my rough measurements, even if we count the reports that are replaced by a logically equivalent, but slightly different report).

However, I would support an off-by-default analyzer flag (~equivalent to an optin checker) which enables the separate modeling of the N=0 branch (unless the analyzer rules it out). That would be helpful for users who want to specifically look for this corner case and can stomach some additional FPs.

Yeah but, we’re still allowed to make unjustified assumptions. It’s just that we need to demonstrate that the unjustified ones didn’t matter in hindsight. This doesn’t mean that we give up as soon as the program has more than one if-statement.

A widening pass eagerly destroys previous assumptions, no matter how justified, because it absolutely needs to reach a unified state at the end in order to continue path exploration.

My approach doesn’t require that. It’s more relaxed. It’s more like, maybe let’s see if we can justify some of them later, and maybe a lot of them never needed to be justified? If we can’t, we can always give up on them anyway.

I see your point that hindsight has some advantages, but I fear that approximating the real behavior by post-processing a combination of several slightly incorrect execution paths is too complicated to be practical.

I suspect that that a solution that tries to filter and tweak the “raw” results will be a bigger, more complex version of trackExpressionValue() which will end up as an unmaintainable nightmare full of whack-a-mole patches that address certain corner cases.

I’m not claiming that there is nobody who could implement this postprocessing in a sustainable manner, but personally I wouldn’t dare to work on it.

Are you sure this comparison remains favorable when you compare it to the number of core.Dereference false positive divided by the number of all pointer dereference operations? (Including, well, a solid portion of array subscript operations that aren’t performed on immediate arrays?) What about the number of core.uninitialized.* warnings divided by the total number of reads from arbitrary variables?

Fair point, you’re right that those checkers produce much less FPs than ArrayBoundV2.

But I still think that the argument “improved loop handling is particularly important for the array bounds checker” holds true.

You’re probably right, but I’d rather say “improved loop handling is particularly important for having good ranges for numerical symbols”. IIRC I have also seen some analogous issues when I was working with the BitwiseShift checker.

And that’s why you want to improve loop modeling too right?

Yes, I’m focusing on loop modeling because I see that a significant part of the ArrayBoundV2 false positives are caused by errors (unjustified information in the State) introduced by bad loop modeling.

I also considered introducing ArrayBoundV2-specific workarounds, but there I ran into technical difficulties (concrete values don’t have an identity, and trackExpressionValue had several issues, so I don’t want to use it), so I decided to aim for a more general approach.

Even if we have to instantiate an entirely separate analysis that cannot co-exist with other checkers in the same ExplodedGraph, we should seriously consider automatically invoking the engine twice, with different loop unrolling rules, when the array bounds checker is enabled.

I don’t think that we’ll need an entirely separate analysis. IIRC Péter Szécsi’s loop widening was not too far from being on-by-default, and my plans are more refined than that.

On that note, have you explored my old ideas about “emitting the warning before entering the loop”?

I conceptually dislike this non-path-sensitive approach because it “reinvents the wheel” and and I fear that it needs to duplicate lots of logic to cover more than a different set of “toy examples” (especially if we want to be conservative and discard all the loops where we don’t understand something).

However, my long-term plans do include a step that runs the last iteration of the loop after the loop widening, which would be suitable for finding those problems that you’re speaking about.

Right now I’m not thinking too much about these questions, I just wish to finally reach a state where ArrayBoundV2 can be brought out of alpha.

steakhal · September 13, 2024, 1:38pm

FYI I’ve just bumped into this issue again, explained here.

DonatNagyE · September 24, 2024, 2:57pm

I’m putting these plans on hold, because I got bogged down with the technical details of implementing the widening/invalidation.

On the other hand, I implemented a suppression hack for ArrayBoundV2 which eliminates lots of loop-related false positives: [analyzer] Suppress out of bounds reports after weak loop assumptions by NagyDonat · Pull Request #109804 · llvm/llvm-project · GitHub

There is a chance that I’ll return to loop handling improvements somewhen in the future, but before that I want to finish the stabilization of ArrayBoundV2.

By the way, this suppression hack does not suppress the false positive mentioned by steakhal in his previous comment

DonatNagyE · October 25, 2024, 2:58pm

As a quick experiment, I created a modified variant of my “suppress out of bounds report after weak loop assumptions” commit (that I mentioned in my previous comment) to completely discard the execution paths with weak loop assumptions (instead of just suppressing the ArrayBoundV2 reports on them) and this prototype produced promising results on open source projects.

The following analysis results were produced by “all the normal checkers”, i.e. the groups core, cplusplus, nullability, security, unix and valist; the Baseline is a recent main (622ae7ff), while Experimental is the version that immediately discards execution paths with weak loop assumptions:

Project	Experimental	Baseline
memcached	0 new reports	1 resolved reports
tmux	1 new reports	5 resolved reports
curl	3 new reports	0 resolved reports
twin	0 new reports	2 resolved reports
vim	9 new reports	13 resolved reports
openssl	1 new reports	3 resolved reports
sqlite	15 new reports	18 resolved reports
ffmpeg	25 new reports	163 resolved reports
postgres	10 new reports	47 resolved reports
tinyxml2	0 new reports	0 resolved reports
libwebm	2 new reports	0 resolved reports
xerces	7 new reports	10 resolved reports
bitcoin	1 new reports	8 resolved reports
protobuf	0 new reports	8 resolved reports
qtbase	24 new reports	41 resolved reports
total	98 new reports	319 resolved reports

As these are stable versions of open-source projects, I presume that most of these results are false positives or otherwise unwanted results.

The “new” reports are very random: they appear because pruning lots of execution paths lets the analyzer explore other unrelated directions. Although this generates mostly “bad” results on these stable projects, overall it’s a good thing if we explore more paths.

(By the way, a patch that (wastefully) traverses the “weak assumption” paths and suppresses all results found on them would produce no new reports and “resolve” (i.e. discard) exactly the same reports that are “resolved” by this patch.)

In a large majority of the “resolved” reports the analyzer arbitrarily assumes that a data structure is empty, e.g. [1], [2], [3], [4], [5].

Of course there might be situations where this kind of aggressive assumptions highlight a real bug, but there are many situations where the programmer implicitly knows that e.g. a certain argument of a function is always nonempty.

We cannot demand that programmers should mark this knowledge with assertions, because there are many situations where that assertion spam would be very awkward and ugly. (And we simply cannot except large-scale refactoring from users who just start to use the analyzer.)

Moreover it’s easy to imagine situations where the data structure cannot be empty, but the analyzer is unable to realize this, so it produces outright false positives. (E.g. the only constructor of an immutable type always ensures that a certain data member is non-empty, but when that type is taken as the in-parameter of the top-level analyzed function we don’t run this constructor and don’t realize this invariant.)

This change is heuristic-based, so it won’t be a final complete solution for all loop-based trouble, but

its simplicity is promising (its only effect is that it discards execution paths, so it won’t introduce any invalidation artifacts etc.),
it doesn’t hurt performance (in fact arguably it’s a performance improvement)
and according to these experimental results, it discards a significant amount of unwanted findings.

Of course we’d need to create a refined commit and perform a more through analysis of the results if we want to merge this change, but I’m confident that this can be done in the foreseeable future. (Unless somebody surprises us with a completely different and even better solution.)

Note that I won’t work on this general loop handling change right now, because my todo list roughly looks like:

Get [analyzer] Suppress out of bounds reports after weak loop assumptions by NagyDonat · Pull Request #109804 · llvm/llvm-project · GitHub merged.
Refine the logic of this heuristic to e.g. use accurate loop iteration counts etc.
Bring ArrayBoundV2 out of alpha state (and fix any other issues that remain with it).
(Perhaps bring some other bounds checking checkers out of alpha with the same improvements that I added in ArrayBoundV2?)
Return to this general loop handling change and publish it (if the results are good enough).

DonatNagyE · December 10, 2024, 3:43pm

After spending some time on unrelated tasks, I’m returning to this loop handling project with some new plans. Based on the very promising test results (that I evaluate in the comment above this one), I decided to return to building general solutions.

I had hoped that the “create report suppression limited to ArrayBoundV2” approach would be a good intermediate step, but it introduced too much additional complexity and bikeshedding that won’t be relevant for my final goals.

To limit the scope of my changes, I decided to split the intended commits into two PRs based on the “direction” of the unjustified assumption:

I already created [analyzer] Don't assume third iteration in loops by NagyDonat · Pull Request #119388 · llvm/llvm-project · GitHub to discard branches where the analyzer assumes too many iterations.
I’ll create another PR to discard branches where the analyzer assumes zero iterations.

Xazax-hun · December 10, 2024, 3:57pm

To be honest, I have mixed feelings about this. On one hand, this is one of the biggest source of false positives. On the other hand, it is also where many of the real bugs are, this is a corner case people often forget about and not handle it correctly.

DonatNagyE · December 10, 2024, 4:31pm

I agree that assuming zero iterations is a difficult question (that’s why I started with creating a PR for the “other side”).

Currently I think the best (least wrong) solution would be that by default the analyzer does not assume the “zero iterations” branch (i.e. it only skips a loop if the condition is constrained to false), but there is an analyzer option that can enable checking this branch (e.g. the user can change assume-min-iterations from the default 1 to 0). This way the “out of the box” behavior would be conservative, but interested users would be able to enable a more through analysis (if they are prepared to scatter some assertions to locations where they assume a nonempty data structure).

Xazax-hun · December 10, 2024, 4:38pm

I think a huge majority of our users never really change the defaults. And even if they did, they might not have the knowledge what should they change. I am wondering if it is time to introduce something like confidence into the diagnostics at some point and emit warnings after assuming zero iterations with lower confidence.

Topic		Replies	Views
RFC clang analyzer false positives (for loop) Clang Frontend	15	163	August 27, 2016
[StaticAnalyzer] Changing the loop widening functionality & measurements Clang Frontend	2	83	August 28, 2017
[GSoC 2017] Improve loop execution model of the Clang Static Analyzer Clang Frontend	5	89	March 27, 2017
Handling of loops in the Clang Static Analyzer Clang Frontend	5	276	February 28, 2017
uncovered path (wrong constraint?) Clang Frontend	5	79	April 21, 2013