I spent lots of time thinking about loop modeling within Clang Static Analyzer, and here I’m trying to summarize my current plans and observations.
Note that my primary goal is reducing the amount of false positives caused by unfounded assumptions, but some plans could also improve the coverage (by continuing the analysis after the loop) and obviously I need to consider the performance impact of any plans that I hope to implement.
My observations are not revolutionary, I’d guess that most of them are already familiar for the “veterans” of this project. However, I think it’s still valuable to describe my exact position to ensure that we’re on a common page and highlight my goals.
Also, I’m sure that there are some mistakes and inaccuracies – please correct these if you spot them!
General background
The current situation is very easy to summarize: by default, the analyzer doesn’t use any loop-specific logic during the analysis. Loops are simply backwards edges in the graph of the basic blocks and there is a threshold called maxBlockVisitOnPath which kills execution paths that encounter the same block too many times.
The well-known (?) “we model the first three iterations of loops” rule is just one special case of this limitation: the default value of maxBlockVisitOnPath is 4, so the modeling is stopped after the third iteration, when BlockCount >= maxBlockVisitOnPath becomes true.
The command-line option -analyzer-max-loop is not loop-specific: it just specifies the value of maxBlockVisitOnPath, so it also affects repeated function calls and other situations where the same block can be reached multiple times. (Also note that the name -analyzer-max-loop is misleading: if you want to run e.g. 5 loop iterations, you need to specify -analyzer-max-loop 6.)
Unfortunately the maxBlockVisitOnPath threshold can cause a loss of coverage (when some code cannot be reached without visiting the same block many times), but the analyzer tries to restart the analysis without inlining the currently analyzed function, so if we have code like
void long_loop() {
for (int i=0; i<100; i++)
do_something();
buggy_code(1);
}
void uses_long_loop() {
long_loop();
buggy_code(2);
}
then the analyzer cannot report buggy_code(1) (because during the analysis of long_loop it cannot find a code path that can reach it without running into maxBlockVisitOnPath), but it can report buggy_code(2) (on a code path where long_loop() is not inlined).
Off-by-default features
We also have two off-by-default loop-specific features which were developed by Péter Szécsi in 2017: loop-unrolling (implemented in LoopUnrolling.cpp) and loop-widening (implemented in LoopWidening.cpp).
The loop-unrolling feature detects certain “simple”, bounded loops with AST matchers, and simply disables checking maxBlockVisitOnPath during their execution. According to the old reviews, this feature is (or at least was?) good enough to be enabled by default – but it was eventually left in an off-by-default state.
The loop-widening feature is activated at the beginning of the last modeled iteration within a loop (where BlockCount == maxBlockVisitOnPath - 1) and generalizes the State by invalidating lots of information. This tweak can “help” the execution path leave the loop (e.g. if i becomes unknown, the analyzer will be able to assume that i < 100 is false), and therefore it increases the coverage.
If I understand it correctly, this feature is also in a good shape, but it isn’t as complete as loop-unrolling.
Known problems
Currently the analyzer has several serious limitations that are directly connected to the (lack of) loop handling logic:
- Unfounded assumptions after loop: Each check of the loop condition can introduce a state split, so after the loop we will have separate execution paths where the loop was executed exactly 0/1/2/3 times. These paths place very narrow constraints onto the loop variable and/or the variables that are compared to it; which produces lots of false positives when one of these execution paths is semantically infeasible (e.g. the programmer knows that the list is not empty) but the analyzer doesn’t realize this.
- Unfounded assumption of a third iteration: If the code contains a loop, then it’s valid to assume that the developer intended at least two iterations (because otherwise they would’ve written an
ifinstead) – but the analyzer shouldn’t assume that a third iteration is possible, unless that is implied by the loop conditions. Inffmpegthis produces dozens ofArrayBoundV2false positives where the analyzer doesn’t understand the loop condition and assumes that it can perform three iterations over a two-element array. - Lack of coverage after loop: If the analyzer cannot find a path through the loop, then the code after it won’t be analyzed.
Among these I’m primarily focusing on (1.), with (2.) as a secondary goal; while Péter Szécsi was mostly working on (3.).
Note that these are generic issues that affect many different checkers, so I think it’s not effective to introduce hacks that fix them for one particular checker.
Plans for unfounded assumptions after loops
The root of the issue is that if the analyzer cannot deduce the (true/false) value of the loop condition, then we get several separate code paths (“didn’t enter the loop”, “did 1 iteration”, “did 2 iterations” etc.) instead of continuing on a single combined “did some number of iterations” code path.
Here we do want to continue the analysis after the loop (loops are too common to ignore everything after them), but it’s logically invalid to single out any one iteration count and report the issues deduced from it.
There are several potential approaches to resolve this issue:
(A) Force the merge of the code paths: Collect all the code paths that are exiting from the loop, take the union of their state and continue the analysis from this “artificial” exploded graph node (instead of the original “natural” paths).
The union of several State objects can be defined as:
- for each symbol, take the union of the
RangeSets corresponding to it; - for each equality/disequality relationship between symbols, keep it only if it’s present in all of the states;
- for each stored binding, if the different
States contain differentSVals, then introduce a new symbolic value to represent them and introduce a proper range for it; - for checker-specific state entries, do something checker-specific if they’re different.
Although theoretically this solution is the most accurate one (in some sense), it would require introducing a novel operation on the complex State data structures, and that is probably prohibitively difficult.
(B) Pick one code path and invalidate information where it differs from the others: This is a simplified variant of (A) where the symmetric (and unprecedented) merge operation is replaced by comparison of states (as far as I know, this is implementable) and invalidations (which is a standard operation).
In this case we may keep all the paths after the loop, but as they would become very similar with the invalidations, it’s probably better to keep just one and discard the rest (for performance reasons).
However, this is still a complex plan that requires a “collect all outgoing transitions and replace them with a single one” step. (I’d guess that implementing this is difficult, but manageable, but I’m not familiar with the graph traversal implementation.)
(C) After the loop invalidate the variables participating in the loop condition: This is a simple and brutal hack that would address the majority of the issues in practice.
The main limitation is that this doesn’t cover the cases when the unreasonably assumed concrete iteration count manifests through an effect that’s implemented in the loop body, e.g. if we have code that looks like
int array[10]; // uninitialized
for (int i = 0; i < get_count(); i++) {
array[i] = i * i;
}
then the analyzer would create a branch where array[0] is known to be uninitialized (because it assumed 0 iterations).
(D) After the loop invalidate all variables/locations that are potentially written in the loop: This is a theoretically sound widening step that can be applied even in situations when the analyzer doesn’t find a path that manages to get out of the loop within a few iterations (so this is also helpful for the “lack of coverage after loop” issue).
This plan could be refined by running one final iteration of the loop after this invalidation step to ensure that (1) the loop condition is evaluated (with symbolic operands) and we record the assumption that the right relationship holds between the (symbolic, after-invalidation) values (2) invariants that are true after every loop iteration (e.g. a certain variable holds a positive value etc.) are true/visible after the loop.
Obviously this widening is not needed if the analyzer can determine the exact number of steps taken in the loop and it’s small enough to directly unroll the loop.
Here we could use Péter Szécsi’s loop unrolling feature (which uses AST pattern matching to detect the “nice” loops), but I could also imagine another solution that tries to unroll any loop for a medium number of iterations (e.g. <128 as in Péter’s solution) as long as the analyzer can exactly determine the branches that are taken within the loop. (If we encounter ambiguity/branching, we just stop the analysis of the loop and continue the analysis after the loop with the widening step.)
(E) Post-process the bug reports and discard the ones that depend on a certain concrete iteration count I’m mentioning this for the sake of completeness and to note that this is impractical because the analysis only samples an unpredictable subset of the paths and we cannot distinguish
“we only found this bug on the 0 iteration discard reports that only appear n branch because that’s an unjustified assumption”
and
“we only found this bug on the 0 iteration branch because that’s the only path where we explored something unrelated that’s needed for the bug”
(F) Create a bug reporter visitor that discards reports where the bug depends on the loop variable (or a variable compared with it) This could have been a hacky, but practical solution for the issue, but unfortunately concrete values don’t have identities, so we cannot reliably determine that “the value of this loop variable is interesting/important for that bug”.
I don’t think that it’s useful to aim for a hacky solution of this kind, because with the work that’s needed to handle the technical difficulties we could also implement a non-hacky solution (e.g. (B) or (D)).
Plans for unfounded assumption of a third iteration
Earlier I was planning to reduce the default maxBlockVisitOnPath value to 3, which would’ve implied that the analyzer only models two iterations within a loop, however now I think that the raw maxBlockVisitOnPath threshold should be a relatively rare “fallback” limitation and we should implement more concrete unrolling / widening / iteration-count-limiting for the specific cases where they’re warranted.
For this reason I’m putting this goal on hold for now; I’m planning to return to it after I select and implement a plan for the “unfounded assumption after loop” issue.
Plans for lack of coverage after loop
This is not a primary goal for me, but e.g. plan (D) for the “unfounded assumption after loop” issue would also be very helpful in this area.
Moreover, enabling Péter Szécsi’s features by default (after ensuring that they’re stable enough) could be a significant improvement in this area.