Questions about symbolic execution in the static analyzer.

Hello,

I am researching various static analyzers and want to get a better understanding of how the symbolic execution in the clang static analyzer works.

Every implementation of symbolic execution has to make certain tradeoffs to be feasible.

I'm interested in the following issues:

How does the static analyzer approach the path explosion problem?
In this [1] presentation it is mentioned that there are budgets which limit the exploration after visiting the same node multiple times.
Are there any additional techniques to alleviate path explosion?

Secondly, I'm am wondering how it deals with symbolic values as indices for arrays (or symbolic pointer arithmetic).
I have read the paper about its memory model [2], but it doesn't seem to address this question.
All the approaches to the problem I know rely either on address concretization, creating a new path for every possible index or using a more powerful constraint solver (for example using theory of arrays).

I hope you can answer my questions.

Sincerely,
Achim Kriso

[1] https://www.youtube.com/watch?v=4n3l-ZcDJNY
[2] http://lcs.ios.ac.cn/~xzx/memmodel.pdf

Hello,

I am researching various static analyzers and want to get a better understanding of how the symbolic execution in the clang static analyzer works.

Every implementation of symbolic execution has to make certain tradeoffs to be feasible.

I'm interested in the following issues:

How does the static analyzer approach the path explosion problem?
In this [1] presentation it is mentioned that there are budgets which limit the exploration after visiting the same node multiple times.
Are there any additional techniques to alleviate path explosion?

Yes, basically a number of different deterministic budgets and limits. A budget for the total number of nodes (the closest thing we have to a hard time limit), a budget for visiting a CFG block too many times (which sounds like what you're describing), a limit for callee function size and a recursion budget in interprocedural analysis, etc. One of my "favorite" limits is the rule to not consider a function for interprocedural analysis if it hit a budget last time we tried it; this one even works across otherwise unrelated symbolic execution entry points (within the same Clang instance, of course) (in hindsight, this was probably a bad idea; we're considering to replace it with something more predictable).

Secondly, I'm am wondering how it deals with symbolic values as indices for arrays (or symbolic pointer arithmetic).
I have read the paper about its memory model [2], but it doesn't seem to address this question.
All the approaches to the problem I know rely either on address concretization, creating a new path for every possible index or using a more powerful constraint solver (for example using theory of arrays).

An access to array through symbolic index produces a symbol that remembers, as part of its identity, that it was obtained by loading from *that* array by *that* symbolic index.

This is where support basically ends :slight_smile: This guarantees that loading twice with the same symbolic index will produce the same value but we don't yet properly write down constraints over the new symbol in terms of constraints over array elements and constraints over the index value.

This probably leads to bugs / false positives but I haven't seen much of those in practice (even though I can easily craft one). If we were to improve upon this, we'd probably go ahead with writing down a constraint like

a[i] == a[1] || a[i] == a[2] || a[i] == a[3]

without splitting the execution path. And then hope that the solver will handle that.

Generally, our experience with powerful constraint solvers is relatively interesting. We have our custom constraint solver that's extremely primitive and barely correct but extremely fast. Replacing it with a full-featured SMT solver causes an unacceptable slowdown (like 20x or so). Our overall future direction is to go for cross-checking instead: use our fast solver for discovering bugs, and then invoke a stronger solver only when a potential bug is discovered. We have this implemented (with Z3 as a motivating example) and it was shown to noticeably improve precision without causing any noticeable slowdowns. One reason this isn't enabled for all users is because nobody is ready to contribute enough manpower to take responsibility of keeping Z3 running in practice - say, if there's a bug in Z3 that causes the static analyzer to crash, it may harm our users way more than the false positives it eliminates, and we'll most likely be unable to address it quickly enough. Also licensing concerns and usual LLVM external dependencies concerns.

If we actually had a chance to ship a strong SMT solver, we could totally take advantage of its array theory support as part of the cross-check pass.

I've got one more thing to add to the first question. One way to combat path explosion would be to aggressively merge execution paths. For example, we could be widening loops instead of unrolling them. Or generally when we have a lot of sequential if-statements we could merge some paths to avoid exponential explosion.

One of the main reasons we aren't doing this would be because it directly impacts the complexity of developing checkers. Low entry bar for developing new checks was something we were always proud of - the original video about "developing your checker in 24 hours" was basically about this, and these days we have over 100 checkers implemented. One way we achieved this was by limiting the number of different operations over the program state, as each such operation potentially requires checker-side support. This is why we opted into a model of interprocedural analysis that is almost entirely transparent to the checkers, and for the same reason we prefer loop unrolling to loop widening. The merge operation over our program state structures is simply not implemented and was never intended to be.

I want to add a few remarks here.

First about the limit on the number of times we visit the same basic block. It’s not entirely true that there is a hardcoded number for that. The limit sounds more like “the number of times we visit the same basic block ON THE SAME EXECUTION PATH”. It actually make a huge difference. If we have a bunch of if statements in the function and the produce let’s say 100 paths, we still get to the exit of this function 100 times even when the limit is 4. This budget is really for loops only.

Merging states DOES happen actually, but it is very conservative. When we are sure that we’ve seen EXACTLY that state, we merge them (or more precisely, we return the existing state instead of constructing one from scratch). It helps in some cases, but not super efficient.

Another way we are keeping path explosion under control is through cleanups. If we encounter situation when we produced too many nodes, we try to remove the nodes that we believe are redundant. This is not helping with performance in terms of time spent, but helps with not exploding memory-wise.

- Valeriy

Thank You!

I really appreciate both your thorough answers, they were exactly what I was looking for.

Cheers,

Achim