[GSoC 2018] Using the Z3 SMT Solver to Validate Bugs Reported by the Clang Static Analyzer

Hello all,

My proposal for GSoC 2018 [0] about Bug Validation in the Clang Static Analyzer using the Z3 SMT Solver was accepted.

I’ll work with George Karpenkov to improve the bug reports that the static analyzer produces by reducing the number of false bugs.

Thank you,

[0] https://docs.google.com/document/d/1-zNSv0l4WyoxYpJUAw8LFnQq_TY4AGjIpPu1VPkmO-g

Hi all,

This is my first report to the community, comparing the results with and without the Z3 refutation when analyzing a number of projects.

raw1.txt (23.4 KB)

report1.ods (18.8 KB)


| Total Time | time.analyzer.time.sys (s) (mean) | time.analyzer.time.user (s) (mean) | time.analyzer.time.wall (s) (mean) |
|
| Reported bugs |

Hi all,

Hi guys,

I’m investigating the constraints being dropped, here’s what I got so far.

First of all, I’m using the following program:

void foo(unsigned width)
{
int base;
int i = 0;

if (!(i % width))
base = 1;

if(base == 1)
abort();
}

I started by looking at ExprEngine::processBranch, where

SVal X = PrevState->getSVal(Condition, PredI->getLocationContext());

returns Unknown for the remainder condition. The getSVal ends up looking in a map for the value result, so I found that bindExpr fills that map.

Going back a bit, into ExprEngine::Visit, when evaluating a Stmt::BinaryOperatorClass (regardless of the eagerly assume flag) ExprEngine::VisitBinaryOperator is called, which eventually calls evalBinOp and, since it doesn’t understand remainder, it returns unknown and BindExpr is never called.

Back to ExprEngine::processBranch, when the symbol is unknown the following piece of code is called:

// If the condition is still unknown, give up.
if (X.isUnknownOrUndef()) {
builder.generateNode(PrevState, true, PredI);
builder.generateNode(PrevState, false, PredI);
continue;
}

and the condition is simply ignored. When the result is defined, it creates two nodes assuming the constraints.

Another solution, since you're using Z3 already, is to implement runtime
support for querying the underlying ConstraintManager's about the types
of constraints that it supports (e.g. canReasonAboutFoo() ). Then, you
can use this to generate the appropriate SVal's at runtime, which could
include support for symbolic extension/truncation, remainder, shifts, etc.

I have two stale patches that implement this, D28955 and D35450. The
first might solve your problem with remainders; see the changes to
SimpleSValBuilder.

Dominic

Hi,

Another solution, since you’re using Z3 already, is to implement runtime
support for querying the underlying ConstraintManager’s about the types
of constraints that it supports (e.g. canReasonAboutFoo() ). Then, you
can use this to generate the appropriate SVal’s at runtime, which could
include support for symbolic extension/truncation, remainder, shifts, etc.

I see, but for the project I proposed to check why the constraints were being dropped, regardless of having the SMT solver enabled or not.

I believe that even if the constraint manager can't "handle" constraints on a symbol, symbols should still be emitted. Because "can handle" here means "can simplify the symbol properly", not "can track constraints over the symbol". The constraint manager would still at least be able to handle the symbol as an opaque symbol and track constraints over it.

In other words, it's still beneficial to know that "(a % b) == 0" even if we've no idea what "%" means, because when we see "(a % b)" next time, even if we've still no idea what it means, we'll know that it's zero anyway.

Investigated the constraint being dropped and found that it’s because of the pre-conditions enforced by evalBinOp.

If any of the arguments is unknown or undefined, the constraint is dropped as the code only handles Loc/NonLoc types. I had two ideas about that:

  • Change evalBinOpLL, evalBinOpNN and evalBinOpLN to take SVals are arguments (and obviously change the names): The good thing is that the solution would be handled in the same place BUT these methods are already quite big, and handling unknown/undef would just make them bigger.

  • Create new evalBinOpLU, evalBinOpNU, evalBinOpUL, evalBinOpUN in the SValBuilder and implement them in SimpleSValBuilder: it would make the code more modular and easier to understand, but will require some refactoring or evalBinOpLL, evalBinOpNN and evalBinOpLN will end up with duplicated code.

Any other suggestion?

Also, a couple of questions:

  1. What’s the difference between a unknown and a undefined symbol when analyzing a program?

From what I understood, an unknown symbol is the same as “I don’t understand this expression” or arguments that we don’t its value and undefined is, well, undefined values (e.g., uninitialized variables). But when building constraints, does it make any difference? I mean, from an SMT point of view, they would be the same, free variables.

Thank you,

There’s no reason to “handle” unknown/undefined values in evalBinOp() family of methods. The only thing you need to do is to avoid producing unknown values in the first place, so that you didn’t need to perform calculations over them in evalBinOp(). The earlier you eliminate the Unknown, the more accurate information you’ll be able to provide to the solver.

Undefined values are values that result from undefined behavior according to the language standard. You don’t need to handle Undefined values because there’s a checker that’ll interrupt the analysis whenever the code tries to perform arithmetic on undefined values, so your new handling code will never be executed in practice.

Unknown values represent values of expressions that the analyzer doesn’t know how to compute. You don’t need to handle Unknown values because instead you can always prevent them from appearing by fixing the part of the analyzer that produces them - just teach the analyzer how to compute them instead. Just get rid of them and you won’t need to handle them. If getting rid of a particular source of Unknowns is suddenly hard (i.e. because certain semantics of the language are hard to model), as a fallback you can still replace the Unknown with a conjured symbol at that source of Unknowns.

Additionally, there are already multiple unsystematic places in the analyzer where we “symbolicate” Unknowns, such as ExprEngine::VisitBinaryOperator:

67 if (RightV.isUnknown()) {
68 unsigned Count = currBldrCtx->blockCount();
69 RightV = svalBuilder.conjureSymbolVal(nullptr, B->getRHS(), LCtx,
70 Count);
71 }

Which is, again, merely a workaround for producing unknowns in the first place, and the proper fix is to avoid producing them.

Just to re-iterate: the analyzer’s code

… evalBinOp(…) {
if (RHS.isUnknown())
return UnknownVal();

}

does produce an UnknownVal, but instead of conjuring a symbol here i suggest to try to find out why RHS is unknown, which

  1. brings us closer to the root cause of the problem,

  2. gives more accurate information to the solver.

The reason symbols are getting dropped, is because, currently, CSA doesn’t generate symbols for things that RangedConstraintManager doesn’t support, unless taint tracking is enabled. The querying that I mentioned is just a performance optimization, to check the capabilities of the ConstraintManager at runtime and avoid generating unsupported symbols.

If, as Artem suggests, the new policy is to generate accurate symbols, even if the underlying ConstraintManager doesn’t support reasoning over them, then the solution is to identify all the places where unknown values (over-approximations) are being generated, and replace them with the correct SVal.

In your patch below, removing the first part is correct. In fact, I have the same change in D35450. But, the second isn’t, because at that point you already have an unknown value, and must propagate it to maintain correctness of the analysis.

The reason I mention D28955 and D35450, is because I’ve already started on some of these changes, so you don’t have to do everything. I’ll try to rebase and update them this weekend.

Dominic

Hi,

Indeed, just removing the taint check in evalBinOp fixes the constraints being dropped and a number of test cases now print the expected output. There are some crashes as well, as mentioned in D28953.