Symbolic Extents

The start of work on symbolic extents. I wanted to ask what the best way
to handle this was. It seems we should have a general EvaluateSymExpr
method somewhere that simplifies a SymExpr as much as possible, which is
what the getKnownExtent() method (see patch) would eventually be doing. Or
perhaps EvaluateSVal, to handle ConcreteInts as well.

It's also going to soon become helpful to ask for the minimum or maximum
value of a SymExpr -- knowing the upper bound on an array size would
certainly be better than no information at all. What should the API for
that be?

(One goal of mine is to be able to model that the length of argv is argc.)

Jordy

symbolic-extents.patch (4.03 KB)

The start of work on symbolic extents. I wanted to ask what the best way
to handle this was. It seems we should have a general EvaluateSymExpr
method somewhere that simplifies a SymExpr as much as possible, which is
what the getKnownExtent() method (see patch) would eventually be doing. Or
perhaps EvaluateSVal, to handle ConcreteInts as well.

I think this patch is a good small improvement over current implementation.

It's also going to soon become helpful to ask for the minimum or maximum
value of a SymExpr -- knowing the upper bound on an array size would
certainly be better than no information at all. What should the API for
that be?

Symbolic extents could go really wild. It's hard to design the
interface without concrete test cases that we want to solve.

Hi Jordy,

I've made comments in some other emails, but here are some specific comments on your patch. Comments inline:

Index: include/clang/Checker/PathSensitive/Store.h

Hi Ted,

I really like the design. It's clean and general enough to leverage
the smart classes SValuator and ConstraintManger.

One premise I want to point out is that this requires the constraint
manager to reason about constraints involving two symbols. It's not
enough to keep relations between symbols and integer ranges, but also
the relations between symbols, like $x == $y. So the current
constraint manager is not smart enough for this.

Hi Ted,

I really like the design. It's clean and general enough to leverage
the smart classes SValuator and ConstraintManger.

One premise I want to point out is that this requires the constraint
manager to reason about constraints involving two symbols. It's not
enough to keep relations between symbols and integer ranges,

It should be able to handle tracking whether an extent is within some constant range, which is no less powerful than what we have now (actually it's more powerful).

but also
the relations between symbols, like $x == $y. So the current
constraint manager is not smart enough for this.

Exactly, but it's no less powerful than what we already had for extents, and opens the door to do it correctly once the ConstraintManager can handle constraints between symbols.

All right, time to go back to the drawing board with this one. So the
direction we're going is this (pseudocode):

(MallocChecker::MallocMemAux)
SVal Result = ValMgr.getConjuredSymbolVal(...)
SVal Extent = ValMgr.getExtent(Result) // or SVator.getExtent(Result)?
state->AssumeEQ(Extent, SizeArg)

The reason I wasn't doing this before is because the constraint manager
wouldn't (currently) handle this case:

void symbolic1(size_t a) {
  char *buf = (char*) malloc(a);
  if (a == 1) {
    buf[1] = 'b'; // expected-warning {{out-of-bound}}
  } else {
    buf[1] = 'b'; // no-warning
  }
  free(buf);
}

Here, the value of the symbolic extent is not known until /after/ it is
set. Our current constraint manager doesn't allow us to "alias" symbols in
this way, though of course it should soon. But all right, let's not worry
about that for now.

As for the extent symbols themselves, I'm not quite sure what to do.
Variable-length arrays aren't symbolic regions, so there's no "parent
symbol" for a SymbolDerived to attach to, but we do need to bound them
symbolically. So we'd at least need a new kind of symbol that attached
itself to regions, rather than other symbols.

Perhaps a MetadataRegion might be more appropriate? The advantage of a new
region over a new symbol is that it could be used for metadata that changes
as well. (I'm looking ahead to taking a stab at modeling C string length,
at least in simple cases.) Of course, this could wreak havoc with the
mythical FlatRegionStore, which would have no place to put this new
subregion.

As for tying SymExprs closer with SVals, it doesn't really make sense,
since SymExprs have more than one pointers' worth of data and thus don't
fit in SVals. Unless we wanted to make a new folding set manager, it's
probably not worth it to extract them from SymbolManager. (Unless we do end
up with a new base in klee.)

Thoughts on MetadataRegion?

Essentially, except that I would merge the two first lines into an API wrapped by SValuator. e.g.:

  const MemRegion *R = ...
  SVal Extent = SVator.getExtent(R);
  state = state->AssumeEQ(Extent, SizeArg);

The client should never care whether or not the extent is backed by a symbol. If the Extent is a constant (e.g., nonloc::ConcreteInt), then the AssumeEQ should be trivially decidable if SizeArg is a constant. If it isn't, it will likely be a linear combination of a symbol and a constant, and the current constraint solver logic (which you added) should be able to handle this.

The extent of a variable length array is just the value of the size expression when it was declared. e.g.:

  int buf[10 * n];

Here the extent is '10 * n' when at the point of the DeclStmt for 'buf'. We already evaluate '10 * n', and should just capture it as part of the region definition for the VLA. This would require creating a specialized VarRegion that had an SVal entry for the extent. SValuator::getExtent() would just need to know for that type of region that it can obtain the extent by querying the region.

Would precisely would a MetadataRegion represent? Regions are suppose to represent chunks of memory, subsets of chunks of memory, with annotations on that chunk. Aside from VLAs (where we need to capture more information anyway), I don't know what a MetadataRegion would represent.

Tying SymExprs with SVals is actually an intrinsic part of their design. Hopefully I can make that little clearer.

SVals and SymExprs are really part of a single feature. The are used to represent the "expression semantics" of evaluating expressions. When we see '(exp1) + (exp2)', we compute an SVal that represents the semantics of evaluating that expression. In the beginning we only represented a very simple set of semantics which you can think of as a lattice/set of values: { Unknown, Undefined, Constant, Locations }. SVals are just *handles* to values in this lattice. This is really important. We pass SVals around everywhere, but anytime they have real data associated with them we have a FoldingSet to unique that data. For example, in the case of nonloc::ConcreteInt the SVal indicates that we have a Constant and the associated data is represented by an APSInt.

SymExprs are meant to just be values in this set of values. They allow us to represent more complicated values, and they allow us to reason about computation more lazily by making the semantic expressions more symbolic. SymExprs are intrinsically part of how SVals are intended to be used.

Another way to look at it is that we are defining a language to represent "semantic expressions". That language consists of a grammar, with SVal just being a handle into an expression in that grammar. More complicated expressions need to reference other expressions, but since SVals are just a single pointer they can't represent that directly. Instead SVals are used as a generic handle to symbolic expressions, and SymExprs can be built (when needed) to represent the recursive nature of more complicated expressions. Actually tying SVals and SymExprs closer together makes a lot of sense, because they are just meant to represent that grammar of expressions.

My suggestion for using parts from Klee was that Klee has its own expression language, similar to the notion of SymExprs. Klee's expression language, however, is far more developed, although it has a different memory model. We could bring in parts of Klee's expression language by potentially wrapping them in SVals. The end goal, however, is that we define an expression language for defining the semantics of expressions. That design is really clear to me, but it is represented very clearly in the code.

Yes, I completely understand how this limitation guided you to the other approach, but this is a fundamental problem that we need to solve anyway. This is a limitation that I've really to address for a long time, but I also think it wouldn't be that hard to support symbol aliases (at least for the simple cases). I know it's tempting to not try to think about this problem now (as we've deferred solving it for a long time), but unless we try to take the general approach with handling extents using SVals (including symbol aliasing), we are just going to hit the fundamental limitation of what extents can represent (e.g., simple constants, basic symbols) by the end of the summer. At that point we'll need to solve this problem anyway.

Symbol aliasing is basically alpha-renaming. We need to store on the side (within GRState) the set of alpha-renamed symbols, and when we assume that two symbols are aliased we need a callback into the Checkers so that they can decide if their current set of meta-data associated with a symbol is compatible with assuming that two symbols are equal. The SValuator can be clever to always use an alpha-renamed symbol when building new expressions, thus folding the renaming into newly constructed SVals.

I don't know what function MetadataRegion would serve and how it would fit into the design. What specific problem would it solve, and would it be used?

Perhaps a MetadataRegion might be more appropriate? The advantage of a
new
region over a new symbol is that it could be used for metadata that
changes
as well. (I'm looking ahead to taking a stab at modeling C string

length,

at least in simple cases.) Of course, this could wreak havoc with the
mythical FlatRegionStore, which would have no place to put this new
subregion.

Would precisely would a MetadataRegion represent? Regions are suppose

to

represent chunks of memory, subsets of chunks of memory, with

annotations

on that chunk. Aside from VLAs (where we need to capture more

information

anyway), I don't know what a MetadataRegion would represent.

I was trying to think of a way to leverage the existing symbol/region
liveness analysis -- subregions remain alive with the superregion, and
region values remain alive with their regions. This shouldn't be a
constraint on us, though -- a symbolic extent is just a (possibly conjured)
value that remains live with the region.

I'll stop trying to plan ahead for C string length -- maybe that's
solvable with the GDM anyway.

So, at this point I would picture extents for SymbolicRegions (and
AllocaRegions, and maybe VLAs) as a cross between derived symbols and
conjured symbols with tags, except for one possible issue -- they might be
different on different paths through the CFG. With conjured symbols now we
have a Count parameter to avoid this. Does this mean we can only
getExtent() (or getExtentVal()) from a function with access to
GRStmtNodeBuilder?

Also, it might be nice if getExtent() lived on MemRegion, instead of
ValueManager or SValuator. Maybe?

Symbol aliasing is basically alpha-renaming. We need to store on the

side

(within GRState) the set of alpha-renamed symbols, and when we assume

that

two symbols are aliased we need a callback into the Checkers so that

they

can decide if their current set of meta-data associated with a symbol is
compatible with assuming that two symbols are equal. The SValuator can

be

clever to always use an alpha-renamed symbol when building new

expressions,

thus folding the renaming into newly constructed SVals.

Aliasing isn't so bad; it's the other constraints that would be harder to
model with sym-sym relations. But maybe we just support aliasing for a
while, or aliasing+linear adjustment, and ignore constraints like "a < b"
or "a != b".

Okay. I think my misunderstanding came from wondering when to pass around
a SymExpr vs. an SVal, since not all SVals map into SymExprs. (In
particular, you can't pass a SymExpr and an APSInt in the same variable.)
The answer seems to be "never" -- you always want an SVal for API that
faces the checkers -- because SymExpr is almost an implementation detail.
Does that sound right?

(usual qualification about exceptions to every rules)

I was also confused by the fact that SymExprs don't deal in SVals, but I
guess they don't need to. The nonloc types are either symbolic, constant,
or compound, and you can't do a symbolic operation to a compound value.

As for tying SymExprs closer with SVals, it doesn't really make sense,
since SymExprs have more than one pointers' worth of data and thus

don't

fit in SVals. Unless we wanted to make a new folding set manager, it's
probably not worth it to extract them from SymbolManager. (Unless we do
end
up with a new base in klee.)

Tying SymExprs with SVals is actually an intrinsic part of their design.

Hopefully I can make that little clearer.

SVals and SymExprs are really part of a single feature. The are used to
represent the "expression semantics" of evaluating expressions. When we
see '(exp1) + (exp2)', we compute an SVal that represents the semantics

of

evaluating that expression. In the beginning we only represented a very
simple set of semantics which you can think of as a lattice/set of

values:

{ Unknown, Undefined, Constant, Locations }. SVals are just *handles*

to

values in this lattice. This is really important. We pass SVals around
everywhere, but anytime they have real data associated with them we have

a

FoldingSet to unique that data. For example, in the case of
nonloc::ConcreteInt the SVal indicates that we have a Constant and the
associated data is represented by an APSInt.

SymExprs are meant to just be values in this set of values. They allow

us

to represent more complicated values, and they allow us to reason about
computation more lazily by making the semantic expressions more

symbolic.

SymExprs are intrinsically part of how SVals are intended to be used.

Another way to look at it is that we are defining a language to

represent

"semantic expressions". That language consists of a grammar, with SVal
just being a handle into an expression in that grammar. More

complicated

expressions need to reference other expressions, but since SVals are

just a

single pointer they can't represent that directly. Instead SVals are

used

as a generic handle to symbolic expressions, and SymExprs can be built
(when needed) to represent the recursive nature of more complicated
expressions. Actually tying SVals and SymExprs closer together makes a

lot

of sense, because they are just meant to represent that grammar of
expressions.

My suggestion for using parts from Klee was that Klee has its own
expression language, similar to the notion of SymExprs. Klee's

expression

I was trying to think of a way to leverage the existing symbol/region
liveness analysis -- subregions remain alive with the superregion, and
region values remain alive with their regions. This shouldn't be a
constraint on us, though -- a symbolic extent is just a (possibly conjured)
value that remains live with the region.

Yes, symbols are also pruned from the state when they are no longer live. This includes symbols derived from regions and symbols derived from other symbols.

I'll stop trying to plan ahead for C string length -- maybe that's
solvable with the GDM anyway.

So, at this point I would picture extents for SymbolicRegions (and
AllocaRegions, and maybe VLAs) as a cross between derived symbols and
conjured symbols with tags, except for one possible issue -- they might be
different on different paths through the CFG. With conjured symbols now we
have a Count parameter to avoid this. Does this mean we can only
getExtent() (or getExtentVal()) from a function with access to
GRStmtNodeBuilder?

No, I think having access to SymbolManager would be enough, which is accessible from ValueManager.

Also, it might be nice if getExtent() lived on MemRegion, instead of
ValueManager or SValuator. Maybe?

I think having it in MemRegion would be fine, except you would need to pass a SymbolManager& argument for the case of generating derived symbols on demand. More complicated extents would involve expressions that were already constructed by the SValuator and then equated (via ConstraintManager) to a symbol.

For “a < b” and such, isn’t it a matter of unifying the constraints that already exist on ‘a’ and ‘b’?

Yes, that's basically it! That said, we could probably benefit from making the recursive nature of the expression language grammar clearer.

Symbol aliasing is basically alpha-renaming. We need to store on the

side

(within GRState) the set of alpha-renamed symbols, and when we assume

that

two symbols are aliased we need a callback into the Checkers so that

they

can decide if their current set of meta-data associated with a symbol

is

compatible with assuming that two symbols are equal. The SValuator

can

be

clever to always use an alpha-renamed symbol when building new

expressions,

thus folding the renaming into newly constructed SVals.

Aliasing isn't so bad; it's the other constraints that would be harder

to

model with sym-sym relations. But maybe we just support aliasing for a
while, or aliasing+linear adjustment, and ignore constraints like "a <

b"

or "a != b".

For "a < b" and such, isn't it a matter of unifying the constraints that
already exist on 'a' and 'b'?

Something like that, but fully modeling it can get complicated very
quickly:

a < b
a < c
// a != MAX, but other than that we know nothing
b == 11
// at this point a < 11
c == 10
// now a < 10

In the long-term I think we're going to want to arbitrarily impose an
order on symbols, so that higher symbols depend on lower ones and not the
other way around. Alternately (or additionally?), we could have a
notification system of some kind: when a min or max value changes, ping any
dependent symbols.

Constraints can form arbitrarily complex graphs, in theory, and we do have
to limit that. Unifying the current constraints is certainly an improvement
over our current state.

Something like that, but fully modeling it can get complicated very
quickly:

a < b
a < c
// a != MAX, but other than that we know nothing
b == 11
// at this point a < 11
c == 10
// now a < 10

In the long-term I think we're going to want to arbitrarily impose an
order on symbols, so that higher symbols depend on lower ones and not the
other way around.

An ordering like this is essentially what is done in the Archer paper:

http://portal.acm.org/citation.cfm?id=940115

(PDF: http://theory.stanford.edu/~yxie/archer.pdf)

Alternately (or additionally?), we could have a
notification system of some kind: when a min or max value changes, ping any
dependent symbols.

Yes this is basically a gradually relaxation of the constraints.

Constraints can form arbitrarily complex graphs, in theory, and we do have
to limit that.

Absolutely. What constraints can be handled should be dictated by the ConstraintManager.