[analyzer] Random SVal hierarchy questions

I've got a few quick random questions about SVal/MemRegion/SymExpr heirarchy in the Static Analyzer. In most cases, i'm looking for design ideas behind particular values, so it's sort of theoretical.

== SymbolConjured of a record type ==

Usually symbols have integral types. However, if a function returns, say, "struct S" by value, and its definition is not available, then its return value is described as conj_$N{S}. I'm surprised to see a symbol of such type, because it doesn't behave like other symbols - eg. you cannot compose a SymIntExpr from it, or assume range constraints on it. Is this behavior intentional, or it's rather worth it try replacing it with a more structure-like value? For example, a LazyCompoundValue with a NULL Store could describe an unknown structure pretty well. Is it worth it to *assert* that all symbols have numeric or pointer types only?

== SymbolicRegion in different memory spaces ==

If two SubRegions have different super regions, in particular if they have different memory spaces, they are considered to be different (i.e., have different hash). This makes sense for, say, discriminating between VarRegions of ParmVarDecls in different stack frame contexts (different calls to the same function). However, for SymbolicRegion based on a pointer symbol, it does not make sense to discriminate between the symbolic region in Unknown memory space and the symbolic region in Heap memory space (eg. constructed by MallocChecker). The analyzer currently considers these regions to be two different regions, even if they are based on the same pointer. Is this problem worth fixing by removing the memory space from the hash function of SymbolicRegion?

== CXXTempObjectRegion and block counts ==

CXXTempObjectRegion is constructed with expression that caused it to appear, a memory space, and that's it. I wonder if, during lifetime of a CXXTempObjectRegion we can accidentally reach the same expression again and produce the same region twice and, say, overwrite its bindings. In other words, does CXXTempObjectRegion require some sort of block-count, like SymbolConjured? This time, i'd rather believe that it doesn't, and that the current behavior is correct, so i wonder if there's an easy way to explain why is it correct.

== SymbolRegionValue after a deleted binding ==

As far as i understand, SymbolRegionValue can be described as a symbol that denotes the unknown value of a region *in the beginning* of the analysis. It is produced by the RegionStore whenever there is no other binding that has overwritten this value *during* the analysis. However, when we do invalidation, we can sometimes remove bindings from regions. Is it intended that the same SymbolRegionValue may possibly re-appear after invalidation, simply because the binding was removed, even if the original SymbolRegionValue is considered dead, and thus all constraints and checker data was erased, and hence there's no harm in reusing the symbol? Or we'd better try our best to ensure that some sort of SymbolDerived would always be created in this scenario, and the original SymbolRegionValue never gets resurrected?

Hi Artem,

Sorry it took me so long to get back to you.

I've got a few quick random questions about SVal/MemRegion/SymExpr heirarchy in the Static Analyzer. In most cases, i'm looking for design ideas behind particular values, so it's sort of theoretical.

== SymbolConjured of a record type ==

Usually symbols have integral types. However, if a function returns, say, "struct S" by value, and its definition is not available, then its return value is described as conj_$N{S}. I'm surprised to see a symbol of such type, because it doesn't behave like other symbols - eg. you cannot compose a SymIntExpr from it, or assume range constraints on it. Is this behavior intentional, or it's rather worth it try replacing it with a more structure-like value? For example, a LazyCompoundValue with a NULL Store could describe an unknown structure pretty well. Is it worth it to *assert* that all symbols have numeric or pointer types only?

One thing conjured symbols of record type let you do is reason about equality of MemRegions and of values in the fields of those regions independently of range constraints. So, for example, conjuring a symbol for the result of getS() below let’s the analyzer establish that x1 == x2:

struct S {
  int f;
};

struct S getS();

void foo() {
  struct S s = getS();

  int x1 = s.f;
  int x2 = s.f;

  // Store Bindings at this program point:
  // (s,0,default) : conj_$2{struct S}
  // (x1,0,direct) : derived_$3{conj_$2{struct S},s->f}
  // (x2,0,direct) : derived_$3{conj_$2{struct S},s->f}

  clang_analyzer_eval(x1 == x2); // Yields TRUE
}

== SymbolicRegion in different memory spaces ==

If two SubRegions have different super regions, in particular if they have different memory spaces, they are considered to be different (i.e., have different hash). This makes sense for, say, discriminating between VarRegions of ParmVarDecls in different stack frame contexts (different calls to the same function). However, for SymbolicRegion based on a pointer symbol, it does not make sense to discriminate between the symbolic region in Unknown memory space and the symbolic region in Heap memory space (eg. constructed by MallocChecker). The analyzer currently considers these regions to be two different regions, even if they are based on the same pointer. Is this problem worth fixing by removing the memory space from the hash function of SymbolicRegion?

Regions in the unknown space can alias with any space, not just the heap space. In general, the analyzer makes fairly optimistic assumptions about dis-aliasing. Consider the following code:

int *bar(int *p);

void foo(int param) {
  int *ptr = bar(&param);
  *ptr = 8;
  param = 9;

  // Store Bindings at this program point:
  // (param,0,direct) : 9 S32b
  // (ptr,0,direct) : &SymRegion{conj_$3{int *}}
  // (SymRegion{conj_$3{int *}},0,direct) : 8 S32b

  clang_analyzer_eval(*ptr == 9); // Yields FALSE
}

Here, the analyzer has SymRegion{conj_$3{int *}} is in the unknown memory space but it is entirely possible that bar() returns its parameter. In this case, updating *ptr would change the value stored for param. The analyzer (unsoundly) assumes this can’t happen. It could be more conservative and case split on all possible aliasing relationships, but this would be very expensive! Alternatively, it could invalidate all other storage on writes to the unknown space — but this would be quite imprecise, invalidating storage for locations that probably aren’t aliased. The assumption that symbolic regions in the unknown space don’t alias parameters isn’t sound, but it does cover the common case.

Similarly, the analyzer makes optimistic dis-aliasing assumptions about regions with two different conjured symbols:

int *bar();

void foo() {
  int *ptrUnknown1 = bar();
  int *ptrUnknown2 = bar();
  *ptrUnknown1 = 8;
  *ptrUnknown2 = 9;

// Store Bindings at this program point:
// (ptrUnknown1,0,direct) : &SymRegion{conj_$2{int *}}
// (SymRegion{conj_$2{int *}},0,direct) : 8 S32b
// (ptrUnknown2,0,direct) : &SymRegion{conj_$5{int *}}
// (SymRegion{conj_$5{int *}},0,direct) : 9 S32b
}

(This optimistic assumption of one of many things that make the “general” (precondition of true) approach to summaries quite challenging. The analyzer could generate a summary making the optimistic assumption that two parameter pointers are dis-aliased only to discover when trying to apply the summary that the two pointers point to the same memory!)

One thing to note is that the analyzer’s assumption of dis-aliasing is sound for regions within the heap space. This is because it only produces symbolic regions in that space when it actually observes the allocation site (e.g., malloc) of the region. For example, if the two calls to bar() above were malloc() instead, then the analyzer would produce regions in the heap space and could safely assume that the two pointers point to distinct memory.

In contrast with the optimistic approach the analyzer takes for aliasing between symbolic regions, the analyzer is more pessimistic about symbolic *offsets* into regions. As Jordan describes in clang/docs/analyzer/RegionStore.txt (Binding Invalidation), binding to a region with a symbolic offset invalidates the entire region. For example:

void foo(int i, int j) {
  int *ptr = (int *)malloc(2*sizeof(int));

  ptr[i] = 8;
  ptr[j] = 9;

// Store Bindings at this program point:
// (ptr,0,direct) : &element{SymRegion{conj_$2{void *}},0 S32b,int}
// (SymRegion{conj_$2{void *}},0,default) : Unknown
// (element{SymRegion{conj_$2{void *}},reg_$4<j>,int},direct) : 9 S32b

For symbolic offsets, the analyzer assumes that p[i] and p[j] may alias and so invalidates p[i] when p[j] is bound. This is potentially imprecise (it may be the case that i != j), but as a practical heuristic it is much more likely that two symbolic offsets within the same region will alias than that two symbolic regions themselves will alias.

== CXXTempObjectRegion and block counts ==

CXXTempObjectRegion is constructed with expression that caused it to appear, a memory space, and that's it. I wonder if, during lifetime of a CXXTempObjectRegion we can accidentally reach the same expression again and produce the same region twice and, say, overwrite its bindings. In other words, does CXXTempObjectRegion require some sort of block-count, like SymbolConjured? This time, i'd rather believe that it doesn't, and that the current behavior is correct, so i wonder if there's an easy way to explain why is it correct.

I discussed this with Jordan and we’re pretty sure this isn’t an issue (although it is a good catch!). Our reasoning is that the only legal way to execute the same expression twice while the result of the previous one is still around is recursion — and you get different CXXTempObjectRegions for different stack frames (see MemRegionManager::getCXXTempObjectRegion()). In all other cases, any const reference that extends the lifetime of the temporary would have to have ended its lifetime as well before the expression could be executed again. That said, there are also sorts of gross things the programmer could do with undefined behavior to get at the same TempObjectRegion (e.g., life-time extending a temporary in a loop and storing the the address of the const reference to it in a pointer). We could potentially add a block count here to provide less unexpected analyzer behavior in these cases.

== SymbolRegionValue after a deleted binding ==

As far as i understand, SymbolRegionValue can be described as a symbol that denotes the unknown value of a region *in the beginning* of the analysis. It is produced by the RegionStore whenever there is no other binding that has overwritten this value *during* the analysis. However, when we do invalidation, we can sometimes remove bindings from regions. Is it intended that the same SymbolRegionValue may possibly re-appear after invalidation, simply because the binding was removed, even if the original SymbolRegionValue is considered dead, and thus all constraints and checker data was erased, and hence there's no harm in reusing the symbol? Or we'd better try our best to ensure that some sort of SymbolDerived would always be created in this scenario, and the original SymbolRegionValue never gets resurrected?

When invalidating regions with symbolic base regions, the analyzer tries to adds a default binding to a fresh conjured symbols (see void invalidateRegionsWorker::VisitCluster(const MemRegion *, const ClusterBindings *)).
For example:

struct S {
  int f;
};

void invalidate(struct S *s);

void foo(struct S *s) {
  s->f = 10;
// Store Bindings at this program point:
// (SymRegion{reg_$0<s>},0,direct) : 10 S32b

  invalidate(s);

// Store Bindings at this program point:
// (SymRegion{reg_$0<s>},0,default) : conj_$1{int}
}

When invalidating ’s', the analyzer removes the direct binding storing the concrete value ’10’ at off set 0 and adds a default binding storing the fresh conjured value conj_$1{int}.

Are you seeing cases where the analyzer is removing bindings but not adding a default binding to a conjured symbol?

Devin

Hi Devin!

This is very useful information. For example knowing when it is safe to rely on aliasing information provided by the analyzer can be useful for the checker writers as well. Do you plan to extend the analyzer documentation in the repositry?

Thanks,

Gabor

Hello Devin,

Thanks a lot for the reply! Questions about CXXTempObjectRegion block counts and SymbolRegionValue lifetime are now perfectly clear. I have a few more confusions to sort out with other questions.

> the only legal way to execute the same expression twice while the result of the previous one is still around is recursion

This argument makes sense to me :slight_smile: Probably some day i'd have a look if MemRegionManager::getCXXStaticTempObjectRegion() can have similar issues.

> Are you seeing cases where the analyzer is removing bindings but not adding a default binding to a conjured symbol?

Not really, i just wanted to ensure that it's intended to stay that way :slight_smile: I was a bit confused with the phabricator comment ⚙ D5104 [analyzer] [proposal] Fix for PR20653 .

> > == SymbolConjured of a record type ==

> One thing conjured symbols of record type let you do is reason about equality of MemRegions and of values in the fields of those regions independently of range constraints.

This confirms my understanding, and, additionally, i realized that the idea of a LazyCompoundValue with NULL store is completely incorrect :slight_smile: So i can correct this question a bit:

Suppose that instead of the SymbolConjured of type "struct S", we return a LazyCompoundValue with a store that contains only one binding - "(R, 0, default): conj_$N<int>", where R is... dunno, some temporary region that represents this structure, or probably the whole memory space, doesn't really matter. Then we obtain a similar SymbolDerived from this "conjured store", which allows reasoning about equality of its field values equally well, however the behavior of the engine confirms to the following additional invariants, which i find convenient:

(1) All SymExpr's have numeric or pointer types; any SymExpr can serve as a constraint manager key.
(2) Functions that return structs by values always return compound values (or unknown/undefined), regardless of how conservatively they were evaluated.

I initially run into this problem when coding alpha-renaming, which caused additional special case of renaming a compound value into a symbolic expression and vice versa. So i'm wondering about the design idea behind using symbolic expressions instead of compound values in this case. Even though in the current code, it causes no problems. So i'm simply wondering if such change would be accepted as a part of some bigger patch (or i work around), or if it accidentally rings any bells that invariants (1) and (2) were actually always desired and missed only accidentally. I think i can prepare a fix if it is desired, which asserts invariants (1) and (2) and adjusts the code accordingly.

> > == SymbolicRegion in different memory spaces ==

Thanks, your explanation had quite some revalations for me regarding the meaning of SymbolicRegion memory spaces for invalidation! What makes me still confused a bit is whether it is intended to support situations like:

   Store:
     (SymRegion{s},0,direct) : 1 S32b
     (SymRegion{s},0,direct) : 2 S32b

where the first SymRegion is in Heap space, and the second SymRegion is in Unknown space? I guess it shouldn't happen, because these two bindings refer to the exact same memory address (s + 0). But i can easily imagine an incorrectly coded checker (when the author wasn't aware of this problem) accidentally cause such store to appear.

So it makes sense to me to try to keep SymbolicRegions consistent: one (symbolic) memory address refers to only one region.

I think i can come up with a different solution from what i mentioned originally: create a new kind of MemRegion for regions returned by malloc()/new, some sort of HeapSegmentRegion. I'm not really seeing any benefit of keeping the address symbol at all in such region (similarly to how we do not keep address symbols for VarRegion - it's a "concrete" region, with all possible information clearly available). Instead, we can keep SVal for extent, and override getExtent() accordingly - this approach is more convenient for the users than constraints on SymbolExtent, because checker authors do not need to perform extra actions to see if SymbolExtent folds into a constant; and also symbolic malloc arguments would be handled better. Again, i can prepare a patch if such fix is desired - i think i even know how to write a test :slight_smile:

Hmm. We can also use alpha-renaming on SymbolExtent, to replace it with the concrete SVal in MallocChecker, when we have alpha-renaming, that it.

Hmm, i think i can answer myself regarding conjured structures, after a bit of thinking. There's really nothing wrong with symbols of structural type - in fact, it has been some kind of a dream for me to consider complicated constraints on structures, and especially on containers - consider a solver that can handle constraints like "this symbolic set is a subset of that symbolic set", or SymSymExpr's like "this set is an intersection of these two sets", etc., which can be achieved with structure-type symbols. Probably even strings would be modeled better with string-symbols and constraints than with stores. So i think it's all right to keep the current situation, even though we're not quite using these capabilities yet, and i withdraw this idea.

Yes, any operation on structs you could view as an uninterpreted function could be represented in this way quite simply, although more complicated theories would be challenging to reason about without deferring to a dedicated solver such as Z3 or CVC4.

Devin