[RFC] RegionStore++

isuckatcs · May 28, 2023, 12:18pm

Summary

This post introduces the current implementation of RegionStore and it’s limitation about storing the extents of a binding. After that it describes a possible solution to tracking the extent and shows how it would be implemented and the results from an experimental implementation.

RegionStore

The current implementation of RegionStore stores the SVals based on their base MemRegion and their offset inside that region. This means that these two values uniquely identify the corresponding SVal.

An llvm::ImmutableMap<const MemRegion *, ClusterBindings> is used to store the bindings, where the const MemRegion * is the base region of the variable and ClusterBindings is an llvm::ImmutableMap<BindingKey, SVal>. BindingKey consists of the base region, the offset and the type of the binding, which can either be Direct or Default.
Regions and clusters
Every MemRegion is assigned a cluster, which will store the SVals inside that region based on their offset from the region base address. If a region doesn’t have an existing cluster, a new one will be created for it, moreover one region can only have one cluster assigned to it at a time.

RegionBindingsRef RegionBindingsRef::addBinding(BindingKey K, SVal V) const {
  const MemRegion *Base = K.getBaseRegion();

  const ClusterBindings *ExistingCluster = lookup(Base);
  ClusterBindings Cluster =
      (ExistingCluster ? *ExistingCluster : CBFactory->getEmptyMap());

  ClusterBindings NewCluster = CBFactory->add(Cluster, K, V);
  return add(Base, NewCluster);
}

Bindings

int x = 1;

struct S {
  int a, b, c;
};

S s;
s.a = 1;
s.c = 3;

As it’s seen, a cluster contains mappings of values to specific points of memory. In the examples above all of the values are directly assigned to a location in memory, so the mappings can be considered Direct bindings.

There are however scenarios when instead of directly assigning values to specific locations, a larger portion of memory is set (e.g.: by calling memset). In such cases a Default binding is created. Since a Default binding affects the whole cluster, one cluster can only contain one Default binding at a time (since in the current implementation a Default binding is always in the form BaseReg + 0 -> Val, whenever a new one is created, the previous one is overwritten). Also there are some values such as LazyCompoundVal that can only appear as a Default binding, which has been ensured by D132143.

S s;
memset(&s, 0, sizeof(s));

S s;
memset(&s, 0, sizeof(s));
s.a = 1;

Note that Direct and Default bindings can appear together in the same cluster as seen in the example above, where the cluster contains both of them at the same location.

Lookup

Looking up a binding inside a cluster happens based on the BindingKey, which stores the region, the offset and the type of the binding, thus ensuring a Default binding is not going to be found if the caller is looking for a Direct binding.

const SVal *RegionBindingsRef::lookup(BindingKey K) const {
  const ClusterBindings *Cluster = lookup(K.getBaseRegion());
  if (!Cluster)
    return nullptr;
  return Cluster->lookup(K);
}

Whenever the analyzer is interested in a specific value, it will try to look for a Direct binding first and if it’s not found, a Default binding will be searched for the region instead.

SVal RegionStoreManager::getBindingForVar(RegionBindingsConstRef B,
                                          const VarRegion *R) {
  if (std::optional<SVal> V = B.getDirectBinding(R))
    return *V;

  if (std::optional<SVal> V = B.getDefaultBinding(R))
    return *V;
  
  // Handle if none of them is found.
  ...
}

S s;
memset(&s, 0, sizeof(s));
s.a = 1;

int x = s.a;
int y = s.b;

Limitations

While the current implementation is suitable for modelling most of the variables throughout the analysis, there are some edge cases, like this infamous one below, which caused the analyzer to hide some true positive bug reports.

struct Point {
  int x;
  int y;
  int z;
};

struct Circle {
  Point origin;
  int size;
};

Point makePoint(int x, int y) {
  Point result;
  result.x = x;
  result.y = y;
  result.z = 3;
  return result;
}

void issue() {
  Circle testObj;
  memset(&testObj, 0, sizeof(testObj));
  testObj.origin = makePoint(1, 2);        <-- problem caused by this line
  int x = testObj.size;
}

This is how the bindings look like at the moment of the assignment in the marked line:

At testObj.origin = makePoint(1, 2); the temporary object will be assigned to testObj + 0, however since the temporary object is a lazyCompoundVal, which can only be a default binding, Default: testObj + 0 -> 0 will be replaced by Default: testObj + 0 -> lazyCompoundVal{result, ...}.

Later when testObj.size is read, instead of returning 0, some other value will be returned falsely. In the past it used to be Undefined, as it can be seen in the commit that addressed it. In the same commit it can be seen that the solution was to prevent the analyzer from reading Undefined from a lazyCompoundValue and pretend that the read value is Unknown, which will hide every other cases, when the value is really Undefined and a warning should be reported.

There are also scenarios where Direct bindings trick the analyzer into reading a wrong value.

void foo() {
  char x = 0;
  int y = *((int *)&x);
}

Here the analyzer is going to assume that y is 0, however in reality it is an Undefined value.

Solution

It has been pointed out by a FIXME inside RegionStoreManager::getLazyBinding() and @NoQ also told me before that the proper solution is to change RegionStore so that it also tracks the extent of the binding instead of just the region and the offset. Throughout the following sections I would like to describe the way I found to achieve this.

BindingKey

BiningKey would also store the extent and use it for comparrison. Two BindingKeys would be equal if they had the same region, kind, offset and extent.

class BindingKey {
  ...
  uint64_t Extent;
  ...
  explicit BindingKey(..., uint64_t extent);
  ...
  bool operator==(const BindingKey &X) const {
    return <old operator== implementation> && Extent == X.Extent;
  }
};

MemRegion

MemRegion would be modified so that it also knows about it’s extent. The extent would be calculated lazyly based on the type of the region except for some special cases that need additional handling, like bitfields.

If we fail to find the extent for some reason and we can’t reason about it, we pretend that it’s infinite as it happens in the current implementation. The same pretense is made, when the calculated extent is 0, which can happen when something is assigned to a placeholder variable.

struct hh {
  char s1[0];
  char *s2;
};

memcpy(h0.s1, input, 4);

h0.s1 doesn’t exist in memory and it’s extent is 0. In this particular case the extent could be determined from the parameter of memcpy, but this change is not in the scope of this proposal. From the MemRegion’s point of view, it cannot be told how wide the region is.

ClusterBindings

ClustedBindings are currently llvm::ImmutableMap<BindingKey, SVal>s, but this would change so that they are derived from this map and provide two additional lookup methods for partial and overlapping value lookup.

class ClusterBindings : public llvm::ImmutableMap<BindingKey, SVal> {
const SVal *lookupPartial(const BindingKey &K) const;

std::vector<SVal> getValuesCoveringBinding(const BindingKey &K) const;
};

The default lookup(), which is inherited would provide a quick O(log(n)) way to lookup the exact binding if it exists, while lookupPartial() is a slower O(n) method that can lookup overlapping bindings and getValuesCoveringBinding() is the slowest O(n * log(n)) method which can be used to tell if the binding is covered by initialized values or not.

Lookup

There can be multiple scenarios that could occur during a lookup. The first is when the exact same region is looked up.

int x = 10;
int y = x; <--

Since the binding being looked up and the binding in the store overlap completely, the default lookup method will be able to automatically find the corresponding value, there is no need for manual work.

There is a differece however in this next scenario, when a small part of an existing binding is looked up:

int x = 1024;
char y = *((char *)&x); <--

looked up binding is inside the stored one
The current implementation of RegionStore would return 1024 in this case, because it ignores the extents, which is incorrect as (on a little endian paltform) the last byte of x is being assigned to y, which is 0. With the proposed changes RegionStore would not find a direct binding by calling lookup, however calling partialLookup would find this region that overlaps the binding and would return 1024.

The third case the could happen is when the looked up binding overlaps every stored binding.

char x = 10;
int y = *((int *)&x);

looked up binding is outside the stored one

Here the current implementation finds the binding and returns 10, which is also incorrect becuase the other 3 bytes past x have not been initialized (technically those bytes would be the part of some other variable on the stack, or the stored base pointer, etc. however in the static analysis context those are still uninitialized values). In this case partialLookup wouldn’t find the binding either.

The partialLookup method

This method finds the smallest binding that overlaps the looked up binding. In the example below the value of Binding 2 will be returned.

int x = 1024;
*(short *)((char *)&x + 1) = 0;
int y = *((char *)&x + 2); <--

If the looked up region is not overlapped by any other regions, this method will not return any value.

struct S {
  char a;
  char b;
  char c;
  char d;
};

void foo() {
  S s{1, 2, 3, 4};
  int x = *(int *)&s; <--
}

In this case none of the regions overlap the binding, but the binding overlaps multiple regions. The current implementation would return Binding 1, which is incorrect and as seen, this case cannot be solved with partial lookup either. If only default and partial lookup existed in this example we would find that x has been assigned a garbage value, which is also incorrect. This is why getValuesCoveringBinding() has also been added.

The getValuesCoveringBinding method

This method returns the SVals that overlap the binding being looked up. If the binding is not covered fully by initialized values, an empty std::vector<SVal> is returned. In the example at the end of the previous section, the vector would contain Binding 1, Binding 2, Binding 3 and Binding 4. This will be useful if in the future we decide to concatenate these values.

In this next example however an empty vector is returned:

Strategy

Looking up a value would change the following way:

// Try getting the binding by comparing everything.
if getDirectBinding(region):
  return binding;

if getPartialBinding(region, direct):
  // We found the part of a symbolic region.
  if binding.isSymbolic():
    return getDerivedRegionValueSymbolVal(bindingSymbol, region);
  // We found the part of a constant value, but SVal slicing is 
  // not supported.
  if binding.isConstant():
    return UnknownVal();

// The bindig is covered by initialized values, 
// but concatenating them is not supported.
if !getValuesCoveringRegion(region, direct).empty():
  return UnknownVal();

if getDefaultBinding(region):
  return binding;

// The idea of a default binding is that every
// value that doesn't have a direct binding inside
// the cluster is covered by it.
if getPartialBinding(region, default):
  return binding;

Results

The proposed changes have been implemented in D132752 and multiple false positives in our regression tests have been corrected so far. The only known true positive that disappeared is coming from the UninitializedObjectChecker, however after analyzing it, I think this comes from the checker and not the RegionStore model.

Thoughts and feedbacks are appreciated.

isuckatcs · May 30, 2023, 7:39pm

Introduced issues

Apparently this change also introduces an issue in the case that can be seen in the snippet below.

void issue() {
  int x = 0x01020304;
  *(char *)&x = 0;

  int y = x;
}

warning: Assigned value is garbage or undefined [core.uninitialized.Assign]
  int y = x;
  ^~~~~   ~

When *(char *)&x = 0 is evaluated, the 32 bit wide binding for x is replaced by the 8 bit wide 0, which
when y is read, makes RegionStore think that it reads 3 uninitialized bytes and the analyzer reports a false positive.

RegionBindingsRef
RegionStoreManager::bind(RegionBindingsConstRef B, Loc L, SVal V) {
  ...

  // Clear out bindings that may overlap with this binding.
  RegionBindingsRef NewB = removeSubRegionBindings(B, cast<SubRegion>(R));

  ...
  return NewB.addBinding(BindingKey::Make(R, KeyKind), V);
}

This is how the store looks like during evaluation per source line:

int x = 0x01020304:
  {"kind": "Direct", "offset": 0, "extent": 32, "value": 16909060 S32b}

*(char *)&x = 0:
  {"kind": "Direct", "offset": 0, "extent": 8, "value": 0 S8b}

int y = x:
  {"kind": "Direct", "offset": 0, "extent": 8, "value": 0 S8b}

The other issue is that only the binding with the same offset is overwritten, other bindings are not, so the following snippet also causes RegionStore to read wrong values:

void issue() {
  int x = 0x01020304;
  *((char *)&x + 1) = 1;

  int y = x;
}

int x = 0x01020304:
  {"kind": "Direct", "offset": 0, "extent": 32, "value": 16909060 S32b}

*(char *)&x = 0:
  {"kind": "Direct", "offset": 0, "extent": 32, "value": 16909060 S32b}
  {"kind": "Direct", "offset": 8, "extent": 8,  "value": 1 S8b}

int y = x:
  {"kind": "Direct", "offset": 0, "extent": 32, "value": 16909060 S32b}
  {"kind": "Direct", "offset": 8, "extent": 8,  "value": 1 S8b}

When x is looked up it has a direct binding and 16909060 (0x01020304) will be returned as the value of x, which is incorrect as the real value is 16908548 (0x01020104).

This latter issue has been addressed by modifying lookup the following way:

if getDirectBinding(region):
  if isRegionCoveredByMultipleBindings(region, direct):
    return UnknownVal();

  return binding;

Whenever a direct binding with the exact offset and extent is found, it’s checked if it’s region is overlapping with any other bindings and only then it will be returned. This leads to the following outcome:

void issue_eval() {
  int x = 0x01020304;
  *((char *)&x + 1) = 1;

  clang_analyzer_eval(x); 
  clang_analyzer_eval(*((char *)&x + 1));
}

warning: UNKNOWN [debug.ExprInspection]
  clang_analyzer_eval(x);
  ^~~~~~~~~~~~~~~~~~~~~~
warning: UNKNOWN [debug.ExprInspection]
  clang_analyzer_eval(*((char *)&x + 1));
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To properly reason about this case, either the values can be concatenated right when a binding is created in RegionStoreManager::bind() or a timestamp can be associated with the binding so that the analyzer will know that {"kind": "Direct", "offset": 8, "extent": 8, "value": 1 S8b} has been bound later than {"kind": "Direct", "offset": 0, "extent": 32, "value": 16909060 S32b}, so at least *((char *)&x + 1) can be reasoned about properly.

Edit: a possible way to mitigate the first issue when the binding is replaced by a new one can be seen in D132752. Basically bindings will only be removed if either the binding about to be added is symbolic, or has a parent symbolic region. Bindings inside the store with symbolic offsets will also be removed.

awakio · June 1, 2023, 6:53am

In the example, when the Store answers an undefined, the uninitialized checker will report a bug.
However, under some (maybe most) circumstances, developers usually know whether little endian or big endian is used.
Reporting this kind of bug with the uninitialized checker will make users confused.
Maybe, a separate checker and bug report message for this kind of bug is preferred.

Besides, if users can explicitly (e.g. with an option) determine the endian type, do we need to assign the overridden byte to the correct location?
E.g. for the supplemented example, after the partial assignment of x, x can be determinately assigned to 0x01020300 or 0x00020304.
Or even more aggressively, 0x01020300 by default if the endian checker (I suggested above) is deactivated.

steakhal · June 5, 2023, 12:01pm

I really like this. The proposal makes sense to me to extend the store with region extents.

I already picked D132752 diff 14, and observed some test cases to be fixed by it. However, I also saw some regressions as well. I did the investigation and it turned out that checker callbacks for accepting some pointers could be affected by this proposal.
In the past, if a pointer was cast to char*, the checkers such as the CStringChecker will mimic a load through that pointer and possibly bind checker-specific metadata to that symbol. In our case the resolution was to load from the base region instead, so we are not loading a single char but the whole entity, and we have our canonical symbol to which we can bind our metadata. So, it turned out to be a “bug” in our checker, but I suspect that other checkers could suffer from the same problem so I leave this comment here. After we fixed this, we had no regressions in the testsuite.

On the actual projects, we also saw a considerable increase of uninit reads, and this was expected. However, most of these new reports were FPs, which is not good, and suggests problems with the change (D132752).

Let me move away from this proposal and introduce some terminology we came up with during internal discussions to aid in explaining how we see the store should be implemented.

Terms and definitions

First of all, it would be nice to have some helper methods for the bindings:

B contains B' (reflexive, transitive, antisymmetric)
B overlaps with B' (reflexive)

This helps to explain situations in how B relates to B'.

In the discussions, it helped a lot to categorize bindings into these 4 sets:
We took each binding S in the cluster such that B.overlapsWith(S):

Exact: S == B
Parents: S.contains(B)
Children: B.contains(S)
Uglies: !S.contains(B) && !B.contains(S)

Note that these sets are important only for discussing the scenarios, we probably don’t need to materialize any of these as part of an implementation.

To keep the implementation simple, it makes sense to have a memory model where bindings “nest”. In formal terms, for all pair of bindings (B1, B2) in a cluster, if B1.overlapsWith(B2) then B1.contains(B2) || B2.contains(B1). Or in other terms, the cluster never contains ugly bindings.

Example for good stores, aka. where bindings nest:

[.............]    [..]
  [..]  [.]

Example for a bad store:

[.............]    [..]
  [..]  [.]  [......] // <-- bad: overlaps but not contains

Note that the store we currently had, nests, because default bindings doesn’t have an extent, so they are in effect from their start. We should preserve this invariant.
User code might want to add bindings violating this because they disrespect the strict-aliasing rule or do an unaligned store; but in practice, we should get away with this, and a potential checker catching those UB cases would justify this invariant anyway.

Now, let’s discuss loading and storing some values.

Store to binding `B` (`[start, end]`):

let UknownStart := B.start
let UnknownEnd := B.end

for each binding S such that B.overlapsWith(S):
  if (B.contains(S)): // This includes if S == B, aka. overwriting the same binding
    remove(S)
  if (S.contains(B))
    continue;

  // We should not introduce "ugly" bindings, so we need to do something about them.
  // We could "split" existing bindings in a way to ensure the new binding "B" would still "nest", but in general it's probably not possible and would complicate things a lot for little value.
  // Alternatively, let record what range we need to "invalidate", and let's just bind "Unknown" to that range before storing "B".
  UknownStart := min(UknownStart, S.start)
  UnknownEnd  := max(UnknownEnd,  S.end)

// If we have to invalidate something, bind "unknown" there.
if (!B.contains([UknownStart, UnknownEnd])):
  addDefaultBinding([UknownStart, UnknownEnd], UnknownVal)

// Finally, just add binding "B".
addBinding(B)

Note that in the real implementation, we might want to “collapse” the introduced “unknown” ranges, but we don’t necessarily need to. To illustrate this, consider adding B3, then B4 in this example:

Let's add B3:
[     B1     ][   B2   ]
         [     B3      ] <-- adding this
=> result:
[       unknown        ]
         [     B3      ]

Let's add B4:
[       unknown        ]
         [     B3      ]
   [   B4   ] <-- adding this
=> result:
[       unknown        ]  <-- we should collapse these
   [      unknown      ]  <--
   [   B4   ]

Loading from a binding

Loading is slightly more tricky because we have LazyCompoundVals. in some cases.

Loading from binding `B` where the type of `B` is non-primitive (array, struct, union, etc):

let NumOverlaps := the number of bindings S of the cluster such that B.overlapsWith(S)
if (NumOverlaps == 1):
  let C := the single binding for which B.overlapsWith(S)
  if (C == B && the value of C is a LazyCompoundVal):
    return that LazyCompoundVal

return a new LazyCompoundVal for the corresponding region matching the offset of B. If there is no clear match, just return Unknown.
// By returning LCVs, we basically "defer" the load until we either want to load a direct binding of a field (subregion) or we want a default binding of a single value.

Loading from binding `B` where the type of `B` is a primitive type (e.g. int).

Let’s start with an evil example:

[    default   ]                [    default   ]
        [  D1  ]
        [    unaligned load of 'int'   ]

Let’s categorize our bindings into 4 categories that were introduced in the first section and use our terms.
So, take bindings S in the cluster such that B.overlapsWith(S), and form our “Exact”, “Parents”, “Children” and “Uglies” sets.
Note that if we have “Exact” or “Parents” then we cannot have any “Uglies” due to our invariant.

Let’s focus on the common cases first.

if (not Children and not Uglies):
  if (Exact): return the corresponding value
  if (Parents):
    let P := the smallest Parent binding
    if (P is a default binding):
      return the default value or a LazyCompoundValue for the subregion.
   else: // if P is a direct binding
    return a derived symbol for that subsymbol or just Unknown for the cases when we would need to split concrete values.

// Otherwise, we have some "Child" or "Ugly" bindings partially covering
// the region we want to load from, so at best we can return "Undefined" or "Unknown".
// We should return "Undefined" if any of the bytes referred by binding
// "B" is "Undefined", otherwise return "Unknown.
// This can be determined in a single pass, sweeping through the store,
// determining what bindings cover what parts of "B".
// We should also respect the memory space of the cluster, as in globals,
// if we don't have a binding to a region, we can assume it's initialized, and
// return "Unknown" in that case. This should be different for automatic storage
// duration objects, where it should resolve to "Undefined" instead.
// Let's call this the "sweep".

This seems really fuzzy, so we decided to sketch up an example implementation of the “sweep” algorithm mentioned previously. The algorithm works in a single pass, so it’s O(N) w.r.t the number of bindings in the cluster, and gives you the value for each subsection of the binding B. Example:

Cluster cluster{{{0, 4}, "T1"},                     {{5, 14}, "T2"},
                {{1, 4}, "C11"},                    {{6, 8}, "C21"},  {{8, 10}, "C22"},
                {{1, 3}, "C111"}, {{3, 4}, "C112"}, {{6, 7}, "C211"}};

0   1     3    4  5    6    7    8        10      14
[      T1      ]  [             T2                 ]
    [    C11   ]       [   C21  ][   C22   ]
    [C111][C112]       [C211]

sweep([1,3]) should return a list like this, where it says, starting with a given offset, what is the value until the next mentioned offset. By using this list, one can reconstruct the exact values that cover the binding B, [1,10] range in our case:

1 -> [1-3] C111
3 -> [3-4] C112
4 -> MemSpace
5 -> [5-14] T2
6 -> [6-7] C211
7 -> [6-8] C21
8 -> [8-10] C22

One could also use the result of sweep to answer the common part of the load, as the result already encodes them.

Many thanks to @Tomasz for the hard work in discussing the RFC, coming up with the mental model, and the concept implementation of the sweep() demonstrated here.

isuckatcs · June 5, 2023, 10:25pm

Note that the store we currently had, nests , because default bindings doesn’t have an extent, so they are in effect from their start. We should preserve this invariant.

Well, this invariant is the reason for introducing the extents, so we definitely shouldn’t preserve it in my oppinion. This causes false positives for which we responded with the snippet below ages ago.

SVal RegionStoreManager::getLazyBinding(const SubRegion *LazyBindingRegion,
                                        RegionBindingsRef LazyBinding) {
  SVal Result;
  if (const ElementRegion *ER = dyn_cast<ElementRegion>(LazyBindingRegion))
    Result = getBindingForElement(LazyBinding, ER);
  else
    Result = getBindingForField(LazyBinding,
                                cast<FieldRegion>(LazyBindingRegion));

  // FIXME: This is a hack to deal with RegionStore's inability to distinguish a
  // default value for /part/ of an aggregate from a default value for the
  // /entire/ aggregate. The most common case of this is when struct Outer
  // has as its first member a struct Inner, which is copied in from a stack
  // variable. In this case, even if the Outer's default value is symbolic, 0,
  // or unknown, it gets overridden by the Inner's default value of undefined.
  //
  // This is a general problem -- if the Inner is zero-initialized, the Outer
  // will now look zero-initialized. The proper way to solve this is with a
  // new version of RegionStore that tracks the extent of a binding as well
  // as the offset.
  //
  // This hack only takes care of the undefined case because that can very
  // quickly result in a warning.
  if (Result.isUndef())
    Result = UnknownVal();

  return Result;
}

On the actual projects, we also saw a considerable increase of uninit reads, and this was expected. However, most of these new reports were FPs, which is not good, and suggests problems with the change.

Could you provide some examples please, so that we can investigate what causes these false positives? I see you already provided one example in the revision of patch, based on that it seems we still have troubles with reading overlapping bindings.

for each binding S such that B.overlapsWith(S):
  if (B.contains(S)): // This includes if S == B, aka. overwriting the same binding
    remove(S)
  if (S.contains(B))
    continue;

I like the idea of removing overlaps upon creating a new binding, even though this can be O(n * log(n)) in the worst case scenario, where n is the number of bindings in the store. (O(n) for the iteration and O(log(n)) for each remove() call).

Edit: I just realized, the complexity wouldn’t increase, since we’ve alrady done something similar before, except that it deleted every binding. Also after investigating the false positive, you attached to the revision, not removing the overlapping bindings is what causes it.

steakhal · June 6, 2023, 6:40am

isuckatcs:

Note that the store we currently had, nests , because default bindings doesn’t have an extent, so they are in effect from their start. We should preserve this invariant.

Well, this invariant is the reason for introducing the extents, so we definitely shouldn’t preserve it in my oppinion. This causes false positives for which we responded with the snippet below ages ago.

SVal RegionStoreManager::getLazyBinding(const SubRegion *LazyBindingRegion,
                                        RegionBindingsRef LazyBinding) {
  SVal Result;
  if (const ElementRegion *ER = dyn_cast<ElementRegion>(LazyBindingRegion))
    Result = getBindingForElement(LazyBinding, ER);
  else
    Result = getBindingForField(LazyBinding,
                                cast<FieldRegion>(LazyBindingRegion));

  // FIXME: This is a hack to deal with RegionStore's inability to distinguish a
  // default value for /part/ of an aggregate from a default value for the
  // /entire/ aggregate. The most common case of this is when struct Outer
  // has as its first member a struct Inner, which is copied in from a stack
  // variable. In this case, even if the Outer's default value is symbolic, 0,
  // or unknown, it gets overridden by the Inner's default value of undefined.
  //
  // This is a general problem -- if the Inner is zero-initialized, the Outer
  // will now look zero-initialized. The proper way to solve this is with a
  // new version of RegionStore that tracks the extent of a binding as well
  // as the offset.
  //
  // This hack only takes care of the undefined case because that can very
  // quickly result in a warning.
  if (Result.isUndef())
    Result = UnknownVal();

  return Result;
}

I don’t understand how would “having extents in the store” imply “we cannot have bindings in the store that nicely nest” because I just shared a way of achieving this. Could you think of an example to support your statement?:

Well, this invariant is the reason for introducing the extents, so we definitely shouldn’t preserve it in my oppinion.

Ok, I’ll try to reduce one or two.

isuckatcs · June 6, 2023, 11:54am

I thought by “we should preserve this invariant” you meant, that you want to preserve that default bindings are infinite. We can have nested bindings, in fact we can’t even avoid them fully.

isuckatcs · June 6, 2023, 11:43pm

I started merging the Store to binding B ([start, end]) algorithm with the current one, however I encountered some cases, where it needed a little modification.

for each binding S such that B.overlapsWith(S):
  if (B.contains(S)): // This includes if S == B, aka. overwriting the same binding
    remove(S)
  if (S.contains(B))
    continue;

[    B1    ][    B2    ]  <-- store
------------------------
[      B3      ]          <-- we bind this

The above loop removes B1 but leaves B2 in the store, since B3 doesn’t contain it.

// If we have to invalidate something, bind "unknown" there.
if (!B.contains([UknownStart, UnknownEnd])):
  addDefaultBinding([UknownStart, UnknownEnd], UnknownVal)

As far as I understand, this snippet would evaluate the previous example further like this:

            [    B2    ]  <-- store
[       Unknown        ]
------------------------
[      B3      ]          <-- we bind this

We want to avoid this scenario and remove B2 too somehow. Also we only want this to happen to Direct bindings, as Default ones behave differently. Consider the following example:

struct Circle
{
    Point origin;
    int size;
};

Circle testObj;
memset(&testObj, 0, sizeof(testObj));
testObj.origin = makePoint(1, 2);     <-- bind this

At the moment of the binding the store will look like this:

[              0              ]  <-- store
-------------------------------
[  makePoint(1, 2); ]            <-- we bind this

We want to preserve 0 here, as replacing it with Unknown would lead to invalid results.

So the current implementation of the proposed algorithm can be seen below. Note that I haven’t gotten to the point to dig deeper into and modify the lookup too, yet:

for each binding S such that B.overlapsWith(S):
  // This includes if S == B, aka. overwriting the same binding
  if (B.contains(S)): 
    remove(S)
  else if (B.overlapsWith(S) && !B.isDefault()):
    // This produces "ugly" bindings, but this can easily be adjusted if really 
    // needed. If we don't replace the value, we need to remove it and then add
    // a new one. Let's just be lazy until we finalize this algorithm.
    S = UnknownVal{}; 

// Finally, just add binding "B".
addBinding(B)

As a result, one FIXME in our regression tests has been partially eliminated (it says, the value should be false or Unknown, and it’s Unknown, so probably we are good), one new nullptr dereference has been detected and the false result from the snippet in the revision has also been eliminated.

isuckatcs · June 9, 2023, 8:46pm

Complexity of the proposed lookup

I started looking into the proposed lookup. You mentioned it’s O(n), however it requires the cluster to be sorted based on the ranges, so it must be at least O(n * log(n)).

I don’t think however, that an algorithm here could actually be better, because we need to sort the bindings in order to reason about the ranges. Changing the cluster map, so that the bindings are stored sorted based on the ranges is not possible, because of symbolic offsets. We have no way to compare a symbolic offset to a non-symbolic one sadly.

void foo(int x) {
  int y;
  *((char *)&y + x) = 0; // call this 'a'
  *((char *)&y + 1) = 1; // call this 'b'
}

How does a relate to b? Is a before b, after it or are they the same? It all depends on the value of x.

Mixing `Direct` and `Default` bindings

There is also a problem with mixing Default and Direct bindings, as there are mechanisms in the analyzer, that first look for a Direct binding and if they don’t find one, they start to look for information in the parnet region of the searched binding. Look at RegionStoreManager::getBindingForElement() or RegionStoreManager::getBindingForField() for example. In those cases we want to make sure that we don’t return a Default binding, if we are looking for a Direct one.

There are scenarios however, where we do want to mix Default and Direct bindings and return Unknown, like the one below:

int *arr[4] = {0};
arr[3] = calloc(1, sizeof(char));
int x = **(int **)((int *)&arr[2] + 1);

clang_analyzer_eval(x); // expected-warning{{UNKNOWN}}

Store:
{"kind": "Default", "offset": 0, "extent": 256 value: 0 (Loc)}
{"kind": "Direct", "offset": 0, "extent": 64 value: 0 (Loc)}
{"kind": "Direct", "offset": 192, "extent": 64 value: &Element{HeapSymRegion{conj_$2{void *, LC1, S1160, #1}},0 S64b,int}}

Lookup:
{"kind": "Direct", "offset": 160, "extent": 64}

0          64        128        192        256
*          *          *          *          *

[-----------------Default 0-----------------]
[-----0----]
                                 [-&Element-]
                           [--Lookup--]

As a result I think that we should call sweep() whenever a Direct binding is needed, and ignore the result if either a Default binding is returned, or an Undefined value is returned.

Looking up `Default` bindings

The proposed method on it’s own cannot be used for looking up Default bindings, because overlapping Default bindings cannot be replaced with Unknown values for the reason in my previous comment. As a result, opposed to Direct bindings, we don’t necessarily want to return the narrowest one.

Take a look at the following snippet and it’s model in the store:

Circle testObj;
memset(&testObj, 0, sizeof(testObj));
testObj.origin = makePoint(1, 2);

{"kind": "Default", "offset": 0, "extent": 96 value: lazyCompoundVal{...,result}}
{"kind": "Default", "offset": 0, "extent": 128 value: 0 S8b}

[      0      ]
[   LCV   ]

Let’s say now, that we are looking for the Default binding of testObj, which is {"kind": "Default", "offset": 0, "extent": 128}. The value we’d like to see here is 0, but sweep() would return [LCV, 0], which we’d concatenate into Unknown.

Here I think we should preserve the current logic, where first getDefaultBinding(), then getPartialBinding() is called, as we don’t want to cover a Default binding with other bindings. The next snippet is a better example for this.

Circle *testObj = calloc(sizeof(Circle), 1);

testObj->origin = makePoint(0.0, 0.0);
if (testObj->size > 0) {
  ;
}

clang_analyzer_eval(testObj->size == 0);

{"kind": "Default", "offset": 0, "extent": 96 value: lazyCompoundVal{0x55ebfa151160,result}}
{"kind": "Default", "offset": 0, "extent": 9223372036854775807 value: 0 S8b}

The binding from calloc() is basically infinite (which is actually incorrect), so when we look for the Default binding of testObj, which is {"kind": "Default", "offset": 0, "extent": 128}, we want to return 0 again. While calling getPartialBinding() can lead to false positives in scenarios like the one below, I don’t see a better option at the moment.

B1:  |------------------------|
B2:  |-------------|
Key:            |-----|

Note: getPartialBinding() only returns a false result if sweep() wasn’t called before, as that would return Unknown in this case. Direct bindings are looked up before Default ones and sweep() is called every time a Direct binding is looked up, so in most cases this shouldn’t be an issue.

Implementation changes

sweep() has been integrated into getValuesCoveringRegion(), which is now called by getDirectBinding().
Whenever a new binding is added, instead of simply overwriting the overlapped ones with Unknown, they are removed and replaced by an extended Unknown binding according to the specification. I thought we can spare the two operations, but as it turned out, sweep() requires the overlapped bindings to be aligned according to the specification.

Edit:

The type of the offsets have also been changed from uint64_t to int64_t as they can be negative and that would confuse sweep().

Results

Two FIXMEs have been removed from the LLVM regression tests.

isuckatcs · June 9, 2023, 9:38pm

If you could turn the analyzer into a big endian mode, you would also need to pay attention to where a specific byte is assigned. Take a look at the next snippet as an example.

int x = 1;
bool littleEndian = *((char *)&x);
if (littleEndian) {
  int *ptr = nullptr;
  *ptr = 2;
}

On a little endian system 1 would sit at the first byte of x, while on a big endian system it would sit on the fourth one. *((char *)&x) however would always return the first byte of x, which is 0 on BE and 1 on LE. This means that the analyzer must be aware of where the specific byte is, so the store should use the correct offsets too.

Xazax-hun · June 10, 2023, 5:22pm

Yeah, symbolic data is a hard nut to crack. We can have both symbolic offsets and symbolic extents. However, I wonder to what extent do we want to mix these with concrete ranges. I suspect that most code dealing with structs/classes uses concrete ranges, and most code dealing with arrays loads/stores one element at a time.

One potential way to side-step this problem is to be really conservative about symbolic offsets and extents. Whenever we have symbolic extent, assume the range extends to infinity, whenever we have symbolic offset, assume the range can start anywhere. As a result, we might end up replacing more bindings with unknown values, but hopefully, these cases are relatively rare.

What do you think?

P.S.: I really love where this thread is going, thanks a lot for all of you tackling this problem.

isuckatcs · June 10, 2023, 6:45pm

We can have both symbolic offsets and symbolic extents. However, I wonder to what extent do we want to mix these with concrete ranges.

I think we can’t mix them even if we wanted to, due to symbolic offsets and extents not being comparable to concrete numbers.

Whenever we have symbolic extent, assume the range extends to infinity, whenever we have symbolic offset, assume the range can start anywhere. […]

What do you think?

I agree with this. In fact, the current implementation does something very similar. It doesn’t allow symbolic extents, the extent will always be an int64_t number. If we see a SymbolicRegion, we try to figure out the extent based on the type of the symbol.

if (const auto *SR = dyn_cast<SymbolicRegion>(R))
    const auto Ty = SR->getSymbol()->getType().getDesugaredType(getContext());

If we find the type of the region, we will know the extent too. If we fail to find the type, or the type is something we cannot reason about (e.g.: void), we say the extent is infinite.

If the offset is symbolic, then upon adding the binding, every other binding from the cluster is deleted as if the binding with symbolic offset would overlap them all.

Xazax-hun · June 10, 2023, 7:17pm

I think there are some limited scenarios where symbolic and non-symbolic extents can mix quite well, like flexible array members:

struct A {
  int member1;
  int member2;
  int flex[];
};

When we use instances of A, the (offset, extent) are concrete for the first two members, and the offset is concrete for the flexible array member, but the extent there might be symbolic.

In most cases, storing something with symbolic offset will end up removing all other existing bindings, just as you mentioned.

@NoQ might have some opinions here. He wants to strip type information from the memory regions.

isuckatcs · June 10, 2023, 8:08pm

@NoQ might have some opinions here. He wants to strip type information from the memory regions.

In this case I have no idea how to calculate the region extents. The DynamicExtent API also relies on this information and our array destruction mechanism relies on the DynamicExtent API, so that change will have a huge impact.

NoQ · June 12, 2023, 6:23am

Dude, first of all, thank you for pushing this project ahead. I’m sorry I’m not paying nearly enough attention to it. This is a deep and difficult problem, and these days I barely have enough time to address “low-hanging fruit” topics

My point is that most of the time, the information isn’t the property of the region, or at least, it’s definitely not part of the region’s identity. Even if x is a simple local variable of type T, this doesn’t mean that x currently contains an object of type T. It could contain no object whatsoever - say, at the moment in time between the allocation of the variable and the first instruction of the constructor. It could contain an object of a different type (say, as a result of type punning or placement new).

For most operations in the language, it’s immaterial what is the declared type of the region; what matters is how the current operation treats the region. In order to model all aspects of the four assignments y[m] = n in the code below:

int x;
char *y = reinterpret_cast<char *>(&x);
y[0] = 1;
y[1] = 2;
y[2] = 3;
y[3] = 4;

…you don’t really care what is x and what type does it have; all you care about is that the current operation treats the same pointer as a pointer to array of characters right now, in the moment.

This is currently encoded in the region’s identity: assignment is performed over Element{x, m, char}. However, an identity of a region is a permanent property of that region. It’s not suitable for reasoning about things that only make sense in the moment.

Of course, the fact that exactly one byte inside variable x has changed after each assignment, isn’t a momentary thing. That’s a permament change to the program state and we need to be able to record it and extract it later. And we already aren’t recording it with the help of the region’s identity; instead, we’re decomposing it into what really matters: base region and offset. And, well, now also extent.

So a better way to capture things in the moment is by passing such information down from the AST, without recording it in the program state. Or, even if we have to record it in the state, record it in the part of the state that is specifically designed to hold “in-the-moment” things, such as the Environment which is reserved entirely for “spatial” storage, not for “temporal” storage.

Basically in my dream world, assuming that Loc is my vegan memregion substitute (aka base region + offset), I’m proposing is:

-ProgramState::bindLoc(Loc, Val);
+ProgramState::bindLoc(Loc, Val, Type);
-ProgramState::getSVal(Loc);
+ProgramState::getSVal(Loc, Type);

Just pass type information down from the AST! It’s always there! You already know that y[1] = 2 treats x like a char[]! Just say that out loud! The static analyzer is an AST interpreter! All the answers are in the AST! We shouldn’t be guessing anything!

The second part is actually already implemented, for the exact same reason: even when we know a lot about the location R, the code that gets modeled as getSVal(R) may choose to treat R differently in the moment. The region’s identity is already insufficient to capture that and we know it.

Of course it’s probably sufficient to pass the extent directly, as opposed to the entire type. But types are more commonly available, and converting type to size is somewhat tedious, so it’s probably easier to pass the type and implement conversion from types to extents once inside RegionStore.

Actually, I have an even stronger dream that doesn’t even require passing extents. Namely:

This was the most realistic solution, but that’s definitely not the ideal solution I’m dreaming about. There’s actually another source of type/extent information that’s also readily available to us, and it doesn’t suffer from the problems that the region’s identity suffers from: namely, we have the type of the bound value.

The code bindLoc(Loc, Val); currently extracts type information from Loc, which is absurd for the aforementioned reasons. Above I proposed to pass type information as a third argument. But the most natural and straightforward solution is to simply extract type information from Val!

Like, we’re supposed to know that in expression y[1] = 2 we’re binding a value of type char. If the identity of the RHS value isn’t sufficient for us to recognize that it’s a value of type char, we’re in trouble. Our concrete and symbolic values are ultimately fed into the constraint solver, which obviously cares about types of things, and it cares about them in a persistent, “temporal” manner, regardless of what’s what “in the moment”.

A direct binding of a value of type char cannot have any other extent. It’s always 8 bits wide.

Similarly, a default binding of type char[4] is always 32 bits wide.

You don’t have to record the extent separately. It’s already there, in the value type. Values naturally have types, and these types naturally have extents, and these extents are supposed to be exactly correct for all our purposes.

So I think this is exactly how things should ideally work. There’s a big problem with that though, which is arguably bigger than your extent problem:

Types of our values are incorrect

We have correct value types for a lot of things, from simple integer constants to complex things like LazyCompoundValues for structs and arrays. We don’t have types for things like UnknownVal and UndefinedVal, but that’s easy to fix: just don’t use these values, they weren’t supposed to be used in the first place. However, what we’re lacking is the most basic thing: the types of unknown integers in the program. They carry some type, which is often the correct type, but integral casts on symbolic integers are currently modeled as no-op, which is incredibly incorrect. This obviously leads to false positives, because our constraint solver is fed incorrect data.

Unfortuntately, feeding the constraint solver the correct data would most likely only make things worse. It’ll make equations more difficult to solve, so it’s more likely that the solver will fail to notice even some very basic contradictions (that it currently notices, even if for a wrong reason). So this is a much harder problem; it’s easy to point out the ideal solution, it’s hard to demonstrate a practical solution.

So ultimately I think we should focus on few things at a time, and I think storing extents like you initially proposed, is probably much more likely to succeed than pursuing the ideal solution.

I also don’t think you should implement my solution for abolishing sub-regions. It’s definitely much more intrusive than a mostly-transparent black-box change in RegionStore that you propose. But I do want you to understand this idea and keep it in mind when you’re working on your problem, so that you could shape your interfaces and contracts between different parts of the static analyzer with this understanding in mind Or, you know, prove me wrong. I always love when people do that

NoQ · June 12, 2023, 6:29am

Side note, @Xazax-hun was talking about the discussion in https://reviews.llvm.org/D150446#4374822 which is my latest rant about how region types were a mistake. There’s a follow-up discussion in https://github.com/llvm/llvm-project/issues/42709#issuecomment-1568479915 where I explain how a few other problems are supposed to be solved within this “new paradigm”.

isuckatcs · June 12, 2023, 9:24am

Passing down the type or the extent from above makes sense. Actually I was also thinking about something similar but only for some checkers. When you call calloc for example, the type of the region will be void * and from that there is no way to figure out the real extent.

The only way to figure out the extent in this case is to calculate it from the function parameters to which we have no easy access from MemRegion. We can take the conjured symbol and extract the statement it contains, which is calloc and then calculate the extent based on the parameters. However I don’t think such code should exist inside MemRegion, instead MallocChecker should be modified so that it also passes down the extent somehow.

steakhal · June 12, 2023, 10:31am

isuckatcs:

I started merging the Store to binding B ([start, end]) algorithm with the current one, however I encountered some cases, where it needed a little modification.
for each binding S such that B.overlapsWith(S):
  if (B.contains(S)): // This includes if S == B, aka. overwriting the same binding
    remove(S)
  if (S.contains(B))
    continue;
[    B1    ][    B2    ]  <-- store
------------------------
[      B3      ]          <-- we bind this
The above loop removes B1 but leaves B2 in the store, since B3 doesn’t contain it.
// If we have to invalidate something, bind "unknown" there.
if (!B.contains([UknownStart, UnknownEnd])):
  addDefaultBinding([UknownStart, UnknownEnd], UnknownVal)
As far as I understand, this snippet would evaluate the previous example further like this:
            [    B2    ]  <-- store
[       Unknown        ]
------------------------
[      B3      ]          <-- we bind this

You are right. I accidentally assumed that addDefaultBinding([UknownStart, UnknownEnd], UnknownVal) will kill the previous bindings in range [UknownStart, UnknownEnd] there.
This means that any default or direct bindings in that range will need to be removed and only then we can add the single unknown default binding.

isuckatcs:

We want to avoid this scenario and remove B2 too somehow. Also we only want this to happen to Direct bindings, as Default ones behave differently. Consider the following example:
struct Circle
{
    Point origin;
    int size;
};

Circle testObj;
memset(&testObj, 0, sizeof(testObj));
testObj.origin = makePoint(1, 2);     <-- bind this
At the moment of the binding the store will look like this:
[              0              ]  <-- store
-------------------------------
[  makePoint(1, 2); ]            <-- we bind this
We want to preserve 0 here, as replacing it with Unknown would lead to invalid results.

IMO in general, it shouldn’t matter if a binding is direct or default. To me, a default binding can be a LCV, which is basically that we “directly” bind some “compound” value somewhere, and in the presence of binding extents, they should genuinely act like a direct binding. But let’s stash this subject.

Coming back to your example, our proposed algorithm for storing the (LCV default) binding would nicely nest, so the default zero binding would remain in the store, hence we wouldn’t replace it with “Unknown”.

I think this example will illustrate our algorithm better for storing ugly default bindings:

Circle testObj = {}; // zero default
testObj.origin = makePoint(1, 2); // LCV default binding
testObj.origin.x = 4;
memset(&testObj.origin.y, 1, 2 * sizeof(int)); // default bind zero covering 'testObj.origin.x' and 'testObj.size'

x         y         size
[              0...          ]
[        LCV        ]
[    4   ]
------------------------------
          [ memset default 1 ]      <-- we bind this

We will have this store:

x         y         size
[        unknown             ]
[    4   ][      1...        ]

I hope this clarifies our vision.

isuckatcs:

So the current implementation of the proposed algorithm can be seen below. Note that I haven’t gotten to the point to dig deeper into and modify the lookup too, yet:
for each binding S such that B.overlapsWith(S):
  // This includes if S == B, aka. overwriting the same binding
  if (B.contains(S)): 
    remove(S)
  else if (B.overlapsWith(S) && !B.isDefault()):
    // This produces "ugly" bindings, but this can easily be adjusted if really 
    // needed. If we don't replace the value, we need to remove it and then add
    // a new one. Let's just be lazy until we finalize this algorithm.
    S = UnknownVal{}; 

// Finally, just add binding "B".
addBinding(B)
As a result, one FIXME in our regression tests has been partially eliminated (it says, the value should be false or Unknown, and it’s Unknown, so probably we are good), one new nullptr dereference has been detected and the false result from the snippet in the revision has also been eliminated.

Consequently, I’m against checking B.overlapsWith(S) && !B.isDefault() to replace S with Unknown. IMO the behavior should be the same for direct bindings as well.

I believe we are using llvm::ImmutableSet/Maps, and I can also see that BindingKey implements the operator<, hence I assumed we can change the iteration order by increasing by start and then decreasing by end. This would naturally give us the right iteration order to make our algorithm efficient.
So, no explicit sorting should be required.

We can have at most one symbolic binding in a cluster, ~~so the bindings in the cluster are already sorted and we don’t need to compare symbolic and non-symbolic bindings.~~
Oh, it seems like we can have at most one default binding even in the presence of a lonely symbolic binding.
It shouldn’t make the sorting more difficult though, as we are talking about a single symbolic binding here. We can always special case that.

I think currently, if we have a cluster that has a symbolic binding, it can only have a single “Unknown” default binding in addition - encoding if the region was previously at least partially initialized before the symbolic binding appeared. And I think this (limitation on symbolic bindings) would still make sense for the re-designed store implementation. Consequently, I’d advise not touching the semantics (storing and reading) of symbolic bindings.

Let’s come back to check how the store looks in your example:

void foo(int x) {
  int y; // empty store
  *((char *)&y + x) = 0;
  // default at 0        -> Unknown
  // direct  at symbolic -> 0
  *((char *)&y + 1) = 1;
  // default at 0 -> Unknown
  // direct  at 8 -> 1
  (void)(y + x);
}

And here is a slightly more complicated example to demonstrate how default and symbolic bindings interact to underpin my reasoning.

isuckatcs:

Mixing Direct and Default bindings

There is also a problem with mixing Default and Direct bindings, as there are mechanisms in the analyzer, that first look for a Direct binding and if they don’t find one, they start to look for information in the parnet region of the searched binding. Look at RegionStoreManager::getBindingForElement() or RegionStoreManager::getBindingForField() for example. In those cases we want to make sure that we don’t return a Default binding, if we are looking for a Direct one.

There are scenarios however, where we do want to mix Default and Direct bindings and return Unknown, like the one below:
int *arr[4] = {0};
arr[3] = calloc(1, sizeof(char));
int x = **(int **)((int *)&arr[2] + 1);

clang_analyzer_eval(x); // expected-warning{{UNKNOWN}}
Store:
{"kind": "Default", "offset": 0, "extent": 256 value: 0 (Loc)}
{"kind": "Direct", "offset": 0, "extent": 64 value: 0 (Loc)}
{"kind": "Direct", "offset": 192, "extent": 64 value: &Element{HeapSymRegion{conj_$2{void *, LC1, S1160, #1}},0 S64b,int}}

Lookup:
{"kind": "Direct", "offset": 160, "extent": 64}
0          64        128        192        256
*          *          *          *          *

[-----------------Default 0-----------------]
[-----0----]
                                 [-&Element-]
                           [--Lookup--]
As a result I think that we should call sweep() whenever a Direct binding is needed, and ignore the result if either a Default binding is returned, or an Undefined value is returned.

Why shouldn’t we say that the result of the lookup in this case is Unknown? To me, this is the handling of unaligned stores/loads, which falls into the ugly bindings territory. Assuming that strict aliasing is a thing, we shouldn’t dereference the result of the lookup, because that is an unaligned load. Consequently, we don’t need to model such loads, rather simply flag them (by some imaginary checker). To conclude, I think the result of the lookup here should be Unknown, because doing anything smarter than that is not worth the complexity and shouldn’t happen in the first place.

isuckatcs:

Looking up Default bindings

The proposed method on it’s own cannot be used for looking up Default bindings, because overlapping Default bindings cannot be replaced with Unknown values for the reason in my previous comment. As a result, opposed to Direct bindings, we don’t necessarily want to return the narrowest one.

Take a look at the following snippet and it’s model in the store:
Circle testObj;
memset(&testObj, 0, sizeof(testObj));
testObj.origin = makePoint(1, 2);
{"kind": "Default", "offset": 0, "extent": 96 value: lazyCompoundVal{...,result}}
{"kind": "Default", "offset": 0, "extent": 128 value: 0 S8b}
[      0      ]
[   LCV   ]
Let’s say now, that we are looking for the Default binding of testObj, which is {"kind": "Default", "offset": 0, "extent": 128}. The value we’d like to see here is 0, but sweep() would return [LCV, 0], which we’d concatenate into Unknown.

To me, lookups should not specify if they are interested in only default or direct bindings. So, I don’t get this section. It’s not even clear to me what bytes (range) are we loading from.
IMO we should return Unknown only if we try to load an long long (primitive type) from the offset 4, forming an unaligned load reaching across the LCV and the default 0 bindings. And in this case, I as advocated already a few times, returning Unknown is fine.

Just to make sure we are aligned, I’ll plot a few scenarios, to depict how I imagine loading (non-ugly) bindings:

store.load(testObj) → LCV{store, testObj}
store.load(testObj.origin) → LCV{store', result}, where store' refers to the store where result still exists.

Let’s look at how a sequence of lookups (testObj.origin.y) would/should be resolved:

store.load(testObj.origin) → LCV{store', result}
store'.load(result.y) → ConcreteInt{2}

Our proposed lookup algorithm always returns LCVs, for loading non-primitive types. For primitive types, the sweep() will/should return a single value, which we simply return in that case.
And in case of loading a primitive type from an ugly binding, we would confirm that the whole region is covered by [LCV, 0], hence we should return Unknown (and not Undefined).

isuckatcs:

If you could turn the analyzer into a big endian mode, you would also need to pay attention to where a specific byte is assigned. Take a look at the next snippet as an example.
int x = 1;
bool littleEndian = *((char *)&x);
if (littleEndian) {
  int *ptr = nullptr;
  *ptr = 2;
}
On a little endian system 1 would sit at the first byte of x, while on a big endian system it would sit on the fourth one. *((char *)&x) however would always return the first byte of x, which is 0 on BE and 1 on LE. This means that the analyzer must be aware of where the specific byte is, so the store should use the correct offsets too.

I agree that we should fix this at some point, and implement endian-aware regions-store, but this count’s as an “extra-mile” to me. And we are not regressing in this regard, so I’m fine keeping the implementation simple for now. The ASTContext should have everything to implement this correctly one day.
So for now, let’s assume we only deal with little-endian (like x86) systems.
FYI: There are even middle-endian systems, that could complicate things a bit more.

NoQ:

Basically in my dream world, assuming that Loc is my vegan memregion substitute (aka base region + offset), I’m proposing is:
-ProgramState::bindLoc(Loc, Val);
+ProgramState::bindLoc(Loc, Val, Type);
-ProgramState::getSVal(Loc);
+ProgramState::getSVal(Loc, Type);
Just pass type information down from the AST! It’s always there! You already know that y[1] = 2 treats x like a char[]! Just say that out loud! The static analyzer is an AST interpreter! All the answers are in the AST! We shouldn’t be guessing anything!

The second part is actually already implemented, for the exact same reason: even when we know a lot about the location R, the code that gets modeled as getSVal(R) may choose to treat R differently in the moment. The region’s identity is already insufficient to capture that and we know it.

Of course it’s probably sufficient to pass the extent directly, as opposed to the entire type. But types are more commonly available, and converting type to size is somewhat tedious, so it’s probably easier to pass the type and implement conversion from types to extents once inside RegionStore.

Makes sense.

+1

I’d vote for separately introducing extents and getting rid of (most) non-base memregions.
These are both great subjects but let’s focus on extents first.

I couldn’t grasp the full extent (pun intended) of the problem.
Could you give some example code?

isuckatcs · June 12, 2023, 2:49pm

Yes, that example wasn’t the best, I’ll try to create a better one.

    [        Default        ]  <-- store
-----------------------------
[  Binding  ]                  <-- we bind this

    [        Default        ]  <-- store
[  Binding  ]                
-----------------------------
              [  Binding  ]    <-- we read this

In this example, when we read the binding, we want to return the value of the Default binding, as that doesn’t change even if we overwrite the part of it. In case of a Direct binding however the value would change and we want to return Unknown. The proposed algorithm treats both cases identically, and it would do the following as far as I understand:

    [        Binding        ]  <-- store
-----------------------------
[  Binding  ]                  <-- we bind this

[          Unknown          ]  <-- store
[  Binding  ]                
-----------------------------
              [  Binding  ]    <-- we read this

This is correct for Direct bindings, but incorrect for Default bindings.

steakhal:

x         y         size
[              0...          ]
[        LCV        ]
[    4   ]
------------------------------
          [ memset default 1 ]      <-- we bind this

We will have this store:

x         y         size
[        unknown             ]
[    4   ][      1...        ]

This example is actually tricky. Let’s rewrite it this way:

x         y         size
[              0...          ]
------------------------------
          [ memset default 1 ]      <-- we bind this

We will have this store:

x         y         size
[        unknown             ]
          [      1...        ]

If 0 was a Default binding, the resulting store is incorrect. We know that x is still 0 in this case. If 0 was a Direct binding, the resulting store is correct as we overwrote some parts of the value.

For this we need to create a comparator that doesn’t confuse the map lookup, so that the map doesn’t accidentally return the symbolic value when we look for the other binding and vica versa. This seems like a hack around a specific property, which I don’t like.

Note that even though we need to sort the bindings, usually the number of bindings in the cluster is very small. I don’t expect a significant performance loss because of the sorting.

Because that would tell the analyzer that we have found a Direct binding. In certain places like RegionStoreManager::getBindingForField(), we don’t want to do this, as if we don’t find a Direct binding, the analyzer performs some additional logic. If we return Unknown we would prevent the analyzer from performing this additional logic and we’d get false results. We could merge this logic into the lookup, however that would be incorrect for other situations, where we might don’t want to do it.

I agree with this, but so far the analyzer has been built around the philosophy of Direct and Default bindings being separated, so if we change this a lot more parts will need to be rewritten. We can add a FIXME though that says that Direct and Default bindings shouldn’t be looked up separately and implement that later.

steakhal:

isuckatcs:

Passing down the type or the extent from above makes sense. Actually I was also thinking about something similar but only for some checkers. When you call calloc for example, the type of the region will be void * and from that there is no way to figure out the real extent.

The only way to figure out the extent in this case is to calculate it from the function parameters to which we have no easy access from MemRegion. We can take the conjured symbol and extract the statement it contains, which is calloc and then calculate the extent based on the parameters. However I don’t think such code should exist inside MemRegion, instead MallocChecker should be modified so that it also passes down the extent somehow.

I couldn’t grasp the full extent (pun intended) of the problem.
Could you give some example code?

Consider the following snippet.

int *arr = calloc(sizeof(int), 10);
int x = arr[0];

The binding that arr contains is a pointer to the beginning of the heap region, calloc() allocated.

arr -> &Element{HeapSymRegion{conj_$2{void *, LC1, S1370, #1}},0 S64b,int}}

We want to calculate the extent of HeapSymRegion{conj_$2{void *, LC1, S1370, #1}}. For other regions we do this based on the type of the region but in this case the type will be deduced from the symbol, so we will get void *.

const auto *SR = dyn_cast<SymbolicRegion>(R));
const auto Ty = SR->getSymbol()->getType().getDesugaredType(getContext());
Ty.dump();

PointerType ... 'void *'
`-BuiltinType ... 'void'

Based on this we would assume the region is 64 bit wide, but that is incorrect. The extent of this region is the amount of memory calloc() allocated, which is sizeof(int) * 10.

We could extract this information out of the symbol, since it’s a conjured symbol and we know that calloc() conjured it, so we have access to it’s AST node, which contains the parameters it has been called with. I just don’t want to put this piece of code into MemRegion::getExtent() as we would just end up duplicating MallocChecker. Instead we assume the region is infinite in this case, for now.

steakhal · June 12, 2023, 6:59pm

isuckatcs:

 [        Default        ]  <-- store
-----------------------------
[  Binding  ]                  <-- we bind this
    [        Default        ]  <-- store
[  Binding  ]                
-----------------------------
              [  Binding  ]    <-- we read this
In this example, when we read the binding, we want to return the value of the Default binding, as that doesn’t change even if we overwrite the part of it. In case of a Direct binding however the value would change and we want to return Unknown. The proposed algorithm treats both cases identically, and it would do the following as far as I understand:
    [        Binding        ]  <-- store
-----------------------------
[  Binding  ]                  <-- we bind this
[          Unknown          ]  <-- store
[  Binding  ]                
-----------------------------
              [  Binding  ]    <-- we read this
This is correct for Direct bindings, but incorrect for Default bindings.

The store looks correct. Returning Unknown for the load is correct.
I still think it’s fine for both default and direct bindings. I assume, we rarely encounter unaligned stores, so it shouldn’t spoil the store too much. The important invariant is to keep the store sound; and at least it remained sound in this case.
Anyway, we could improve the situation on unaligned stores by “splitting” the default binding.
For example, we could “split” a zero-default binding into two smaller zero-default bindings (or think of it as if it was split), to ensure that the new (unaligned) binding doesn’t spoil the whole default binding there.

    [        Default        ]  <-- store
-----------------------------
[  Binding  ]                  <-- we bind this

    [------->[   Default    ]  <-- store
[  Binding  ]                
-----------------------------
             [  Binding   ]    <-- we read this

Actually, considered this in our original proposal, but I decided to drop it to keep it simple. I thought it was already quite complicated.
And I’m not sure if it’s always possible to do.
For default bindings binding a primitive value such as zero, unknown and undefined it’s trivial. However, splitting LCVs seems a bit more complicated. For those, we would need a memregion to mention that we want to keep (mentally “copy” from the previous store"), and I’m afraid it’s not always possible to form one.

But again, I’m not even sure if we need to care about this in practice, and it would for sure raise complexity.

isuckatcs:

x         y         size
[              0...          ]
------------------------------
          [ memset default 1 ]      <-- we bind this
We will have this store:
x         y         size
[        unknown             ]
          [      1...        ]
If 0 was a Default binding, the resulting store is incorrect. We know that x is still 0 in this case. If 0 was a Direct binding, the resulting store is correct as we overwrote some parts of the value.

No. The resulting store is incorrect. The memset default 1 binding would nicely nest when added, so the algorithm would just add it. We only bind the default unknown for the invalidation if we would have ugly bindings with the new binding. Refer to the condition guarding that statement.

I don’t really understand the comment after your example (even in the context of that store). In particular, I would highlight that binding unknown is always sound/correct, but frequently we can do better. So, in the extreme, a store that always returns unknown for any lookup is correct.

I’ll think about how can we fit symbolic bindings into the model. Initially, that was out of scope for our proposal. Let me come back to this later.

isuckatcs:

steakhal:
isuckatcs:
There are scenarios however, where we do want to mix Default and Direct bindings and return Unknown, like the one below:
int *arr[4] = {0};
arr[3] = calloc(1, sizeof(char));
int x = **(int **)((int *)&arr[2] + 1);

clang_analyzer_eval(x); // expected-warning{{UNKNOWN}}
Store:
{"kind": "Default", "offset": 0, "extent": 256 value: 0 (Loc)}
{"kind": "Direct", "offset": 0, "extent": 64 value: 0 (Loc)}
{"kind": "Direct", "offset": 192, "extent": 64 value: &Element{HeapSymRegion{conj_$2{void *, LC1, S1160, #1}},0 S64b,int}}

Lookup:
{"kind": "Direct", "offset": 160, "extent": 64}
0          64        128        192        256
*          *          *          *          *

[-----------------Default 0-----------------]
[-----0----]
                                 [-&Element-]
                           [--Lookup--]
As a result I think that we should call sweep() whenever a Direct binding is needed, and ignore the result if either a Default binding is returned, or an Undefined value is returned.
Why shouldn’t we say that the result of the lookup in this case is Unknown?
Because that would tell the analyzer that we have found a Direct binding. In certain places like RegionStoreManager::getBindingForField(), we don’t want to do this, as if we don’t find a Direct binding, the analyzer performs some additional logic. If we return Unknown we would prevent the analyzer from performing this additional logic and we’d get false results. We could merge this logic into the lookup, however that would be incorrect for other situations, where we might don’t want to do it.

This reply seems to be tied to the current implementation. I must admit that the current implementation is so convoluted, that I didn’t spend much time untangeling it because I assumed that in the complexity we will also have bugs or unsound assumptions that could drive me sideways. So, I tried to stick to how in theory, the store should work and not how to retrofit the proposal to the current codebase. TBH, I would probably just scrap the current one and start it over. I was even considering to have an alternative store implementation, to make this incremental, and let users to play with the new store until is strictly supersedes the one we currently have. This should be possible.

Anyway, I don’t envy you for implementing this for sure. You are now probably in the really few who understands the store in CSA.

Now, replaying to your comment about that we cannot return unknown somewhere,. Well, that’s probably bad, and likely some bug I suspected.

Ah, yet another nail in the coffin. I’m afraid of deviating from the proposal algorithm would reduce transparency what the store is actually doing. Transparency means to me that it’s less likely to have bugs, and cause confusion for future fixes and turn into spagetti like it is now.
What I think we can sacrifice is a layer between the the true-store (implemented cleanly, according to the spec) and the current getBinding() and friends, which would “fake” any semantics that would make the transition/rewrite easier.
This way we would have place for workarounds and a correct store under the hood.
I know it’s handwavy, and it is, but this is the best I could come up with now.
[I could be still very well detached from reality though.]

isuckatcs:

steakhal:

isuckatcs:

Passing down the type or the extent from above makes sense. Actually I was also thinking about something similar but only for some checkers. When you call calloc for example, the type of the region will be void * and from that there is no way to figure out the real extent.

The only way to figure out the extent in this case is to calculate it from the function parameters to which we have no easy access from MemRegion. We can take the conjured symbol and extract the statement it contains, which is calloc and then calculate the extent based on the parameters. However I don’t think such code should exist inside MemRegion, instead MallocChecker should be modified so that it also passes down the extent somehow.

I couldn’t grasp the full extent (pun intended) of the problem.
Could you give some example code?

Consider the following snippet.
int *arr = calloc(sizeof(int), 10);
int x = arr[0];
The binding that arr contains is a pointer to the beginning of the heap region, calloc() allocated.
arr -> &Element{HeapSymRegion{conj_$2{void *, LC1, S1370, #1}},0 S64b,int}}
We want to calculate the extent of HeapSymRegion{conj_$2{void *, LC1, S1370, #1}}. For other regions we do this based on the type of the region but in this case the type will be deduced from the symbol, so we will get void *.
const auto *SR = dyn_cast<SymbolicRegion>(R));
const auto Ty = SR->getSymbol()->getType().getDesugaredType(getContext());
Ty.dump();
PointerType ... 'void *'
`-BuiltinType ... 'void'
Based on this we would assume the region is 64 bit wide, but that is incorrect. The extent of this region is the amount of memory calloc() allocated, which is sizeof(int) * 10.

We could extract this information out of the symbol, since it’s a conjured symbol and we know that calloc() conjured it, so we have access to it’s AST node, which contains the parameters it has been called with. I just don’t want to put this piece of code into MemRegion::getExtent() as we would just end up duplicating MallocChecker. Instead we assume the region is infinite in this case, for now.

Ah, I thought it’s for replying to Artem’s comment about passing types down from the AST, as a single source of truth.
The types for SymbolicRegions et. al. are doomed. We should never rely on them.

Topic		Replies	Views
How does the Static Analyzer maintain the bindings? Clang Frontend	0	77	April 1, 2014
checkBind: distinguish between MemRegionVal/ElementRegion Clang Frontend	3	101	May 8, 2014
Proposed: change tracking for RegionStore Clang Frontend	23	117	August 14, 2010
[analyzer] Random SVal hierarchy questions Static Analyzer	5	105	November 17, 2015
Memory modeling, RegionStore Questions Clang Frontend	0	101	May 18, 2020

[RFC] RegionStore++

Summary

RegionStore

Bindings

Lookup

Limitations

Solution

BindingKey

MemRegion

ClusterBindings

Lookup

The partialLookup method

The getValuesCoveringBinding method

Strategy

Results

Introduced issues

Terms and definitions

Store to binding B ([start, end]):

Loading from a binding

Loading from binding B where the type of B is non-primitive (array, struct, union, etc):

Loading from binding B where the type of B is a primitive type (e.g. int).

Complexity of the proposed lookup

Mixing Direct and Default bindings

Looking up Default bindings

Implementation changes

Results

Related topics

Store to binding `B` (`[start, end]`):

Loading from binding `B` where the type of `B` is non-primitive (array, struct, union, etc):

Loading from binding `B` where the type of `B` is a primitive type (e.g. int).

Mixing `Direct` and `Default` bindings

Looking up `Default` bindings