Are `SymbolicRegions` really untyped?

I’m about to catch a bug regarding the modeling of copies of structs, which involves a bunch of SymbolicRegions and LazyCompoundVals along with understanding the RegionStore.
During my investigation, I found some interesting behavior relating the handling of SymbolicRegions.

One hand, they are used as if they represent some untyped memory, but in other cases we dig up the wrapped symbol and use that’s type as the type of the SymbolicRegion.
That being said, it feels like by reading the code SymbolicRegions are somewhere between being typed and untyped.

My questions are, why is the SymbolicRegion not a TypedValueRegion?
Why do we pretend that it’s a typed region in some situations?

Is the following statement true? For all Sym of type T*: SymRegion{Sym} is equivalent with Element{SymRegion{Sym}, 0, T}
If so, shouldn’t we canonicalize them?


Let’s gather some comments and code snippets regarding the typed-ness of SymbolicRegions:

  1. The last sentence of the doc comment of class SymbolicRegion:

/// SymbolicRegion - A special, “non-concrete” region. Unlike other region
/// classes, SymbolicRegion represents a region that serves as an alias for
/// either a real region, a NULL pointer, etc. It essentially is used to
/// map the concept of symbolic values into the domain of regions. Symbolic
/// regions do not need to be typed.

  1. This class inherits from SubRegion, so it is not an instance of TypedValueRegion - matching the comment above.

  2. The class always wraps a symbol (the address of the memory within the abstract machine’s memory), which by nature always typed.
    The presence of this symbol is enforced by an assertion in the constructor: assert(s && isa<SymbolData>(s)) where s is a SymbolRef.

  3. There are a few cases already, where we dig up the symbol wrapped by the SymbolicRegion and use the type of the symbol as an approximation for the type of the SymbolicRegion.

ExprInspectionChecker::analyzerDumpElementCount():

  QualType ElementTy;
  if (const auto *TVR = MR->getAs<TypedValueRegion>()) {
    ElementTy = TVR->getValueType();
  } else {
    ElementTy =
        MR->castAs<SymbolicRegion>()->getSymbol()->getType()->getPointeeType();
  }

At MemRegion.cpp::calculateOffset() there is this snippet of code & comment about handling SymbolicRegions:

if (const auto *TVR = dyn_cast<TypedValueRegion>(R)) {
  Ty = TVR->getDesugaredValueType(R->getContext());
} else if (const auto *SR = dyn_cast<SymbolicRegion>(R)) {
  // If our base region is symbolic, we don't know what type it really is.
  // Pretend the type of the symbol is the true dynamic type.
  // (This will at least be self-consistent for the life of the symbol.)
  Ty = SR->getSymbol()->getType()->getPointeeType();
  RootIsSymbolic = true;
}

RegionStoreManager::getBinding():

  if (!isa<TypedValueRegion>(MR)) {
    if (T.isNull()) {
      if (const TypedRegion *TR = dyn_cast<TypedRegion>(MR))
        T = TR->getLocationType()->getPointeeType();
      else if (const SymbolicRegion *SR = dyn_cast<SymbolicRegion>(MR))
        T = SR->getSymbol()->getType()->getPointeeType();
    }
    assert(!T.isNull() && "Unable to auto-detect binding type!");
    assert(!T->isVoidType() && "Attempting to dereference a void pointer!");
    MR = GetElementZeroRegion(cast<SubRegion>(MR), T);
  } else {
    T = cast<TypedValueRegion>(MR)->getValueType();
  }

In this case, it seems like we are wrapping the SymbolicRegion into an ElementRegion just to ‘make it typed’.
It doesn’t feel like a good thing to do when we should seek to have a canonical representation (keys) in the RegionStore to make lookups consistent with the binds.

I think the concept of something being “typed” is ambiguous. Would you consider a SymbolicRegion made for a void* pointer typed? Moreover, even when we have a type for a Symbol, that type could be imprecise. E.g., the actual dynamic type of the pointee of a symbolic pointer might be a derived class. I wonder whether the original intention was to use TypedValueRegion when we are certain that we have precise (dynamic) type information and SymbolicRegion when we are not. For example, we always know the most precise type when we saw how the object was constructed. I like the idea of having this distinction, but not entirely sure whether this actually holds true for every piece of code that constructs memory regions.

I think they might not be entirely equivalent.

I think the purpose of wrapping into an ElementRegion might have more to do with pointer arithmetic rather than giving a SymbolicRegion a type. Basically, whenever we have a pointer or a reference, we cannot know whether it points to a single object or an element of an array. If we dereference a symbolic pointer and assume that it points to a single object only to later discover that the user is doing pointer arithmetic to the pointer, it would be really hard to fix up the analysis state to undo the initial assumption of dealing with a single object. I think the analyzer tries to get around this by (almost?) always assuming that we will access objects at other indices later on.

So I think SymRegion{Sym} represents something that could be an array or a single object, we don’t know yet, while Element{SymRegion{Sym}, 0, T} represents the memory for a single object. Could you elaborate on the problem you are trying to solve? In what scenario would it make your life easier to canonicalize these? Is it possible that in some cases we fail to wrap the SymRegion when we should?

I believe that there are no correct answers to these questions, because the entire abstraction of “typed region” is flawed and should probably be nuked, together with things like “sub-region”.

As Gabor already pointed out, “type” of the region is the type of the object actually stored in it, and SymbolicRegion->getSymbol()->getType() is simply an educated guess for that. A char * doesn’t necessarily point to a null-terminated string, it could be buffer of any type. A void * doesn’t necessarily point to a void (because literally nothing ever points to a void because void d̵̝̈o̶̦͂e̴͉̽s̶͕͠ ̶͕̾n̷̡̂ó̴̜t̶̘̋ ̷̯̀é̵̱x̵̰̾ḭ̶̓s̸̝̔t̷͈̿).

So the “typed region” abstraction was put in place to discriminate between the situation where we see the actual object the pointer points to, eg.

void foo() {
  char x[5];
  char *p = &x; // 'p' is a typed VarRegion of type char[5]
}

and where the type isn’t known, eg.

void foo(char *p) {
  // 'p' is an untyped SymbolicRegion, probably some char or char[]
}

There are, however, other cases where precise type can be obtained in the middle of the analysis. One case, as Gabor mentioned, is dynamic_cast to a final class. Another example is aliasing:

int glob;
void foo(char *p) {
  if (p == &glob) {
    // `p` is an untyped region that definitely points to an int
  }
}

We already have a separate machine for dynamic types that tries to live together with intrinsic typedness of some but not all regions. We also don’t really care about aliasing because that’s probably the smallest problem with aliasing so for now we just pretend aliasing doesn’t exist.

But then, separately, there’s ElementRegion that screws things up even further as it can be slapped onto any region to turn it into technically “typed” even though this doesn’t mean we’ve obtained any actual information about the actual type of the actual object stored in that region. So now instead of typed regions we have to talk about typed base regions (with sub-regions slapped on top being completely irrelevant). And I actually struggle to come up with any upside of representing pointer casts with ElementRegions slapped on top because it also makes casting implementation insanely complicated (whereas it could have been a simple no-op).

So I think the right solution is to nuke typed regions, nuke element regions, field regions, base/derived class regions, and keep working in terms of base regions and offsets, just like RegionStore already does internally.

Sorry for my late response. Thanks for the replies.

Well, I’m not sure how I could solve this. Let me clarify and give more context about where this question was raised.

It’s a bit complicated example, but let’s see if I can shed some light on the issue where this question was raised. I’ll show 4 similar examples, getting more and more complicated with each.

void no_structs(int** n1) {
  int* c = *n1; // copy
  int** n2 = &c;
  clang_analyzer_dump(*n1); // rval(*n1): reg_$1<int * Element{SymRegion{reg_$0<int ** n1>},0 S64b,int *}>
  clang_analyzer_dump(*n2); // rval(*n2): reg_$1<int * Element{SymRegion{reg_$0<int ** n1>},0 S64b,int *}>
  clang_analyzer_eval(*n1 != *n2); // FALSE, good!
  (void)(*n1);
  (void)(*n2);
}

The behavior remains the same if the copy is done at the heap.

void no_structs_heap(int** n1) {
  int** n2 = new int*(*n1); // copy on heap
  clang_analyzer_dump(*n1); // rval(*n1): reg_$3<int * Element{SymRegion{reg_$2<int ** n1>},0 S64b,int *}>
  clang_analyzer_dump(*n2); // rval(*n2): reg_$3<int * Element{SymRegion{reg_$2<int ** n1>},0 S64b,int *}>
  clang_analyzer_eval(*n1 != *n2); // FALSE, good!
  (void)(*n1);
  (void)(*n2);
  delete n2;
}

Now let’s look at the case when the int * pointer is wrapped by a struct. The situation should remain exactly the same, but now we’ll have different answers, and the equality can no longer be proven.

struct Node { int* ptr; };
void with_structs(Node* n1) {
  Node c = *n1; // copy
  Node* n2 = &c;
  clang_analyzer_dump(*n1); // lazy...
  clang_analyzer_dump(*n2); // lazy...
  clang_analyzer_dump(n1->ptr); // rval(n1->ptr): reg_$2<int *         SymRegion{reg_$0<struct Node * n1>}.ptr>
  clang_analyzer_dump(n2->ptr); // rval(n2->ptr): reg_$1<int * Element{SymRegion{reg_$0<struct Node * n1>},0 S64b,struct Node}.ptr>
  clang_analyzer_eval(n1->ptr != n2->ptr); // UNKNOWN, bad!
  (void)(*n1);
  (void)(*n2);
}

We can see that the values are identical, except for the ElementRegion; making the equality check fail.

In case you wonder what could be different for doing the same on heap; well there must be something xD

void with_structs_heap(Node* n1) {
  Node* n2 = new Node(*n1); // copy on heap
  clang_analyzer_dump(*n1); // lazy...
  clang_analyzer_dump(*n2); // lazy...
  clang_analyzer_dump(n1->ptr); // rval(n1->ptr): reg_$4<int * SymRegion{reg_$2<struct Node * n1>}.ptr>
  clang_analyzer_dump(n2->ptr); // rval(n2->ptr): unknown
  clang_analyzer_eval(n1->ptr != n2->ptr); // UNKNOWN, bad!
  (void)(*n1);
  (void)(*n2);
  delete n2;
}

In this case, even the n2->ptr results in UnknownVal because internally while evaluating the load it gets the expected LazyCompoundVal, which is then cast to the required type (int *), but the evalCast() returns Unknown for that case.

I propose a fix for the mentioned issues in the following patch-stack:
https://reviews.llvm.org/D132142