Memory region invalidation tracking improvements

tl;dr

I’m proposing the introduction of a dedicated Symbol (SymExpr) for representing values resulting from invalidations. This would enable us to differentiate regular SymbolConjured values and symbolic values resulting from some invalidation, thus creating heuristical bug report suppression involving seemingly contradicting assumptions like described at #53338, #53970.


Foundation of invalidation

First, let me briefly introduce what it means if a memory region escapes from the analysis, and what invalidation and inlining means in this context.
These are not precise definitions but should give a solid foundation for understanding this proposal.
For more details about the memory model refer to [1][2].

A memory region escapes if the analyzer can no longer track what is written to that region. E.g. if a mutable reference/pointer is passed to a function, the analyzer cannot/doesn’t want to inline.
Note that in this context by inlining I mean following the instructions within the callee’s body.
A memory region escape will trigger an invalidation to prevent false assumptions in the subsequent steps.
This is achieved in the analyzer by calling the ProgramState::invalidateRegions().

Consider the following case:

struct Data {
  int value;
  int *q;
};

void escape(Data *p); // It might write to any global variable or any `lvalue` reachable through `p`.
int my_errno;

void top() {
  int x = 4;
  Data d = {5, &x};
  my_errno = 0;
  escape(&d);
  clang_analyzer_eval(d.value == 5);  // We cannot know: `d` directly escaped.
  clang_analyzer_eval(x == 4);        // We cannot know: indirectly escaped.
  clang_analyzer_eval(my_errno == 0); // We cannot know: indirectly escaped.
  // d.value:  SymbolDerived{SymbolConjured{$1, int}, d.value}
  // x:        SymbolConjured{$2, int}
  // my_errno: SymbolDerived{SymbolConjured{$3, int}, my_errno}
}

Invalidation will basically remove all bindings from the memory region, then bind a fresh SymbolConjured symbol representing the value of the whole memory region after the invalidation.
SymbolConjured by design, carries type information. It doesn’t make much sense in this situation since we just use int even for representing the whole d object, which is of type struct Data. For this reason, I recommend ignoring them in this example.
To get it straight, subobjects of d, such as d.value, will indeed have no bindings.

If one forms a reference to the d.value member after the invalidation, the analyzer will create a SymbolDerived symbolic value, representing the value of the smaller slice/chunk of the memory region of a larger symbolic one. In our case, since the value of d is represented by SymbolConjured, SymbolDerived will just wrap it to represent the value of the member/subobject d.value.
For trivial types, such as the int variable x, x will be bound to a SymbolConjured symbolic value, which will represent the whole object this time, and this is why we don’t see SymbolDerived in that case.

One might ask why did we have SymbolDerived for invalidated global variables?
It is because of the hierarchical memory model [1]. The user global variables reside in the GlobalInternalSpaceRegion. Any subregions within that memory space should be invalidated since the escape() function might write to any of those regions (such as to global variables), hence the invalidation will involve not only dropping bindings from the Store, but also binding fresh conjured symbols to the entire memory space in which these globals reside. After this point, any global variables should slice its part of the conjured symbol representing the value of the whole memory space. This is why we had different behavior for my_errno and the stack variable x.

Note: Escaping the address of a subobject (e.g. a pointer to a member of Data) would cause the entire object to be invalidated since one can apply pointer arithmetic to access any other non-const part of the wrapping object. This also applies to aggregates, such as arrays as well.


Motivation

SymbolDerived is frequently an artifact of some invalidation, but there could/might be other reasons why a SymbolDerived appears. [example needed]
In the example at the previous section, a single SymbolConjured was bound to the memory region of x, without any SymbolDerived wrapping it. As a consequence, the presence of SymbolDerived does not perfectly correlate to the fact that invalidation happened with a given memory region in question.

Why do we need better tracking of invalidated memory regions?

There are multiple areas that could potentially help.
One of such is bug report suppression. It’s a frequent problem that due to invalidation we assign a different symbol for an entity to which later we add a constraint seemingly contradicting with the state previous to the invalidation point. This is especially confusing to deal with if the memory region was indirectly invalidated, such as global variables or transitively reachable objects by following an escaped object.
We should be able to detect bug reports that have a set of constraints associated with an interesting symbol, which was the result of some invalidation prior to which we had a different, contradicting set of constraints to the one we have at the bug report. Such bug reports should be marked invalid, since at best these would be difficult to explain to the user if not outright a false-positive.
More details here: #53338, #53970

Currently, we don’t distinguish the reason why an invalidation happened.
However, in different cases, we might want to act differently, depending on e.g. if the invalidation was triggered by the LoopWidening or a ConvervativeEvalCall.
Digging such information from a SymbolConjured or other artifacts is a nightmare if not impossible.

Proposal

We could introduce a new kind of SymbolData to SymExprs hierarchy. We already have a couple of others like SymbolExtent or SymbolMetadata denoting special symbols, so this wouldn’t be an outlier. I’m not sure how to call it, so let’s refer to it as SymbolInvalidationArtifact for the time being.

This SymExpr would hold an InvalidationCause object, which would hold the metadata of the invalidation event, such as the Stmt of the loop, which was widened [3]; or a CallExpr for a conservative eval call, etc.

InvalidationCause
|- UnmodeledCall
|  |-ConservativeEvalCall {CallExpr}
|  `-PartiallyModeledCall {CallExpr} // e.g. memset sometimes not modeled perfectly, thus we call invalidation
`- UnmodeledStmt
   |- UnmodeledExpr: {Expr} // e.g. atomics
   `- LoopWidening: {loop Stmt}

All non-leaf classes in this inheritance diagram are abstract, similarly to how MemRegions or SymExprs.

Probably wouldn’t be hard to introduce this, since we only have a couple of uses of the currently existing invalidation API.
There are only 8 references to the ProgramState::invalidateRegions():

  • CStringChecker: We try hard to perfectly model the memcpy operations but if we fail to do so, we rather invalidate some regions (source and destination buffers).
  • SmartPtrModeling/handleOstreamOperator(): We eval call the operator<<(std::ostream&, const std::unique_ptr<T>&) to preserve the state associated with the smart pointer, and we rather manually invalidate the memory of the ostream object.
  • ExprEngine/createTemporaryRegionIfNeeded()/SubobjectAdjustment::MemberPointerAdjustment: Unimplemented.
  • ExprEngine/VisitAtomicExpr(): Handlers of __c11_atomic_load() are unimplemented as well, thus we invalidate the regions passed to them as arguments.
  • ExprEngineCallAndReturn/bindReturnValue(): I’m not exactly sure, but it’s related to the widening of constructor calls so that the not evaluated/skipped constructor won’t cause false uninitialized read bugs due to the skipped ctor calls. But this is a wide guess. Anyway, it shouldn’t be too important.
  • ExprEngineCallAndReturn/conservativeEvalCall(): invalidates the escaped memory regions.
  • ExprEngineCXX/VisitCXXNewExpr(): Comment says that “Invalidate placement args”. So that’s probably what it does :smiley:
  • LoopWidening/getWidenedLoopState(): Takes the current stack and the globals and invalidates them.

By using such dedicated symbols for representing the value of some memory region posterior to some invalidation event, we can distinguish those values from e.g. SymbolConjured values produced by some Eval::Call event.

@NoQ @Xazax-hun @martong @Szelethus @ASDenysPetrov


[1]: A Memory Model for Static Analysis of C Programs - foundational paper
[2]: Region Store - dev docs
[3]: Widening - wiki

I very much support this initiative. Thanks a lot for working on this.

SymbolConjured is basically always bad, it always means we’ve not only lost information so we have to conjure a symbol out of thin air but we’ve also forgotten why did we lose that information. Any effort to replace them with something more expressive is very welcome.

SymbolDerived is frequently an artifact of some invalidation, but there could/might be other reasons why a SymbolDerived appears. [example needed]

This may happen when there are actual symbols of stucture type, eg. when a structure is returned from unknown function by value in C (not C++), and then a field of that structure is accessed. But it doesn’t really matter, it’s not like we consciously preferred that to producing a compound value over temporary region.

  • ExprEngineCallAndReturn/bindReturnValue() : I’m not exactly sure, but it’s related to the widening of constructor calls so that the not evaluated/skipped constructor won’t cause false uninitialized read bugs due to the skipped ctor calls. But this is a wide guess. Anyway, it shouldn’t be too important.

That’s what happens an unknown function returns a struct by value in C++. In this case target region is well-defined so all we can do is make sure its contents are no longer known. It’s basically equivalent to returning into pointer passed into the function as argument (and that’s exactly how it’s typically implemented in ABI).

It is a good initiative. However, alternatively (or in parallel) we might want to fight the problem with other means.

With invalidation we loose information and we use an estimation that might be wrong. We invalidate global variables because we simply don’t know if the callee is going to change that memory region. In your example, my_errno is being set to UNKNOWN even though, escape might not do anything with that variable. So, invalidation of a global variable can lead to a false-positive as well! In an extreme case, it might turn up that we have more false-positive reports with invalidation than without it. It is an inherent failure because invalidation was designed to avoid false-positives. Invalidation is a subjective heuristic. We could have guessed the other way, i.e. that the value remained intact. Why is it better to say that we don’t know anything about that variable than saying it was set to 0, so, it might still be 0? I haven’t seen any empirical measures or publications on this.

My point is that, we should try to minimize the number of invalidations during the analysis. And we can achieve that if we have more information about the called functions.

I imagine that before commencing the path sensitive analysis, we could analyse all functions of a project and discover which memory regions (arguments, globals) they are going to write/read or a function might even be pure. This can be done via known data-flow algorithms, and there have been already an attempt to reuse the existing LLVM pass to get this information (⚙ D85319 [analyzer][RFC] Get info from the LLVM IR for precision). Back then we had concerns of reusing an LLVM pass in CSA, infrastructural concerns mainly. However, we might be able to re-implement the required data-flow analysis on top of the Clang CFG, especially nowadays because we lately have the shiny new data-flow framework (⚙ D114234 [clang][dataflow] Add base types for building dataflow analyses). @Xazax-hun, what do you think, is the framework mature enough to conduct such an analysis?

Finally, invalidation is here forever because there are many functions, library and system function calls that we cannot model precisely. So, I support this idea, but perhaps these man-hours rather could be spent for improving the precision.

What do we actually consider invalidation? The question might seem silly at the first glance, but consider the following two examples:

int x = conservative();
int y;
conservative(&y);

Would your proposal assign a conjured symbol to x and an invalidation symbol to y? It makes me a bit uneasy that the two variants of the code above might have the exact same purpose, yet we would do very different modeling for them. I wonder whether initializing variables using output arguments should have special handling.

Do you plan to implement this as a global visitor that affect all the checkers?

I do not have any empirical data, but I think this specifically helps in scenarios when a variable is initialized by a function call and we don’t want to get spurious errors by either using an Undef value or getting a division by zero or something similar.

I think the framework is promising, but it currently has some problems. With its current memory model the fixed-point iteration did not converge, so it currently does not attempt to reach a fixed point. I am not sure what is the current status but I think the authors are working on a solution.

The answer is twofold. Yes, I would bind different kinds of SymbolData, namely SymbolConjured for x and SymbolInvalidationArtifact to y; but the engine/checkers should not rely on what kind of SymbolData is it, that should be used carefully already. So they are different, and sort of the same at the same time.

Let me give another example, of when the checkers should not assume the kind of the symbol.
The SymbolExtent SymbolData was designed to represent extent sizes, thus one would expect that the getDynamicExtent() would return a symbol of that kind, but it’s not the case in general.

Consider this seemingly unrelated code:

int *top() {
  unsigned items = get_number_of_items(); // 'items' is 'SymbolConjured{$1}'
  int *p = (int*)malloc(items * sizeof(int));
  return p; // extentOf(p): (SymbolConjured{$1} * 4)
}

The extent of the memory pointed by p is represented by the SymIntExpr{SymbolConjured{$1}, 4}. As a consequence, one should rarely depend on the kind of a symbol.
That being said, we rarely if ever check for only SymbolConjured kinds, and it immediately raises the red flag for me when I see something like that.

Yes, I’ve already experimented with some ideas, but it was difficult to handle all the edge cases; hence I wrote this proposal. It’s going to be a slow process though, I might not even finish this alone, but the steps should be outlined if not finished.

I’m more in the camp of having invalidation. Consider this example:

void top() {
  myerrno = 0;
  invalidate();
  // If no error happened, no error recovery needed.
  if (myerrno == 0)
    return; // Bail out on success.
  // Now do some error recovery...
}

If we don’t invalidate the globals, then the engine will only discover the success early return path. Any bugs left on the error recover path would be undiscovered.
So, invalidation is not about reducing FPs, but rather preserving correctness even at the expense of a lot of FPs.

Yes, I agree. And this statement underlines that we should strive to keep the number of invalidations at the minimum.

Me too. I am not against invalidation, it is still better than keeping the stale values of globals. I am just questioning whether this is really the best we can do in all cases.