tl;dr
I’m proposing the introduction of a dedicated Symbol (SymExpr) for representing values resulting from invalidations. This would enable us to differentiate regular SymbolConjured
values and symbolic values resulting from some invalidation, thus creating heuristical bug report suppression involving seemingly contradicting assumptions like described at #53338, #53970.
Foundation of invalidation
First, let me briefly introduce what it means if a memory region escapes from the analysis, and what invalidation and inlining means in this context.
These are not precise definitions but should give a solid foundation for understanding this proposal.
For more details about the memory model refer to [1][2].
A memory region escapes if the analyzer can no longer track what is written to that region. E.g. if a mutable reference/pointer is passed to a function, the analyzer cannot/doesn’t want to inline.
Note that in this context by inlining I mean following the instructions within the callee’s body.
A memory region escape will trigger an invalidation to prevent false assumptions in the subsequent steps.
This is achieved in the analyzer by calling the ProgramState::invalidateRegions()
.
Consider the following case:
struct Data {
int value;
int *q;
};
void escape(Data *p); // It might write to any global variable or any `lvalue` reachable through `p`.
int my_errno;
void top() {
int x = 4;
Data d = {5, &x};
my_errno = 0;
escape(&d);
clang_analyzer_eval(d.value == 5); // We cannot know: `d` directly escaped.
clang_analyzer_eval(x == 4); // We cannot know: indirectly escaped.
clang_analyzer_eval(my_errno == 0); // We cannot know: indirectly escaped.
// d.value: SymbolDerived{SymbolConjured{$1, int}, d.value}
// x: SymbolConjured{$2, int}
// my_errno: SymbolDerived{SymbolConjured{$3, int}, my_errno}
}
Invalidation will basically remove all bindings from the memory region, then bind a fresh SymbolConjured
symbol representing the value of the whole memory region after the invalidation.
SymbolConjured
by design, carries type information. It doesn’t make much sense in this situation since we just use int
even for representing the whole d
object, which is of type struct Data
. For this reason, I recommend ignoring them in this example.
To get it straight, subobjects of d
, such as d.value
, will indeed have no bindings.
If one forms a reference to the d.value
member after the invalidation, the analyzer will create a SymbolDerived
symbolic value, representing the value of the smaller slice/chunk of the memory region of a larger symbolic one. In our case, since the value of d
is represented by SymbolConjured
, SymbolDerived
will just wrap it to represent the value of the member/subobject d.value
.
For trivial types, such as the int
variable x
, x
will be bound to a SymbolConjured
symbolic value, which will represent the whole object this time, and this is why we don’t see SymbolDerived
in that case.
One might ask why did we have SymbolDerived
for invalidated global variables?
It is because of the hierarchical memory model [1]. The user global variables reside in the GlobalInternalSpaceRegion
. Any subregions within that memory space should be invalidated since the escape()
function might write to any of those regions (such as to global variables), hence the invalidation will involve not only dropping bindings from the Store
, but also binding fresh conjured symbols to the entire memory space in which these globals reside. After this point, any global variables should slice its part of the conjured symbol representing the value of the whole memory space. This is why we had different behavior for my_errno
and the stack variable x
.
Note: Escaping the address of a subobject (e.g. a pointer to a member of Data
) would cause the entire object to be invalidated since one can apply pointer arithmetic to access any other non-const part of the wrapping object. This also applies to aggregates, such as arrays as well.
Motivation
SymbolDerived
is frequently an artifact of some invalidation, but there could/might be other reasons why a SymbolDerived
appears. [example needed]
In the example at the previous section, a single SymbolConjured
was bound to the memory region of x
, without any SymbolDerived
wrapping it. As a consequence, the presence of SymbolDerived
does not perfectly correlate to the fact that invalidation happened with a given memory region in question.
Why do we need better tracking of invalidated memory regions?
There are multiple areas that could potentially help.
One of such is bug report suppression. It’s a frequent problem that due to invalidation we assign a different symbol for an entity to which later we add a constraint seemingly contradicting with the state previous to the invalidation point. This is especially confusing to deal with if the memory region was indirectly invalidated, such as global variables or transitively reachable objects by following an escaped object.
We should be able to detect bug reports that have a set of constraints associated with an interesting symbol, which was the result of some invalidation prior to which we had a different, contradicting set of constraints to the one we have at the bug report. Such bug reports should be marked invalid, since at best these would be difficult to explain to the user if not outright a false-positive.
More details here: #53338, #53970
Currently, we don’t distinguish the reason why an invalidation happened.
However, in different cases, we might want to act differently, depending on e.g. if the invalidation was triggered by the LoopWidening
or a ConvervativeEvalCall
.
Digging such information from a SymbolConjured
or other artifacts is a nightmare if not impossible.
Proposal
We could introduce a new kind of SymbolData
to SymExpr
s hierarchy. We already have a couple of others like SymbolExtent
or SymbolMetadata
denoting special symbols, so this wouldn’t be an outlier. I’m not sure how to call it, so let’s refer to it as SymbolInvalidationArtifact
for the time being.
This SymExpr
would hold an InvalidationCause
object, which would hold the metadata of the invalidation event, such as the Stmt
of the loop, which was widened [3]; or a CallExpr
for a conservative eval call, etc.
InvalidationCause
|- UnmodeledCall
| |-ConservativeEvalCall {CallExpr}
| `-PartiallyModeledCall {CallExpr} // e.g. memset sometimes not modeled perfectly, thus we call invalidation
`- UnmodeledStmt
|- UnmodeledExpr: {Expr} // e.g. atomics
`- LoopWidening: {loop Stmt}
All non-leaf classes in this inheritance diagram are abstract, similarly to how MemRegion
s or SymExpr
s.
Probably wouldn’t be hard to introduce this, since we only have a couple of uses of the currently existing invalidation API.
There are only 8 references to the ProgramState::invalidateRegions()
:
-
CStringChecker
: We try hard to perfectly model thememcpy
operations but if we fail to do so, we rather invalidate some regions (source and destination buffers). -
SmartPtrModeling/handleOstreamOperator()
: We eval call theoperator<<(std::ostream&, const std::unique_ptr<T>&)
to preserve the state associated with the smart pointer, and we rather manually invalidate the memory of theostream
object. -
ExprEngine/createTemporaryRegionIfNeeded()/SubobjectAdjustment::MemberPointerAdjustment
: Unimplemented. -
ExprEngine/VisitAtomicExpr()
: Handlers of__c11_atomic_load()
are unimplemented as well, thus we invalidate the regions passed to them as arguments. -
ExprEngineCallAndReturn/bindReturnValue()
: I’m not exactly sure, but it’s related to the widening of constructor calls so that the not evaluated/skipped constructor won’t cause false uninitialized read bugs due to the skipped ctor calls. But this is a wide guess. Anyway, it shouldn’t be too important. -
ExprEngineCallAndReturn/conservativeEvalCall()
: invalidates the escaped memory regions. -
ExprEngineCXX/VisitCXXNewExpr()
: Comment says that “Invalidate placement args”. So that’s probably what it does -
LoopWidening/getWidenedLoopState()
: Takes the current stack and the globals and invalidates them.
By using such dedicated symbols for representing the value of some memory region posterior to some invalidation event, we can distinguish those values from e.g. SymbolConjured
values produced by some Eval::Call
event.
@NoQ @Xazax-hun @martong @Szelethus @ASDenysPetrov
[1]: A Memory Model for Static Analysis of C Programs - foundational paper
[2]: Region Store - dev docs
[3]: Widening - wiki