[analyzer] Zombie symbols

A certain peculiar property of the removeDeadBindings() mechanism seems to have been overlooked since quite a long time.

My colleague Ilya Palachev has recently shown me a specific leak false-negative. It could be simplified down to the following fairly short SimpleStreamChecker test that currently fails:

   void test() {
     FILE *fp = fopen("myfile.txt", "w");
     fp = 0; // Hmm, we should have overwritten the symbol.
   } // expected-warning{{Opened file is never closed; potential resource leak}}

So what's going on? In fact, it is not quite correct to say that "checkDeadSymbols() is called for symbols that are no longer available in the program state". It is just the opposite: it is only called for symbols that *are* available in the program state, otherwise how does it know that they even exist!

In order to die, the symbol needs to be marked as dead on one of the removeDeadBindings() runs, and also never marked as live during such run (otherwise the "live" mark takes a higher priority). Such marks can be added by Environment (which marks values of active expressions as live and values of inactive expressions as dead), Store (which runs a worklist to find live keys and mark their values live, and then find more live keys, etc.), Constraint Manager (which simply marks all of its symbols as dead, as he is not the one to be interested in keeping them alive), or checkers (through checkLiveSymbols() callback).

So in the program state before the assignment, the file descriptor is live, because the store marks it as live, because it is a value stored in a region of a live variable.

And in the state after the assignment, the file descriptor is not present at all, because the checker fails to inform SymbolReaper of this symbol's existence in its checkLiveSymbols() callback, and there are no other sources of such information for the SymbolReaper.

And the assignment itself doesn't instantly trigger the dead symbol removal. It is only triggered once in a while automatically "between" the statements, and the kind of the statement is of little concern.

SimpleStreamChecker could be fixed with the following callback:

   void SimpleStreamChecker::checkLiveSymbols(ProgramStateRef State,
                                              SymbolReaper &SymReaper) const {
     StreamMapTy TrackedStreams = State->get<StreamMap>();
     for (StreamMapTy::iterator I = TrackedStreams.begin(),
                                E = TrackedStreams.end(); I != E; ++I) {
       SymbolRef Sym = I->first;
       SymReaper.maybeDead(Sym);
     }
   }

However, even if i propose this fix, somebody has to, kind of, update the presentation of writing the checker in 24 hours, and explain all the stuff from this mail in the process, and i guess the checker would no more be as simple as it used to be.

A few other checkers are not suffering from this issue because another entity performs the maybeDead() call for them: for example, MallocChecker adds default bindings on their symbol's symregions, and leaves it to the Store to mark the base symbols as dead. However, i guess they shouldn't rely on the Store for doing this job.

So i've got a few ideas:

0. Fix all checkers as discussed above.
1. Scan all symbols in the SymbolManager to find zombie symbols, deprecate the maybeDead() mechanism.
2. Wait until some kind of smart data map takes care of this extra work.
3. Call checkDeadSymbols on exceptional occasions, like bindings and invalidations, explicitly marking the relevant symbols as maybe dead.

1. and 2. sound most reasonable to me (3. is hacky and we cannot expect to cover all the cases and probably code duplication). Not sure about performance of 1., maybe i'd have a look, but it'd most likely simplify some code. Otherwise i guess we'd have to just 2., unless anybody has anything else in mind or i'm missing something(?)

Best regards,
Artem.

Anna, Devin,

Could you comment on this? It sounds like an important design issue.

Anna, Devin,

Could you comment on this? It sounds like an important design issue.

This is an important issue to address; unfortunately, I do not have a solution at hand.

A certain peculiar property of the removeDeadBindings() mechanism seems
to have been overlooked since quite a long time.

My colleague Ilya Palachev has recently shown me a specific leak
false-negative. It could be simplified down to the following fairly
short SimpleStreamChecker test that currently fails:

  void test() {
    FILE *fp = fopen("myfile.txt", "w");
    fp = 0; // Hmm, we should have overwritten the symbol.
  } // expected-warning{{Opened file is never closed; potential
resource leak}}

So what's going on? In fact, it is not quite correct to say that
"checkDeadSymbols() is called for symbols that are no longer available
in the program state". It is just the opposite: it is only called for
symbols that *are* available in the program state, otherwise how does it
know that they even exist!

In order to die, the symbol needs to be marked as dead on one of the
removeDeadBindings() runs, and also never marked as live during such run
(otherwise the "live" mark takes a higher priority). Such marks can be
added by Environment (which marks values of active expressions as live
and values of inactive expressions as dead), Store (which runs a
worklist to find live keys and mark their values live, and then find
more live keys, etc.), Constraint Manager (which simply marks all of its
symbols as dead, as he is not the one to be interested in keeping them
alive), or checkers (through checkLiveSymbols() callback).

So in the program state before the assignment, the file descriptor is
live, because the store marks it as live, because it is a value stored
in a region of a live variable.

And in the state after the assignment, the file descriptor is not
present at all, because the checker fails to inform SymbolReaper of this
symbol's existence in its checkLiveSymbols() callback, and there are no
other sources of such information for the SymbolReaper.

If I understand correctly, the problem is that by the time we run removeDeadBindings, the symbol is not in the state; therefore, we do not qualify it as dead or alive. We just do not know it ever existed. At which point is the symbol removed from the state? It that a byproduct of the store?

And the assignment itself doesn't instantly trigger the dead symbol
removal. It is only triggered once in a while automatically "between"
the statements, and the kind of the statement is of little concern.

SimpleStreamChecker could be fixed with the following callback:

  void SimpleStreamChecker::checkLiveSymbols(ProgramStateRef State,
                                             SymbolReaper &SymReaper)
const {
    StreamMapTy TrackedStreams = State->get<StreamMap>();
    for (StreamMapTy::iterator I = TrackedStreams.begin(),
                               E = TrackedStreams.end(); I != E; ++I) {
      SymbolRef Sym = I->first;
      SymReaper.maybeDead(Sym);
    }
  }

However, even if i propose this fix, somebody has to, kind of, update
the presentation of writing the checker in 24 hours, and explain all the
stuff from this mail in the process, and i guess the checker would no
more be as simple as it used to be.

A few other checkers are not suffering from this issue because another
entity performs the maybeDead() call for them: for example,
MallocChecker adds default bindings on their symbol's symregions, and
leaves it to the Store to mark the base symbols as dead. However, i
guess they shouldn't rely on the Store for doing this job.

So i've got a few ideas:

0. Fix all checkers as discussed above.

I do agree with Artem that changing all of the checkers is not a good option. We should try to move away from the model that asks checker writers to write more boilerplate.

1. Scan all symbols in the SymbolManager to find zombie symbols,
deprecate the maybeDead() mechanism.
2. Wait until some kind of smart data map takes care of this extra work.

Are you talking about a data structure that keeps track of the data stored by the checkers. Presumably, that could be scanned to find out which symbols are tracked by the checkers? This direction looks the most appealing at this point.

At which point is the symbol removed from the state? It that a

> byproduct of the store?

Well, yeah, i guess it's probably even "the primary product" - to erase the original value while overwriting. And once the original value is erased, it's no longer referenced.

> > 1. Scan all symbols in the SymbolManager to find zombie symbols,
> > deprecate the maybeDead() mechanism.
> > 2. Wait until some kind of smart data map takes care of this extra work.

> Are you talking about a data structure that keeps track of the
> data stored by the checkers. Presumably, that could be scanned
> to find out which symbols are tracked by the checkers? This
> direction looks the most appealing at this point.

Seems so; why i'm also thinking of "1." is because i feel it'd make removeDeadBindings sooo much easier to understand and sooo much more intuitive - which it seems to need, as even an issue as simple as this one had remained unnoticed. It seems that everybody was thinking that "1." is already there. "The garbage collector was supposed to see which symbols are referenced and remove those that aren't, how else could it work?" In this case, we probably wouldn't even need a worklist [which currently iterates only through the store, but also needs to iterate through the GDM at the same time, as there may be mutual dependencies between symbols here and there] during garbage collection, just rely on later iterations to clean up sub-symbols step by step. So i'd probably give it a try some day and try to explain it in more details if something useful shows up.

Though i've no idea about performance; the worklist would most likely still have better performance.

And also, if we make a new API for the data map, the old API might still want to be fixed somehow.

> At which point is the symbol removed from the state? It that a
> byproduct of the store?

Well, yeah, i guess it's probably even "the primary product" - to erase the original value while overwriting. And once the original value is erased, it's no longer referenced.

> > 1. Scan all symbols in the SymbolManager to find zombie symbols,
> > deprecate the maybeDead() mechanism.
> > 2. Wait until some kind of smart data map takes care of this extra work.

> Are you talking about a data structure that keeps track of the
> data stored by the checkers. Presumably, that could be scanned
> to find out which symbols are tracked by the checkers? This
> direction looks the most appealing at this point.

Seems so; why i'm also thinking of "1." is because i feel it'd make removeDeadBindings sooo much easier to understand and sooo much more intuitive - which it seems to need, as even an issue as simple as this one had remained unnoticed. It seems that everybody was thinking that "1." is already there. "The garbage collector was supposed to see which symbols are referenced and remove those that aren't, how else could it work?" In this case, we probably wouldn't even need a worklist [which currently iterates only through the store, but also needs to iterate through the GDM at the same time, as there may be mutual dependencies between symbols here and there] during garbage collection, just rely on later iterations to clean up sub-symbols step by step. So i'd probably give it a try some day and try to explain it in more details if something useful shows up.

Though i've no idea about performance; the worklist would most likely still have better performance.

The removeDeadBindings code path is currently one of the larger bottlenecks in the code. Something like 20-25% of the time is spent there. This is largely because there is mark-and-sweep algorithm performed here, and mark-and-sweep is great at causing lots of cache misses in the short term and invalidating the processor cache for later code. I really want to spend some time here, but that code is pretty difficult to understand (as you noted), and other tasks have gotten in the way for now.

I really want to spend some time here, but that code is pretty

> difficult to understand (as you noted)

Umm, before i forget, i guess i'd dump some quick explanation of what i figured out about it.

First, we have symbols-or-regions, hereinafter "values" unless it causes confusion. Liveness of some values may be determined regardless of any other values: for example, a VarRegion of a local variable can be proven live via live variables analysis, which is an auxiliary data flow analysis that doesn't depend on our symbolic execution engine. However, values may also be related to other values via a "keeps-alive relation". There are two kind of keep-alive relations:

I. "first-class" keep-alive relations arise from and rely purely on the SVal hierarchy (for example, SymbolRegionValue is kept alive by its parent region, SymbolicRegion keeps its base symbol alive).

II. "second-class" keep-alive relations arise when symbols and regions appear in certain parts of the program state (for example, any region would keep alive its contents - all symbols stored inside it).

The purpose of removeDead() is to understand which values present in the current program state are dead (and, as a useful by-product, which values are live) by exploring the keep-alive relations between values; second-class relations are explored with respect to the current program state. Whenever a certain value X is determined to be live, any values kept alive by X are also known to be live. The liveness information is collected from scratch on every pass and does not depend on previous removeDead() runs. Once the liveness information is collected, a special action is taken on all dead values, which may involve removing them from the state, but does not necessarily boil down to just that.

SymbolReaper is an object that contains the incomplete understanding of live values on the mark phase, which eventually becomes complete and as such can be used on the sweep phase. SymbolReaper's data consists of two sets of values - the "live set" and the "dead set". The live set contains values already known to be live. The dead set contains the values found in the state, liveness of which is explicitly questioned and was not yet determined; values may move from the dead set to the live set (if they turn out to be live due to a previously unconsidered second-class relation to a live value) but not in the opposite direction. First-class relations are mostly hardcoded directly into the SymbolReaper - see isLive(), isLiveRegion(). Originally, SymbolReaper was simply a pair of sets, but later it was extended to take care of first-class keep-alive relations.

The removeDead() procedure begins by creating a temporary SymbolReaper object that lives for as long as the procedure runs. Then it proceeds with querying different manager objects responsible for different sections of the program state to explore their section and fill-in the liveness information in the SymbolReaper. Each section needs to classify each value it is responsible for as live or dead.

Program state sections are queried in the following order:

(1) GDM traits of particular checkers (through CheckerManager and checkLiveSymbols() callback).
(2) Environment (through EnvironmentManager).
(3) Store (through StoreManager).
(4) Range constraints (through ConstraintManager).

Section (1) is implemented in each checker in a different manner. The checker needs to be aware of the fact that if it doesn't markLive() a value here, then all information related to it may be removed from other sections of the state (eg. range constraints for symbols or bindings for regions). And if the checker doesn't mark any value maybeDead(), it may never receive the callback when this value actually dies.

/*
[Q1]: In http://reviews.llvm.org/D12726 you mentioned that checkers mark liveness after the store, but now you say it's just the opposite, how so?
[A1]: I was wrong and stay corrected. It seems that checkers do not have access to any liveness information obtained from the rest of the state for during checkLiveSymbols().
*/

Sections (2), (3), (4) combine(!!) their sweep phase with the mark phase. However, for checker traits, the sweep phase is delayed until the checkDeadSymbols() checker callback, which goes last at the end of removeDead() after (4).

/*
[Q2]: Does taint information for dead symbols ever get removed? Cause it's in GDM, but it's neither checkers nor range constraints.
[A2]: Afaik, no; we don't even have State::removeTaint() yet. This is a bug, which may probably cause performance problems with taint analysis, though maybe not too much, because tainted symbols are rare.
*/

Section (2) scans active expressions, marks values of active expressions as live. It also marks values of inactive expressions that are still found in the state as dead and removes these expressions and their values from the environment. It does not need to rely on any liveness information for values to make such decision, which is why this step is completely independent and isolated. Note, hovewer, that first-class relations play an important role here - any values found inside values that were marked dead or live are also explicitly marked dead or live, which contributes to complexity of the code. The visitor used here to conduct marking of sub-values is not re-used.

Section (3) is the most complex here. It takes care of the following second-class keep-alive relation: for every binding in the store, the binding key keeps alive the binding value. Both binding keys and binding values are (at least, partially/sometimes) "values" (in the symbols-or-regions sense), hence it may happen that a certain binding value X is also a binding key for another binding value Y, and once liveness of X is established, liveness of Y needs to be re-calculated. Liveness of binding pairs cannot thus be computed in a single pass through the store, and therefore a worklist-like algorithm is implemented, that tries to reach a fixed point in which the liveness information no longer gets updated. Once the worklist algorithm finishes, the following happens: all bindings with dead keys are removed from the store, all sub-values of binding keys and all binding values and their sub-values are marked as maybe-dead. The visiting algorithm for sub-values is not re-used from the Environment.

The method assumes that after the end of (3) no new live marks are added. Which is why Store is capable of removing dead bindings without requiring a separate pass.

Section (4) is trivial - the constraint manager does not mark any symbols live. It marks all constraint keys that were not previously proven live as maybe-dead and removes them from the constraint map.

/*
[Q3]: So what if we try to fix the issue mentioned in [Q1] by putting (1) after (4)?
[A3]: It's more complicated than that. Checkers may introduce their own second-class keep-alive relations between values. For example, CStringChecker would keep string length alive for as long as the string region is alive. There may be more such relations introduced by other checkers. It may happen that a certain checker-specific second-class keep-alive relation would cause a store binding key to become live. Hence, there needs to be a single worklist algorithm run for both store-related second-class keep-alive relations and checker-specific second-class keep-alive relations, and (1) needs to be essentially merged with (4). And that would make things yet even more complex.
*/

Artem,

You might want to store this in the repo by adding a document or extending this one:

./docs/analyzer/RegionStore.txt

Anna.

I really want to spend some time here, but that code is pretty
difficult to understand (as you noted)

Umm, before i forget, i guess i’d dump some quick explanation of what i figured out about it.

First, we have symbols-or-regions, hereinafter “values” unless it causes confusion. Liveness of some values may be determined regardless of any other values: for example, a VarRegion of a local variable can be proven live via live variables analysis, which is an auxiliary data flow analysis that doesn’t depend on our symbolic execution engine. However, values may also be related to other values via a “keeps-alive relation”. There are two kind of keep-alive relations:

I. “first-class” keep-alive relations arise from and rely purely on the SVal hierarchy (for example, SymbolRegionValue is kept alive by its parent region, SymbolicRegion keeps its base symbol alive).

II. “second-class” keep-alive relations arise when symbols and regions appear in certain parts of the program state (for example, any region would keep alive its contents - all symbols stored inside it).

The purpose of removeDead() is to understand which values present in the current program state are dead (and, as a useful by-product, which values are live) by exploring the keep-alive relations between values; second-class relations are explored with respect to the current program state. Whenever a certain value X is determined to be live, any values kept alive by X are also known to be live. The liveness information is collected from scratch on every pass and does not depend on previous removeDead() runs. Once the liveness information is collected, a special action is taken on all dead values, which may involve removing them from the state, but does not necessarily boil down to just that.

SymbolReaper is an object that contains the incomplete understanding of live values on the mark phase, which eventually becomes complete and as such can be used on the sweep phase. SymbolReaper’s data consists of two sets of values - the “live set” and the “dead set”. The live set contains values already known to be live. The dead set contains the values found in the state, liveness of which is explicitly questioned and was not yet determined; values may move from the dead set to the live set (if they turn out to be live due to a previously unconsidered second-class relation to a live value) but not in the opposite direction. First-class relations are mostly hardcoded directly into the SymbolReaper - see isLive(), isLiveRegion(). Originally, SymbolReaper was simply a pair of sets, but later it was extended to take care of first-class keep-alive relations.

The removeDead() procedure begins by creating a temporary SymbolReaper object that lives for as long as the procedure runs. Then it proceeds with querying different manager objects responsible for different sections of the program state to explore their section and fill-in the liveness information in the SymbolReaper. Each section needs to classify each value it is responsible for as live or dead.

Program state sections are queried in the following order:

(1) GDM traits of particular checkers (through CheckerManager and checkLiveSymbols() callback).
(2) Environment (through EnvironmentManager).
(3) Store (through StoreManager).
(4) Range constraints (through ConstraintManager).

Section (1) is implemented in each checker in a different manner. The checker needs to be aware of the fact that if it doesn’t markLive() a value here, then all information related to it may be removed from other sections of the state (eg. range constraints for symbols or bindings for regions). And if the checker doesn’t mark any value maybeDead(), it may never receive the callback when this value actually dies.

/*
[Q1]: In http://reviews.llvm.org/D12726 you mentioned that checkers mark liveness after the store, but now you say it’s just the opposite, how so?
[A1]: I was wrong and stay corrected. It seems that checkers do not have access to any liveness information obtained from the rest of the state for during checkLiveSymbols().
*/

Sections (2), (3), (4) combine(!!) their sweep phase with the mark phase. However, for checker traits, the sweep phase is delayed until the checkDeadSymbols() checker callback, which goes last at the end of removeDead() after (4).

/*
[Q2]: Does taint information for dead symbols ever get removed? Cause it’s in GDM, but it’s neither checkers nor range constraints.
[A2]: Afaik, no; we don’t even have State::removeTaint() yet. This is a bug, which may probably cause performance problems with taint analysis, though maybe not too much, because tainted symbols are rare.

Taint has never been fully implemented. I also want to investigate using flow-sensitive analysis for taint tracking instead of the approach the current half-implementation uses.

> At which point is the symbol removed from the state? It that a
> byproduct of the store?

Well, yeah, i guess it's probably even "the primary product" - to erase the original value while overwriting. And once the original value is erased, it's no longer referenced.

> > 1. Scan all symbols in the SymbolManager to find zombie symbols,
> > deprecate the maybeDead() mechanism.
> > 2. Wait until some kind of smart data map takes care of this extra work.

> Are you talking about a data structure that keeps track of the
> data stored by the checkers. Presumably, that could be scanned
> to find out which symbols are tracked by the checkers? This
> direction looks the most appealing at this point.

Seems so; why i'm also thinking of "1." is because i feel it'd make removeDeadBindings sooo much easier to understand and sooo much more intuitive - which it seems to need, as even an issue as simple as this one had remained unnoticed. It seems that everybody was thinking that "1." is already there. "The garbage collector was supposed to see which symbols are referenced and remove those that aren't, how else could it work?" In this case, we probably wouldn't even need a worklist [which currently iterates only through the store, but also needs to iterate through the GDM at the same time, as there may be mutual dependencies between symbols here and there] during garbage collection, just rely on later iterations to clean up sub-symbols step by step. So i'd probably give it a try some day and try to explain it in more details if something useful shows up.

Though i've no idea about performance; the wordlist would most likely still have better performance.

Craig is right, performance of the analyzer is very sensitive to removeDeadBindings.

And also, if we make a new API for the data map, the old API might still want to be fixed somehow.

We could rewrite all of the checkers to use the new model if it was flexible enough. (Also, the old checkers won't be broken, they will juts not benefit from the improvement; will have false negatives.)

You might want to store this in the repo by

> adding a document or extending this one:
> ./docs/analyzer/RegionStore.txt

Hmm, i wish i could fix the bugs rather than document them (= Will have a look!

> Taint has never been fully implemented. I also
> want to investigate using flow-sensitive analysis
> for taint tracking instead of the approach the
> current half-implementation uses.

The current approach seems very sensible to me, i like it, and i'm not quite aware of any significant theoretical problems.

The implementation gore of sharing the state trait across checkers is annoying, but since we want to eventually share all traits in a generic manner anyway, this problem would eventually be gone. It is also a bit hard to work with tainted strings, but that seems to be an SVal hierarchy thing rather than a taint implementation problem.

Taint analysis becomes the most effective with heavy IPA, probably better even with inter-unit analysis and good call graph sorting to ensure top-bottom analysis, implying huge node limits, because input-output and usage are often spread out very far away from each other, through many function calls. It wouldn't find many things if it only considers small scopes.

Ted pointed out that isLive covers the symbols that do not exist in the state. Have you tried addressing the issue with something like this:

diff --git a/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h b/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
index 30481ea…1c5a984 100644
— a/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
+++ b/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
@@ -550,8 +550,8 @@ public:
///
/// This should only be called once all marking of dead symbols has completed.
/// (For checkers, this means only in the evalDeadSymbols callback.)

  • bool isDead(SymbolRef sym) const {
  • return TheDead.count(sym);
  • bool isDead(SymbolRef sym) {
  • return TheDead.count(sym) || !isLive(sym);
    }

Almost all tests pass with this change and the one that does not looks like a legitimate new leak reported.

There are a couple of issues to further investigate here:

  • What are the performance implications? I’d expect some, but they are probably not very bad. Still we should measure.
  • We should probably eliminate the use of dead_begin dead_end, which some checkers reply on.
  • removeDeadBindings is not run right after the last reference to the object is lost, which leads to imprecise error reports and failure to report the leak in some cases. It’s because of how hasDeadSymbols() is implemented. That problem is solved if we disable the optimization, but I do not know how that effects performance. Maybe we can come up with something more clever.

diff --git a/lib/StaticAnalyzer/Core/ExprEngine.cpp b/lib/StaticAnalyzer/Core/ExprEngine.cpp
index 0b66933…e43c5b7 100644
— a/lib/StaticAnalyzer/Core/ExprEngine.cpp
+++ b/lib/StaticAnalyzer/Core/ExprEngine.cpp
@@ -380,14 +380,7 @@ void ExprEngine::removeDead(ExplodedNode *Pred, ExplodedNodeSet &Out,
// Process any special transfer function for dead symbols.
// A tag to track convenience transitions, which can be removed at cleanup.
static SimpleProgramPointTag cleanupTag(TagProviderName, “Clean Node”);

  • if (!SymReaper.hasDeadSymbols()) {
  • // Generate a CleanedNode that has the environment and store cleaned
  • // up. Since no symbols are dead, we can optimize and not clean out
  • // the constraint manager.
  • StmtNodeBuilder Bldr(Pred, Out, *currBldrCtx);
  • Bldr.generateNode(DiagnosticStmt, Pred, CleanedState, &cleanupTag, K);

Hmm, this sounds like a very reasonable approach. Will see what i can do, shouldn’t be very hard. In fact, !isLive() should be enough once all the liveness info is collected. Due to first-class relations, amounts of live and dead symbols are typically infinite, and this approach takes care of that. Still, by automating the old approach by exploring the checker GDM we could achieve better performance due to the short-path optimization below (but not due to lookup direction in checkers, which is seemingly optimized through dead_begin()…dead_end(), because we’d have to iterate through the state in checkLiveSymbols phase anyway).