[analyzer] Summary IPA thoughts

Hello,

Yeah, there's one particular known-issue with aliasing, which i mentioned
earlier ([cfe-dev] [analyzer] Summary IPA thoughts
):

Oh, I see. I just started to catch up with the conversation and did not
reach that point yet.

   int a;
   void foo(int *x, int *y) {
     *x = 0;
     *y = 1;
   }
   void bar() {
     foo(&a, &a);
     clang_analyzer_eval(a); // expected-warning{{TRUE}}
   }

Here we may randomly get TRUE or FALSE, kind of race condition.

We didn't fix it yet, and it can be fixed by adding timestamps to stores,
i.e. which bind was earlier.

With timestamps, i think the store would be fine.

I think that timestamps might not be a general solution to this problem,
see the next inline response. But as a side not, do you plan to use
timestamps to use this problem:

  void bar(int *b) {
    foo(b); // [1] calling foo()
  }
  void foo(int *a) {
    *a = 0;
    if (a == 0) {
      // [2] warn: null-dereference
    }
  }

Or is this solved already? What did you do about this in that case?

On the other hand, i don't think that there are other places that
explicitly rely on assuming different symbols to represent *different*
values. It's only true for the store, which expects different base-regions
to non-overlap, even when the regions are symbolic. Say, the constraint
manager, upon encountering "if (a == b)", should split the state (we did
enable SymSymExpr for this to work), rather than produce always-false.

How did you achieve this? For example:

   int a;
   void foo(int *x, int *y) {
      int a;
      int *b = &a;
      if (x == y) { } // I expect this to split the state.
      if (b == &a) {} // I expect this not to split the state
   }
   void bar() {
     foo(&a, &a);
   }

Does this work as expected?

And then actualization can take care of the rest. Say the checker, upon
encountering, say, "fclose(f1); fclose(f2);" and then later, when applying
the summary, finding out that f1 and f2 is the same descriptor, would
detect the problem upon actualization of the respective symbols to the same
symbol. Or the store would correctly merge bindings to different
sub-regions of aliased regions even if they were not aliased outside the
context.

I think I would prefer a way to do this automatically instead of making
checker writers consider aliasing explicitly. But I may misunderstood your
point.

Oh, the check-after-use for delayed positives. We did not fix this yet because we didn’t agree on how much this is really unwanted. In fact, for most kinds of bugs we’d like to have some kind of check-after-use checks, but we should also be able to turn it off. The solution we’re having in mind is to consider the program state in the delayed-check-node to see if the bug gets reproduced inside it (we remember the whole node anyway for bug reporter purposes), though we didn’t think really deeply about it yet, so it may fail to work. The current code produces such positives.

Yeah, i guess it works pretty much as expected. In the out-of-context analysis, we have two branches for foo(), depending on range constraints for symbol ‘reg_$0 == reg_$0’ - in one branch it’s [0, 0], in the other branch it’s [1, something like BOOL_MAX]. When calling foo(&a, &a) from bar(), reg_$0 gets renamed to &a (as integer), reg_$0 gets renamed to &a (as integer) (i don’t remember if we still keep the SymbolRegionAddress thing around, because i wanted to replace it with the correct nonloc::LocAsInteger eventually, as it seems more correct; it is more difficult to work with, but imho it is “the” correct actualization), and finally the whole symbol gets renamed to ‘&a (as integer) != &a (as integer)’ and collapses to 0 during evalBinOp, but because range [0, 0] of 0 does not intersect with [1, something like BOOL_MAX], this branch is discarded as unreachable. So when applying summary in-context we have only one branch.

Yeah, me too. But as long as the GDM is opaque for the analyzer core, we cannot avoid relying on checkers for managing their info in the state. Hence the recent buzzword in our discussions - the “smart data map”, which i’m dreaming of, but could not work on it yet.

I think that timestamps might not be a general solution to this problem,
see the next inline response. But as a side not, do you plan to use
timestamps to use this problem:

  void bar(int *b) {
    foo(b); // [1] calling foo()
  }
  void foo(int *a) {
    *a = 0;
    if (a == 0) {
      // [2] warn: null-dereference
    }
  }

Oh, the check-after-use for delayed positives. We did not fix this yet
because we didn't agree on how much this is really unwanted. In fact, for
most kinds of bugs we'd like to have some kind of check-after-use checks,
but we should also be able to turn it off. The solution we're having in
mind is to consider the program state in the delayed-check-node to see if
the bug gets reproduced inside it (we remember the whole node anyway for
bug reporter purposes), though we didn't think really deeply about it yet,
so it may fail to work. The current code produces such positives.

On the other hand, i don't think that there are other places that

explicitly rely on assuming different symbols to represent *different*
values. It's only true for the store, which expects different base-regions
to non-overlap, even when the regions are symbolic. Say, the constraint
manager, upon encountering "if (a == b)", should split the state (we did
enable SymSymExpr for this to work), rather than produce always-false.

How did you achieve this? For example:

   int a;
   void foo(int *x, int *y) {
      int a;
      int *b = &a;
      if (x == y) { } // I expect this to split the state.
      if (b == &a) {} // I expect this not to split the state
   }
   void bar() {
     foo(&a, &a);
   }

Does this work as expected?

Yeah, i guess it works pretty much as expected. In the out-of-context
analysis, we have two branches for foo(), depending on range constraints
for symbol 'reg_$0<x> == reg_$0<y>' - in one branch it's [0, 0], in the
other branch it's [1, something like BOOL_MAX]. When calling foo(&a, &a)
from bar(), reg_$0<x> gets renamed to &a (as integer), reg_$0<y> gets
renamed to &a (as integer) (i don't remember if we still keep the
SymbolRegionAddress thing around, because i wanted to replace it with the
correct nonloc::LocAsInteger eventually, as it seems more correct; it is
more difficult to work with, but imho it is "the" correct actualization),
and finally the whole symbol gets renamed to '&a (as integer) != &a (as
integer)' and collapses to 0 during evalBinOp, but because range [0, 0] of
0 does not intersect with [1, something like BOOL_MAX], this branch is
discarded as unreachable. So when applying summary in-context we have only
one branch.

That is great!

What I wanted to emphasize here, do we still get only one branch for `if (b
== &a) {}`? Because that condition can be decided regardless of the
context.

Yeah, sure, the out-of-context analysis for summary construction is, in most senses, just your normal analysis; we may just record stuff along the way, but not make significantly different decisions. If the condition is resolvable without context, it will be resolved without context, just like during normal analysis. So the function in your example has exactly two branches.

Hi!

Hello,

Just before i forget - timestamps-like problems arise not only in summary, but also in generic alpha-renaming. Consider:

int x;
void f1(int *y) {
   x = 1;
   *y = 2;
   if (y == &x) {
     // We need to merge the two store bindings:
     // (x, 0, default): 1 S32b
     // (SymRegion{reg_$0<y>}, 0, default): 2 S32b;
     // into one, and it'd better be this one:
     // (x, 0, default): 2 S32b
   }
}

struct S { int x, y; }
void f2(struct S *s, struct S *t) {
   s->x = 1;
   t->y = 2;
   if (s == t) {
     // But here we need to merge the base regions
     // keeping both bindings alive:
     // (SymRegion{reg_$0<s>}, 0, direct): 1;
     // (SymRegion{reg_$0<s>}, 32, direct): 2;
     // I guess i'd leave LazyCompoundVal cases
     // as a simple excercise for the reader.
   }
}

void f3(FILE *fp1, FILE *fp2) {
   fclose(fp1);
   fclose(fp2);
   if (fp1 == fp2) {
     // Optionally throw check-after-use double-close?
     // Ignoring wouldn't hurt though.
   }
}

void f4(pthread_mutex_t *m, pthread_mutex_t *n) {
   pthread_mutex_lock(m);
   pthread_mutex_unlock(n);
   pthread_mutex_lock(m);
   pthread_mutex_trylock(n);
   // etc.
   if (m == n) {
     // Should we reconstruct the whole sequence
     // of locks and unlocks by timestamps here?
     // In order to see if there's double-anything
     // or the branch is unreachable or whatever.
     // Data map should be responsible this time.
   }
}

void f5(int *x, int *y) {
   if (*x == 1)
     if (*y == 2)
       if (x == y) {
         // And this time we surely need to sink.
         // Constraint manager should detect it.
       }
}

void f6(char *x, char *y) {
   if (strlen(x) == 3)
     if (strlen(y) == 4)
       if (x == y) {
         // Sink here as well.
         // Data map should detect it.
       }
}

    Then, yeah, we try to figure out how to handle the differences in
    the store, eg. between
        (x,0,direct): 0 S32b
        ...
        (x,0,direct): 2015 S32b
        ...
    without creating separate summary branches for each case. Even if
    the execution path seems to be determined only by the fact that,
    say, reg_$0<x> is in range [0, 10], we cannot be sure that it is
    absolutely irrelevant for the checker whether reg_$0<x> is 2 S32b
    or 3 S32b, until we ask the checker (for instance, reg_$0<x> may
    be some sort of enum or bitfield passed into an API call modelled
    by this checker inside the summary graph, and the checker leaves
    its internal mark in the GDM regarding its value, so that to take
    a completely different path later, but the engine may accidentally
    consider the difference between 2 S32b and 3 S32b irrelevant,
    because all the if's took the same branch so far, and hence ignore
    important information and cause false positives.

I think there are ways to mitigate or solve this problem. Some of the checkers can add assumptions. E.g.: when you divide by a value which is unknown an assumption such that this value is not zero is added to the state. Each time a checker does something like this, there is a potential error case there, so that decision might need to be revised in each new context. It is also possible to tag states by the checkers. Maybe those tags could be used to help to decide what to do in these scenarios?

Yeah, there are a lot of different APIs we can provide, which do not explicitly ask the checker to support summary, but instead ask it to provide more info to the core on what he's doing. I'd delay this decision until we settle with how the data map should look like. Though - unlike the examples above - i've still not seeing how exactly it'd work and what kind of API we'd need, keeping in mind that it shouldn't restrict the fantasy of checker authors too much.

    Instead, we can also think of alpha-renaming as a one-time process
    of completely eliminating the symbol that was renamed into another
    symbol from all parts of the program state, including GDM, and
    making sure it doesn't ever reincarnate. Eg., ensure that
    SValBuilder would not construct the old symbolic value again, even
    if asked directly to do so, but return the new symbolic value
    instead. This approach seems preferable to us, because it doesn't
    require O(N) lookups described above, and all the necessary code
    is put in the single callback.

    Note that while an "evalRenaming()" callback is necessary to
    support alpha-renaming, it is not really scary if this callback is
    not instantly implemented in the checker; the checker would just
    fail in these situations as it always did. For summary callbacks,
    this is not the case.

I think this might start to get very hard to track. I would prefer a new kind of symbol that is going to be the subject of actualization. So we could assert when a symbol that is not supposed to be actualized is being renamed, or a such symbol is used after summary application without actualization.

Hmm. You mean like if the checker forgets to rename his symbol, it'd be hard to notice and debug? Then i'm not seeing how a new symbol class would help: we either need to mark all symbols in the caller state as renamed, or all symbols in the callee state as not yet renamed, but both approaches require iterating through the state, including the checker data, and as such are not much easier than actually conducting the renaming properly.

But the way i see how our summary currently works, it wasn't hard to keep an eye on all symbols.

    Similarly, SymbolDerived is tied to the value of the region at the
    moment of time in which invalidation occured, and this moment also
    needs to be shifted to the relevant moment in the invalidation
    cascade.

    We cannot yet predict how resource-heavy the cascading
    invalidation approach would be, and it more or less works with our
    current dummy approach, so we can implement the cascading
    invalidation, or delay it for later patches.

Basically the invalidation depends on ordering of events, so it requires timestamps I guess. Also, when a summary is applied, basically the checker get a list of exploded nodes and it can replay some of the state changes to based on the context and the summary. But what happens when the checker needs to do some invalidation or add some assumption on aliasing (that affects invalidation)? I have the feeling that each time we want to solve some of these problems we are getting close and closer to a full replay.

With such "cascading invalidation", we replay only conservatively evaluated calls. That's extra work, but hopefully not too much. On the other hand, now that i think of it, this time i'd really have a look at our hierarchy - probably i already forgot what i was thinking, but i feel that by making SymbolConjured or SymbolDerived carry extra data with them, we could probably avoid this problem without cascading replays. Needs more thought.

    Regarding interaction between inter-dependent checkers. While we
    didn't code this yet, it might be partially naively solved by
    introducing dependency graphs for checkers. Eg. a checker that
    relies on taint propagation may pull the GenericTaintChecker as
    its dependency, and then our checker gets summary-applied after
    GenericTaintChecker, when the taint information is already
    available. Another minor use case for this feature is that if you
    explicitly run only one path-sensitive checker that only registers
    itself on C++ files, then you can skip pulling core checkers on C
    and ObjC files and run only AST checkers for them, saving analysis
    time.

Dependency graph for checkers would be useful regardless of summary based IPA. I had to do workarounds several times because the lack of this. But there are more to these dependencies than checkers depending on each other. Checkers might also add constraints that the core symbolic execution depends on.

Yep.

    This also may be useful if the checker wants to refer to a value
    in the warning message in a user-friendly manner. If anybody likes
    something like this as a separate commit, i can open a review.

Just an update, as far as I remember this is already committed.

Yep.

    =2= Then we start with alpha-renaming, and we've got the following
    problems to deal with:
        - Decide if we want to keep renaming inside the constrant
    manager or let it work through destroying symbols, as described in
    this letter;
        - Decide how to work with renaming of concrete offsets in
    store keys into symbolic+concrete offsets, as described in the
    previous letter
    [cfe-dev] [analyzer] Summary IPA thoughts
    (the last part of the miscellaneous section). Possible solutions
    include making store keys heavier or reconstructing original
    regions by offsets (tried this, very ugly and unreliable, imho;
    the former solution would also solve the issue pointed out in
    [cfe-dev] [analyzer] Summary IPA thoughts).
    Solving this by recording-and-replaying all stores is something
    we'd try to avoid as best as we can, i guess.

What is the status of this one? Do you have a patch under review?

Nope, not yet, got distracted... Also, we really need the smart data map before this, it makes things a lot easier and without it a lot of progress is stalled, so i'd certainly work on that first. A few thoughts on the approach to alpha-renaming i'm having in mind:

1. For every renaming situation, create a specific partial renamer object inherited from FullSValVisitor<RetTy = Optional<SVal>> that contains information about what exactly is being renamed. For example, if we rename SymExpr 'x' to SymExpr 'y', this visitor would return nonloc::SymbolVal 'y' for SymExpr 'x' and a missing optional value for everything else. For summary, it would conduct the stack frame context shift on stack memory spaces and do whatever needs to be done with SymbolConjured, SymbolDerived, etc.

2. Create a generic renamer inherited from FullSValVisitor<RetTy = SVal> (not optional this time!) that consumes an instance of the partial renamer. The generic renamer would traverse the SVal recursively, and trust his partial renamer instance above anything else on every item that the partial renamer successfully renames. Then it tries to re-construct bigger SVal's after smaller SVal's were renamed by the partial renamer. If the partial renamer changes nothing (even if it always returns a valid SVal), then the original SVal should remain unchanged, i guess.

3. Then i'd probably go ahead and traverse the state to do replacements. This section depends on how the smart data map gets done.

    =3= Decide if we want context-sensitive or context-insensitive
    summaries, as discussed above in this letter.

Is there a consensus on this one?

Nope, not yet. But all the earlier machinery seems to be pretty much the same for both approaches, as far as i can see. There's one thing i'm pretty sure about though (disclaimer - when i'm sure, it usually means i'm going to say something very stupid): if we want to eventually implement inter-unit with serialized summaries, then these summaries would better be context-insensitive, otherwise how do we know what contexts would we need. Well, we could make a complicated IPC system in which different instances of clang constantly ask each other for context-sensitive summaries of functions they're looking for in particular contexts, but, hmm, huh, dunno really...

Lets try to summarize what are the tasks:
- Summary apply callback for checkers
- Solve the interdependency problem
- Update all of the checkers?
- Timestamps to solve some aliasing related problems
- Filter confusing assuming.. stuff from path reconstruction
- Figure out how to do leak related checks properly
- Figure out how to do invalidation properly
- Anything I missed?

There are also some FIXME tests in test/Analysis/summary-ipa-xfail.cpp , i don't really remember what's going on here. I guess it has something to do with complicated passing-by-reference, and temporary objects, and compound values. Otherwise, as far as i remember, that's it in case you'd like to take our experimental code and fix it.

For leaks, if i understand correctly, there isn't much action required. If a symbol leaks inside the callee, it couldn't have existed in the caller. If a symbol appears inside the callee but doesn't leak there, then it'd be collected by the caller anyway. Different statuses like "pointer-escaped" should be applied by the checker during application of the summary. So i've a feeling that no special action is required to handle that. At most, we'd need to handle liveness of sutff from different stack frame contexts, as right now we collect dead symbols properly on the end of function (while still inside the function), but with summary we cannot do that. So it's worth having a look, but i'm not sure it's actually broken.

Best regards,
Artem.