Hello,
Just before i forget - timestamps-like problems arise not only in summary, but also in generic alpha-renaming. Consider:
int x;
void f1(int *y) {
x = 1;
*y = 2;
if (y == &x) {
// We need to merge the two store bindings:
// (x, 0, default): 1 S32b
// (SymRegion{reg_$0<y>}, 0, default): 2 S32b;
// into one, and it'd better be this one:
// (x, 0, default): 2 S32b
}
}
struct S { int x, y; }
void f2(struct S *s, struct S *t) {
s->x = 1;
t->y = 2;
if (s == t) {
// But here we need to merge the base regions
// keeping both bindings alive:
// (SymRegion{reg_$0<s>}, 0, direct): 1;
// (SymRegion{reg_$0<s>}, 32, direct): 2;
// I guess i'd leave LazyCompoundVal cases
// as a simple excercise for the reader.
}
}
void f3(FILE *fp1, FILE *fp2) {
fclose(fp1);
fclose(fp2);
if (fp1 == fp2) {
// Optionally throw check-after-use double-close?
// Ignoring wouldn't hurt though.
}
}
void f4(pthread_mutex_t *m, pthread_mutex_t *n) {
pthread_mutex_lock(m);
pthread_mutex_unlock(n);
pthread_mutex_lock(m);
pthread_mutex_trylock(n);
// etc.
if (m == n) {
// Should we reconstruct the whole sequence
// of locks and unlocks by timestamps here?
// In order to see if there's double-anything
// or the branch is unreachable or whatever.
// Data map should be responsible this time.
}
}
void f5(int *x, int *y) {
if (*x == 1)
if (*y == 2)
if (x == y) {
// And this time we surely need to sink.
// Constraint manager should detect it.
}
}
void f6(char *x, char *y) {
if (strlen(x) == 3)
if (strlen(y) == 4)
if (x == y) {
// Sink here as well.
// Data map should detect it.
}
}
Then, yeah, we try to figure out how to handle the differences in
the store, eg. between
(x,0,direct): 0 S32b
...
(x,0,direct): 2015 S32b
...
without creating separate summary branches for each case. Even if
the execution path seems to be determined only by the fact that,
say, reg_$0<x> is in range [0, 10], we cannot be sure that it is
absolutely irrelevant for the checker whether reg_$0<x> is 2 S32b
or 3 S32b, until we ask the checker (for instance, reg_$0<x> may
be some sort of enum or bitfield passed into an API call modelled
by this checker inside the summary graph, and the checker leaves
its internal mark in the GDM regarding its value, so that to take
a completely different path later, but the engine may accidentally
consider the difference between 2 S32b and 3 S32b irrelevant,
because all the if's took the same branch so far, and hence ignore
important information and cause false positives.
I think there are ways to mitigate or solve this problem. Some of the checkers can add assumptions. E.g.: when you divide by a value which is unknown an assumption such that this value is not zero is added to the state. Each time a checker does something like this, there is a potential error case there, so that decision might need to be revised in each new context. It is also possible to tag states by the checkers. Maybe those tags could be used to help to decide what to do in these scenarios?
Yeah, there are a lot of different APIs we can provide, which do not explicitly ask the checker to support summary, but instead ask it to provide more info to the core on what he's doing. I'd delay this decision until we settle with how the data map should look like. Though - unlike the examples above - i've still not seeing how exactly it'd work and what kind of API we'd need, keeping in mind that it shouldn't restrict the fantasy of checker authors too much.
Instead, we can also think of alpha-renaming as a one-time process
of completely eliminating the symbol that was renamed into another
symbol from all parts of the program state, including GDM, and
making sure it doesn't ever reincarnate. Eg., ensure that
SValBuilder would not construct the old symbolic value again, even
if asked directly to do so, but return the new symbolic value
instead. This approach seems preferable to us, because it doesn't
require O(N) lookups described above, and all the necessary code
is put in the single callback.
Note that while an "evalRenaming()" callback is necessary to
support alpha-renaming, it is not really scary if this callback is
not instantly implemented in the checker; the checker would just
fail in these situations as it always did. For summary callbacks,
this is not the case.
I think this might start to get very hard to track. I would prefer a new kind of symbol that is going to be the subject of actualization. So we could assert when a symbol that is not supposed to be actualized is being renamed, or a such symbol is used after summary application without actualization.
Hmm. You mean like if the checker forgets to rename his symbol, it'd be hard to notice and debug? Then i'm not seeing how a new symbol class would help: we either need to mark all symbols in the caller state as renamed, or all symbols in the callee state as not yet renamed, but both approaches require iterating through the state, including the checker data, and as such are not much easier than actually conducting the renaming properly.
But the way i see how our summary currently works, it wasn't hard to keep an eye on all symbols.
Similarly, SymbolDerived is tied to the value of the region at the
moment of time in which invalidation occured, and this moment also
needs to be shifted to the relevant moment in the invalidation
cascade.
We cannot yet predict how resource-heavy the cascading
invalidation approach would be, and it more or less works with our
current dummy approach, so we can implement the cascading
invalidation, or delay it for later patches.
Basically the invalidation depends on ordering of events, so it requires timestamps I guess. Also, when a summary is applied, basically the checker get a list of exploded nodes and it can replay some of the state changes to based on the context and the summary. But what happens when the checker needs to do some invalidation or add some assumption on aliasing (that affects invalidation)? I have the feeling that each time we want to solve some of these problems we are getting close and closer to a full replay.
With such "cascading invalidation", we replay only conservatively evaluated calls. That's extra work, but hopefully not too much. On the other hand, now that i think of it, this time i'd really have a look at our hierarchy - probably i already forgot what i was thinking, but i feel that by making SymbolConjured or SymbolDerived carry extra data with them, we could probably avoid this problem without cascading replays. Needs more thought.
Regarding interaction between inter-dependent checkers. While we
didn't code this yet, it might be partially naively solved by
introducing dependency graphs for checkers. Eg. a checker that
relies on taint propagation may pull the GenericTaintChecker as
its dependency, and then our checker gets summary-applied after
GenericTaintChecker, when the taint information is already
available. Another minor use case for this feature is that if you
explicitly run only one path-sensitive checker that only registers
itself on C++ files, then you can skip pulling core checkers on C
and ObjC files and run only AST checkers for them, saving analysis
time.
Dependency graph for checkers would be useful regardless of summary based IPA. I had to do workarounds several times because the lack of this. But there are more to these dependencies than checkers depending on each other. Checkers might also add constraints that the core symbolic execution depends on.
Yep.
This also may be useful if the checker wants to refer to a value
in the warning message in a user-friendly manner. If anybody likes
something like this as a separate commit, i can open a review.
Just an update, as far as I remember this is already committed.
Yep.
=2= Then we start with alpha-renaming, and we've got the following
problems to deal with:
- Decide if we want to keep renaming inside the constrant
manager or let it work through destroying symbols, as described in
this letter;
- Decide how to work with renaming of concrete offsets in
store keys into symbolic+concrete offsets, as described in the
previous letter
[cfe-dev] [analyzer] Summary IPA thoughts
(the last part of the miscellaneous section). Possible solutions
include making store keys heavier or reconstructing original
regions by offsets (tried this, very ugly and unreliable, imho;
the former solution would also solve the issue pointed out in
[cfe-dev] [analyzer] Summary IPA thoughts).
Solving this by recording-and-replaying all stores is something
we'd try to avoid as best as we can, i guess.
What is the status of this one? Do you have a patch under review?
Nope, not yet, got distracted... Also, we really need the smart data map before this, it makes things a lot easier and without it a lot of progress is stalled, so i'd certainly work on that first. A few thoughts on the approach to alpha-renaming i'm having in mind:
1. For every renaming situation, create a specific partial renamer object inherited from FullSValVisitor<RetTy = Optional<SVal>> that contains information about what exactly is being renamed. For example, if we rename SymExpr 'x' to SymExpr 'y', this visitor would return nonloc::SymbolVal 'y' for SymExpr 'x' and a missing optional value for everything else. For summary, it would conduct the stack frame context shift on stack memory spaces and do whatever needs to be done with SymbolConjured, SymbolDerived, etc.
2. Create a generic renamer inherited from FullSValVisitor<RetTy = SVal> (not optional this time!) that consumes an instance of the partial renamer. The generic renamer would traverse the SVal recursively, and trust his partial renamer instance above anything else on every item that the partial renamer successfully renames. Then it tries to re-construct bigger SVal's after smaller SVal's were renamed by the partial renamer. If the partial renamer changes nothing (even if it always returns a valid SVal), then the original SVal should remain unchanged, i guess.
3. Then i'd probably go ahead and traverse the state to do replacements. This section depends on how the smart data map gets done.
=3= Decide if we want context-sensitive or context-insensitive
summaries, as discussed above in this letter.
Is there a consensus on this one?
Nope, not yet. But all the earlier machinery seems to be pretty much the same for both approaches, as far as i can see. There's one thing i'm pretty sure about though (disclaimer - when i'm sure, it usually means i'm going to say something very stupid): if we want to eventually implement inter-unit with serialized summaries, then these summaries would better be context-insensitive, otherwise how do we know what contexts would we need. Well, we could make a complicated IPC system in which different instances of clang constantly ask each other for context-sensitive summaries of functions they're looking for in particular contexts, but, hmm, huh, dunno really...
Lets try to summarize what are the tasks:
- Summary apply callback for checkers
- Solve the interdependency problem
- Update all of the checkers?
- Timestamps to solve some aliasing related problems
- Filter confusing assuming.. stuff from path reconstruction
- Figure out how to do leak related checks properly
- Figure out how to do invalidation properly
- Anything I missed?
There are also some FIXME tests in test/Analysis/summary-ipa-xfail.cpp , i don't really remember what's going on here. I guess it has something to do with complicated passing-by-reference, and temporary objects, and compound values. Otherwise, as far as i remember, that's it in case you'd like to take our experimental code and fix it.
For leaks, if i understand correctly, there isn't much action required. If a symbol leaks inside the callee, it couldn't have existed in the caller. If a symbol appears inside the callee but doesn't leak there, then it'd be collected by the caller anyway. Different statuses like "pointer-escaped" should be applied by the checker during application of the summary. So i've a feeling that no special action is required to handle that. At most, we'd need to handle liveness of sutff from different stack frame contexts, as right now we collect dead symbols properly on the end of function (while still inside the function), but with summary we cannot do that. So it's worth having a look, but i'm not sure it's actually broken.
Best regards,
Artem.