Several questions about UncheckedReturn statistic Checker

Hi clang,

I’m trying to find out how to implement this statistic checker. IMO, there could be several steps.

  1. Add a ConjuredSymbol to all the CallExpr in PostVisitCallExpr, set the state to Unchecked.
  2. In VisitBranchCondition, if a callexpr is checked, set the state to checked.
  3. After all the decl was analyzed, do the first part(in a translationunit) of statistic count (and/or check) in AnalysisConsumer::HandleTranslationUnit.
  4. Implement a script to do the whole project’s statistic check.

This patch try to do the first two steps.

I register the checker in RegisterExperimentalChecks, but a new action seems more appropriate. Statistic checkers are some different from the checkers in RegisterExperimentalChecks.

In order to generate ExplodedNode but not a blockedge, i add a new “generateNode” function in GRBranchNodeBuilder. I think this function is not quite right. I use poststmt as the ProgramPoint, maybe we need a new ProgramPoint here?

Beacause this is a really experimental checker and not all the steps are finished, i can’t give a testcase to show what i did.

More comments inline.

UncheckedReturn.patch (9.83 KB)

Hi Lei,

I think this is a good first step. Comments inline.

Hi clang,

Sorry for this late reply.

I reimplement UncheckedReturnChecker, here are some key points and some problems confusing me.

  1. Instead of the original syntactic approach, i track the symbolic values in BinaryOperator & UnaryOperator. It means every BinaryOperator & UnaryOperator will have a symbol to record the callexprs in subExpr. (So i create many Conjured Symbol to do this, is this all alright?)

  2. I create a global DenseMap<const CallExpr*, CheckedReturnState*> to store whether a callexpr is checked or not.

  3. This patch didn’t consider the deal with the escaped symbol.

  4. The bugreport is very simple now…

  5. This checker seems much slower than other…

  6. I didn’t add any new action and programpoint yet…
    After all, there are many work should be done. I write this mail to ask whether this direction is right or not.

More comments inline.

I’ll appreciate it if there are any advice about this patch.

2010/12/7 Ted Kremenek <kremenek@apple.com>

UncheckedReturnChecker.patch (18.7 KB)

Hi clang,

Sorry for this late reply.

I reimplement UncheckedReturnChecker, here are some key points and some problems confusing me.

Hi Lei,

Thanks for continuing to push forward on this. Comments inline.

  1. Instead of the original syntactic approach, i track the symbolic values in BinaryOperator & UnaryOperator. It means every BinaryOperator & UnaryOperator will have a symbol to record the callexprs in subExpr. (So i create many Conjured Symbol to do this, is this all alright?)

I don’t think this is quite the right approach, although your patch raises some interesting questions. In principal, you shouldn’t need to create any new conjured symbols at all. Conjured symbols represent values to be tracked by the analyzer. Conceptually, your checker only tracks meta-data associated with symbols. However, what you are trying to do is a form of “taint” analysis, which may require introducing more symbols.

A few specific comments:

  • // Once we encounter a call expr, set the initial state to Unchecked.
  • if (SymbolRef Sym = V.getAsSymbol()) {
  • llvm::DenseSet *CRStates = new llvm::DenseSet();
  • CheckedReturnSet CRSet = CheckedReturnSet(CE, *CRStates);
  • CRSet.addState(CheckedReturnState::getUnchecked(CE));
  • state = state->set(Sym, CRSet);
  • C.addTransition(state);
  • }

DenseMaps should never be used as data stored by a GRState. DenseMaps are not immutable, and your checker will now allocate a ton of memory and leak like crazy. For data to be tracked for a given GRState, you must use immutable data structures such as ImmutableMap. The reason is two fold:

  1. ImmutableMaps can be reused in multiple instances of GRState, and we can compare (using pointer equality) if two ImmutableMaps are equivalent. This allows us to do state caching, and also merge paths when two ExplodedNodes have the same ProgramPoint and GRState.

  2. All memory associated with the ImmutableMaps (particularly those managed by the GDM) are region allocated and destroyed. Such memory is allocated efficiently. Using DenseMaps for per-GRState involves malloc()'ing a bunch of memory that just gets leaked. That’s really inefficient, and algorithmically not correct.

Instead, I’d expect to see something like:

if (SymbolRef sym = V.getAsSymbol()) {
state = state->add(sym, CheckedReturnState::getUnchecked(CE))
C.addTransition(state);
}

That’s a lot less code and doesn’t allocate a ton of (mutable) memory. Id then replace CheckedReturnState with some like the following:

enum CheckReturnStateKind { Unchecked, Checked };

typedef llvm::ImmutableMap<SymbolRef, CheckReturnStateKind > CheckedReturnState;
namespace clang { namespace ento {
template<>
struct GRStateTrait : public GRStatePartialTrait< CheckedReturnState > {
static void* GDMIndex() { static int index = 0; return &index; }
};
}}

That reduces about 60 lines of code to 8, and then allows you to use CheckedReturnState as I illustrated above. After reading through more of your code, I realize that this doesn’t quite match up with your algorithm, so let’s continue to discuss the algorithmic design and see how it matches up with the static analyzer APIs we have.

Now let’s look at checkPostStmt(BinaryOperator*):

+void UncheckedReturnChecker::checkPostStmt(const BinaryOperator *B,

  • CheckerContext &C) const {
  • const GRState *state = C.getState();
  • SVal V = state->getSVal(B);

Hi Ted,

Thank you very much for your reply. It solved many problems that confusing me.

Sorry for the misuse of the APIs and data structures. My intention of last mail is to show what i am trying to do by code without changing the Engine and the checker visit mechanism.

If you don’t mention “taint analysis” in your email, i didn’t realize that yet. Yes, i think it’s some sort of taint analysis.
IMO, some key points in this ‘taint’ analysis are:

source: return values from all the CallExpr.
sanitizer: the condition in a BranchCondition.
sink: when we handle RemoveDeadSymbols?(according to your mail)

And as you point, the ‘taint’ propagation ruls is the most important thing. So i think this analysis is more or less different from the traditional taint analysis.

More comments inline.

2011/3/15 Ted Kremenek <kremenek@apple.com>

Hi clang,

Sorry for this late reply.

I reimplement UncheckedReturnChecker, here are some key points and some problems confusing me.

Hi Lei,

Thanks for continuing to push forward on this. Comments inline.

  1. Instead of the original syntactic approach, i track the symbolic values in BinaryOperator & UnaryOperator. It means every BinaryOperator & UnaryOperator will have a symbol to record the callexprs in subExpr. (So i create many Conjured Symbol to do this, is this all alright?)

I don’t think this is quite the right approach, although your patch raises some interesting questions. In principal, you shouldn’t need to create any new conjured symbols at all. Conjured symbols represent values to be tracked by the analyzer. Conceptually, your checker only tracks meta-data associated with symbols. However, what you are trying to do is a form of “taint” analysis, which may require introducing more symbols.

If i should not create conjured symbols, which symbol should i use.

A few specific comments:

  • // Once we encounter a call expr, set the initial state to Unchecked.

  • if (SymbolRef Sym = V.getAsSymbol()) {

  • llvm::DenseSet *CRStates = new llvm::DenseSet();

  • CheckedReturnSet CRSet = CheckedReturnSet(CE, *CRStates);

  • CRSet.addState(CheckedReturnState::getUnchecked(CE));

  • state = state->set(Sym, CRSet);

  • C.addTransition(state);

  • }

DenseMaps should never be used as data stored by a GRState. DenseMaps are not immutable, and your checker will now allocate a ton of memory and leak like crazy. For data to be tracked for a given GRState, you must use immutable data structures such as ImmutableMap. The reason is two fold:

  1. ImmutableMaps can be reused in multiple instances of GRState, and we can compare (using pointer equality) if two ImmutableMaps are equivalent. This allows us to do state caching, and also merge paths when two ExplodedNodes have the same ProgramPoint and GRState.

  2. All memory associated with the ImmutableMaps (particularly those managed by the GDM) are region allocated and destroyed. Such memory is allocated efficiently. Using DenseMaps for per-GRState involves malloc()'ing a bunch of memory that just gets leaked. That’s really inefficient, and algorithmically not correct.

Errr, sorry about that.

Instead, I’d expect to see something like:

if (SymbolRef sym = V.getAsSymbol()) {
state = state->add(sym, CheckedReturnState::getUnchecked(CE))
C.addTransition(state);
}

That’s a lot less code and doesn’t allocate a ton of (mutable) memory. Id then replace CheckedReturnState with some like the following:

enum CheckReturnStateKind { Unchecked, Checked };

typedef llvm::ImmutableMap<SymbolRef, CheckReturnStateKind > CheckedReturnState;
namespace clang { namespace ento {
template<>
struct GRStateTrait : public GRStatePartialTrait< CheckedReturnState > {
static void* GDMIndex() { static int index = 0; return &index; }
};
}}

That reduces about 60 lines of code to 8, and then allows you to use CheckedReturnState as I illustrated above. After reading through more of your code, I realize that this doesn’t quite match up with your algorithm, so let’s continue to discuss the algorithmic design and see how it matches up with the static analyzer APIs we have.

Now let’s look at checkPostStmt(BinaryOperator*):

I think I see what you are trying to do. Essentially:

a) the checker tracks symbols returned from function calls

b) binary/unary operations can create new values that are derived from symbols we are tracking. When those new values are checked, we want to treat the original symbols as tracked, so we need a way to track that a value is derived from another.

c) What are the “checked” propagation rules? For example:

x = foo(); // new symbol
y = bar(); // new symbol
z = x + y; // new symbol, based on the symbolic values in ‘x’ and ‘y’
if (z) …

Algorithmically, at ‘if (z)’, should the values in ‘x’ and ‘y’ be considered checked? I’m not certain. I think it really depends on your analysis and the “taint” propagation rules.

IMO, ‘x’ and ‘y’ should be considered checked. But there is no evidence, so i think the ‘taint’ propagation rules needs to much more study and test.

My gut feeling is that (b) should be managed by the analyzer engine, not a specific checker. The reason is that many checkers may wish to do taint analysis (a general term for these kinds of problems), and track how a symbolic value “morphs” into other values. (c) seems to be the purview of a specific checker.

So here is how I’d expect the checker to be written:

I) The core analysis engine provides a few key APIs for doing taint analysis that can be used by specific checkers. Specifically:

a) CheckerContext provides an API for a checker to indicate that it wants to start tracking the “taint” state of a given function return value. Operationally, this would do the following:

  • If the return value is a symbol, the analysis engine adds that symbol to a list of symbols to be tracked for taint analysis.
  • If the return value is not a symbol, the analysis engine conjures a new symbol, constraints that symbol to have the value returned from the function call, and then adds that symbol to a list of symbols to be tracked for taint analysis

b) When the analysis engine evaluates unary or binary operations (that can create new values derived from symbols), it checks if any of the operands are tainted symbols. If so, it does one of the following:

  • If the new value for that expression is a symbol, the analysis engine records that that new symbol was “derived” via taint analysis from the tracked tainted symbol.
  • If the new value for that expression is not a symbol, the analysis engine creates a new symbol, constrains that symbol to be equal to the value computed for that expression, and then uses that new symbol as the value of the expression. It then records that the new symbol was “derived” via taint analysis from the tracked tainted symbol."

We can have some variations here, but with this first cut essentially the checker would need no additional code to do the taint propagation. It would be done entirely by the analysis engine. This causes your checkPostStmt(BinaryOperator*) and checkPostStmt(UnaryOperator*) to completely vanish.

c) The analysis engine provides APIs to query if a symbol is tainted, and walk the tree of symbols from which it inherits its “tainted” classification.

How the Engine works depends on how we define the propagation rules. If we wanna make checkPostStmt(BinaryOperator *) and checkPostStmt(UnaryOperator *) to completely vanish, we must implement those rules in the engine. But we may have different propagation rules, like traditional one and what we have in UncheckedReturnChecker. Should all this rules work in the engine? Or maybe we need have some callback for propagation?

II) Using the APIs above, the checker would consist of two parts:

  • Use API (a) in checkPostStmt(CallExpr). The checker would get the symbol returned by the API, and set it to the “Unchecked” state (per my example code above).
  • Use API (c) in checkBranchCondition(). If the condition evaluates to a tainted symbol, we can explore the tree of symbols from which it derives it taint status and mark all symbols in that tree as being “checked”.

The nice thing about this approach is that your checker is just in the business of tracking basic typestate, and (ideally) does minimal work to propagate the taint propagation (which can really only be done well in the analysis engine itself).

Yes, i think this is the right apprach.
Thank you very much for pointing this. I knew something wrong with my patch, but i couldn’t tell it. So i didn’t go any further, and wrote last mail for help.

  1. I create a global DenseMap<const CallExpr*, CheckedReturnState*> to store whether a callexpr is checked or not.

This probably isn’t the right approach, because we really want to shy away from global state. It also makes a bunch of assumptions that might not be true. For example, it assumes that functions are analyzed independently and that we do no inter-procedural analysis. While that’s what we mostly do right now, that isn’t an invariant.

For example, consider the case where we do basic inter-procedural analysis where we “inline” function calls. That is, when we see a function call, we continuing creating the ExplodedGraph by tracing into the called function, and then when we return from that function we continue analyzing the original function. We then have one analysis pass, and the “global” state is associated with analyzing all of those functions. What happens if the called function is the same function we are already analyzing? E.g., we are analyzing foo(), and foo() calls itself. This will cause the counts to be all messed up, as they will conflate two calls. Is this what we want? Maybe, but it should be a deliberate decision, not an accident. Further, Checker objects now exist for the entirety of

Yes, you are right. I forget about this.

analyzing ALL functions in a translation unit. This means that they are reused by multiple GRExprEngine instances. This means that such global state really is a problem, as it is shared between multiple analysis runs.

Errr, I didn’t make this clear. IMO, this global DenseMap<const CallExpr*, CheckedReturnState*> is used by the whole translation unit. All the CheckedReturnStates in different GRExprEngine instances are record here, and do the statistica count after all the function handled.
However, using a global state like this is indeed wierd. And i will move on as you suggested.

Instead, I suggest we decompose things as follows:

  • Provide a way to associate checker-specific state with a LocationContext. A LocationContext can encapsulate the current “stack frame”, and thus provides the “local” but “global” analysis state you are looking for.
  • Using the “local” state I suggested, Incrementally build up the statistical counting as needed. We only need to update the counts in two places:

a) in checkBranchCondition(), when we transition a tracked symbol to the “checked” state, we can increment the checked count for that symbol.
b) when we handle RemoveDeadSymbols and a symbol is in the “unchecked” state, we increment the unchecked count for that symbol.

We keep statistical counts for each LocationContext (or rather, StackFrameContext), and then do the post-analysis based on that.

OK, this souds much better than what i did, and i need to read more code about LocationContext.

  1. This patch didn’t consider the deal with the escaped symbol.

We can handle this in a variety of ways. We can add a new checker callback when symbols escape, and then have the Checker respond accordingly.

  1. The bugreport is very simple now…

That’s fine. That’s polish.

  1. This checker seems much slower than other…

There are a couple reasons for this. First, you are malloc’ing a ton of objects. malloc() is really slow, and your checker now leaks like crazy. All checker state should be allocated using the BumpPtrAllocator associated with GRStateManager. Second, a semantic approach is inherently slower than a syntactic one, but far more general and thorough. Third, because of the way you are tracking checker state, you have forced the static analyzer to be worst-case exponential. There will be no state caching as there is no way to merge isomorphic states because of the freshly created DenseMaps.

Thank you for pointing these out.

After all, there are many work should be done. I write this mail to ask whether this direction is right or not.

After writing my comments above, I realized this really is a big project, not just a new checker. To do it semantically, we need a bunch of new infrastructure that currently doesn’t exist. I see two options:

a) We forge ahead on such infrastructure (for taint propagation in the analysis engine). I can work with you to gradually make this happen, but this will take time. Once that’s in place, we use that new infrastructure to implement your checker using that infrastructure (or we do this in tandem).
b) We go (for now) with your original “syntactic” version of the checker, and fix the issues that would be common to both the semantic and syntactic versions.

I think we should do both. The good thing about (b) is that it establishes a baseline. That baseline allows us to write tests, etc. It also allows you to not commit staying involved beyond where your interest or time peaks out. With (a), we can keep (b) around as an entirely separate implementation, and gradually make progress on (a) until we feel it is good enough to replace (b).

I know this is a lot of data, and I feel I probably explained a few things very poorly. At a high level, I think to do your checker “right” we need to add a bunch of infrastructure to the core analysis engine that doesn’t exist. Once that infrastructure exists, however, I think the checker itself should be fairly simple.

Thoughts? Comments? How would you like to proceed?

I think option(a) is what i want. But to implement it, i need to study the the taint analysis theory and find out the propagation rules we should establish for this checker. And i can’t guarantee the time spent on it.
So as you said, we go with (b) first, and then gradually make (a) happen.

I will go back the with the original “syntactic” version of the checker, and see what i can do about (a) meanwhile.

What you say?