Some problems about function summary

Hi, Ted,

I am trying to extend clang static analyzer to support inter-procedural
summary-based analysis. I have implemented a simple method for detecting
inter-procedural memory leaks which still has some problems preventing
doing a precise or even correct analysis.The main idea is recording the
side effects of each function, called SE for short. The structure of SE
is maps from statements to SVals.
For example,
void f(int *x) {
x[0] = 1;
}
void g() {
int a = 0;
f(&a);
}
The SE of function f is stmt(x[0])->ConcreteInt(1) (Here, x[0] must meet
a condition that the base region of stmt must be a symbolic Region
accessed from formal argument.).

When analyzing function g, we binding the actual argument &a to the
formal argument x(call the method EnterStackFrame() in GRState class.)
and then iterator all of SEs in function f to evaluate the side
effects(Use method ExprEngine::Visit(x[0]) to get the actual SVal or
memory region of x[0] and assign it the Value Concrete(1)).

There are many problems in this method.
1. If the index of x is a local variable ,the analysis is incorrect.For
instance, void f(int *x) { int i = 0; x[i] = 1; }.The actual memory
region of x[i] is unknown.Because the value i in function f is missed in
current summary.
2. If there exists assignments between arguments, the analysis is also
incorrect. For instance, void f(int *x, int *y){ x = y; x[0] = 1; }.
...
...

Obviously, the summary info is insufficient.So I try to explore more
summary info by collecting each function's access paths(AP).
Access paths are symbolic names for memory locations accessed in a
procedural.The structure of access paths is maps from statements to
memory region. We use access paths to handle the transfer from the
called function's statements with formal arguments to the caller's
actual statements. For example,
void f(int *x, int *y) {
x = y;
x[0] = 1;
}
void g(int *a, int *b) {
f(a, b);
}
The access paths of function f are stmt(x) -> symbolicRegion($x),
stmt(y)->symbolicRegion($y).In order to use access paths, the structure
of SE is adopted as maps from memory region to SVal.In this figure , f's
SE is ElemRegion{SymbolicRegion($y), 0} -> ConcreteInt(1).

While applying the summary info at the function's call site, we should
translate from the called function's memory region in SE to actual
memory region.ElemRegion{SymbolicRegion($y), 0} is translated as
ElemRegion{SymbolicRegion($a), 0},and then assign ConcreteInt(1) to
ElemRegion{SymbolicRegion($a), 0}.

This method is difficult or even infeasible to implement.Because the
translation is not always reasonable and the state data such as memory
region during intra-procedural analysis is not released which causes a
large memory consuming.

So we should define a data structure for summary which is independent of
GRState(or memory region).The most difficult thing is how to obtain the
actual value from the called function's analysis data.Can you give me
some suggestions about the design of function summary?Hope for your
reply!

Hi Zhenbo,

I was looking through my old emails, and stumbled upon this one. I'm so sorry for not replying to it sooner.

Function summaries are a tough nut to crack. There are a variety of issues to tackle. Beyond memory updates, I think there are a bunch of other issues that need to be captured in a summary. Here is a (partial) list:

1) Memory updates (changes to the symbolic store)
2) Return values
3) Return values via out parameters
4) Constraints added to symbols, which can add "path constraints" on memory updates, return values, etc.
5) Unifying symbols (alpha-renaming) in the symbolic store and path constraints
6) Modeling checker state updates
7) and possibly many others...

Let's break these down.

== 1. Memory updates: ==

Memory updates represent updates in the symbolic store. While ExprEngine and checkers reason about MemRegions, there is no requirement that the underlying StoreManager actually model memory bindings using MemRegions. Indeed, RegionStoreManager lowers complex regions into a base region + byte offset. Since one can represent the address of a byte using different MemRegions (e.g., FieldRegion or an ElementRegion representing a byte offset), this lowered representation has many advantages, and can possibly do a much better job of capturing the raw underlying semantics while leaving MemRegions for the higher-level clients to "describe" how memory is being used. This is a really powerful decoupling, that gives us the flexibility of swapping in a new StoreManager that uses an entirely different memory binding representation.

So what does this mean for summaries? For summaries, I think this means that memory updates should be described in the underlying representation of the StoreManager. When one creates a summary, ideally one just looks at the GRStates at the "return sites" for the function, and looks at the differences between the initial GRState and the end GRStates. The StoreManager can then articulate a "diff", which represent the effects on the memory model, between the initial state and the end states. That diff, however, is completely up to the StoreManager, which preserves the decoupling from the MemRegion abstraction and the underlying memory representation in the StoreManager. Further, the StoreManager knows what it can and cannot represent; there is no reason to capture more information beyond that in the summary.

Now I want to get back to something you wrote:

When analyzing function g, we binding the actual argument &a to the
formal argument x(call the method EnterStackFrame() in GRState class.)
and then iterator all of SEs in function f to evaluate the side
effects(Use method ExprEngine::Visit(x[0]) to get the actual SVal or
memory region of x[0] and assign it the Value Concrete(1)).

I may be misinterpreting you, but technically this isn't summary based analysis, but inlining. Summary-based analysis still involves analyzing functions in (partial isolation), and building up summaries that allows a summary to applied to *any* function call for a given function. In a summary-based analysis, when we encounter a function call to a function we have previously not generated a summary, we lazily generate one (by analyzing the called function), and then apply the summary. Whether or not we include specific context from the caller when generating the summary is a design decision, and may provide more precision depending on the analysis, but it isn't a prerequisite for summary-based analysis.

If we take the approach that summaries are generated in a context-free manner (where the caller is not taking into account), then the summary will articulate diffs in terms of the symbolic inputs and outputs representing the values of global variables, input parameters, and any other value that is "visible" to the function. For example, when analyzing "g", we would have symbolic values for "a" and "b", and the summary would describe its effects on "a" and "b" in terms of those symbolic values. Then, when *applying* the summary, we need to resolve the actual arguments (which can be actual concrete values) to the symbolic values used in the summary. This is a form of alpha-renaming (taking the idea from lambda calculus). When applying the summary, we may have many possible "diffs" to chose from (from the different GRStates at the end sites in the summarized function). Not all of the diffs while generate GRStates in the caller that are consistent. For example, if the caller has the constraint that an actual argument is non-null, and one of the diffs assuming that the formal parameter is null, then when applying that diff we will get an invalid GRState because the values aren't consistent. If none of the diffs can apply cleanly, then there is no feasible path.

For memory updates, "applying the diff" is probably up to the StoreManager (since diffs aren't necessarily described in terms of MemRegions), but it has to be done in concert with the other diffs (from 2-6).

== 2. Return values ==

Like memory updates, there may be different possible return values, depending on constraints on the inputs. This also needs to be captured in the summary.

== 3. Return values via ouput parameters ==

These are just captured via memory updates, so if we solve (1), we solve (3).

== 4. Constraints added to symbols ==

The diffs for (1-3) and (6) all may be *conditional* on specific values of the inputs to a function (e.g., the constraints on symbolic regions or specific symbolic values). These constraints are modeled by the ConstraintManager. Like StoreManager, ExprEngine and checkers talk to the ConstraintManager via an abstract interface. The ConstraintManager, however, is allowed to represent constraints in GRState in any way that it wishes. The "diff" for constraints that is captured in the summary should also be left up to the ConstraintManager.

What are things that would be in the diff? Let's say we generate N different GRStates at end sites when analyzing a function for which we are creating a summary, and each of those N states has a different constraint on an input parameter. There may also be other constraints modeled in the GRState, e.g., constraints on local variables, but those are irrelevant for the summary because only the input values dictate the effects as seen by the caller. Thus, when generating the diff, we do the following:

* For each of the N end states, take the input state and end state, and summarizing the constraints added to each of the input values. We discard any other constraints. Likely we will do the same exact thing with StoreManager: just discard memory bindings that not visible to the caller, and summarizing the differences between the initial bindings and final memory bindings.

One thing that is cool about this approach is that we can basically just generate a new GRState that represents the "diffs", and then just alpha-rename the symbolic values to the values in the caller when applying the summary. When alpha-renaming, we resolve both incompatible bindings and incompatible constraints between the symbolic and caller's values, and then discard any GRStates that are invalid.

== 5. Unifying symbols ===

I think I covered this in (4), but the idea is that summaries just represent constraints on input symbols from the perspective of the called function, and when we apply the summary in the caller's context, we are just "unifying" the symbols. This can be a complex operation, and the logic can be distributed amongst the various subcomponents (e.g., StoreManager, ConstraintManager, etc.).

== 6. Modeling checker state updates ==

Like the StoreManager and ConstraintManager, checkers can also have "updates" to the state that need to be reflected in a summary. Because the logic of individual checkers is clearly unknown to ExprEngine, checkers alone show that the construction of summaries needs to be both (a) generic and (b) allow individual subcomponents that impact what goes into a GRState to be responsible for generating the summary diff *any* for helping the alpha-renaming process when applying a summary. This will require extending the checker interface, and possibly providing better primitives for checker writers to make this process as painless as possible.

== 7. and possibly many others... ===

There could be other things we may need to capture in a summary. For example, do certain path constraints imply that a bug gets reported? Or how about a path terminates? etc. These need to be reflected in a summary as well.

All of this of course is a big project, which is why the *inlining* approach to inter-procedural analysis is more immediately appealing for doing simple inter-procedural analysis. With the inlining approach, we analyze the caller, and when we see a function call, we just inline the semantics of the caller in situ by literally continuing the analysis of the caller by binding actual parameters to formal parameters, and the callee's analysis is done purely in the context of caller's. This is easy and highly precise, but won't scale beyond simple cases.

Hope this helps,
Ted

Hi, Ted and Zhenbo. All of this seems very interesting to me too; currently, a function call throws an awful lot of the analyzer's carefully constructed constraints out the window. I just wanted to add that it's probably possible to make a pessimistic function summary mechanism. That is, rather than assuming a function summary models /all/ salient behavior of a function, it can simply make guarantees about certain variables, or add certain constraints, without promising anything else.

Say, perhaps, that the only "summary" information we have is that if a length parameter is zero, the buffer parameter doesn't need to be invalidated. (I have no idea how we'd represent that.) That means we can go ahead and invalidate any other regions, globals, etc. as usual, but we can still take advantage of the information we /do/ have.

I'm sure there's plenty of caveats to be overcome here, but it seems like any of these six/seven parts of a summary would be an improvement over none. Especially if you/we figure out a way to provide this information across translation units.

Just my two cents!
Jordy

Hi Zhenbo,

I was looking through my old emails, and stumbled upon this one. I'm so sorry for not replying to it sooner.

Function summaries are a tough nut to crack. There are a variety of issues to tackle. Beyond memory updates, I think there are a bunch of other issues that need to be captured in a summary. Here is a (partial) list:

1) Memory updates (changes to the symbolic store)
2) Return values
3) Return values via out parameters
4) Constraints added to symbols, which can add "path constraints" on memory updates, return values, etc.
5) Unifying symbols (alpha-renaming) in the symbolic store and path constraints
6) Modeling checker state updates
7) and possibly many others...

[snip]

Hi Jordy,

In my own words, I’ll generalize a bit what you are suggesting.

One aspect of summaries is that they should model what pieces of memory are modified. A perfect summary would require no region invalidations, as it would just model the modifications directly. A less perfect summary would say “I know this part of memory could be modified, but I don’t know what the value is.” The least perfect summary assumes that all memory can be touched.

Conceptually what we have now is a summary-based analysis where we always use the most conservative summary. So we already have a summary-based analysis, but it isn’t general, and it assumes the most pessimistic results possible.

So what you are suggesting are summaries that don’t explicitly model all possible memory modifications that a function could make, but a conservative estimate (but more precise than what we currently have) of how memory could be modified. To do this, we need to make region invalidation a declarative effect in the summary (just like any other changes the summary models), which ExprEngine then explicitly performs when applying the summary. As part of this, we would just need to make “region invalidation” in the summary dependent on value constraints, just like any other aspect of a summary. One way to do this would be to store region invalidation information literally in a GRState (that would be used to construct a summary). We could (for example) put this in the GRState’s GDM, and provide an API for clients to store region invalidation constraints in the GRState. This would allow checkers with domain-specific knowledge of a function to model these region invalidation constraints (in the absence of more perfect summary information).

Ted