Some problems about function summary

Hi, Ted,

I am trying to extend clang static analyzer to support inter-procedural
summary-based analysis. I have implemented a simple method for detecting
inter-procedural memory leaks which still has some problems preventing
doing a precise or even correct analysis.The main idea is recording the
side effects of each function, called SE for short. The structure of SE
is maps from statements to SVals.
For example,
void f(int *x) {
  x[0] = 1;
}
void g() {
  int a = 0;
  f(&a);
}
The SE of function f is stmt(x[0])->ConcreteInt(1) (Here, x[0] must meet
a condition that the base region of stmt must be a symbolic Region
accessed from formal argument.).

When analyzing function g, we binding the actual argument &a to the
formal argument x(call the method EnterStackFrame() in GRState class.)
and then iterator all of SEs in function f to evaluate the side
effects(Use method ExprEngine::Visit(x[0]) to get the actual SVal or
memory region of x[0] and assign it the Value Concrete(1)).

There are many problems in this method.
1. If the index of x is a local variable ,the analysis is incorrect.For
instance, void f(int *x) { int i = 0; x[i] = 1; }.The actual memory
region of x[i] is unknown.Because the value i in function f is missed in
current summary.
2. If there exists assignments between arguments, the analysis is also
incorrect. For instance, void f(int *x, int *y){ x = y; x[0] = 1; }.
...
...

Obviously, the summary info is insufficient.So I try to explore more
summary info by collecting each function's access paths(AP).
Access paths are symbolic names for memory locations accessed in a
procedural.The structure of access paths is maps from statements to
memory region. We use access paths to handle the transfer from the
called function's statements with formal arguments to the caller's
actual statements. For example,
void f(int *x, int *y) {
  x = y;
  x[0] = 1;
}
void g(int *a, int *b) {
  f(a, b);
}
The access paths of function f are stmt(x) -> symbolicRegion($x),
stmt(y)->symbolicRegion($y).In order to use access paths, the structure
of SE is adopted as maps from memory region to SVal.In this figure , f's
SE is ElemRegion{SymbolicRegion($y), 0} -> ConcreteInt(1).

While applying the summary info at the function's call site, we should
translate from the called function's memory region in SE to actual
memory region.ElemRegion{SymbolicRegion($y), 0} is translated as
ElemRegion{SymbolicRegion($a), 0},and then assign ConcreteInt(1) to
ElemRegion{SymbolicRegion($a), 0}.

This method is difficult or even infeasible to implement.Because the
translation is not always reasonable and the state data such as memory
region during intra-procedural analysis is not released which causes a
large memory consuming.

So we should define a data structure for summary which is independent of
GRState(or memory region).The most difficult thing is how to obtain the
actual value from the called function's analysis data.Can you give me
some suggestions about the design of function summary?Hope for your
reply!

Hi,

Let me just suggest this paper to you:
Precise Interprocedural Analysis in the Presence of Pointers to the Stack
Pascal Sotin and Bertrand Jeannet (INRIA)
http://pop-art.inrialpes.fr/people/sotin/papers/SotinJeannet2011-ExactPointerAnalysis.pdf

The classic Reps (POPL'01?) and SLAM (polymorphic predicates) papers may also give you some ideas.

Nuno

Hi Zhenbo,

IMO, function summary may includes many informations, like constraints to the parameters, security rules the parameters should obey… Do you have a over all design of these? Or could your implementation meets all these demands?

2011/4/27 Zhenbo Xu <zhenbo1987@gmail.com>

Hi, Ted,

I am trying to extend clang static analyzer to support inter-procedural
summary-based analysis. I have implemented a simple method for detecting
inter-procedural memory leaks which still has some problems preventing
doing a precise or even correct analysis.The main idea is recording the
side effects of each function, called SE for short. The structure of SE
is maps from statements to SVals.
For example,
void f(int *x) {
x[0] = 1;
}
void g() {
int a = 0;
f(&a);
}
The SE of function f is stmt(x[0])->ConcreteInt(1) (Here, x[0] must meet
a condition that the base region of stmt must be a symbolic Region
accessed from formal argument.).

When analyzing function g, we binding the actual argument &a to the
formal argument x(call the method EnterStackFrame() in GRState class.)
and then iterator all of SEs in function f to evaluate the side
effects(Use method ExprEngine::Visit(x[0]) to get the actual SVal or
memory region of x[0] and assign it the Value Concrete(1)).

IMO, since you are using function summary, you could not simply call the enterStackFrame and iterator all SEs. You need a new mechanism to do inter-procedural analysis using function summary.

There are many problems in this method.

  1. If the index of x is a local variable ,the analysis is incorrect.For
    instance, void f(int *x) { int i = 0; x[i] = 1; }.The actual memory
    region of x[i] is unknown.Because the value i in function f is missed in
    current summary.
  2. If there exists assignments between arguments, the analysis is also
    incorrect. For instance, void f(int *x, int *y){ x = y; x[0] = 1; }.

Obviously, the summary info is insufficient.So I try to explore more
summary info by collecting each function’s access paths(AP).
Access paths are symbolic names for memory locations accessed in a
procedural.The structure of access paths is maps from statements to
memory region. We use access paths to handle the transfer from the
called function’s statements with formal arguments to the caller’s
actual statements. For example,
void f(int *x, int *y) {
x = y;
x[0] = 1;
}
void g(int *a, int *b) {
f(a, b);
}
The access paths of function f are stmt(x) → symbolicRegion($x),
stmt(y)->symbolicRegion($y).In order to use access paths, the structure
of SE is adopted as maps from memory region to SVal.In this figure , f’s
SE is ElemRegion{SymbolicRegion($y), 0} → ConcreteInt(1).

While applying the summary info at the function’s call site, we should
translate from the called function’s memory region in SE to actual
memory region.ElemRegion{SymbolicRegion($y), 0} is translated as
ElemRegion{SymbolicRegion($a), 0},and then assign ConcreteInt(1) to
ElemRegion{SymbolicRegion($a), 0}.

This method is difficult or even infeasible to implement.Because the
translation is not always reasonable and the state data such as memory
region during intra-procedural analysis is not released which causes a
large memory consuming.

Can you show the cases when the translation is not reasonable?

I think when generate a function summary(after the intra-procedural analysis is done), you should remove the state data that have no effects to outside.