Inter procedural analysis across translation unit in clang static analyzer

Hi All,

I was going through clang project and found static analyzer to be a quite useful tool. I would like to work and contribute on the same. I went through the code and developed few basic checkers(Socket stream checker etc) to start with.

I had a doubt which i wanted to clarify from the community.

If i’m not wrong Clang static tool currently supports only one translation unit at a time and so inter procedural analysis across translation unit is not supported.
Is there any plan to support the same in clang static analyzer?
What kind of infrastructure would be required in static analyzer core to support this feature?
Will it require detailed understanding of clang front end(AST etc)?

Thanks
Karthik

Hi All,

I was going through clang project and found static analyzer to be a quite useful tool. I would like to work and contribute on the same. I went through the code and developed few basic checkers(Socket stream checker etc) to start with.

I had a doubt which i wanted to clarify from the community.

If i'm not wrong Clang static tool currently supports only one translation unit at a time and so inter procedural analysis across translation unit is not supported.

That is correct.

Is there any plan to support the same in clang static analyzer?

This is something we would definitely like to address as it is one of the main missing pieces. I am not sure when we are going to address it.

What kind of infrastructure would be required in static analyzer core to support this feature?

We have not designed this in detail yet. However, this is going to be a lot of work. We would probably go with summary based approach, where one constructs summaries for the analyzed functions; the summaries are then used when modeling the calls.

Will it require detailed understanding of clang front end(AST etc)?

This project would require understanding the analyzer very well.

Thanks Anna… Can you suggest me some place were i can find document for static analyzer core module. I found many documents for checkers and how to write checkers, but couldn’t find document explaining functioning of static analyzer core.

Any help appreciated.

Thanks

Thanks Anna… Can you suggest me some place were i can find document for static analyzer core module. I found many documents for checkers and how to write checkers, but couldn’t find document explaining functioning of static analyzer core.

Unfortunately, there is not much documentation about the analyzer core yet (code is our documentation). The only additional documentation I know of is in clang/docs/analyzer; specifically, we have the IPA.txt there, which describes how we approach cross function analyzes. You could also search the archives from this list (in particular, emails from Ted Kremenek).

The analyzer utilizes the idea of path-sensitive dataflow analysis, which can be tackled with different specific techniques, but they all boil down to trying to compute a set of reachable program states. Our LLVM Dev meeting talk Building a Checker in 24 Hours gives a very high level overview of how it works (http://llvm.org/devmtg/2012-11/). Here are some relevant academic papers, but there are many more papers in the area, and the analyzer is inspired by many of them:
A System and Language for Building System-Specific, Static Analyses (Hallem et al)
Precise interprocedural dataflow analysis via graph reachability (Reps et al)

As I had mentioned, cross translation unit analyzes is a huge project; for example, it would most likely take more than a year to complete. However, it can be split into subtasks. There are also many other directions for improving the analyzer core.

Please, feel free to ask questions.
Anna.

Any help appreciated.

Thanks

Hi All,

I was going through clang project and found static analyzer to be a quite useful tool. I would like to work and contribute on the same. I went through the code and developed few basic checkers(Socket stream checker etc) to start with.

It would be great if you plan to contribute those checkers back!

Thanks Anna for the prompt replies. Will go through those documents.

Sure will contribute the checkers after refining it a bit. It is a simple SocketStream checker (checks socket open/close issues etc) inspired by SimpleStream checker from Building a Checker in 24 Hours video

I had one more query currently we terminate analysis of a path as soon as we detect a critical bug. (for e.g.) in the code -

int one =1;
int zero = 0;

int k = one/zero;

// Some other bugs

In the above code we are generating a sink node as soon as we see a divbyzero and do not analyse other bugs along the path.
Since static analysis of large code usually takes a lot of time is it not a good idea to report as much bugs as possible on a particular path?

For e.g in the above code instead of creating a sink node for divide by zero can we assign value for k as unknown and continue simulation?

I tried the same by modifying evalBinOp function and creating a transition node instead of sink node -

if ((op == BO_Div || op == BO_Rem ||op == BO_DivAssign ||op == BO_RemAssign) && rhs.isZeroConstant()) // in case of div by zero assign result as unknown and continue execution
return UnknownVal();

so that if we have a divide operation with rhs as 0 assign the result as unknown and continue simulation to detect other bugs on the path.

Thanks

Thanks Anna for the prompt replies. Will go through those documents.

Sure will contribute the checkers after refining it a bit. It is a simple SocketStream checker (checks socket open/close issues etc) inspired by SimpleStream checker from Building a Checker in 24 Hours video

I had one more query currently we terminate analysis of a path as soon as we detect a critical bug. (for e.g.) in the code -

int one =1;
int zero = 0;

int k = one/zero;

// Some other bugs

In the above code we are generating a sink node as soon as we see a divbyzero and do not analyse other bugs along the path.
Since static analysis of large code usually takes a lot of time is it not a good idea to report as much bugs as possible on a particular path?

Good point.

For e.g in the above code instead of creating a sink node for divide by zero can we assign value for k as unknown and continue simulation?

I tried the same by modifying evalBinOp function and creating a transition node instead of sink node -

if ((op == BO_Div || op == BO_Rem ||op == BO_DivAssign ||op == BO_RemAssign) && rhs.isZeroConstant()) // in case of div by zero assign result as unknown and continue execution
return UnknownVal();

so that if we have a divide operation with rhs as 0 assign the result as unknown and continue simulation to detect other bugs on the path.

There are several reasons why this approach might not work well in general.

For example, if you encounter a null pointer dereference(or div by zero), report it, and continue the execution, chances are that the same pointer would get used again and again. So you would end up reporting multiple instances of the same issue. Even if you came up with ways to unique those reports, analyzing after a “severe error” could lead to other issues being reported. How would one model the state after the error? Recovering from errors without loosing state/causing side effects might not always be as easy as in this case. (And even here we introduce some inconsistency into the analyzer.)

Another argument is that what is optimal for performance depends on the workflow in which one is using the analyzer. For example, it might be preferable to have a faster turnaround when issues are found rather than running the tool until as much as possible is covered. For example, if you are analyzing a small project regularly and expect a very fast turn around.

Anna.

Hi Anna,

Thanks for the clarification.

For IPA across translation unit, to start off for my project i do not require full fledged functionality but few basic ones such as divByZero Checker, memory leak check, null deference.

Regarding summary based approach i was planning to take the following approach -

  1. Run LibTooling and collect information for all functions across translation unit. The Summary can have things like -

Struct Summary
{

char* MangledFnName;
bool doesReturnZeroConst; //Summary to check null deref and div by zero
bool doesReturnMallocedMemory; //Check for potential memory leak
bool doesFreeParam(int i); //Check for free of param i
bool doesDivideParam(int i); // To check if div by zero



}

Maintain a Hash for MangledFnName and it’s Summary and dump to a file after each translation unit analysis. ( Summary )

  1. Run Static Analyzer normally. If at any point we are not able to inline a call load Summary file and get corresponding function summary using mangledName as key.
    Do the required state changes as per the summary of the function and report issues if found.

For example if we have -
k = fun() and summary of fun say doesReturnZeroConst then assign k ZeroConst and continue analysis to detect divide by zero.

I would like to get few clarifications from experts before diving into the coding part -

  1. Does the above approach (i.e. running 1st pass to collect summary for all translation units and dumping to a file and later do our normal static analysis and use summary were ever we are not able to inline.) seem feasible or are there any design issues i’m overlooking?

  2. Can i extract the required information(i.e. the rough summary mentioned above) from current LibTooling infrastructure? Or should these be collected these in some other manner?

  3. The above example(div by zero) is a bit simple but is this kind of summary enough to support memory leak detection, resource leak and null deference?

Thanks
Karthik

Karthik,

The design we were thinking about was somewhat different. A summary would be a general concept, which would include information about the function and all the checker specific info. For example, it could be a set of node pairs, where each pair would contain a state before the function execution and the state after the function execution. (You’d get a set because there could be more than one function entry state depending on the context in which the function is called.) This would imply that all the checker specific information would be stored in the summary for free, so the summaries would work for all the checkers(without any checker specific knowledge in the analyzer core).

Another advantage of such an approach is that it will store the information required for path sensitivity. Consider this contrived example, where it is important that the summary stores the knowledge about the path inside the called function.

int* cond_malloc(unsigned c) {
if (c)
return malloc(sizeof(int));
return 0;
}

foo(unsigned cond) {

int *p = cond_malloc(cond);

if (c)
free(p);

}

The roadmap we were thinking of is the following:

  1. Teach the analyzer about alpha renaming. So that it could compare two states when symbols are renamed. This would involve quite a bit of infrastructure work and address a bunch of false positives as well as give us a building block for cross-translation unit analyzes.
  2. Implement the summary based approach for functions inside one translation unit. This could potentially speed up the analyzer.
  3. Take it to the next step - cross translation unit.

With this approach you would not see the results immediately, but it is a better generic solution.

Cheers,
Anna.