usage of clang in an university project

Hi,

I was assigned a project for the security class in the university that consists in doing some analisys of varargs function calls (C only). Basically I need some form of an AST of a C file and a way to transverse it.
Do you think that clang is the best tool for the job? (btw, it has to be delivered in mid-December). I already took a look to other projects, like ANTLR, GCCXML, a flex+bison grammar I found on the web, and even the dump of gcc -fdump-translation-unit, but noone seems to be appropriate (they have a big learning curve or they don't work well enough).

So if you think that clang might do the job, can you please give me some pointers to how to get started?

Thanks in advance,
Nuno Lopes

I was assigned a project for the security class in the university that
consists in doing some analisys of varargs function calls (C only).
Basically I need some form of an AST of a C file and a way to transverse it.

Sure, clang can do that very easily.

Do you think that clang is the best tool for the job? (btw, it has to be
delivered in mid-December). I already took a look to other projects, like
ANTLR, GCCXML, a flex+bison grammar I found on the web, and even the dump of
gcc -fdump-translation-unit, but noone seems to be appropriate (they have a
big learning curve or they don't work well enough).

So if you think that clang might do the job, can you please give me some
pointers to how to get started?

The easiest thing to do is to get clang and build it. Then run the -parse-ast-print option. The code for this is implemented in the clang/AST/StmtPrinter.cpp file. That will give you an overview of using the AST for something simple.

For you, you don't want to process all nodes, just calls. Given a Stmt for a function body, to walk the AST, you should be able to do something like this:

void WalkAST(Stmt *S) {
   if (S == 0) return;

   if (CallExpr *Call = dyn_cast<CallExpr>(S)) {
     // Look at this call.
   }

   for (Stmt::child_iterator I = S->child_begin(), E = S->child_end(); I != E; ++I)
     WalkAST(*I);
}

-Chris

Thank you for your fast answer.
Today I was reading the clang code and I'm now trying to implement an ASTConsumer, as I'll have to do a little more analysis that just checking function calls. I also need to know when a particular variable used in the function call is initialized or not as well as if it is not initialized by the function call (through a pointer) whether it is used afterwards or not. I still didn't know if this will be simple enough to do, but I'll try..

For now, I have just more two questions:
- is it possible to disable clang warnings and output just the warnings I'll be generating? (I added a -parse-ast-sirs opt for me)
- can clang ignore parsing errors? e.g. if it can't find an header.h file can it continue even if a function is called without being declared. This is especially useful in my case because I'll be parsing the PHP sources (in C) and I don't have all the external libraries that some PHP extensions rely on, but I still would like to be able to analyse those extensions, though.

BTW, in attach you'll find a backtrace I got when doing executing 'clang -parse-ast-*'. It happens when I parse a PHP source file and when not all include (-I) paths are specified. I took a quick look at those functions, but I couldn't fix the bug myself.

Thanks,
Nuno

clang-bt.txt (1.16 KB)

Thank you for your fast answer.
Today I was reading the clang code and I'm now trying to implement an ASTConsumer, as I'll have to do a little more analysis that just checking function calls. I also need to know when a particular variable used in the function call is initialized or not as well as if it is not initialized by the function call (through a pointer) whether it is used afterwards or not. I still didn't know if this will be simple enough to do, but I'll try..

Ok, the dataflow framework Ted is working on will be useful to you. We don't really support interprocedural analysis yet though.

For now, I have just more two questions:
- is it possible to disable clang warnings and output just the warnings I'll be generating? (I added a -parse-ast-sirs opt for me)

Yes. The Diagnostic class handles mapping of warnings and other diagnostics to error levels.

Just call Diagnostics->setDiagnosticMapping(somediag, diag::MAP_IGNORE); to ignore somediag.

- can clang ignore parsing errors? e.g. if it can't find an header.h file can it continue even if a function is called without being declared.

No, not in full generality. In C you need information about types to be able to parse, for example.

This is especially useful in my case because I'll be parsing the PHP sources (in C) and I don't have all the external libraries that some PHP extensions rely on, but I still would like to be able to analyse those extensions, though.

You may have some luck, but in general it won't work well.

BTW, in attach you'll find a backtrace I got when doing executing 'clang -parse-ast-*'. It happens when I parse a PHP source file and when not all include (-I) paths are specified. I took a quick look at those functions, but I couldn't fix the bug myself.

Please create a .i file with clang and send that in: clang whatever.c -E > output.i

-Chris

Thank you for your fast answer.
Today I was reading the clang code and I'm now trying to implement an ASTConsumer, as I'll have to do a little more analysis that just checking function calls. I also need to know when a particular variable used in the function call is initialized or not as well as if it is not initialized by the function call (through a pointer) whether it is used afterwards or not. I still didn't know if this will be simple enough to do, but I'll try..

Ok, the dataflow framework Ted is working on will be useful to you. We don't really support interprocedural analysis yet though.

No, I don't need interprocedural analysis right now. For now, I'll just code what the functions do. e.g.:
zend_parse_parameters(ZEND_NUM_ARGS(), "s|d", &str, &str_len, &integer);

I know that str and str_len will get initialized (unless the function returns FAILURE) and that integer may or may not be initialized. So for now, I'll have all the information hard-coded. Maybe next year I can do my master thesis with clang/llvm and use interprocedural analysis.

For now, I have just more two questions:
- is it possible to disable clang warnings and output just the warnings I'll be generating? (I added a -parse-ast-sirs opt for me)

Yes. The Diagnostic class handles mapping of warnings and other diagnostics to error levels.

Just call Diagnostics->setDiagnosticMapping(somediag, diag::MAP_IGNORE); to ignore somediag.

OK, thanks, I'll check it out.

- can clang ignore parsing errors? e.g. if it can't find an header.h file can it continue even if a function is called without being declared.

No, not in full generality. In C you need information about types to be able to parse, for example.

It could just ignore type checking and such kind of checks, as I don't need that. It is enough for me to have the function/type names without further information.
Do you think I can workaround this? It's very very important for this particular project.

This is especially useful in my case because I'll be parsing the PHP sources (in C) and I don't have all the external libraries that some PHP extensions rely on, but I still would like to be able to analyse those extensions, though.

You may have some luck, but in general it won't work well.

damn..

BTW, in attach you'll find a backtrace I got when doing executing 'clang -parse-ast-*'. It happens when I parse a PHP source file and when not all include (-I) paths are specified. I took a quick look at those functions, but I couldn't fix the bug myself.

Please create a .i file with clang and send that in: clang whatever.c -E > output.i

Attached.

Thanks,
Nuno

php_pcre.c.clang.i (27.9 KB)

Nuno,

Currently clang has support for building CFGs from ASTs, and there is a fairly generic dataflow solver in place for doing flow-sensitive dataflow analyses (both forward and backward). Currently there is an implementations of both live variable analysis and uninitialized values analysis built on this solver. I won't attest that the framework (or the analyses built on it) is bug-free, nor that it is in the final form it eventually will be in, but it certainly implements most of the boilerplate for iterating over statements, merging dataflow values, etc. We of course would welcome feedback if you decided to use this part of clang, as the goal is to make that part of clang very powerful but also easy to use.

I'm more than happy to provide support if you are interested in using this part of clang. I'm afraid that documentation is limited for this part of clang, although I do plan on trying to remedy this problem (at least partially) in the short term.

Ted

Nuno,

Currently clang has support for building CFGs from ASTs, and there is a fairly generic dataflow solver in place for doing flow-sensitive dataflow analyses (both forward and backward). Currently there is an implementations of both live variable analysis and uninitialized values analysis built on this solver. I won't attest that the framework (or the analyses built on it) is bug-free, nor that it is in the final form it eventually will be in, but it certainly implements most of the boilerplate for iterating over statements, merging dataflow values, etc. We of course would welcome feedback if you decided to use this part of clang, as the goal is to make that part of clang very powerful but also easy to use.

I'm more than happy to provide support if you are interested in using this part of clang. I'm afraid that documentation is limited for this part of clang, although I do plan on trying to remedy this problem (at least partially) in the short term.

Ted

Thank you. I'll certainly try to use the CFG framework.
I'll carefully read the Analysis directory files first and I'll get back to you if (well, when) I have some question.

Thanks,
Nuno

BTW, in attach you'll find a backtrace I got when doing executing 'clang -parse-ast-*'. It happens when I parse a PHP source file and when not all include (-I) paths are specified. I took a quick look at those functions, but I couldn't fix the bug myself.

Please create a .i file with clang and send that in: clang whatever.c -E > output.i

Attached.

Fixed, thanks!
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20071008/002282.html

-Chris

BTW, in attach you'll find a backtrace I got when doing executing 'clang -parse-ast-*'. It happens when I parse a PHP source file and when not all include (-I) paths are specified. I took a quick look at those functions, but I couldn't fix the bug myself.

Please create a .i file with clang and send that in: clang whatever.c -E > output.i

Attached.

Fixed, thanks!
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of- Mon-20071008/002282.html

Thank you. I run a few more tests and I couldn't trigger the problem anymore.

Nuno

Hi again,

Sorry for my late response, but I've been busy with other stuff.
I'm writing firstly to make sure I want to do sane things and that they are doable with current clang's CFG infrastructure.

For example, one of the errors I would like to detect is the following (from a patch that fixed a crash in PHP):

- if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(4, &domain, &msgid1, &msgid2, &count, &category) == FAILURE) {
+ if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(5, &domain, &msgid1, &msgid2, &count, &category) == FAILURE) {
  WRONG_PARAM_COUNT;
  }

(ZEND_NUM_ARGS() is just an int variable).

My question is how can I track those values? I think that tracking them in a more general way (e.g. var > 5; var2 < var1+3) needs a full SAT solver. But simplifying things, is this doable with the PersistentMap, for example?

Also I would like to find memory overflow bugs, like:
char dest[10];
if (size <= 11)
    memcpy(dest, input, size);

At the CallExpr, how do I know that 'size' is not sanitized correctly (i.e. size <= sizeof(dest))?

Also, from which example of the Analysis dir should I base my code?

Thanks in advance,
Nuno

Hi again,

Sorry for my late response, but I've been busy with other stuff.
I'm writing firstly to make sure I want to do sane things and that they are doable with current clang's CFG infrastructure.

For example, one of the errors I would like to detect is the following (from a patch that fixed a crash in PHP):

- if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(4, &domain, &msgid1, &msgid2, &count, &category) == FAILURE) {
+ if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(5, &domain, &msgid1, &msgid2, &count, &category) == FAILURE) {
WRONG_PARAM_COUNT;
}

(ZEND_NUM_ARGS() is just an int variable).

My question is how can I track those values? I think that tracking them in a more general way (e.g. var > 5; var2 < var1+3) needs a full SAT solver. But simplifying things, is this doable with the PersistentMap, for example?

Hi Nuno,

Static analysis technology can be based on a whole variety of "solvers" that perform symbolic logic. Bit-blasting to propositional logic and solving with SAT is only one approach. Bit-blasting, while extremely powerful, is a very heavy hammer, and there are hundreds of tradeoffs when choosing the right solver and analysis method for a particular problem. Sometimes a "simpler" approach, such as those based on traditional dataflow analysis, can accomplish the task just as well, but in the process be more scalable. It really depends on your problem, but you certainly don't *need* a SAT solver for tracking and solving the constraints you described, nor would it always be the best approach.

For example, there are many techniques out there use specialized solvers within an inter-procedural, path-sensitive, dataflow framework. For example, I highly suggest checking out the work by Dawson Engler on his group's work on Metacompilation (and a related project at MSR was ESP):

Checking System Rules Using System-Specific, Programmer-Written Compiler Extension
Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem.
Appeared in: Proceedings of the 4th Symposium on Operating System Design and Implementation.

Manuvir Das, Sorin Lerner, and Mark Seigle
ESP: Path-Sensitive Program Verification in Polynomial Time
PLDI '02: Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, Berlin, Germany, June 2002.

If you want an example of a specialized solver built on top of the Metacompilation work, check out Archer, which found buffer overruns in systems code such as the Linux kernel:

ARCHER -- An Automated Tool for Detecting Buffer Access Errors
Yichen Xie, Andy Chou, and Dawson Engler
Proceedings of ESEC/FSE 2003, Helsinki, Finland

I have also seen some recent work on doing path-sensitive analysis using type systems, or using denotational semantics. None of these are based on SAT solvers. This isn't an argument against using SAT solvers; I'm just saying there are a lot of options, and it really depends on your problem domain. My main argument is that often taking a more "light-weight" approach can get you more traction both in the short term, but also in terms of the overall effectiveness in the tool (in terms of scalability, etc.).

In terms of things that could be used to help implement such techniques, we now have the ImmutableMap and ImmutableSet data types in the LLVM libraries (see later in my email), which supersede the PerstistentMap implementation. These could be used to track mappings from variable names to constraints (the ARCHER paper provides some excellent details on how such constraints can be tracked). There of course needs to be some other supporting infrastructure for doing path-sensitive analysis, which we are working on (see below), or you would have to develop yourself.

Also I would like to find memory overflow bugs, like:
char dest[10];
if (size <= 11)
  memcpy(dest, input, size);

At the CallExpr, how do I know that 'size' is not sanitized correctly (i.e. size <= sizeof(dest))?

An excellent paper to read would be the ARCHER paper (mentioned above), which describes a specialized solver for finding these kinds of errors. It's a fairly scalable system that had a specialized constraint solver that kept track of linear constraints on buffers and extents. In such a system, flagging an error like this is trivial.

Also, from which example of the Analysis dir should I base my code?

The Analysis dir contains very nascent code. In the last couple of months I put in some code for a flow-sensitive dataflow analysis framework, and implemented "Live Variables" and "Uninitialized Values" on top of that framework. As you have seen, these analyses themselves have bugs, some of which have been introduced as the rest of clang has evolved. That said, I think it is a good baseline to build flow-sensitive analyses on top of, and in the process the API will be refined as users decide what they want (and don't want) from the framework.

I have also added both an ImmutableSet and ImmutableMap data structures to the LLVM core libraries, which implement a functional set and map similar to those in the Objective Caml libraries. There is still some tweaking to do on their interfaces, but these ADTs were implemented with the intention of being used for program analysis in clang. These can be used to efficiently implement dataflow "stores" that map from symbols to values, and their functional nature allows them to efficiently represent the dataflow values at different program points while exploiting the fact that most dataflow values across program points are very similar (and thus much of their representation can be shared).

Clang currently lacks, however, a system for doing path-sensitive analysis, as well as the infrastructure to support inter-procedural analysis. Work on this is currently underway. Thus there is currently no example code, however, in the Analysis directory (other than the clients of the flow-sensitive solver) on which to base your own code. At this point you would have to build the necessary infrastructure yourself, or depending on your timeline, work with the libraries that we provide as we gradually bring them online. We are currently building the logic into clang to serialize out its ASTs, and eventually build an infrastructure for reasoning about a collection of serialized ASTs (which could constitute an "image" of a code base, for example) to enable inter-procedural analysis (among other things).

Our initial goals are to provide a dataflow-based path-sensitive solver and supporting libraries, e.g., similar to those in ESP or Metacompilation, all of which are based on the standard Reps-Horwitz-Sagiv (RHS) inter-procedural, path-sensitive dataflow algorithm:

Precise interprocedural dataflow analysis via graph reachability

This approach, as illustrated by the ESP and Metacompilation systems, permits a wide range of checking tools, and it was also the basis of the model-checker SLAM that was developed at MSR (although SLAM operated on boolean programs and solved constraints using a theorem prover, it was still based on RHS).

Our current goals for clang are to provide the infrastructure for building such tools, and we plan on focusing on building such libraries starting in the next month or so. I should caution that developing these libraries will be a process, and will not happen overnight.

Note that we are not engineering clang to support only one kind of analysis framework. For example, a program analysis system such as those based on SAT (such as the Saturn project at Stanford) or any other technology are all reasonable things that could be built on clang.

My advice is that you that you approach the problem that you wish to tackle by first simplifying it. Don't go for the most general solution, and think about how you would go about solving a special subset of the problem. From that you can build simple analyses, which can gradually be refined on a need by need basis. I would do this regardless of whether or not you are using clang or some other framework. If your project will be over a long period of time, you may potentially be able to use some of the infrastructure we are bringing online (and perhaps, if you are interested, you can also be active in its development). Otherwise, I would stick with a project that is manageable and only requires that you develop the infrastructure you need to get some initial results. This will allow you to quickly experiment with new things, which will better help you understand your problem domain as well as the right program analysis approach to solving it.

Ted