# Buffer bounds checking for C in static analysis

Hi all,

I'm a student (part of a group of 3) looking for a project involving
software engineering tools, preferably static analysis. I do a lot of work
in C and have long felt that a static analysis tool for bounds checking on
memory buffers (arrays/strings) would be very helpful. I saw that this was a
requested feature for the Clang analysis tool, but there's very little info
and it's apparently not fully developed yet. Therefore I have two questions:

What is the state of bounds checking for C in the Clang analyzer, in terms
of how far it has gotten and how much work is progressing on it?

Is there another static analysis area that the Clang static analyzer needs
implemented that would be a reasonable project for a few CS grad students?

Thanks,
Chris Hacking

There's no place I can be,
Since I found Serenity.
But you can't take the sky from me.

Hi Chris,

The current clang static analyzer only has a basic array bounds checker (ArrayBoundChecker.cpp). Its function is very simple: whenever a location is visited, if it’s an array element, compare its index with the size of the array. Some errors it can detect:

int a[4];
a[4] = 3;

and
int *p = malloc(12);
p[3] = 3;

That is, the current analyzer has the basic infrastructure for detecting out-of-bound array access.

The complexity of bounds checking arises from two facts: (may be there are others)

a. Many out-of-bound accesses occur in loops. Depending on the length of the buffer, it may require looping many times before triggering the out-of-bound access. But currently we only unwrap the loop for 2 or 3 times.

b. There are several string manipulation functions, such as strcpy(), strcat(). Currently we don’t handle them. One simple way is to provide a fake implementation of them and inline them into the call site, and treat them as normal loops. Another way is to create some linear constraints over the arguments of them, and solve it. But this has complexity that we need to model the actual length of the string and the size of the array.

Some other projects includes integer overflow checking. This has not been implemented at all in clang.

Also inter-procedural analysis is required to enhance all other checks.

2010/1/19 Chris Hacking <chacking@cs.washington.edu>

Hi Zhongxing,

Thanks for the quick response. If I’m understanding you correctly, here are couple of cases that the bounds checker would miss:
void setfive (int* x) { x[5]=5; }
int a[4]; setfive(a);
(due to lack of inter-procedural analysis), and
int b[5]; int x;
for (x=0; x<=5; x++) { b=x; }
(due to lack of loop analysis).

If I’ve misunderstood, please let me know. Considering the problem briefly, a few other areas that are potentially problematic are alias analysis:
int c[5]; int* y = c; y[5]=50;
pointer arithmetic:
char s[10]; s++; s[9] = ‘\0’;
dynamically sized arrays:
int len = atoi(“5”); char* str = malloc(len); str[len] = ‘\0’;
although all could be interesting extensions to current checks.

While the exact implementation for handling the string manipulation functions would require some thought (and examination of the current analysis tool), they do seem important to test. The most dangerous buffer overflows typically seem to involve them in some way. To extend that idea somewhat, the string.h functions aren’t the only ones that can trigger buffer overflows; in particular many system calls write into buffers:
char e[255];
int s = socket(<blah>);
connect(s, <blah>);
recv(s, e, 256, 0);

Anyhow, it sounds like there’s just enough groundwork here that we’d be able to make very meaningful contributions within the time frame available. I’m curious whether your comment on inter-procedural analysis means it’s not currently implemented in the Clang analyzer at all, or that it’s limited to specific checks right now. It seems like it would be near-critical to some checks.

In any case, thanks again. We need to create a specific project proposal (in the next few days), which is why I’m asking for more specifics on what is or is not checked currently. If there’s any documentation you could point me at that would help too. I looked at the source file but it doesn’t describe any limitations of the environment in which it operates.

Chris Hacking

Another big missing feature is reasoning about (linear) constraints between symbolic values. For example:

p = malloc(len);
if (i > len)
p[i] = …

Right now we track range constraints on individual symbolic values (e.g., the values of ‘i’ and ‘len’ respectively), but don’t track linear constraints between symbolic values. Doing so really is a prerequisite for doing interesting buffer overflow checking on arrays of dynamic size.

Tracking linear constraints between symbolic values would likely be a big win in the analyzer’s overall precision, and not just useful for buffer overflow checking. Doing this would reduce down to implementing a new ConstraintManager (that possibly was built on the currently existing ones).

Hi Zhongxing,

Thanks for the quick response. If I’m understanding you correctly, here are couple of cases that the bounds checker would miss:
void setfive (int* x) { x[5]=5; }
int a[4]; setfive(a);
(due to lack of inter-procedural analysis), and
int b[5]; int x;
for (x=0; x<=5; x++) { b=x; }
(due to lack of loop analysis).

Yes, clang would miss these cases. The reason is exactly what you pointed out.

If I’ve misunderstood, please let me know. Considering the problem briefly, a few other areas that are potentially problematic are alias analysis:
int c[5]; int* y = c; y[5]=50;
pointer arithmetic:
char s[10]; s++; s[9] = ‘\0’;
dynamically sized arrays:
int len = atoi(“5”); char* str = malloc(len); str[len] = ‘\0’;
although all could be interesting extensions to current checks.

Clang can process and track the information for these cases well. The
first two is pointer assignment and arithmetic. For the third one we
need to model atoi() (we haven't for now). If 'len is replaced by 5,
clang can detect it.

While the exact implementation for handling the string manipulation functions would require some thought (and examination of the current analysis tool), they do seem important to test. The most dangerous buffer overflows typically seem to involve them in some way. To extend that idea somewhat, the string.h functions aren’t the only ones that can trigger buffer overflows; in particular many system calls write into buffers:
char e[255];
int s = socket(<blah>);
connect(s, <blah>);
recv(s, e, 256, 0);

Anyhow, it sounds like there’s just enough groundwork here that we’d be able to make very meaningful contributions within the time frame available. I’m curious whether your comment on inter-procedural analysis means it’s not currently implemented in the Clang analyzer at all, or that it’s limited to specific checks right now. It seems like it would be near-critical to some checks.

Inter-procedural analysis is not used in any checks. There is an
experimental CallInliner.cpp, which implements inlining the callee
into the caller. But it's very preliminary. It's implementation is
hampered by some causal changes to the core analysis engine. For
example, we need to modify the RemoveDeadBindings algorithm to
accommodate the multiple call frame situation.

In any case, thanks again. We need to create a specific project proposal (in the next few days), which is why I’m asking for more specifics on what is or is not checked currently. If there’s any documentation you could point me at that would help too. I looked at the source file but it doesn’t describe any limitations of the environment in which it operates.

I'm sorry that there is no document on the current status of checkers.
They are in the development stage.