GSoC Proposal: Buffer

Hi, all. I am a third-year CS undergrad at UC Berkeley, and I'm interested
in doing a Google Summer of Code project for Clang. My primary programming
experience is in Mac OS X programming, but I've also done work in several
other languages. During the last two summers I interned at Apple, first in
the AppKit frameworks group and then in the Xcode group. As part of the
first internship, I worked on modifying the Regex component of the ICU
libraries to use the ICU abstract type UText, rather than requiring the
use
of UTF-16 arrays. (The final version of this is planned for ICU 4.6.) In
my
"spare time", I write little utilities for OS X, which are available at
http://belkadan.com.

Clang and its static analyzer have been great to have for OS X
programming, but it wasn't until last week that I actually got involved. I
submitted a patch to fix one XFAILing test (Analysis/no-outofbounds.c),
then spent the weekend working out an understanding of another
(PCH/source-manager-stack.c), unfortunately without coming up with a good
answer. Even so, my main focus of interest is on programming languages and
on /how/ people program, which makes Clang a very promising project!

My project proposal is to make buffer handling much smarter, particularly
for char* strings. Currently a snippet like this won't even warn under
checker analysis (unless I missed some flags):

char buf[5], shortBuf[2];
memcpy(buf, shortBuf, 2); // copy of uninitialized values
strcpy(buf, "12345"); // overflow
gets(buf); // overflow (?) / Bad Idea
strncpy(buf, "12345", 6); // overflow
strcpy("12345", buf); // copy into a const buffer

The fix is twofold. First, to add attributes requiring buffer contents to
be defined, specifying that a given procedure will (or won't) modify a
buffer (similar to C++ const). This is relatively simple and similar to
other sorts of analysis warnings currently provided by the analyzer.

The second part of the fix is to add bounds-tracking to buffers. That is,
it should be possible to catch cases like this:

char *buf = malloc(i);
buf[i] = '\0';

Adding this to string handling procedures (such as strncpy()) could catch
plenty of errors. If this functionality is exposed as a source-level
annotation, this could catch all sorts of potential buffer overflows.

The project would proceed in three stages:
1. Add initialization state checking for functions like memcpy, strcpy,
fgets, etc. which either initialize the buffer or require an initialized
buffer.
2. Allow buffer sizes (and possibly any integer expressions) to be bounded
above and below, instead of just fixed, not-equal, or unknown. This would
even include related parameters (e.g. the length of argv is argc+1).
3. Time permitting, allow bounds to be path-based. "Time" here refers both
to "time during the summer" and "analysis time for a potentially
exponential operation".

This seems like a useful feature both for new programmers and for
hardening existing codebases. What makes it interesting, though, is the
possibility of source-level annotations, e.g. declaring that a function
requires a buffer to be initialized.

Any comments, suggestions, etc?

If no one finds this interesting, I am also interested in the
documentation project listed on the "Open Projects" page. (I wrote a toy
documentation generator several years ago.) While I'm not as intrigued by
this project, it would be useful to have documentation be able to take
advantage of scoping rules. Of course, it would also be a great help to not
have the documentation tool lagging behind the compiler concerning language
changes.

Apologies for the very long e-mail, and thanks!
Jordy Rose