Possible clang features

Here's a couple of possible ideas. I'm sending them directly to you so you can either put them on the TODO list or toss them based on their merits.

Hi John,

Sorry for the late response. I've CC'ed the cfe-dev list so that others can chime in as well.

1) Stores that cause GC qualifications to be dropped.

This has bitten me so many times. The gist is to catch things that are roughly like:

NSString *contextString = @"someContext";
void *context = contextString;

Ditto for returning values that drop a GC qualification, like:

char *someFucntion(void( {
__strong char *buffer = NSAllocationCollectable(4096, NO_SCAN);
...
return(buffer);
}

Adding checks for various GC-related properties and invariants is something that we think static analysis could excel at. Several people have voiced interest in such checks. Since there are a variety of GC-related checks, I believe the best way to start implementing these to is to come up with a list of specific checks to implement and then go from there.

To build such a task list of various GC checks to implement, probably the best thing to do is to start filing Bugzilla reports (feature requests) against the static analyzer. That way myself or anyone else interested in implementing a check can go to the list, see a complete specification for the check, and go and implement it.

2) Peephole optimization recommendations.

Since you've got a fully parsed program, it should be possible to match some generic patterns and offer up some optimization advice. I'm thinking of simple things like:

for(NSUInteger x = 0; x < [anArray length]; x++) {

}

The '[anArray length]' can 'obviously' be hoisted out of the loop, but the standard compiler can't make that optimization due to the dynamic nature of method dispatch. Your tool, however, can make the recommendation that the programmer take a look at it and offer something like the following advice:

Possibly rewrite this for() loop as:

NSUInteger anArrayLength = [anArray length];
for(NSUInteger x = 0; x < anArrayLength; x++) {

}

You could even have different levels of 'optimization aggressiveness'. Another common trick I use is to replace '[anArray objectAtIndex:X]' with 'CFArrayGetValueAtIndex(anArray, X)'. Inside loops, this can often be a pretty big performance win.

Heh, actually, some of these things would be ideal inside Shark. It would have easy access to the hotspots, and it could offer this kind of 'code cruftifying' optimization advice only in hotspot areas.

I think you get the idea though. There's definitely a couple of 'low hanging fruit' items that should be trivial to implement. clang is the perfect place to put them in where they can be consistently applied and identified.

This is a great example of automatic code refactoring combined with static analysis and profiling information. Performing "unsound optimizations" by suggesting changes to the source code is something I'd like to see more of. It's a little risky, but if done right it could yield some huge performance improvements for some programs, especially those written in Objective-C (where as you said the dynamic nature of the language makes it difficult for the compiler to do the optimizations you mentioned).

Although we don't have an extensive refactoring library in clang right now, we do have some pieces to support refactoring, including a textual code rewriter that can rewrite fragments of the code in place (preserving comments, macros, etc). Adding more high-level interfaces to support refactoring applications like these would be a great contribution to clang, and my design for the static analysis library in Clang is to use it for a variety of applications (not just finding bugs).

This would actually be a really interesting project to work on if you (or others) were interested.

3) Parsed code export.

This one is sort of 'pie in the sky' idea based on a need of mine. I mention it because I'm pretty sure you've got the bulk of the machinery in place to do all your other checks.

For one of my projects, RegexKit (the non-Lite version), I ended up writing a documentation system to go along with it (RegexKit Documentation). HeaderDoc just didn't cut it for what I needed. It still follows the header doc /*! @TAG */ style to a large degree, but I ended up tweaking a few things here and there as the need came up. I think you can feed RegexKit's headers in to header doc unmodified and still get something reasonable out the other side. It wouldn't take much to write some kind of 'scrubber' script so it was 100% HeaderDoc compatible.

This wasn't something that was planned out, it just sort of grew organically as needed. One of the first needs was to be able to generate a Table Of Contents from all the source headerdoc commented files. Other stuff got bolted on from there, the end result being about what you'd expect from one hack slapped on top of another hack.

The basic idea is to stuff everything in to an SQLite database. There's a (very messy) perl script that 'parses' header files. Parses is used only in the loosest possible terms because what it really is is a bunch of regex pattern matches that match things that are 'close enough'. The headerdoc comments are easy to find (/\/\*!(.*?)\*\//), and some of the other bits and pieces are fairly easy to find, such as method and function prototypes. Since it's really just a bunch of regex patterns, it can be confused easily and is sort of fragile in the face of big changes.

What would be nice is for a real grammar to parse the header and spit out the parsed structure in some kind of 'easy to use' format. You could then write a perl script that would scoop up the easy to use output and do whatever it needed to.

For example, a method prototype would get decoded in to it's basic parts (I'm completely making this up as I type this, so don't expect anything reasonable).

- (NSArray *)componentsSeparatedByString:(NSString *)separator;

This is a NSString instance method. To a full blown parser, it's trivial to separate out the different parts. It returns a type of 'NSArray *', and the parser knows that NSArray is a class. 'separator' is an argument that is a 'NSString *' type, and NSString is a class.

Basically, some kind of output where all that heavy lifting is done for you. It's also handy to have back references to where something was defined, such as a class or typedef.

In my particular system, this is 'approximately' what happens, modulo the fact that it uses a couple of heuristics instead of the actual syntax structure to derive some of its information. The 'parse' script extracts the information and stuffs it in to an SQLite database. HTML generation happens only after all the .h files have been read in.

HTML generation uses the 'parsed' structure to assist in formatting things inside the HTML. Because everything is inside a database, when it comes time to output a 'pretty' method prototype in the HTML, it can scan the types for types that exist in the database and automagically place a link around that type that points to the documentation for that type. Other fancy bits are also possible, such as pretty the arguments out in italic style. Here's an example of the HTML:

<div class="block method">
<div class="section name"><a name="NSMutableString_RegexKitLiteAdditions__-replaceOccurrencesOfRegex:withString:options:range:error:">replaceOccurrencesOfRegex:withString:options:range:error:</a></div>
<div class="section summary">Replaces all occurrences of the regular expression <span class="argument">regex</span> using <span class="argument">options</span> within <span class="argument">range</> with the contents of <span class="argument">replacement</span> string after performing capture group substitutions, returning the number of replacements made.</div>
<div class="signature"><span class="hardNobr">- (NSUInteger)<span class="selector">replaceOccurrencesOfRegex:</span>(NSString *)<span class="argument">regex</span></span> <span class="hardNobr"><span class="selector">options:</span>(<a href="#RKLRegexOptions" class="code">RKLRegexOptions</a>)<span class="argument">options</></span> <span class="hardNobr"><span class="selector">withString:</span>(NSString *)<span class="argument">replacement</span></span> <span class="hardNobr"><span class="selector">range:</span>(NSRange)<span class="argument">range</span></span> <span class="hardNobr"><span class="selector">error:</span>(NSError **)<span class="argument">error</span>;</span></div>

It's a bit messy raw, but the basics are there. CSS is used extensively to control the visual aspect. The 'hardNobr' class is '.hardNobr { white-space: nowrap; }', which forces the rendered output to not be broken up. In this case, it's used to keep 'logically similar' elements together during word breaking. It looks kinda ugly to break on the space in '(NSString *)'. :slight_smile:

Anyways, I think you get the idea. By understanding the underlying syntactical structure, it's much easier to automatically reformat things in a pleasing documentation friendly way. It also means someone can declare the method any way they want, with whatever whitespace/newline formatting they want, and I can still squeeze things back in to a documentation consistent form.

I found this to be of particular use for 'enum' and 'struct' like definitions. As an example, for enums, it becomes trivial to use a table for the formating. The identifier goes in to one table column, and the identifier constant goes in to another. This lays out things in a neatly aligned and visually consistent fashion, not just a big '<pre></pre>' block and hope for the best.

An unexpected benefit to doing my documentation this way was that when 10.5 came out with DocSet integrated documentation, it was just a matter of writing up a perl script that extracted the information from the database and output it in to a form docset understands. I literally had working, Xcode integrated docset documentation in under a day.

Again, this is 'pie in the sky' type of stuff. It's one of those things that you think 'Oh, that's trivially easy!' and you've got some rough code in about 20 minutes, or it's big project. There's lots of neat things (like this documentation example) that you can do if you have access to an easy to use 'parsed structure' output where all the heavy lifting of parsing the file has been done for you.

Your point about 'parsed code export' I believe touches on a larger issue: people want to build a variety of tools that reason about or manipulate source code. This has been hard in C/C++/Objective-C space because either the frontend technology is intertwined into the implementation of a larger component (e.g., a compiler) or the ASTs (or whatever other structured code representation) can not be persistently stored and used by other clients.

Clang is designed to obviate both issues. First, Clang is built as a set of libraries. The lexer, preprocessor, parser, type checking, ASTs --- these are all represented as libraries, modules that can be easily linked into any tool that wants to use them, be it a command line compiler driver, and IDE, or some other tool. This design is intentional, and it follows the same guiding principles as the rest of LLVM. By having a library based design people can use the pieces they want to build their own tools.

We are also working on making Clang's internal representation of its ASTs persistent. We have support for pretty-printing and textual dumping (some of it is not completely implemented, but it's getting there) so that clients can view a dump of some of Clang's internal data structures right from the command line. Other tools that wish to use information from Clang but not use its libraries directly can potentially use such output. You can also define that output in any way you wish by adding an appropriate ASTConsumer to the clang driver.

We've also built serialization support into Clang to serialize its ASTs out to disk. This isn't 100% there, but it will provide the basis for PCH support in Clang, and the static analyzer will also use serialized ASTs to perform inter-procedural analysis across files. Such persistent ASTs could be inserted into a database, sent across a network connection, etc.

Your example of building an automatic documentation generating system (e.g., Headerdoc or doxygen) is a great example: I think this would be a perfect example of a tool that reuses the Clang libraries for a different purpose other than compilation. The current ASTConsumer interface used by the Clang driver would actually be a great place to start if you wanted to build such a tool, as clients are essentially streamed the AST of a source file and they can do what they like with it. As we bring up more support for persistent ASTs on disk (especially for use with inter-procedural analysis), a documentation generating tool could use those serialized ASTs to perform accurate cross-referencing of function/methods/etc. across files. This allows the documentation system to generate really accurate information about type hierarchies, implementations of classes, macro information, you name it.

In a similar way, I believe a whole cadre of other tools could be built. We are really excited about Clang because we see it as an enabling technology to build great source-level tools for C/C++/Objective-C.

BTW, I like the idea of a Clang-based automatic documentation system so much that I added it to our list of possible projects for people to do on the Clang website.

- Ted