Static analysis output format

As we've been working through the list of results from static analysis for Adium it's become increasingly clear that the output format is introducing some complications. Specifically, each time we rerun (whether to use an updated version of checker, or to check against the latest source) it eliminates any metadata that we've built up around the results, such as which ones were false positives.
  Unfortunately, fixing this seems somewhat tricky. The main thing that would be necessary is a way of identifying results across runs. That way we can plug this into our automated testing system so each time we commit it can rerun and say "ok, these ones are known, these ones are known false positives, and these ones are new" rather than just "here's a list to re-evaluate". I'm not sure how to come up with some sort of identifier for issues though. Line numbers probably change too frequently to be reliable. I suppose a heuristic based on function name, issue type, file name, and approximate line number might be fairly accurate.

              David

  As we've been working through the list of results from static
analysis for Adium it's become increasingly clear that the output
format is introducing some complications. Specifically, each time we
rerun (whether to use an updated version of checker, or to check
against the latest source) it eliminates any metadata that we've built
up around the results, such as which ones were false positives.
  Unfortunately, fixing this seems somewhat tricky. The main thing that
would be necessary is a way of identifying results across runs. That
way we can plug this into our automated testing system so each time we
commit it can rerun and say "ok, these ones are known, these ones are
known false positives, and these ones are new" rather than just
"here's a list to re-evaluate".

I believe this is a necessary feature, and I think it is one that will take several iterations to get right.

I'm not sure how to come up with some
sort of identifier for issues though. Line numbers probably change too
frequently to be reliable. I suppose a heuristic based on function
name, issue type, file name, and approximate line number might be
fairly accurate.

This seems like a very reasonable heuristic. Even eluding the line number might be fine for now.

BTW, some of this meta-data can easily be grepped right out of the HTML file. This is exactly what scan-build does to build the index.html file. For example:

$ grep BUG report-wEXcKk.html
<!-- BUGPATHLENGTH 2 -->
<!-- BUGLINE 15 -->
<!-- BUGFILE /Volumes/Data/Users/kremenek/Desktop/MyClass.m -->
<!-- BUGDESC Memory Leak -->

We can easily include other meta-data, such as the function/method name where the bug occurs, an cryptographic hash of the source file (or function) that contained the bug, etc.

Aside from your own automatic testing tools, ideally, we want the HTML output that the tool (scan-build) produces to allow users to triage and navigate bugs across runs. This is an important feature, but not immediately high on the priority list. Much of the heavy lifting would probably be done in scan-build (which is currently written in Perl) where the summary HTML pages are generated.

Anyone with Perl and HTML knowledge is welcome to provide patches to improve this aspect of the system without basically having any knowledge of how the analyzer works (meta-data embedded in report-XXXXX.html files that is useful for building such features into scan-build could be implemented on demand).

Moreover, scan-build can be completely rewritten to provide a more advanced system for triaging bugs if anyone is interested in undertaking such a project.

Ted

BTW, when I mean its not "immediately high on the priority list", I'm referring to my own queue. I'm more than happy to work with others to move this feature along.