Clang Static Analyzer Round Table

Hi CSA developers,

I’ve submitted a round table request for the upcoming Dev Meeting (Nov 17-19). Would be great to have a discussion.
Please also invite colleagues who you think might be interested.

Thanks,
Gabor

+1

+Deep because he expressed interest.

Yay! Yes, absolutely, let's have that.

+Andrew and Bruno who attended our tiny cozy static analyzer round table at the bay area meetup!

Andrew, you have some notes from that round table, do you think it makes sense to share them in this mailing list thread?

  • Randell

Hi Randell,

Thank you for your email, I am forwarding it to the list.

Is the list set up to block new subscribers from posting until moderators review?
I really don’t know, maybe someone in the list will know.

Are we having this? If so, when?

It’s 9:30 - 10:30 Wednesday (PST / UTC-8), already up on

Hi!

There are some notes from the CSA roundtable. The meeting was pretty long and we had a chance to get a glimpse into GCC’s analyzer as well, thanks a lot to David for sharing all that info! The notes might be a bit spotty, feel free to add more info in replies or correct anything that I got wrong.

People introducing themselves
Question from the chat: Is analyzer for LLVM IR? No, for source code.
Discussing best way to invoke the analyzer using command line, scan-build, codechecker.
Should we adopt -fanalyze from gcc? It is good for compatibility but we might have different output so we are conflicted. Want to keep this option open.
David, GCC’s analyzer’s author is interested in implementing SARIF so we have a common output format across GCC and Clang.
Discussing what are the best way to get started with CSA development
Short status report on Deep’s GSoC project
Can we use the static analyzer to transform source code? Technically, it can emit fixits, but we would not recommend doing it from the analyzer. For fixits you need to understand all execution paths, analyzer does not do that. And it is also often ambiguous on a single path that which step introduced the problem.
Max mentioned using CSA in education. One of the problems he runs into is regarding the dependencies between checks and he would like to have a way to disable some of the sinks.There were some discussions whether it is possible to declare dependencies among checks when using plugins.
Mohannad shared more context why he is interested in source-to-source transformation.
Tim asked how CSA compares to other analyzers like CppCheck and the participants shared some experiences but it is really hard to compare analyzers objectively.
Anton asked if clang-tidy would be a good place to start working on a tool that can rewrite #pragma directives into builtin calls. We believe it would be a better place than CSA.
We discussed why CSA is not the right tool to reason about reachability. We can use ExprInspectionChecker to see what parts of the code are explored by the analyzer.
Max mentions students sometimes hitting some false positives when they implement non-trivial structures like doubly linked lists. The CSA team suggests reporting all of the false positives.
Max also was interested to learn more about the “mark interesting” features. Artem explained how it is used to construct additional notes when building bug reports.
David shared that GCC’s analyzer work on the gimple layer can piggy-back on LTO, but the representation is already somewhat lowered, which can also introduce problems obscuring what the user wrote.
David color codes the exploded graph nodes in the dumps to make it easier to identify points of interests.
We also discussed how loops are modeled (heuristics to completely unroll certain loops, and do something akin to widening in abstract interpretation). After that we briefly discussed supporting inline assembly and doing taint analysis.
Are there benchmarks for the static analyzer? We have some docker based solutions to easily run analyzer over some project, it is in the LLVM repo, utils/analyzer/SATest.py docker.
One can also use csa-testbench from github.
Max brought up the conservative evaluation that invalidates a large part of the execution state. He advocated for emulating certain system calls more precisely to avoid excessive invalidations. Artem mentioned there are already facilities for doing that but there is always some room for improvements.
David described how GCC’s analyzer works on the high level. The transfer functions can generate both mergeable and unmergeable symbolic values to control what states can be merged and what state cannot.
Exceptions are a pain point for both analyzers at the moment. Both analyzers have context sensitive program points for doing inter-procedural analysis.
We discussed the analyzer’s inline defensive check heuristics to suppress false positives.
David explained how he got into creating a static analyzer in GCC (after being the CPython maintainer and being fed up with certain reference counting bugs in badly written CPython extensions and wanted to have an automated solution to find those errors).
In GCC, the states are mutated in place as opposed to in Clang where most data structures are immutable (that are used by CSA). Bifurcating state was not trivial with in-place mutated state.
Using gimple David feels like he side-stepped many problems, e,g, how constructors/destructors are represented is more uniform with regular functions calls. But many aggressive optimizations happen early, like some early inlining. Unfortunately, there is no easy way to get a nice, completely unoptimized IR in GCC at the moment. There is a mark-and-sweep GC in GCC that is used in various places. But the analyzer is rarely using the GC for cleaning up memory, it is mostly relying on RAII.
In CSA many of the objects are reference counted. The symbolic values in CSA are pass by value.
Symbolic values are internalized in a manager for GCC, so there is only one instance of every symbolic expression, they are deduplicated.
GCC’s analyzer looks into the memory space and if they are different, it assumes they will not alias.
We discussed how CSA’s model is bad at recovering from wrong assumptions about aliasing.
GCC’s analyzer has a way to trigger a break when reaching a certain program point marked with analyzer_break().
Ideas to deal with aliasing:

  • prepass to check for equality checks
  • or add new nodes with the equality assumption to the worklist (a new entry to the function), and do not analyze the branch with the wrong assumption
    GCC’s static analyzer also has plugins, and there is an example about holding the global lock for CPython modules.
    There are many attributes that can describe the behavior of certain functions. It would be nice to coordinate on them so both GCC and Clang can understand the same set of attributes.
    We discussed Microsoft SAL, attributes, contracts to describe function behavior.
    Clang analyzer supports noreturn, nullability, and some other check-specific attributes.
    GCC has some access attributes to describe buffer accesses. Microsoft is using annotations internally extensively to annotate buffer operations.
    Clang analyzer does not really ask for buffer sizes since it is not doing really deep analysis of loops. It will not attempt to do buffer overflow detection (by default).
    An important attribute to support is the cleanup attribute to avoid false memory leak warnings.
    We described the facilities we have to explain warnings to users, how we construct bug reports. David suggests looking into Explaining static analysis with rule graphs paper.
    Both GCC and Clang deduplicate similar warnings and try to present the one with the shortest feasible path. GCC also tracks the object of interest during path diagnostic creation while walking backwards.
    It would be great to have a dynamic UI for warnings where the user can explore details on demand rather than overwhelming users with information from the ground up. Artem thinks a static analyzer should have an interface which is similar to a time travel debugger. David mentioned it would be fun to have a GDB target so we can use GDB to step through a trace.
    David worked alone on GCC’s analyzer, and only worked for 2,5 years on it. But he had a GSoC student this summer to help him out.
    Artem mentioned that we do many tuning to eliminate false positives and that tuning is depending on what projects the devs used to test the analyzer.
    In GCC, originally every entity like regions or symbolic values had ids, and used those ids to make comparisons easier. Later the ids were replaced with pointers and the code got simpler and a category of bugs disappeared. The meaning of regions also changed over time. In GCC 10 things were stored into the region. In GCC 11 the region is just a description of how to access memory. Two regions can be different descriptions of the same thing and there is a separate mapping from regions to values. That made regions immutable.
    We were covering what kinds of symbolic values are there in GCC and Clang.
    GCC has different regions for label and function pointers. It is fairly similar to Clang’s, maybe GCC is a bit more fine grained.
    David was also maintaining the diagnostic subsystem in GCC and implemented the caret style diagnostics.