I've got an idea. The idea comes as a replacement for fixing current issues with the GDM (sharing traits across checkers, automating the routine work on checker side). The idea is: GDM is not the right place to model most of the checker-controlled program state traits, instead we should encourage checkers to use the Store, which already has most of the necessary features.
Specifically, whenever the checker tries to model a certain data structure, which exists (or could have existed) somewhere in the program (be it user or kernel memory, or CPU registers, or shadow memory of some sanitizer-like tool which could have been run alongside the program), then the checker should model this data structure directly in the Store(?)
This approach may be thought of as bringing the idea of "ghost code" from model checking into our analyzer. We're supposed construct ghost structures during analysis, which mimic real or imaginary things that happen during program execution.
With the current region hierarchy, it's a bit hard to follow this approach, because memory regions are hard to construct. So my proposal is to add a few more region classes, which would allow checkers to cheaply, probably even transparently, construct the necessary structures in the Store.
A few reasons why i ended up liking this idea:
1. It's natural. After all, the program doesn't have anything like GDM when it executes dynamically - it only has memory, which means that the Store, being our memory model, should easily take care of modeling anything that is going on in the program, and useful features of the Store are still useful for such modeling, so why don't we use them. In other words, the most natural way to explain how exactly should the smart data map automate routine tasks would be to say that it should behave exactly like the Store. So why don't we just use the Store.
2. There's no problem with sharing the trait. Any checker can see what's going on with a plain getSVal(). Of course, we can hide the key for getSVal() in anonymous namespace and probably provide getters/setters instead, which wouldn't stop the Store from iterating over bindings.
3. There's no problem with routine work and code duplication in checkers, which makes writing checkers a lot easier:
a. Store transparently takes care of cleaning up live/dead symbols, there's no more problems with different worklists for store and checkers.
b. The store takes care of invalidation, eg. checkRegionChanges could probably be automated with this approach, by simply choosing a good parent region for our checker data structures.
c. Also, Store provides us with SymbolRegionValue-class and SymbolDerived-class symbols for our structures when values in them are unknown; this way we can reason if it's the same unknown or a probably-different unknown.
d. The store can be easily dump()'ed, we no longer need hooks for printing the program state on checker side.
e. In the future, we can probably add alpha-renaming, and then problems and solutions would most likely be exactly the same for the Store itself and for checker data structures.
4. This approach is very easy to implement and to explain. Most of the time, we just use the standard getSVal()/bindLoc() API for everything, probably some overrides.
A few arguments against this approach:
1. Some program state traits do not try to model data structures that already exist in the program. The best example is probably range constraints. Constraints are immaterial for dynamic execution - they're a meta-information about chosen execution path. For such traits, i do not yet present a solution, though i have a couple of thoughts - if anybody's curious, i can dump. In particular, i feel as if constraint-like traits can be easily supported with a separate generic approach, and it'd cover all of our problems.
2. GDM is done in terms of sets and maps, however Store doesn't have sets or maps, it only works with C-like memory with at best 1-dimensional arrays. Of course we can model maps of ints to ints as ghost arrays with symbolic bindings to elements, and it's a safe and sane implementation. For a map from int to set of ints, it gets harder. But i think it's a non-issue: it's still more important to figure out how do we model more complicated containers (i still dream of container symbols and container constraints, which would reduce this issue to issue '1.'), and then we could re-visit this issue.
I've made proof-of-concept patchs to SimpleStreamChecker (two different approaches) and PthreadLockChecker. The patch introduces four kinds of ghost regions - each "smart trait" owns its own memory space which doesn't get invalidated (for things that don't need to be invalidated, or for manual control), and allows creating one variable inside that (or any other) memory space, one field inside any region, or one symbol-based region for any symbol (unlike SymbolicRegion, the ghost symbol-based region doesn't need to be based on a pointer-type symbol, any symbol will do fine).
The patches are on my github because i'd re-arrange them if i was to put them on phabricator, and not sure what to do with multiple checker variants. So i'm asking to have a look at the checkers and see if you want to write checkers *that* way, instead of the traditional GDM, then i'd focus on fixing smaller issues, or i'd have to come up with something different.
SimpleStreamCheckerV2 defines symbol-based ghost regions for file descriptor symbols and stores constant values (0 or 1) inside them, indicating opened and closed streams. It doesn't need to store the "unknown" state - it is represented by non-constant values, eg. SymbolRegionValue of the ghost region. Link: [analyzer] GhostSymbolicRegion: Initial implementation · haoNoQ/clang@578e790 · GitHub
SimpleStreamCheckerV3 defines a ghost field inside each symbolic region corresponding to the each stream's pointer symbol, and stores the same constants inside these fields. It has troubles suppressing invalidation of these fields, which, in my opinion, should ideally be solved through evalcall/bodyfarm of library functions. Both versions do not suffer from the "zombie symbols" issue. Link: [analyzer] GhostFieldRegion: Initial implementation · haoNoQ/clang@bb63471 · GitHub
PthreadLockCheckerV2 defines a ghost field inside each mutex, indicating one of the four states (Initialized, Locked, Unlocked, Destroyed), which allows it to find double locks/unlocks etc. Additionally, it models the stack of locks in order to find lock order reversals. This is modeled through a single ghost variable of type `void**', which represents the stack pointer - the checker dereferences this pointer to access or modify the stack and increments/decrements it. Link: [analyzer] GhostVarRegion: Initial implementation · haoNoQ/clang@ab581e4 · GitHub
None of these checker's code actually deals with ghost regions through MemRegionManager - they always use wrappers in the ProgramState class, which in my opinion are easy and intuitive, though probably more such wrappers need to be introduced.
All the newly introduced terminology is worth discussing, i don't really like it.