[analyzer] An idea regarding our data map

I've got an idea. The idea comes as a replacement for fixing current issues with the GDM (sharing traits across checkers, automating the routine work on checker side). The idea is: GDM is not the right place to model most of the checker-controlled program state traits, instead we should encourage checkers to use the Store, which already has most of the necessary features.

Specifically, whenever the checker tries to model a certain data structure, which exists (or could have existed) somewhere in the program (be it user or kernel memory, or CPU registers, or shadow memory of some sanitizer-like tool which could have been run alongside the program), then the checker should model this data structure directly in the Store(?)

This approach may be thought of as bringing the idea of "ghost code" from model checking into our analyzer. We're supposed construct ghost structures during analysis, which mimic real or imaginary things that happen during program execution.

With the current region hierarchy, it's a bit hard to follow this approach, because memory regions are hard to construct. So my proposal is to add a few more region classes, which would allow checkers to cheaply, probably even transparently, construct the necessary structures in the Store.

A few reasons why i ended up liking this idea:

1. It's natural. After all, the program doesn't have anything like GDM when it executes dynamically - it only has memory, which means that the Store, being our memory model, should easily take care of modeling anything that is going on in the program, and useful features of the Store are still useful for such modeling, so why don't we use them. In other words, the most natural way to explain how exactly should the smart data map automate routine tasks would be to say that it should behave exactly like the Store. So why don't we just use the Store.

2. There's no problem with sharing the trait. Any checker can see what's going on with a plain getSVal(). Of course, we can hide the key for getSVal() in anonymous namespace and probably provide getters/setters instead, which wouldn't stop the Store from iterating over bindings.

3. There's no problem with routine work and code duplication in checkers, which makes writing checkers a lot easier:
   a. Store transparently takes care of cleaning up live/dead symbols, there's no more problems with different worklists for store and checkers.
   b. The store takes care of invalidation, eg. checkRegionChanges could probably be automated with this approach, by simply choosing a good parent region for our checker data structures.
   c. Also, Store provides us with SymbolRegionValue-class and SymbolDerived-class symbols for our structures when values in them are unknown; this way we can reason if it's the same unknown or a probably-different unknown.
   d. The store can be easily dump()'ed, we no longer need hooks for printing the program state on checker side.
   e. In the future, we can probably add alpha-renaming, and then problems and solutions would most likely be exactly the same for the Store itself and for checker data structures.

4. This approach is very easy to implement and to explain. Most of the time, we just use the standard getSVal()/bindLoc() API for everything, probably some overrides.

A few arguments against this approach:

1. Some program state traits do not try to model data structures that already exist in the program. The best example is probably range constraints. Constraints are immaterial for dynamic execution - they're a meta-information about chosen execution path. For such traits, i do not yet present a solution, though i have a couple of thoughts - if anybody's curious, i can dump. In particular, i feel as if constraint-like traits can be easily supported with a separate generic approach, and it'd cover all of our problems.

2. GDM is done in terms of sets and maps, however Store doesn't have sets or maps, it only works with C-like memory with at best 1-dimensional arrays. Of course we can model maps of ints to ints as ghost arrays with symbolic bindings to elements, and it's a safe and sane implementation. For a map from int to set of ints, it gets harder. But i think it's a non-issue: it's still more important to figure out how do we model more complicated containers (i still dream of container symbols and container constraints, which would reduce this issue to issue '1.'), and then we could re-visit this issue.

I've made proof-of-concept patchs to SimpleStreamChecker (two different approaches) and PthreadLockChecker. The patch introduces four kinds of ghost regions - each "smart trait" owns its own memory space which doesn't get invalidated (for things that don't need to be invalidated, or for manual control), and allows creating one variable inside that (or any other) memory space, one field inside any region, or one symbol-based region for any symbol (unlike SymbolicRegion, the ghost symbol-based region doesn't need to be based on a pointer-type symbol, any symbol will do fine).

The patches are on my github because i'd re-arrange them if i was to put them on phabricator, and not sure what to do with multiple checker variants. So i'm asking to have a look at the checkers and see if you want to write checkers *that* way, instead of the traditional GDM, then i'd focus on fixing smaller issues, or i'd have to come up with something different.

SimpleStreamCheckerV2 defines symbol-based ghost regions for file descriptor symbols and stores constant values (0 or 1) inside them, indicating opened and closed streams. It doesn't need to store the "unknown" state - it is represented by non-constant values, eg. SymbolRegionValue of the ghost region. Link: [analyzer] GhostSymbolicRegion: Initial implementation · haoNoQ/clang@578e790 · GitHub

SimpleStreamCheckerV3 defines a ghost field inside each symbolic region corresponding to the each stream's pointer symbol, and stores the same constants inside these fields. It has troubles suppressing invalidation of these fields, which, in my opinion, should ideally be solved through evalcall/bodyfarm of library functions. Both versions do not suffer from the "zombie symbols" issue. Link: [analyzer] GhostFieldRegion: Initial implementation · haoNoQ/clang@bb63471 · GitHub

PthreadLockCheckerV2 defines a ghost field inside each mutex, indicating one of the four states (Initialized, Locked, Unlocked, Destroyed), which allows it to find double locks/unlocks etc. Additionally, it models the stack of locks in order to find lock order reversals. This is modeled through a single ghost variable of type `void**', which represents the stack pointer - the checker dereferences this pointer to access or modify the stack and increments/decrements it. Link: [analyzer] GhostVarRegion: Initial implementation · haoNoQ/clang@ab581e4 · GitHub

None of these checker's code actually deals with ghost regions through MemRegionManager - they always use wrappers in the ProgramState class, which in my opinion are easy and intuitive, though probably more such wrappers need to be introduced.

All the newly introduced terminology is worth discussing, i don't really like it.

Hi!

Hello,

Hi!

Yeah, i guess it’s inevitably necessary. For this proof of concept, when i was playing with SimpleStreamChecker, i put checking into PreCall (“first check our invariant then close the file”) and modeling in post-call (“before the call it’s not closed yet”). But i’ve little hope that such tricks would work in the generic case.

By the way, forgot to mention after a few rewrites: in all examples i provide the checker is split into the “modeling” checker (which throws no reports) and the “checking” checker (which doesn’t change the program state), which interact through the smart traits. PthreadLockChecker example is split into three parts. So this facility seems to allow splitting large checkers, like MallocChecker (4000+ lines of code) into smaller checkers.

I just hope that we’re logarithmic enough to avoid performance drops due to that issue. If we’re not, then i’d suggest putting most of the “ghost” objects into some separate bucket, because very often we known where to find what we’re looking for - if we’re looking for a ghost object or for a real object. Finally, performance of removeDeadBindings() would drop because of the single worklist, but that’s intentional - we need the single worklist. We shouldn’t be suffering from significant memory overhead though, because the amount of stuff to store is roughly the same anyway.

For the lock deepness use case, i’d suggest a single ghost variable that gets incremented by the checker when something is locked and decremented when something is unlocked. This variable is essentially a simple data structure that does not instantly exist in the program, but can probably be implemented by some dynamic instrumentation technique, so i’m still fine with imagining such data structure for the purpose of static analysis. My last experiment with PthreadLockChecker provides an even more complicated example - it models the stack of mutexes for catching lock order reversal.

So far the only thing i’m expecting troubles with is constraint-like information (eg. range constraints on symbols) - which is truly a kind of meta-information that truly cannot be represented as a data structure. For such constraints, i’ve no plan so far but to leave them in the GDM; see below though for criteria.

Hmm. That’s a problem we seem to automagically solve - behavior of GDM is very store-like in that case. We used to have SymbolMetadata for string length, for which liveness is checker-managed. Now, with modeling string length as ghost object, strlen() returns a SymbolRegionValue of the ghost object, which lives exactly as long as we need, even if the ghost object dies.

Yep.

I’d agree that invalidation of ghost-fields is so far the most unclear part of the proposal. Even for SimpleStreamChecker “version 3” it didn’t work very well - we’re losing positives on test due to lack of ability to manage invalidation manually. But that’s about pointer-escape, not region-changes. I’ve not tested much of the region-changes stuff yet.

I’m thinking more on that, with two ideas - either add some helpers into the trait objects (“this trait doesn’t want to be auto-invalidated by functions matching certain criteria”), or disable auto-invalidation completely for ghost-fields (they’d no longer have a symbolic offset, but some new kind of offset that’s neither concrete nor symbolic).

Most importantly, binding to ghost fields shouldn’t invalidate regular fields.

Maybe i’d find out that ghost fields won’t work as nicely as i expected, so i’d have to think of them as “shadow memory” instead - pretend that ghost values mapped to memregions are actually “outside” memregions. But i still hope that it’s not the case and managing superregions would be helpful; the ‘errno’ ghost variable is my primary motivation for this (we don’t always have it in the AST, as it could be macroed into some `*__errno()', but we still know in which memspace it should reside).

The Store just auto-creates them for regions initialized in a certain manner (eg. not initialized at all → create SymbolRegionValue), and i guess it just shouldn’t make an exception for ghost regions.

When this feature is not used, as in my examples with stream checker, we simply see if the value is constant or not, and we don’t care if it’s constant.

There’s slight problem with marking the stream state as unknown - right now i bind an UnknownVal into it. Once we do want to use such feature, we’d have to call a standard invalidation method on that region. With GDM, we could simply remove it from GDM - “stop tracking”, then “start tracking” again. But we cannot remove bindings from the Store, because we want to avoid reincarnation of SymbolRegionValue. This UnknownVal binding would be automatically removed once the stream dies, but i agree that it’s still some overhead to measure.

Yeah, it’d be nice to have constants mapped to some string values, eg. right now the dump looks like:

(StreamState{conj_$0<void *>}, 0, Direct): 1 S32b

But it’d be better as:

(StreamState{conj_$0<void *>}, 0, Direct): Open

We could do that by adding a helper method to the state trait, maybe merge the whole enum into the trait.

A few more info on this matter, as it seems important. Yeah, the primary example is ranges for symbols. One more example would be dynamic type map, but it’s a bit more complicated.

So i propose the following reasoning as an approximation for a criteria of using GDM vs. using Store: it’s somehow similar to difference between symbols and regions. I’m not possessing a perfectly clear understanding right now, but i’d try my best to explain my thoughts below:

Region contents are essentially mutable. Today they contain one thing, tomorrow they contain another thing, but it’s still the same region. You can overwrite it, you can move it to another region, you can invalidate it, you can placement-new on top of it.

Symbols, on the other hand, are essentially immutable. Any information about the symbol which we managed to gather about the symbol was true since forever and shall remain true forever. A symbol of range [1, 2] would never take value ‘3’, and has never taken before, during execution of the program over that particular branch.

These two concepts - of regions and symbols - work together to make the analyzer model simple operations.

GDM was designed to let the analyzer model more complex operations. However, my point is that all GDM traits are either “Store-like”, which means, their keys behave like regions into which we bind values, then re-bind different values, or they are “Constraint-like”, which means their keys behave like symbols, and information bound to them can only be expanded, but not mutated.

For instance, despite SimpleStreamChecker’s key is a symbol, the trait is essentially “Store-like”: the same file descriptor is first open, then closed. Which is why i propose first fixing the situation by providing a region tied to that symbol, and then bind values to represent file descriptor operations.

Now we’re ready to consider the DynamicTypePropagation example. In fact, dynamic type information sounds like “Constraint-like” trait, because we can only know more about the dynamic type of every object. However, currently dynamic type is bound to a region. There’s a problem with that - we can placement-new into the region, and the dynamic type shall mutate. Because region contents are mutable. So, in my opinion, dynamic type map in its current implementation is not a well-defined trait - it should be refactored into two traits, one of which is “Store-like” (bind an object ID symbol to every region; it can be copied, moved, invalidated, or affected by placement-new as much as we want), and another is “Constraint-like” (map object ID symbol to its dynamic type) - which is truly immutable. And, again, these two concepts - “Store-like” things and “Constraint-like things” - work together.

So my criterion so far is to move “Store-like” objects to the Store, where they seem to belong, and keep “Constraint-like” objects in the GDM.

Additionally, I believe that “Constraint-like” objects are always mapped to symbols (i.e. a map from symbols to something - because you can’t tie them to regions, because region contents are mutable). So i believe that stuff like liveness or alpha-renaming should be very simple for “Constraint-like” traits - i.e. never keep alive, just clean up symbols, merge info on alpha-renaming - and we can still automate everything.

In fact, i’m thinking^W dreaming about the same thing here - container constraints should be “Constraint-like”, tied to container content symbols.

One sec. Here. Hope these links don’t die too soon. Just wasn’t sure how to organize that on github.

Original SimpleStreamChecker vs. SimpleStreamCheckerV2:
https://www.diffchecker.com/a9fqicng

Original SimpleStreamChecker vs. SimpleStreamCheckerV3:
https://www.diffchecker.com/3ketimia

SimpleStreamCheckerV2 vs. SimpleStreamCheckerV3:
https://www.diffchecker.com/ebac2icl

PthreadLockChecker vs. PthreadLockCheckerV2:
https://www.diffchecker.com/mixtsmra