Proposed: change tracking for RegionStore

In working with C string length tracking, I've found it's hard to track
all the cases when a region might change. One possibility would be to
allow
registering callbacks when /any/ store is made (possibly erasures as
well).
I wasn't so fond of that idea.

Instead, I went with an alternate idea: record a "change marker" in the
store for each region. Whenever a region changes, the marker is cleared, as
well as markers for all its super-regions.

The markers are stored under a new type of binding (BindingKey::Change),
(ab)using UndefinedVal's data slot to contain the marker. The actual value
of the marker is simply a pointer to the store just before the marker was
added.

The downside, of course, is that the marker-clearing happens with every
Add/Remove in the store, most of the time doing nothing. The super-region
hierarchy isn't /that/ deep, but it's still something to consider.

Comments? What needs to be done to make this committable?
RequestChangeMarker is particularly bad, but I'm not sure what the best way
around it is.

change-markers.patch (6.2 KB)

Hi Jordy,

I'm really concerned about the potential scalability cost this will impugn on RegionStore. Adding change markers will cause the store to record information that is highly specific to a given path. In this case, stores would retain memory of the previous store, which transforms them from being memoryless (aside from LazyCompoundVals) to retaining a bunch of extra state that is specific to a given path. This can cause states not to merge when they are otherwise identical, which could possibly have huge algorithmic implications with how well the analyzer scales, as far fewer paths will get merged.

I think there is a higher-order API problem here as well. The internal model that a StoreManager uses for managing memory bindings might not be the same as a Checker's view of memory. Checkers reason about memory using MemoryRegions, but a store could model them using MemRegions+offsets, etc. Checkers wishing to know about changes to memory would want to know about those changes in terms of how they reason about memory, not how the StoreManager reasons about them.

In your opinion, why is the callback idea not so hot?

Hi Jordy,

I'm really concerned about the potential scalability cost this will

impugn

on RegionStore. Adding change markers will cause the store to record
information that is highly specific to a given path. In this case,

stores

would retain memory of the previous store, which transforms them from

being

memoryless (aside from LazyCompoundVals) to retaining a bunch of extra
state that is specific to a given path. This can cause states not to

merge

when they are otherwise identical, which could possibly have huge
algorithmic implications with how well the analyzer scales, as far fewer
paths will get merged.

This is true; I hadn't thought of that. Of course, it's only a problem
when the checkers request that information, but it could be a big deal for
a large function (or function net with IPA) dealing with lots of C strings.

I think there is a higher-order API problem here as well. The internal
model that a StoreManager uses for managing memory bindings might not be
the same as a Checker's view of memory. Checkers reason about memory

using

MemoryRegions, but a store could model them using MemRegions+offsets,

etc.

Checkers wishing to know about changes to memory would want to know

about

those changes in terms of how they reason about memory, not how the
StoreManager reasons about them.

I don't think that's true. If I'm watching a[0][4] of an int, and it
get casted to an int and the fourth element changes, I still want to
know.

In your opinion, why is the callback idea not so hot?

It's mainly an API problem as well. A StoreManager may do multiple Add()
or Remove() calls as part of a single action, and between those actions the
store isn't exactly in a reasonable state. For example, the old code for
initializing an array with a string would copy each element of the string
into the array. All that happens here is that change markers get
redundantly cleared, but with callbacks the observer might be able to see
the store with undefined values, or whatever.

(I haven't looked through RegionStore to see if that's actually true.)

Moving the callback up to Bind or Invalidate might work, but then the
stores would have to keep track of every region touched during the call --
annoying, but not impossible. This could still show up during a larger
operation, since the StoreManager doesn't know who's modifying the store.
(It could be part of modeling a function's behavior, for example.)

If we can work out the right API for that, I'd be happy to switch over.
First pass: any call that returns a new Store will call

void RegionWatcher::RegionChanged(Store NewStore, const MemRegion *MR)

for each region, and it's up to the RegionWatcher to traverse
superclasses. This would probably be using the logical view of regions
rather than the internal view, which I'm still not sure is better.

It's mainly an API problem as well. A StoreManager may do multiple Add()
or Remove() calls as part of a single action, and between those actions the
store isn't exactly in a reasonable state. For example, the old code for
initializing an array with a string would copy each element of the string
into the array. All that happens here is that change markers get
redundantly cleared, but with callbacks the observer might be able to see
the store with undefined values, or whatever.

(I haven't looked through RegionStore to see if that's actually true.)

Moving the callback up to Bind or Invalidate might work,

This is the level I was thinking of.

but then the
stores would have to keep track of every region touched during the call --
annoying, but not impossible.

Not necessarily. Instead of tracking all regions that modified, we can have a registered list of subscribers (provides as input to Bind/Invalidate) that want to know when *specific* regions (and/or their subregions) have been touched. If there are no subscribers, then we record nothing. If there are subscribers, we only record information for regions that they have said they care about. Most subscribers will only care about regions for which they have checker-specific state, and that set should be small.

This could still show up during a larger
operation, since the StoreManager doesn't know who's modifying the store.
(It could be part of modeling a function's behavior, for example.)

That is true, and we want individual subscribers (which are probably Checkers) to have the ability to interject new analysis state (either via new GRState objects or ExplodedNodes). Not every call to Bind/Invalidate is structured to work that way, but it could probably be done.

One thing I like about this approach is that it could be a natural extension of the Checker callback interface.

If we can work out the right API for that, I'd be happy to switch over.
First pass: any call that returns a new Store will call

void RegionWatcher::RegionChanged(Store NewStore, const MemRegion *MR)

for each region, and it's up to the RegionWatcher to traverse
superclasses.

Superclasses?

This would probably be using the logical view of regions
rather than the internal view, which I'm still not sure is better.

Subscribers/checkers need to be StoreManager agnostic, and in the common cases they will be reasoning (with checker-specific state) about SymbolicRegions and VarRegions (or other "base" regions). We really don't want checkers to get into the business of directly reasoning about individual array elements, as that could implicitly require discretizing the representation of the contents of the array instead of reasoning about it symbolically.

It's mainly an API problem as well. A StoreManager may do multiple

Add()

or Remove() calls as part of a single action, and between those actions
the
store isn't exactly in a reasonable state. For example, the old code

for

initializing an array with a string would copy each element of the

string

into the array. All that happens here is that change markers get
redundantly cleared, but with callbacks the observer might be able to

see

the store with undefined values, or whatever.

(I haven't looked through RegionStore to see if that's actually true.)

Moving the callback up to Bind or Invalidate might work,

This is the level I was thinking of.

but then the
stores would have to keep track of every region touched during the call
--
annoying, but not impossible.

Not necessarily. Instead of tracking all regions that modified, we can
have a registered list of subscribers (provides as input to
Bind/Invalidate) that want to know when *specific* regions (and/or their
subregions) have been touched. If there are no subscribers, then we

record

nothing. If there are subscribers, we only record information for

regions

that they have said they care about. Most subscribers will only care

about

regions for which they have checker-specific state, and that set should

be

small.

That's true, but I meant we have to pass a list down from method to method
for any possible modifications. The simplest way would be to add all
regions to the list, but we could also pass a set of interesting regions
and have the methods mark the ones that are touched.

Would it be better to store a region<->listeners map or a listener set
that we then query for a list of regions? I'm leaning towards the former.

This could still show up during a larger
operation, since the StoreManager doesn't know who's modifying the

store.

(It could be part of modeling a function's behavior, for example.)

That is true, and we want individual subscribers (which are probably
Checkers) to have the ability to interject new analysis state (either

via

new GRState objects or ExplodedNodes). Not every call to

Bind/Invalidate

is structured to work that way, but it could probably be done.

One thing I like about this approach is that it could be a natural
extension of the Checker callback interface.

If that's the case, what's the best way for checkers to register their
interest in a region? Do they register with the GRExprEngine, the
GRStateManager, or directly with the StoreManager?

And hm. How can we inject new analysis state at the level of Bind or
Invalidate, where only the Store is being changed? Not all store-modifying
code goes through GRState.

If we can work out the right API for that, I'd be happy to switch over.
First pass: any call that returns a new Store will call

void RegionWatcher::RegionChanged(Store NewStore, const MemRegion *MR)

for each region, and it's up to the RegionWatcher to traverse
superclasses.

Superclasses?

Super-regions, sorry. If we're figuring out which observers to notify
within the store, however, it's going to have to be the store's
responsibility to check super-regions.

This would probably be using the logical view of regions
rather than the internal view, which I'm still not sure is better.

Subscribers/checkers need to be StoreManager agnostic, and in the common
cases they will be reasoning (with checker-specific state) about
SymbolicRegions and VarRegions (or other "base" regions). We really

don't

want checkers to get into the business of directly reasoning about
individual array elements, as that could implicitly require discretizing
the representation of the contents of the array instead of reasoning

about

it symbolically.

Good point, never mind.

One thing I like about this approach is that it could be a natural
extension of the Checker callback interface.

If that's the case, what's the best way for checkers to register their
interest in a region? Do they register with the GRExprEngine, the
GRStateManager, or directly with the StoreManager?

Interest in a region is probably something that would go into GRState, but that "registration" of interest can be implicit. For example, GRExprEngine could query the checkers for the set of "interesting" regions given a GRState*. After all, a given checker probably knows what regions it cares about because it is already tracking checker-specific state for them.

And hm. How can we inject new analysis state at the level of Bind or
Invalidate, where only the Store is being changed? Not all store-modifying
code goes through GRState.

I think Bind and Invalidate would still work as expected, but just return the list of regions that "subscribers" had said they cared about.

Bind and Invalidate, however, are only called at specific places (e.g., in GRExprEngine), and usually when one is reasoning about a GRState or ExplodedNode. Does it seem plausible to try and do the callbacks after the Bind/Invalidate, possibly generating new ExplodedNodes? For example, I can see this happening when evaluating a store within GRExprEngine.

And hm. How can we inject new analysis state at the level of Bind or
Invalidate, where only the Store is being changed? Not all
store-modifying
code goes through GRState.

I think Bind and Invalidate would still work as expected, but just

return

the list of regions that "subscribers" had said they cared about.

Bind and Invalidate, however, are only called at specific places (e.g.,

in

GRExprEngine), and usually when one is reasoning about a GRState or
ExplodedNode. Does it seem plausible to try and do the callbacks after

the

Bind/Invalidate, possibly generating new ExplodedNodes? For example, I

can

see this happening when evaluating a store within GRExprEngine.

I'm thinking of modeling functions, though. The only place where this is
used now is MallocChecker's fill value (admittedly, a patch from me, weeks
ago). But memset() is basically Invalidate-then-BindDefault.
OSAtomicChecker is using EvalStore, but it's not as clean as it should be.
There's no EvalFill or EvalInvalidate. Should there be?

I'm inclined to not allow new ExplodedNodes here, only a one-to-one filter
of GRStates. That is, you can change states when getting a region update,
but not bifurcate the state. This limits us to "places that call
GRState::makeWithStore", which are a manageable set. But still not at the
GRExprEngine level. I'm not thrilled with turning every bind* call into
this:

RegionSet RS = C.getEngine().GetInterestingRegions();
tie(state, RS) = state->bindLoc(L, V, RS);
C.getEngine().NoteRegionsChanged(state, RS);

Or this:

state = state->bindLoc(L, V, C.getEngine());

They both seem like they're mixing levels. But having to move these cover
methods for Bind* and friends (eight methods) up to the engine seems a
little silly. Maybe it shouldn't.

Are we trying to solve the problem that, some checkers want to be
informed when some region bindings are changed?

Can we do this:

GRExprEngine maintains a RegionWatcher, which is basically a map:
      MemRegion * => list of Checkers who are interested in the MemRegion

Checkers first register themselves with the region they are interested
in to the RegionWatcher.

In StoreManager::bind and other necessary places, call
RegionWatcher::RegionChanged() with the new store and a list of
changed regions. RegionWatcher then inform each Checker that
registered before.

Are we trying to solve the problem that, some checkers want to be
informed when some region bindings are changed?

Can we do this:

GRExprEngine maintains a RegionWatcher, which is basically a map:
      MemRegion * => list of Checkers who are interested in the

MemRegion

Checkers first register themselves with the region they are interested
in to the RegionWatcher.

In StoreManager::bind and other necessary places, call
RegionWatcher::RegionChanged() with the new store and a list of
changed regions. RegionWatcher then inform each Checker that
registered before.

That's a good idea. I'd be a little worried transferring from one
GRExprEngine to the next during far inline calls, but I guess that can wait
until there's more support for that. (Plenty of checkers, for example,
assume the ASTContext doesn't change between invocations.) And I assume
these callbacks would happen at the end of the StoreManager public
interface methods.

I'd actually still like to push this up to GRStateManager, since that
would allow checkers to mess with their GDM store as a side effect of a
region change (in the case of CStringChecker, to invalidate any recorded
strlen).

Are we trying to solve the problem that, some checkers want to be
informed when some region bindings are changed?

Can we do this:

GRExprEngine maintains a RegionWatcher, which is basically a map:
MemRegion * => list of Checkers who are interested in the

MemRegion

Checkers first register themselves with the region they are interested
in to the RegionWatcher.

In StoreManager::bind and other necessary places, call
RegionWatcher::RegionChanged() with the new store and a list of
changed regions. RegionWatcher then inform each Checker that
registered before.

That's a good idea. I'd be a little worried transferring from one
GRExprEngine to the next during far inline calls, but I guess that can wait
until there's more support for that. (Plenty of checkers, for example,
assume the ASTContext doesn't change between invocations.) And I assume
these callbacks would happen at the end of the StoreManager public
interface methods.

We should try to make more components ASTContext agnostic.

I'd actually still like to push this up to GRStateManager, since that
would allow checkers to mess with their GDM store as a side effect of a
region change (in the case of CStringChecker, to invalidate any recorded
strlen).

This makes sense.

(Just catching up)

This is essentially what we are trying to do. The problem is whether that map is done as a per-path property or something "global" to the instance of a Checker or GRExprEngine. If it is per-path, the map has to go in GRState or something similar if that map isn't short-lived. My suggestion before was that this map is essentially implicitly defined by the Checkers, since they are already tracking checker-specific state on the side. In this case, representing this map separately in GRState seems a little silly to me, and potentially a performance concern when it comes to state caching.

That's a good idea. I'd be a little worried transferring from one
GRExprEngine to the next during far inline calls, but I guess that can wait
until there's more support for that. (Plenty of checkers, for example,
assume the ASTContext doesn't change between invocations.) And I assume
these callbacks would happen at the end of the StoreManager public
interface methods.

Besides the ASTContext changing, we possibly will have different MemRegions as well. We haven't worked those details out yet.

I'd actually still like to push this up to GRStateManager, since that
would allow checkers to mess with their GDM store as a side effect of a
region change (in the case of CStringChecker, to invalidate any recorded
strlen).

Yes, this makes sense. Checkers don't get to manipulate stores directly, and GRState is the only place they can put new state.

The problem with just generating GRStates, as opposed to ExplodedNodes, is that it prevents Checkers from registering an error immediately when the memory "notification" takes place.

Agreed.

This is essentially what we are trying to do. The problem is whether

that

map is done as a per-path property or something "global" to the instance

of

a Checker or GRExprEngine. If it is per-path, the map has to go in

GRState

or something similar if that map isn't short-lived. My suggestion

before

was that this map is essentially implicitly defined by the Checkers,

since

they are already tracking checker-specific state on the side. In this
case, representing this map separately in GRState seems a little silly

to

me, and potentially a performance concern when it comes to state

caching.

Tracking the map in GRState is probably worse than change markers when it
comes to merging state, so I agree that a path-sensitive map wouldn't buy
us much. The difference between having an explicit map and an implicit one
is really just a time/memory/simplicity tradeoff.

These are the options I see:

Explicit map :
- One place to deal with super-regions.
- No virtual calls except for the checkers who actually said they were
interested.
- Backing (proposed): DenseMap<const MemRegion *, vector<Checker *>>

"Interest poll":
- Before each modification, ask checkers if they're interested in the
change.
- After the modification, notify all checkers that said they were
interested.
- If we wanted, this would let us ask checkers about certain /kinds/ of
changes.
- Downside: (up to) twice as many calls.

ProcessRegionChange:
- Like ProcessAssume, does no filtering of checkers. Takes a GRState and
the location that changed (SVal or MemRegion?), returns a new GRState.
- Not too hard to add the same sort of respondsToCallback checking as for
Visit*Stmt, meaning we'd at least only get checkers that /can/ track region
changes.
- Downside: not context-sensitive at all.

ProcessRegionChange:
- Like ProcessAssume, does no filtering of checkers. Takes a GRState and
the location that changed (SVal or MemRegion?), returns a new GRState.
- Not too hard to add the same sort of respondsToCallback checking as for
Visit*Stmt, meaning we'd at least only get checkers that /can/ track region
changes.

Yup. It's all very simple too.

- Downside: not context-sensitive at all.

It's only not context-sensitive from the perspective of when to do the callbacks to the checkers, but ultimately the checkers know what data they care about when they operate on a given GRState. Since the amount of "interested" regions will likely be small, this is probably not an issue.

I think this is the simplest and cleanest approach we have on the table.

OK, here's a first pass at ProcessRegionChange.

The first note is that I didn't choose to call it for every change to the
store.

Included: bindLoc, bindDefault, InvalidateRegions
Excluded: bindCompoundLiteral, bindDecl, bindDeclWithNoInit, unbindLoc,
EnterStackFrame

The reason for this is that the excluded calls only /add/ regions to the
store; they do not change existing bindings. It follows that a checker
cannot already be tracking a region that has not existed until this point.
(Arguably, bindDefault should be here too, since it's currently only used
for new regions.) unbindLoc is a special case since it's not used for
regions at all.

The second note is that it requires a ridiculous cross-indexing sort of
search to be useful: if I'm tracking the strlen of region X, that length is
invalid if any of X's subregions /or/ super-regions change. This results in
the code seen in the attached txt snippet. Is there a better way to set
this up that would make this neater?

Finally, the bind* covers in GRState are getting a little bulky for inline
functions. At what point would we give up on the efficiency gain and move
them to GRState.cpp?

ProcessRegionChanges.patch (9.17 KB)

EvalRegionChanges.txt (1.39 KB)

OK, here's a first pass at ProcessRegionChange.

The first note is that I didn't choose to call it for every change to the
store.

Included: bindLoc, bindDefault, InvalidateRegions

Note that InvalidateRegions touches a lot more than the regions specified. It computes a transitive closure, including looking at regions whose addresses are bound to other regions. Should those be included in the callback?

Excluded: bindCompoundLiteral, bindDecl, bindDeclWithNoInit, unbindLoc,
EnterStackFrame

The reason for this is that the excluded calls only /add/ regions to the
store; they do not change existing bindings. It follows that a checker
cannot already be tracking a region that has not existed until this point.
(Arguably, bindDefault should be here too, since it's currently only used
for new regions.) unbindLoc is a special case since it's not used for
regions at all.

Makes lots of sense.

The second note is that it requires a ridiculous cross-indexing sort of
search to be useful: if I'm tracking the strlen of region X, that length is
invalid if any of X's subregions /or/ super-regions change. This results in
the code seen in the attached txt snippet. Is there a better way to set
this up that would make this neater?

This looks very similar to the region "cluster" analysis done in RegionStore that is used for InvalidateRegions and RemoveDeadBindings. When I changed RegionStore to use the cluster analysis it wiped out a ton of old code. We could expose that algorithm/API at a higher level, and possibly eliminate most of the boilerplate in your code. Exposing APIs to do this kind of cross-indexing gives us the opportunity to do clever optimizations later that benefits all clients.

One nit: In your code, it's probably good to check that EntryMap is non-empty before building this index (which could be expensive).

Finally, the bind* covers in GRState are getting a little bulky for inline
functions. At what point would we give up on the efficiency gain and move
them to GRState.cpp?

Agreed. I think all of these should be moved out-of-line at this point. We should keep GRState.h defining a clean, readable interface. These methods aren't hotspots anyway, so keeping them inline buys us nothing.

Second pass. This mainly addresses the fact that InvalidateRegions doesn't
just invalidate the regions it's given -- thanks, Ted -- by recording all
the touched regions in a SmallVector.

I looked at what it would take to re-use the cluster analysis code. Right
now ClusterAnalysis is pretty tightly tied to RegionStore (it maps regions
to BindingKeys). Even if the region-walking part could be extricated,
though, it's really not the same as what the checker needs (which is
basically IsSubOrSuperRegion(MemRegion*, MemRegion*)). I guess it's okay to
leave for now.

ProcessRegionChanges.patch (20.1 KB)

Thanks Jordy. I'm still really leery about the cost of this, but I'm willing to see how CStringChecker makes use of the information so we can see how we can possibly do this better. Recording the entire set of invalidated regions seems really expensive. Comments inline:

Index: include/clang/Checker/PathSensitive/Store.h

Second pass. This mainly addresses the fact that InvalidateRegions
doesn't
just invalidate the regions it's given -- thanks, Ted -- by recording

all

the touched regions in a SmallVector.

Thanks Jordy. I'm still really leery about the cost of this, but I'm
willing to see how CStringChecker makes use of the information so we can
see how we can possibly do this better. Recording the entire set of
invalidated regions seems really expensive.

I know. Probably in most cases, the set won't be very big, but any work to
pare out regions (such as only storing base regions) seems like it'd be
more expensive, and also prone to miss cases where a region without a store
binding is invalidated. (This should still make it back to checkers.)

I've attached my local version of CStringChecker, so you can see how I'm
using all of this. Much of it is very simple usage and can probably be
cleaned up to be more efficient.

As for this:

   virtual Store InvalidateRegions(Store store,
                                   const MemRegion * const *Begin,
                                   const MemRegion * const *End,
                                   const Expr *E, unsigned Count,
                                   InvalidatedSymbols *IS,
- bool invalidateGlobals) = 0;
+ bool invalidateGlobals,
+ InvalidatedRegions &Regions) = 0;

Does it make sense to pass in a pointer instead of a reference? The
problem with passing in a reference is that it means we always have to

do

the work of recording the invalidated MemRegions.

Right. I was uneasy about this, but I think it actually ought to stay.
Why? Any region invalidation should be picked up by the checkers
eventually, or we have a correctness problem. (It's /because/ invalidating
a set of regions might touch other regions that this is necessary -- even
an internal use has no guarantee of a controlled invalidation.)

If you think it's necessary, I'll change it, but I think it may actually
be needed now.

CStringChecker.cpp (29.6 KB)