Iterator Checkers: Understanding Bindings

Hello Adam.

The main problem with iterators is that they may be a result of complex inline-analyzed functions. Relying on the internal structure of returned SVal doesn't seem to be a good idea. According to my experience, the best approach here is just to consider it as something that has the required type.

Note that working with iterators in CSA is complicated enough and I'm not even pretend for the good knowledge. Just sharing some experience.

I think I am on the good path to create iterator checkers for the Clang Static Analyzer. I already executed some experiments successfully, however to do that I needed to "hack" the infrastructure. Now the next step is to create a proper solution instead of the hacking which is unacceptable and would probably brake the system.

My first iterator checker is a very simple checker that tries to recognize cases where an iterator is dereferenced or incremented past its end. To do that I first catch the symbolic value of calls to end() functions that return an iterator type. At this point (checkPostCall) the return value is symbolic which is good. The next step is to post-check comparisons but here I get a temporary instead. The good news is that this temporary is default bound to the same symbol I got from the return value of end(). The bad news is that this default binding is buried and irretrievable in the current Static Analyzer core.

You may need something like checkTemporaryValue() callback which is implemented but not currently contributed to upstream (sorry for that). It may be used to transmit information related to temporary values while doing analysis in your checker before the related information is removed from the state. You can check our github project and BasicStringBoundChecker to see the implementation of this callback and its usage. If it is what you need, we'll hurry with upstreaming.
https://github.com/haoNoQ/clang/blob/summary-ipa-draft/lib/StaticAnalyzer/Checkers/BasicStringBoundChecker.cpp

The key function to retrieve binding is RegionStoreManager::getBinding() in RegionStore.cpp. However, this function only retrieves direct bindings. I could not find too much documentation about the two types of bindings, I could only find out the default binding is used for structure members where the members are not direct bound but the structure itself has a default binding. I found nothing about what should happen when retrieving the binding for the structure itself where it only has a default binding.

I'm not sure that getting a binding is something that you really need. I have tried to operate with regions directly instead. Could you share your work and give some test case? I think I don't fully understand what is your question about.

The situation is even worse here. The name of function getBinding() is a bit misleading because it does not even retrieve a direct binding even if it exists.

May be the binding _is_ LazyCompoundVal?

RegionStore, as one of the possible implementations of the Store, does not explicitly store every binding it can provide. For example, at the beginning of the analysis, every variable's value is SymbolRegionValue of this variable's region, however this value is not stored as a direct binding - it is simply assumed to be there since there's no other binding to override this assumption.

Same thing with LazyCompoundVal's - once you ask for an SVal to represent a value of a structure-type region R, you get a LazyCompoundVal, even if it's not bound to R directly. The original symbol (returned by eg. in your case begin() or end()) represents the original contents of the structure. However, later the structure may (or may not) have its fields changed, that is, other values may (or may not) be bound to sub-regions of R. Which means that the original symbol does not necessarily describe the value bound to the structure-type region R. LazyCompoundVal, on the other hand, always (regardless of other bindings) describes the structural value correctly. Moreover, it can be constructed without figuring out if there were sub-region bindings - it just copies the whole store (which is immutable, and therefore extremely cheap to copy). So the analyzer just goes ahead and always creates a LazyCompoundVal, without looking if there are any new bindings.

I've seen cases when retrieving the original default symbol made sense to me, so i'd agree that an API for that would be useful. However, from above it should be obvious that it's not how the store should normally operate; you start digging into implementation details here. This symbol can be used for identifying the instance of the object (and all copies of it), but the object could have completely changed since this symbol was born. Because iterators don't seem to change through bindings, this might be a way to go.

Hope this helps.

Hello,

Thank you for your quick reply (both Artem and Aleksei). It was very useful. Now I understand that there are two very different needs. The first need is what most checkers require and the Static Analyzer currently satisfies is that structures are analyzed deeply. However, what we need here is to handle some structures in our checker as opaque objects. I think iterator checkers are not the only one with such need, some checkers, e.g. STL, Boost or some project-specific checkers to be developed in the future may have similar needs.

If I understand it correctly, we need to implement a new function in RegionStoreManager that retrieves the raw binding for an SVal. This function must also be declared in StoreManager class as a pure virtual function. This function must be able to retrieve default bindings as well. Should it try to retrieve a direct binding first? And how to call that function? getRawBinding()? Or maybe getBindingForOpaqVal()?

Regards,

Ádám

If I understand it correctly, we need to implement a new function in

> RegionStoreManager that retrieves the raw binding for an SVal.
> This function must also be declared in StoreManager class as a pure
> virtual function. This function must be able to retrieve default bindings
> as well. Should it try to retrieve a direct binding first? And how to call
> that function? getRawBinding()? Or maybe getBindingForOpaqVal()?

Hmm. As long as the contents of the structure remain unchanged, LazyCompoundVal retains its original region (it is simply copied around as an value). Have a look at this quick debug.ExprInspection -based test:

   struct S {
     int z;
   };

   S conjure_S();

   void test_6() {
     S s1 = conjure_S();
     S s2 = s1;
     // lazily frozen compound value of temporary object constructed at statement 'conjure_S()'
     clang_analyzer_explain(s2);
     s2.z = 3;
     // lazily frozen compound value of local variable 's2'
     clang_analyzer_explain(s2);
   }

Maybe it'd be a good idea to try to identify iterators by that region. You'd probably need to able to evalCall() their methods (and probably a few other functions) in order to avoid corrupting their contents through invalidation. On methods like operator++, which are bound to invalidate contents, you can still transfer your id to the new region.

Hello,

Contents of an iterator (a class or structure) do change. However, we are not interested in the internal changes, for us an iterator is a black box. We track some of their operations and record their state (which is checker dependent) in the execution state. This is almost exactly the same as SimpleStreamChecker does with streams, the only difference is that streams are pointers and iterators are structures. I am sure that iterator checkers are not the only ones that require an API for tracking structures/classes as black boxes. There could be some other STL, Boost or project specific library checkers as well.

evalCall() is not an option since it forces all iterator checking into one single checker which is inconvenient. Furthermore, instead of unreliable hacks with regions we need a proper solution. I wonder what the core guys would say about such a new function what I proposed.

Regards,

Ádám

I agree with pretty much all of your points.

I think a proper solution would also include cases when the object was not conjured by a function. Eg., how do we identify `c' in the following code?:

   class C {...};
   void foo() {
     C c;
   }

In this case `c' doesn't have a default-bound symbol. The region approach works though; but it has other problems.

I wish we had a way to assign identifier-like symbols to object instances (perhaps in an on-demand manner), and probably construct symbolic-ID expressions for identifiers of objects produced from other objects (eg. SymbolID, SymbolCopiedID<SymbolID>, etc.).

I'm still not instantly sure how to implement that, despite thinking about it quite a bit. I'd also ask more clever guys to share ideas (:

P.S. Imho, evalCall problem is actually about sharing checker state. There's no problem with having evalCall() staying in only one checker if we share its results with other checkers.