Lazy bindings

In current static analysis implementation, big array initializations takes a very long time, because we create regions for every elements no matter if it is used at all. Lazy binding is desirable for such cases. That is, when the decl stmt is visited, only a VarRegion is created for the array. ElementRegions will be created when it is first time read or written.

We have to distinguish the following cases if a region is not in regionbindings:

  • it is assigned unknown value (a.k.a. killed, we remove its binding in such case)
  • it is not initialized and it is local, we should return undefined value for it
  • it is not initialized and it is global, we should return symbolic value for it

A tentative logic is proposed as follows:

Retrieve(Loc L)
if L is in bindings
return bind(L)
else
if L is not field or element
return UnknownVal // we initialize scalar variable, so it must be killed.
else
if L is in killset
return UnknownVal
else
if L is on the stack or heap
return UndefinedVal
else

Sorry, this email is unfinished. I accidentally clicked ‘send’ button.

This is the complete email:

In current static analysis implementation, big array initializations takes a very long time, because we create regions for every elements no matter if it is used at all. Lazy binding is desirable for such cases. That is, when the decl stmt is visited, only a VarRegion is created for the array. ElementRegions will be created when it is first time read or written.

We have to distinguish the following cases if a region is not in regionbindings:

  • it is assigned unknown value (a.k.a. killed, we remove its binding in such case)
  • it is not initialized and it is local, we should return undefined value for it
  • it is not initialized and it is global, we should return symbolic value for it

A tentative logic is proposed as follows:

Retrieve(Loc L) {
if L is in bindings
return bind(L)
else
if L is not field or element
return UnknownVal // we initialize scalar variables, so it must be killed.
else
if L is in killset
return UnknownVal
else
if L is on the stack or heap
return UndefinedVal
else
return SymbolicVal
}

Remove(Loc L) {
remove as usual.
if (L is element or field)
add L to killset
}

Thoughts?

Another much simplified scheme:

  • leave all locations uninitialized
  • if a location is assigned Unknown, bind it to Unknown explicitly in the bindings.

Then Retrieve becomes:

Retrieve (L)
{
if L has binding
return L’s binding
else
if L is on stack or heap
return UndefinedVal
else
return SymbolicVal
}

The locations that are assigned unknown should not be too much, so this would not incur too much overhead. The only bindings that would be removed explicitly are those that are dead.

This is a great start. We could treat recording "UnknownVal" as the same thing as putting something in the killset.

By "heap", I assume you are referring to blocks returned by calls to malloc()?

One thing that would be nice to eventually model is default initialization for array elements and struct fields. For example:

   int large_buffer[2000] = { 1 }; // the rest of the elements are bound to 0

Not having to bind the value '0' to elements 1...1999 would be nice, as we know that their default value is 0. We could put this in a side map in the GDM (i.e. default_element_value: MemRegion* -> SVal). I actually don't think this would be hard to do at all. It also means that we scale with the number of elements referenced, not with the number declared.

Another much simplified scheme:

  • leave all locations uninitialized
  • if a location is assigned Unknown, bind it to Unknown explicitly in the bindings.

Then Retrieve becomes:

Retrieve (L)
{
if L has binding
return L’s binding
else
if L is on stack or heap
return UndefinedVal
else
return SymbolicVal
}

The locations that are assigned unknown should not be too much, so this would not incur too much overhead. The only bindings that would be removed explicitly are those that are dead.

This is a great start. We could treat recording “UnknownVal” as the same thing as putting something in the killset.

Yeah, then Retrieve becomes:

Retrieve (L)
{
if (L has binding)
return L’s binding
else if (L is in killset)
return unknown
else
if (L is on stack or heap)
return undefined
else
return symbolic
}

By “heap”, I assume you are referring to blocks returned by calls to malloc()?

Yes.

One thing that would be nice to eventually model is default initialization for array elements and struct fields. For example:

int large_buffer[2000] = { 1 }; // the rest of the elements are bound to 0

Not having to bind the value ‘0’ to elements 1…1999 would be nice, as we know that their default value is 0. We could put this in a side map in the GDM (i.e. default_element_value: MemRegion* → SVal). I actually don’t think this would be hard to do at all. It also means that we scale with the number of elements referenced, not with the number declared.

This is cool.

Hi Ted,

This is the patch for lazy binding.

  • failed C test cases are due to bugs in RemoveDeadBindings(),
    which removes constraints that is still alive. I plan to fix this
    in a later patch. This patch is already messy enough.

  • I wish you could help fix failed ObjC test cases, since I have
    no access to the Mac development platform. It’ll take some time for me
    to learn basic things.

  • default value of array and struct regions are not implemented
    yet, because the GDM of the same type of ImmutableMap<const
    Region*, SVal> is already used for region extents.

  • Now Bind() methods take and return GRState* because binding could
    also alter GDM.

  • No variables are initialized except those declared with initial
    values.

lazybinding.patch (49.6 KB)

Hi Ted,

This is the patch for lazy binding.

Hi Zhongxing,

This is a huge patch, so I'm going to make a few passes over it. Overall it looks very good. I'm wondering if we can break out the changes to SymbolManager.[h,cpp] into a separate patch without compromising the current functionality of BasicStore. It would also make your changes to RegionStore more isolated.

* failed C test cases are due to bugs in RemoveDeadBindings(),
which removes constraints that is still alive. I plan to fix this
in a later patch. This patch is already messy enough.

Okay.

* I wish you could help fix failed ObjC test cases, since I have
no access to the Mac development platform. It'll take some time for me
to learn basic things.

Sure thing.

* default value of array and struct regions are not implemented
yet, because the GDM of the same type of ImmutableMap<const
Region*, SVal> is already used for region extents.

You can have multiple GDM entries of the same type. Just use a different GDM index and create a different "tag" type. For example, I *believe* that the following should do the trick (this shows two GDM entries with the same map type):

typedef llvm::ImmutableMap<SymbolRef, unsigned> MapTy;

// TypeA.
class TypeA {};
static int TypeAIndex = 0;

namespace clang {
   template<>
   struct GRStateTrait<TypeA> : public GRStatePartialTrait< MapTy > {
     static inline void* GDMIndex() { return &TypeAIndex; }
   };
}

// TypeB.
static int TypeBIndex = 0;

namespace clang {
   template<>
   struct GRStateTrait<TypeB> : public GRStatePartialTrait< MapTy > {
     static inline void* GDMIndex() { return &TypeBIndex; }
   };
}

I think this should work. Basically the type used for GRStateTrait used for indexing into the GDM table, and we can subclass GRStatePartialTrait to create the rest of the methods.

All of the other uses of the GDM should probably be modified to follow this style to avoid any accidental re-uses of the same GDM entries.

* Now Bind() methods take and return GRState* because binding could
  also alter GDM.

Makes sense.

* No variables are initialized except those declared with initial
  values.

Great.

Hi Ted,

This is the patch for lazy binding.

Hi Zhongxing,

This is a huge patch, so I'm going to make a few passes over it. Overall it looks very good. I'm wondering if we can break out the changes to SymbolManager.[h,cpp] into a separate patch without compromising the current functionality of BasicStore. It would also make your changes to RegionStore more isolated.

* failed C test cases are due to bugs in RemoveDeadBindings(),
which removes constraints that is still alive. I plan to fix this
in a later patch. This patch is already messy enough.

Okay.

* I wish you could help fix failed ObjC test cases, since I have
no access to the Mac development platform. It'll take some time for me
to learn basic things.

Sure thing.

* default value of array and struct regions are not implemented
yet, because the GDM of the same type of ImmutableMap<const
Region*, SVal> is already used for region extents.

You can have multiple GDM entries of the same type. Just use a different GDM index and create a different "tag" type. For example, I *believe* that the following should do the trick (this shows two GDM entries with the same map type):

typedef llvm::ImmutableMap<SymbolRef, unsigned> MapTy;

// TypeA.
class TypeA {};
static int TypeAIndex = 0;

namespace clang {
  template<>
  struct GRStateTrait<TypeA> : public GRStatePartialTrait< MapTy > {
    static inline void* GDMIndex() { return &TypeAIndex; }
  };
}

// TypeB.
static int TypeBIndex = 0;

namespace clang {
  template<>
  struct GRStateTrait<TypeB> : public GRStatePartialTrait< MapTy > {
    static inline void* GDMIndex() { return &TypeBIndex; }
  };
}

I think this should work. Basically the type used for GRStateTrait used for indexing into the GDM table, and we can subclass GRStatePartialTrait to create the rest of the methods.

All of the other uses of the GDM should probably be modified to follow this style to avoid any accidental re-uses of the same GDM entries.

* Now Bind() methods take and return GRState* because binding could
also alter GDM.

Makes sense.

* No variables are initialized except those declared with initial
values.

Great.

Hi Ted,

This is the patch for lazy binding.

Hi Zhongxing,

This is a huge patch, so I’m going to make a few passes over it. Overall it looks very good. I’m wondering if we can break out the changes to SymbolManager.[h,cpp] into a separate patch without compromising the current functionality of BasicStore. It would also make your changes to RegionStore more isolated.

OK. I’ll try to break it into some separate patches.

  • failed C test cases are due to bugs in RemoveDeadBindings(),
    which removes constraints that is still alive. I plan to fix this
    in a later patch. This patch is already messy enough.

Okay.

  • I wish you could help fix failed ObjC test cases, since I have
    no access to the Mac development platform. It’ll take some time for me
    to learn basic things.

Sure thing.

  • default value of array and struct regions are not implemented
    yet, because the GDM of the same type of ImmutableMap<const
    Region*, SVal> is already used for region extents.

You can have multiple GDM entries of the same type. Just use a different GDM index and create a different “tag” type. For example, I believe that the following should do the trick (this shows two GDM entries with the same map type):

typedef llvm::ImmutableMap<SymbolRef, unsigned> MapTy;

// TypeA.
class TypeA {};
static int TypeAIndex = 0;

namespace clang {
template<>
struct GRStateTrait : public GRStatePartialTrait< MapTy > {
static inline void* GDMIndex() { return &TypeAIndex; }
};
}

// TypeB.
static int TypeBIndex = 0;

namespace clang {
template<>
struct GRStateTrait : public GRStatePartialTrait< MapTy > {
static inline void* GDMIndex() { return &TypeBIndex; }
};
}

I think this should work. Basically the type used for GRStateTrait used for indexing into the GDM table, and we can subclass GRStatePartialTrait to create the rest of the methods.

All of the other uses of the GDM should probably be modified to follow this style to avoid any accidental re-uses of the same GDM entries.

Thanks! This should work. But I prefer to do it in a later patch.

The SymbolManager part of the patch added const’ness to VarDecl* and ParmVarDecl*. It will break the compilation if committed separately. Essentially, it only adds one function:
SymbolRef SymbolManager::getSymbol(const MemRegion* R);

Hi Ted,

This is the patch for lazy binding.

I have some specific comments on the patch below. Two main comments:

1) Aside from Bind() returning a GRState* instead of Store, let's keep BasicStoreManager the same. It works just fine for what it was intended to do. The reason I get scared about large patches is the huge functionality changes. BasicStoreManager just keeps binding for the VarDecls that are referenced within a function. I haven't encountered any scalability issues with it. We don't need to use a killset. By keeping it the same we preserve its stability. New functionality should go into RegionStoreManager. BasicStoreManager doesn't need to be lazy in the same way.

2) Rename GRStateManager::getState() to MakeStateWithStore() (or something similar). It makes it clear what it is doing. 'get'State()' is so short and ambiguous that it has no real meaning, and such an important function shouldn't be ambiguous.

* failed C test cases are due to bugs in RemoveDeadBindings(),
which removes constraints that is still alive. I plan to fix this
in a later patch. This patch is already messy enough.

Let's temporarily disable those tests (if necessary) for region store and then fix them in later patches. Since there won't be any fundamental changes to BasicStoreManager all tests should pass for it.

* I wish you could help fix failed ObjC test cases, since I have
no access to the Mac development platform. It'll take some time for me
to learn basic things.

Sure thing. Same as the above comment (disable tests for region store and fix in later patch).

* default value of array and struct regions are not implemented
yet, because the GDM of the same type of ImmutableMap<const
Region*, SVal> is already used for region extents.

In a later patch we'll use the "tag class" approach I mentioned in my other email.

* Now Bind() methods take and return GRState* because binding could
  also alter GDM.

Looks great. As I said before, have BasicStoreManager use the new interface (return a state) but don't change its functionality.

* No variables are initialized except those declared with initial
  values.

Okay!

Here are specific comments:

Index: include/clang/Analysis/PathSensitive/Store.h

Thanks for the detailed review! It is very hard.

All comments were adopted. Now all tests passed except two region-store specific ones failed due to RemoveDeadBindings().

New patch attached.

new.patch (46.9 KB)

Looks good! Please apply.