MemRegions - how to (re)use them right?

Hi @clang,

I need to analyze some (rather simple) data flow at AST level. While I could do it by my own I have the strange feeling that the MemRegionManager already provides a lot of the means I need. However I couldn't figure out how to use it. At the moment it seems to be used for diagnostics only.
Here is a code example:

struct A { int a, b };

void foo(int* sink)
{
  A temp;
  A* ptr = &temp;

  temp.a = /*expr*/;
  temp.b = /* another_expr */;
  *sink = temp.a; // (1)
  *sink = ptr->b; // (2)
}

I want to know that at (1) actually the result of expr is written to sink and that at (2) the result of another_expr is written to sink. Can I somehow compute this using clang's static analyzer?

Best regards
Olaf Krzikalla

Hi Olaf,

The short answer is "yes this can be done," and others have voiced interest in doing taint analysis.

At a high-level, do you want to do basic flow-sensitive dataflow analysis or path-sensitive analysis?

MemRegions are used by the path-sensitive dataflow engine (GRExprEngine) to do a lazy abstraction of memory. They are used in combination with SVals (see SVals.h) to reason about the values of expressions along a given path. They aren't used just for diagnostics, but are used to precisely reason about memory and memory bindings (e.g., what value binds to a variable).

If you wanted to do this using path-sensitive analysis, one way to do this is to walk the ExplodedGraph produced by GRExprEngine. The ExplodedGraph represents the possible paths traced within a function, and each node consists of a program location (e.g., an expression) and a program state. Specific ExplodedNodes have locations that represent loads/stores. You could "simply" walk the graph (using DFS) looking for loads, and then trace where individual expression values are passed to stores. Each store consists of a location the value will be stored and the value to be stored. After a store, you keep traversing the ExplodedGraph until you see a load from the same location, then trace that value, etc. There are many details to get this right, but it could be done, and you wouldn't need to modify the analyzer at all.

This option could be generally useful for several clients interested in "taint" analysis. Down the line we could build a simpler interface to this information that layers on top of the ExplodedGraph, and have the underlying machinery do all the hard work. For example, the interface would support the query that for a given Expr evaluated at an ExplodedNode, what is the immediate set of preceding ExplodedNodes (and corresponding Exprs) from which the value of that Expr is derived. This relation in itself is a graph that could be walked, but it is at a little higher level than just walking the ExplodedGraph directly.

Another option is for us to build support for general "taint" tracking into GRExprEngine and friends, with core transfer function logic doing some of the taint propagation. This might be useful anyway, but would add more complexity not required by all clients. It also might reduce the amount of path caching done by the analyzer.

Hi Ted,

it's been a while since I asked about the topic hence I appended (aka top-post) our conversation so far.
First, I think I need a basic flow-sensitive dataflow analysis only. That is, at least the code block of interest can be considered to be a basic block. However it might be of course embedded in for-stmts or the like. I looked at the graphs produced by GRExprEngine (cmd-line was clang-cc -analyzer-viz-egraph-graphviz -checker-cfref -analyze -analyzer-store=basic -analyzer-constraints=basic) but I found the output not very useful in the presence of loops - it seems that the analyzer performs some sort of simulation of the program. The graph contains the loop body several times (every time the index was counted up). In general I think GRExprEngine does too much for my purpose (at least at the moment).
Meanwhile I 'm able to further refine my requirements. Assume the following code:

void foo()
{
  for (/*...*/) // a loop is always involved
  {
    lhs_1 = rhs_1;
    lhs_2 = rhs_2;
    //aso.
  }
}

At block level the for-body contains binary assignment expressions only. Now I just want to know, if and where lhs_x appears anywhere in rhs_y (with x < y, let's ignore x > y for the moment). Then I could expand rhs_y by replacing lhs_x with rhs_x, perhaps recursively. This of course isn't that easy as it sounds: you have to resolve things like *&a <-> a or a->b <-> (*a).b.
Thus all I need is a unique identifier for lhs_x and a function which for a given arbitrary rvalue-expression finds the referred lhs_x identifier (if one exists). Given these my initial example could be solved quite elegant:

A* ptr = &temp; // var ptr gets ID1
temp.a = /*expr*/; // field var temp.a gets ID2
temp.b = /* another_expr */; // field var temp.b gets ID3
*sink = temp.a; // function detects ID2 for temp.a: expand to *sink = /*expr*/: done
*sink = ptr->b; // function detects ID1 for ptr: expand to *sink = (&temp)->b:
                  // function detects ID3 for (&temp)->b: expand to *sink = /* another_expr */: done

I think, that one feasible approach to do this is working on the AST level by unifiying expressions somehow (e.g. something like a unified string) and then just search for them in other unified expressions. However I'm afraid that this approach isn't very extensible. At least it means some work which possibly has been done.
OTOH I've seen the MemRegions which apparently are already designed so that they can act easily as these unique identifiers I mentioned above. Also the PostLoad node seems to designate sub-expressions for which a MemRegion exists. But the only clang part dealing with MemRegions so far is the GRExprEngine and as I said I'm note sure if it is what I need. Or it's me who is still not able to use GRExprEngine properly for my purpose.

What do you think? Are this or similar problems already solved, maybe by others?
Thanks for any help.

Best Regards
Olaf Krzikalla

Ted Kremenek schrieb:

At block level the for-body contains binary assignment expressions only. Now I just want to know, if and where lhs_x appears anywhere in rhs_y (with x < y, let's ignore x > y for the moment). Then I could expand rhs_y by replacing lhs_x with rhs_x, perhaps recursively. This of course isn't that easy as it sounds: you have to resolve things like *&a <-> a or a->b <-> (*a).b.
Thus all I need is a unique identifier for lhs_x and a function which for a given arbitrary rvalue-expression finds the referred lhs_x identifier (if one exists). Given these my initial example could be solved quite elegant:

A* ptr = &temp; // var ptr gets ID1
temp.a = /*expr*/; // field var temp.a gets ID2
temp.b = /* another_expr */; // field var temp.b gets ID3
*sink = temp.a; // function detects ID2 for temp.a: expand to *sink = /*expr*/: done
*sink = ptr->b; // function detects ID1 for ptr: expand to *sink = (&temp)->b:
                // function detects ID3 for (&temp)->b: expand to *sink = /* another_expr */: done

I think, that one feasible approach to do this is working on the AST level by unifiying expressions somehow (e.g. something like a unified string) and then just search for them in other unified expressions. However I'm afraid that this approach isn't very extensible. At least it means some work which possibly has been done.
OTOH I've seen the MemRegions which apparently are already designed so that they can act easily as these unique identifiers I mentioned above. Also the PostLoad node seems to designate sub-expressions for which a MemRegion exists. But the only clang part dealing with MemRegions so far is the GRExprEngine and as I said I'm note sure if it is what I need. Or it's me who is still not able to use GRExprEngine properly for my purpose.

Hi Olaf,

Sorry for the delay in my response.

MemRegions can be viewed as the "name" or "location" of a piece of memory. This is how GRExprEngine uses them, and it uses MemRegions to talk to StoreManager to reason about values bound to MemRegions.

MemRegion certainly can be repurposed by other clients, as MemRegionManager (which constructs MemRegion objects) doesn't rely on the rest of the path-sensitive engine. MemRegions conceptually represent abstract chunks of memory. For example, a VarRegion represents the memory associated with a given variable. Regions can be layered on top of each other to represent field and array accesses, casts, and so forth. MemRegions are constructing on demand by GRExprEngine when it evaluates casts, pointer decays, and so forth. MemRegions are uniqued by MemRegionManager. For example, when you request the MemRegion for a given VarDecl, you will always get the same MemRegion*.

In your example above, I'm not certain if ID* symbols are the values bound to locations or the locations themselves. If they are the locations, then you'd have:

   ID1 => VarRegion(ptr)
   ID2 => FieldRegion(VarRegion(temp), a)
   ID3 => FieldRegion(VarRegion(temp), b)

Now if ID* represent values, you're going to need something else besides MemRegions to keep track of location -> value bindings. For example:

*sink = ptr->b; // function detects ID1 for ptr: expand to *sink = (&temp)->b:

To get this reasoning, you need to keep track of the binding:

    VarRegion(ptr) => VarRegion(temp)

which would cause the l-value of 'ptr->b' to evaluate to:

    FieldRegion(VarRegion(temp), b)

Then the r-value of 'ptr->b' expands to whatever value you track for the field 'b' of 'temp'.

In general, I'm not certain how much value flow logic you plan on implementing. MemRegions themselves don't represent the values of expressions, but rather memory locations (which some expressions may evaluate to). The path-sensitive engine uses SVals (short for "symbolic values") to represent what the symbolic result of evaluating an expression. As you'll see, SVals can represent "locations" and "non-locations", and "locations" include MemRegions.

I guess it's not clear to me how much symbolic evaluation you plan on doing. If your analysis isn't path-sensitive, how do plan on handling confluence points in the CFG (and loops)?

Ted

Hi Ted,

Ted Kremenek schrieb:

Sorry for the delay in my response.

It was just in time to put me (hopefully) on the right track.

In your example above, I'm not certain if ID* symbols are the values bound to locations or the locations themselves.

They were meant as the locations themselves. These locations then get bound to values similiar to what GRExprEngine does (as I have seen, RegionBindings stores these bindings there).

If they are the locations, then you'd have:

  ID1 => VarRegion(ptr)
  ID2 => FieldRegion(VarRegion(temp), a)
  ID3 => FieldRegion(VarRegion(temp), b)

Now if ID* represent values, you're going to need something else besides MemRegions to keep track of location -> value bindings. For example:

*sink = ptr->b; // function detects ID1 for ptr: expand to *sink = (&temp)->b:

To get this reasoning, you need to keep track of the binding:

   VarRegion(ptr) => VarRegion(temp)

That is, the value of a pointer itself represents another memory region. However I have the strong feeling that MemRegionVal actually represents a different concept and value flow logic isn't implemented at all (thats how I interpret the FIXME comment in GRExprEngine::EvalLoad). OTOH at least the "pointer!=0"-constraint is stored somewhere.

which would cause the l-value of 'ptr->b' to evaluate to:

   FieldRegion(VarRegion(temp), b)

Then the r-value of 'ptr->b' expands to whatever value you track for the field 'b' of 'temp'.

In general, I'm not certain how much value flow logic you plan on implementing.

Just some basic alias analysis. That is, the value of a pointer is either a simple adress-of-operator result (in that case the MemRegion of the &-op-subexpression is bound to the pointer) or another simple pointer. Of course I try to make this as extensible as possible. For the value flow of non-pointer types the means provided by the MemRegion concept are sufficient.

MemRegions themselves don't represent the values of expressions, but rather memory locations (which some expressions may evaluate to). The path-sensitive engine uses SVals (short for "symbolic values") to represent what the symbolic result of evaluating an expression. As you'll see, SVals can represent "locations" and "non-locations", and "locations" include MemRegions.

But as I said above I don't think they are used in GRExprEngine in the way I want to use them.

If your analysis isn't path-sensitive, how do plan on handling confluence points in the CFG (and loops)?

Rather pragmatic. For the moment I will analyse basic blocks only. Later on this can be refined. E.g. at confluence points you may check if a pointer value remains or becomes the same on both paths. Then you can use this value further. Otherwise the pointer gets marked as unknown.

Overall I think that I can't use the xxxEngine framework but I have to write new components. However I think its possible to factor out the part creating MemRegions in a common Stmt visitor (apparently RegionStore and BasicStore don't differ at this point and maybe even the NOTE comment and the following if clause in RegionStoreManager::getLValueFieldOrIvar belongs to BasicStoreManager::getLValueField too). With this tool I should be able to resolve the first write to sink in my example. Then, in a second step I reason about pointer values. However for this second step IMHO a class like UnknownTypedRegion is still missing. Example:

void foo(int* x)
{
  int* y = x;
  //...
}

Even if I don't know the particular MemRegion x points to, I know that y points to the same region.
I hope to get some first results by the end of week. Meanwhile I'm still eager to get your comments.

Ciao Olaf

Hi Ted,

Ted Kremenek schrieb:

Sorry for the delay in my response.

It was just in time to put me (hopefully) on the right track.

In your example above, I'm not certain if ID* symbols are the values bound to locations or the locations themselves.

They were meant as the locations themselves. These locations then get bound to values similiar to what GRExprEngine does (as I have seen, RegionBindings stores these bindings there).

That's correct. Anything that subclasses StoreManager manages the "symbolic store" component of GRState, where the symbolic state manages bindings from locations to values.

If they are the locations, then you'd have:

ID1 => VarRegion(ptr)
ID2 => FieldRegion(VarRegion(temp), a)
ID3 => FieldRegion(VarRegion(temp), b)

Now if ID* represent values, you're going to need something else besides MemRegions to keep track of location -> value bindings. For example:

*sink = ptr->b; // function detects ID1 for ptr: expand to *sink = (&temp)->b:

To get this reasoning, you need to keep track of the binding:

  VarRegion(ptr) => VarRegion(temp)

That is, the value of a pointer itself represents another memory region. However I have the strong feeling that MemRegionVal actually represents a different concept and value flow logic isn't implemented at all (thats how I interpret the FIXME comment in GRExprEngine::EvalLoad). OTOH at least the "pointer!=0"-constraint is stored somewhere.

The FIXME in GRExprEngine::EvalLoad was stale and needed to be removed.

loc::MemRegionVal is simply allows regions (which are memory locations) to be used as values during symbolic execution. GRExprEngine reasons mostly about SVals, and when necessary specially handles the cases where those values are memory regions. Most of the logic to handle MemRegions, however, are intentionally left as a detail in the StoreManagers.

Constraints on pointers, such as whether or not a pointer is non-null, is handled by the ConstraintManager. The ConstraintManager only reasons about symbols (provided by SymbolManager). SymbolicRegions represent abstract regions that can either be null or actual locations in memory. The ConstraintManager keeps track of whether or not symbol associated with a SymbolicRegions is null. ConstraintManagers also reason about constraints on other symbols, such as integers.

which would cause the l-value of 'ptr->b' to evaluate to:

  FieldRegion(VarRegion(temp), b)

Then the r-value of 'ptr->b' expands to whatever value you track for the field 'b' of 'temp'.

In general, I'm not certain how much value flow logic you plan on implementing.

Just some basic alias analysis. That is, the value of a pointer is either a simple adress-of-operator result (in that case the MemRegion of the &-op-subexpression is bound to the pointer) or another simple pointer. Of course I try to make this as extensible as possible. For the value flow of non-pointer types the means provided by the MemRegion concept are sufficient.

I think MemRegions should give you almost all the flexibility you need to reason about l-values. The result of an address-of expression is always a region wrapped in a loc::MemRegionVal or UnknownVal.

If your analysis isn't path-sensitive, how do plan on handling confluence points in the CFG (and loops)?

Rather pragmatic. For the moment I will analyse basic blocks only. Later on this can be refined. E.g. at confluence points you may check if a pointer value remains or becomes the same on both paths. Then you can use this value further. Otherwise the pointer gets marked as unknown.

Makes sense.

Overall I think that I can't use the xxxEngine framework but I have to write new components. However I think its possible to factor out the part creating MemRegions in a common Stmt visitor (apparently RegionStore and BasicStore don't differ at this point and maybe even the NOTE comment and the following if clause in RegionStoreManager::getLValueFieldOrIvar belongs to BasicStoreManager::getLValueField too). With this tool I should be able to resolve the first write to sink in my example. Then, in a second step I reason about pointer values. However for this second step IMHO a class like UnknownTypedRegion is still missing. Example:

void foo(int* x)
{
int* y = x;
//...
}

Even if I don't know the particular MemRegion x points to, I know that y points to the same region.
I hope to get some first results by the end of week. Meanwhile I'm still eager to get your comments.

The way this is handle in GRExprEngine is that the VarRegion for 'x' binds to a symbolic region. After the assignment to y, both the VarRegion for y and x bind to the same region. RegionStoreManager handles some bindings like this lazily, but it it essentially reasons about them in this way.

Just to be clear, we have four components at play in GRExprEngine:

- MemRegions, which represent locations in memory
- SVals, which represents values that expressions can evaluate to (including in the case of lvalues, regions)
- StoreManager: which manages bindings from MemRegions to SVals
- ConstraintManager: which manages constraints on symbols (where SymbolicRegions have an associated symbol, and integers and other values can be symbolic as well)

Each of these components represents a separate set of functionality that when composed together provide the value tracking reasoning used by the path-sensitive engine.