fix for Clang PR 8419

Hi Ted,

I'm working on fixing PR8419, and would like to check with you if I'm
on the right track.

As I found out, the analyzer crashes on

  ++s[0];

as it expects s[0] to be a Loc (which it should be), but instead sees a NonLoc.

Upon reading the analyzer code, I see two things that don't seem right:

1. EnvironmentManager::bindExprAndLocation() throws away the
'location' argument.

2. GRExprEngine.cpp calls state->getSVal(Ex) to get the location of
Ex, but the implementation of getSVal(Ex) doesn't use the right key
(should be something like MakeLocation(Ex) to look up the expression.

Please see http://codereview.appspot.com/2920041/ for a very early
draft. It's not yet cleaned up and some changes aren't strictly
necessary -- I'll clean up after the discussion.

I'm sure it's not quite right, as I'm still figuring out how the
analyzer works. (I wish there are more comments in the code. :wink:

Could you let me know if the direction of the patch is correct? What
problems do you see in it?

Also, what do you think is the best way to ramp up on the analyzer code base?

Thanks,

Hi Zhangyong,

First, I appreciate the added comments in the patch. Please make those doxygen comments. For comments that don’t involve added functionality, I’m fine with committing those as is.

Second, I don’t think this is the droid you are looking for. The method bindExprAndLocation() was added so that the diagnostics engine could recover some of the original information from GRState before getSVal() folds symbols. This is a very recent addition, but the extra data isn’t needed for evaluation (so the rest of the changes you suggest I don’t think are correct). This line, however, is wrong:

return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S), V), S, V));

As you point out, it should be:

return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S), location), S, V));

In general, the way EvalLoad works is that some expressions are known to evaluate to lvalues and other to rvalues. The Environment doesn’t care whether an Expr* binds to an lvalue or rvalue; all of those are kept in the same mapping. However, EvalLoad simulates an lvalue-to-rvalue conversion, and replaces the previous binding of the Expr* with the new value. The purpose of bindExprAndLocation() is to preserve the original location value by use for the construction of PathDiagnostics (when we walk the ExplodedGraph). You can see this because bindExprAndLocation() is called in only one place:

void GRExprEngine::EvalLoadCommon(…) {

}
else {
if (LoadTy.isNull())
LoadTy = Ex->getType();
SVal V = state->getSVal(cast(location), LoadTy);
MakeNode(Dst, Ex, *NI, state->bindExprAndLocation(Ex, location, V),
ProgramPoint::PostLoadKind, tag);
}
}
}

You’re absolutely right that you found a bug in bindExprAndLocation(). When I added it I could quite figure out why I couldn’t get the changes I wanted to work in BugReporter, and I ran out of time and decided to revisit it later. This may indeed be the problem I was encountering.

Concerning ++s[0], the code that simulates pre/post-increment in GRExprEngine needs to be modified to understand reference types. It assumes now that the result of ++s[0] is an r-value, which is the result of doing a load from s[0] after it’s value has been incremented.

I wish there was more documentation as well. My apologies for that. It's something we really need to find time to do.

I think the best way to ramp up the codebase is to pick an example problem (like the one you are exploring and dig into it). If you are interested in bringing up support for C++ expressions, the best place to look is the visitation logic in GRExprEngine, which handles the "simulation" of individual expressions. There's plenty of examples there of how other expressions are handled.

If you are interested in writing checkers, then I'd suggest looking at the Checker and CheckerVisitor interfaces (Checker.h and CheckerVisitor.h). Almost all the files in libChecker that are named *Checker.cpp implement this interface. The Checker interface is designed to be minimal and simple for Checker writers, and attempts to isolate them from much of the gore of the internal analysis engine.

There are some useful command line options for debugging. For example:

$ clang -cc1 -help | grep analyze | <subset that I'm pointing out>
  -analyze-function <value>
  -analyzer-display-progress
  -analyzer-viz-egraph-graphviz

The first allows you to specify only analyzing a specific function. The second prints to the console what function is being analyzed. The third generates a graphviz dot file of the ExplodedGraph. This is extremely useful when debugging the analyzer and viewing the simulation results.

Of course, viewing the CFG is also useful:

$ clang -cc1 -help | grep cfg
  -cfg-add-implicit-dtors Add C++ implicit destructors to CFGs for all analyses
  -cfg-add-initializers Add C++ initializers to CFGs for all analyses
  -cfg-dump Display Control-Flow Graphs
  -cfg-view View Control-Flow Graphs using GraphViz
  -unoptimized-cfg Generate unoptimized CFGs for all analyses

-cfg-dump dumps a textual representation of the CFG to the console, and -cfg-view creates a GraphViz representation.

And of course, feel free to ask lots of questions! Myself and many others are more than welcome to help. Over time, I'd like to add some real developer documents on clang-analyzer.llvm.org.

Thanks for the comments, Ted!

Hi Zhangyong,
First, I appreciate the added comments in the patch. Please make those
doxygen comments. For comments that don't involve added functionality, I'm
fine with committing those as is.

Will do. The comments reflect *my* understanding of the code base,
and thus may not be correct. Have you reviewed them for accuracy?

Second, I don't think this is the droid you are looking for. The method

I'm not surprised at all. :wink: Due to a lack of understanding on how
the analyzer works at a high level, I'm still trying to put the pieces
together in the dark.

bindExprAndLocation() was added so that the diagnostics engine could recover
some of the original information from GRState before getSVal() folds
symbols. This is a very recent addition, but the extra data isn't needed
for evaluation (so the rest of the changes you suggest I don't think are
correct).

Who's supposed to consume the extra data? Is the consumer still to be added?

This line, however, is wrong:
return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S), V), S,
V));
As you point out, it should be:
return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S),
location), S, V));
In general, the way EvalLoad works is that some expressions are known to
evaluate to lvalues and other to rvalues. The Environment doesn't care
whether an Expr* binds to an lvalue or rvalue; all of those are kept in the
same mapping. However, EvalLoad simulates an lvalue-to-rvalue conversion,
and replaces the previous binding of the Expr* with the new value. The
purpose of bindExprAndLocation() is to preserve the original location value

Is there a difference between a location value and an L-value? If
yes, it's not clear to me what the difference is. If no, what do we
introduce the term "location value" when L-value is already in wide
use?

In

  return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S),
    location), S, V));

what's the relation between 'location' and 'V'? Is the latter the
result of the lvalue-to-rvalue conversion from the former?

by use for the construction of PathDiagnostics (when we walk the
ExplodedGraph). You can see this because bindExprAndLocation() is called in
only one place:
void GRExprEngine::EvalLoadCommon(...) {
...
}
else {
if (LoadTy.isNull())
LoadTy = Ex->getType();
SVal V = state->getSVal(cast<Loc>(location), LoadTy);
MakeNode(Dst, Ex, *NI, state->bindExprAndLocation(Ex, location, V),
ProgramPoint::PostLoadKind, tag);
}
}
}
You're absolutely right that you found a bug in bindExprAndLocation(). When
I added it I could quite figure out why I couldn't get the changes I wanted
to work in BugReporter, and I ran out of time and decided to revisit it
later. This may indeed be the problem I was encountering.
Concerning ++s[0], the code that simulates pre/post-increment in
GRExprEngine needs to be modified to understand reference types. It assumes
now that the result of ++s[0] is an r-value, which is the result of doing a
load from s[0] after it's value has been incremented.

OK, I'll take a look there. Thanks!

Also, what do you think is the best way to ramp up on the analyzer code base?

I wish there was more documentation as well. My apologies for that. It's something we really need to find time to do.

I'll try to help by adding comments where I need to touch the code.
In general, I think it works best if we require every patch to have
decent comments to cover the changes it introduces, as it can be a
luxury to have large chunks of time dedicated to improve the
documentation later. Choosing descriptive identifier names (as
opposed to abbreviations like F for Factory, for example) also helps
with readability a lot.

Do you think it would be possible to enforce such a policy or
something alike? By getting people into the habit of always
documenting the changes, the documentation will have less chance to
rot.

I think the best way to ramp up the codebase is to pick an example problem (like the one you are exploring and dig into it). If you are interested in bringing up support for C++ expressions, the best place to look is the visitation logic in GRExprEngine, which handles the "simulation" of individual expressions. There's plenty of examples there of how other expressions are handled.

If you are interested in writing checkers, then I'd suggest looking at the Checker and CheckerVisitor interfaces (Checker.h and CheckerVisitor.h). Almost all the files in libChecker that are named *Checker.cpp implement this interface. The Checker interface is designed to be minimal and simple for Checker writers, and attempts to isolate them from much of the gore of the internal analysis engine.

OK, I'll try that. Is there a paper or design doc I can read the get
the high level idea though?

There are some useful command line options for debugging. For example:

$ clang -cc1 -help | grep analyze | <subset that I'm pointing out>
-analyze-function <value>
-analyzer-display-progress
-analyzer-viz-egraph-graphviz

The first allows you to specify only analyzing a specific function. The second prints to the console what function is being analyzed. The third generates a graphviz dot file of the ExplodedGraph. This is extremely useful when debugging the analyzer and viewing the simulation results.

Of course, viewing the CFG is also useful:

$ clang -cc1 -help | grep cfg
-cfg-add-implicit-dtors Add C++ implicit destructors to CFGs for all analyses
-cfg-add-initializers Add C++ initializers to CFGs for all analyses
-cfg-dump Display Control-Flow Graphs
-cfg-view View Control-Flow Graphs using GraphViz
-unoptimized-cfg Generate unoptimized CFGs for all analyses

-cfg-dump dumps a textual representation of the CFG to the console, and -cfg-view creates a GraphViz representation.

I think these will be very useful. I'm willing to format and add them
to an HTML page if you can tell me where the page should be. Thanks,

Thanks for the comments, Ted!

Hi Zhangyong,
First, I appreciate the added comments in the patch. Please make those
doxygen comments. For comments that don't involve added functionality, I'm
fine with committing those as is.

Will do. The comments reflect *my* understanding of the code base,
and thus may not be correct. Have you reviewed them for accuracy?

I quickly scanned them, but I will re-review them. From what I can tell they looked fine.

Second, I don't think this is the droid you are looking for. The method

I'm not surprised at all. :wink: Due to a lack of understanding on how
the analyzer works at a high level, I'm still trying to put the pieces
together in the dark.

Thanks for digging in.

bindExprAndLocation() was added so that the diagnostics engine could recover
some of the original information from GRState before getSVal() folds
symbols. This is a very recent addition, but the extra data isn't needed
for evaluation (so the rest of the changes you suggest I don't think are
correct).

Who's supposed to consume the extra data? Is the consumer still to be added?

This data is meant to be used by the BugReporterVisitors, which are used by BugReporter to add annotations to the path diagnostics. This includes add messages such as "null pointer assigned here" or "assuming pointer is null." The extra data is used by the BugReporterVisitors to garner additional information about what happened during a transition.

This line, however, is wrong:
  return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S), V), S,
V));
As you point out, it should be:
  return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S),
location), S, V));
In general, the way EvalLoad works is that some expressions are known to
evaluate to lvalues and other to rvalues. The Environment doesn't care
whether an Expr* binds to an lvalue or rvalue; all of those are kept in the
same mapping. However, EvalLoad simulates an lvalue-to-rvalue conversion,
and replaces the previous binding of the Expr* with the new value. The
purpose of bindExprAndLocation() is to preserve the original location value

Is there a difference between a location value and an L-value? If
yes, it's not clear to me what the difference is.
If no, what do we
introduce the term "location value" when L-value is already in wide
use?

There is a difference; it's about how the values are used.

In the static analyzer, a "location" (SVals in namespace clang::loc) represents a pointer value. Pointer values, however, can be manipulated, just like any other scalar value. For example:

   int **p;
   ...
  *p = q;

In this example, the expression 'q' evaluates to a pointer value. That value is an rvalue. During analysis, this value can either be UnknownVal, UndefinedVal, or some subclass of Loc (i.e., a location). In comparison, the expression on the left-side, '*p', evaluates to an lvalue (in this case, the location of memory pointed-to by 'p'). The result of the assignment is that the rvalue on the right is bound to the location of the lvalue on the left. By definition, and lvalue is a location, but an rvalue can also be a location since pointers can be used as first-class values.

Digging deeper, the expression 'q' is evaluated in two stages. First, the lvalue of the variable 'q' is computed. Then, because the entire expression is meant to be an rvalue, there is an lvalue-to-rvalue conversion, which results in a load. This is what EvalLoad does.

In

return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S),
   location), S, V));

what's the relation between 'location' and 'V'? Is the latter the
result of the lvalue-to-rvalue conversion from the former?

Going back to the example for the expression 'q', the 'location' is the lvalue and 'V' is the rvalue after the lvalue-to-rvalue conversion. In this case, the lvalue is the location of 'q' and the rvalue is the loaded contents of 'q'.

OK, I’ll try that. Is there a paper or design doc I can read the get
the high level idea though?

Here is a paper on the design of the region memory model, although it is a bit outdated.

http://lcs.ios.ac.cn/~xzx/memmodel.pdf

Thanks a lot for the introduction on how the analyzer works! It's invaluable.

Sorry for the delay in getting back to you -- I had a vacation and was
catching up.

I've organized the notes you wrote into
tools/clang/lib/Checker/README.txt, which I committed in r119288.
(http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Checker/README.txt?view=markup&pathrev=119288)
Please let me know if you want me to move it to somewhere else or
change the format/content. Thanks,

Hi Ted,

Concerning ++s[0], the code that simulates pre/post-increment in
GRExprEngine needs to be modified to understand reference types. It assumes
now that the result of ++s[0] is an r-value, which is the result of doing a
load from s[0] after it's value has been incremented.

I'm not sure this is the case. Given code:

class Foo {
public:
  char& get() const;
};

char& get();

void Test() {
  Foo foo;
  foo.get()++;
  get()++; // Crashes.
}

'clang --analyze' has no trouble with "foo.get()++" but crashes on
"get()++", so the culprit seems to be in how CallExpr (as opposed to
CXXMethodCallExpr) is handled.

While debugging this, I saw one thing that I don't understand:

GRExprEngine::ProcessStmt() calls Visit() as opposed to VisitLValue()
when processing the "foo.get()" subexpression of "foo.get()++".
  Is this right or a bug? My understanding is that "foo.get()" is an
L-value and thus should be handled by VisitLValue() -- what am I
missing? Thanks,

Now I see that ProcessStmt() actually processes "foo.get()" twice.
The first time it's treated as a top-level expression, and Visit() is
used. (That's what I talked about in my previous message.) The
second time it's treated as a sub-expression of "foo.get()++", and
VisitLValue() is used (more precisely, Visit("foo.get()++) is called,
which then calls VisitLValue("foo.get()")). This makes more sense
now. Though it's still unclear to me why the first time is needed.

Here's my theory of why "foo.get()" is processed before "foo.get()++":

The static analyzer executes the code symbolically by following the
control flow. When

  foo.get()++;

is executed, first "foo.get()" needs to be evaluated, as dictated by
C++'s semantics. Hence ProcessStmt() processes "foo.get()" first.
Next, "++" is done on the result of "foo.get()". This causes
ProcessStmt() to process "foo.get()++" as a whole. Even though the
call Visit("foo.get()++") looks to me like it's simulating the
execution of the entire "foo.get()++" expression, it actually only
simulates the "top-level" part of it -- that is, "x++" where "x" is
the result of "foo.get()".

Is my understanding correct? Thanks,

Hi, Ted,

I made a new patch based on our discussion last night. I've uploaded
it to http://codereview.appspot.com/2920041/ (the raw patch can be
found at http://codereview.appspot.com/download/issue2920041_2001.diff).

Things to note:

1. In addition to trying to fix PR8419, I also added comments that
helped me understand how the code I'm touching works.

2. The meat of the fix is in CFG.cpp. I'm still not sure if this is a
principled fix, but all tests pass, so I hope it's at least an
improvement.

Would you please take a look? Thanks!

Hi, Ted,

I made a new patch based on our discussion last night. I’ve uploaded
it to http://codereview.appspot.com/2920041/ (the raw patch can be
found at http://codereview.appspot.com/download/issue2920041_2001.diff).

Things to note:

  1. In addition to trying to fix PR8419, I also added comments that
    helped me understand how the code I’m touching works.

  2. The meat of the fix is in CFG.cpp. I’m still not sure if this is a
    principled fix, but all tests pass, so I hope it’s at least an
    improvement.

Would you please take a look? Thanks!

Hi Zhanyong,

I think there is goodness here. I’d prefer this be separated up into separate commits. For example, purely documentation comments that are unrelated to the functionality changes should go into a separate commit.

Comments inline.

Index: test/Analysis/misc-ps-region-store.cpp

— test/Analysis/misc-ps-region-store.cpp (revision 119815)
+++ test/Analysis/misc-ps-region-store.cpp (working copy)
@@ -159,6 +159,25 @@
for (; ; x++) { }
}

+// PR8419 – this used to crash.
+
+class String8419 {

  • public:
  • char& get(int n);
  • char& operator[](int n);
    +};

+char& get8419();
+
+void Test8419() {

  • String8419 s;
  • ++(s.get(0));
  • get8419()++; // used to crash
  • ++s[0]; // used to crash
  • s[0] &= 1; // used to crash
  • s[0]++; // used to crash
    +}

// PR8426 – this used to crash.

Test case is reasonable. This this also pass with the basic store? If so, this should just go in misc-ps.m.

void Use(void* to);
Index: include/clang/Checker/PathSensitive/Store.h

— include/clang/Checker/PathSensitive/Store.h (revision 119815)
+++ include/clang/Checker/PathSensitive/Store.h (working copy)
@@ -22,6 +22,9 @@

namespace clang {

+/// Store - This opaque type encapsulates an immutable map. Different
+/// subclasses of StoreManager may choose to put different types of
+/// keys and values in a Store.
typedef const void* Store;

Instead of “immutable map”, I would say “an immutable mapping from locations to values.” The second sentence is incorrect. The entire purpose of the store is to map from locations to values; not arbitrary keys and values. Different StoreManagers may represent this mapping as they choose. At a high-level, the store represents the symbolic memory model.

class GRState;
Index: include/clang/Checker/PathSensitive/GRExprEngine.h

— include/clang/Checker/PathSensitive/GRExprEngine.h (revision 119815)
+++ include/clang/Checker/PathSensitive/GRExprEngine.h (working copy)
@@ -513,6 +513,10 @@
public:
// FIXME: ‘tag’ should be removed, and a LocationContext should be used
// instead.

  • // FIXME: Comment on the meaning of the arguments, when ‘St’ may not
  • // be the same as Pred->state, and when ‘location’ may not be the
  • // same as state->getLValue(Ex).
  • /// Simulate a read of the result of Ex.
    void EvalLoad(ExplodedNodeSet& Dst, const Expr* Ex, ExplodedNode* Pred,
    const GRState* St, SVal location, const void tag = 0,
    QualType LoadTy = QualType());
    @@ -522,7 +526,7 @@
    void EvalStore(ExplodedNodeSet& Dst, const Expr
    AssignE, const Expr* StoreE,
    ExplodedNode* Pred, const GRState* St, SVal TargetLV, SVal Val,
    const void tag = 0);
    -private:
    +private:
    void EvalLoadCommon(ExplodedNodeSet& Dst, const Expr
    Ex, ExplodedNode* Pred,
    const GRState* St, SVal location, const void *tag,
    QualType LoadTy);

Comment seems reasonable.

Index: include/clang/Checker/PathSensitive/Environment.h

— include/clang/Checker/PathSensitive/Environment.h (revision 119815)
+++ include/clang/Checker/PathSensitive/Environment.h (working copy)
@@ -27,7 +27,12 @@
class ValueManager;
class LiveVariables;

+// FIXME: The name clang::Environment is too generic. Find a more
+// descriptive name (e.g. GREnvironment?).

Instead of a FIXME, why not just do this change?

+/// Environment - An immutable map from Stmts to their current
+/// symbolic values (SVals).
+///

Comment seems reasonable. This is basically the dual of ‘store.’ Instead of mapping from locations to values, it maps from expressions to values. This essentially captures the current semantics of an expression as it was just evaluated by the analyzer engine.

class Environment {
private:
friend class EnvironmentManager;
@@ -51,6 +56,9 @@
return X ? *X : UnknownVal();
}

  • /// Like LookupExpr, but doesn’t require Ex to be in the map (in
  • /// which case the return value will either be UnknownVal() or
  • /// created by the ValueManager).
    SVal GetSVal(const Stmt* Ex, ValueManager& ValMgr) const;

This is a high-level, public API; this level of detail in the comment seems a little weird. The fact that values are lazily generated is an implementation detail. I’d rather the comment just say that this fetches the current binding of the expression in the Environment. The fact that the binding is implicitly or explicitly represented in the Environment is an implementation detail.

Also, LookupExpr is a private detail of the class. Comparing GetSVal to LookupExpr is a little weird when it comes to documenting a method used by clients.

/// Profile - Profile the contents of an Environment object for use
Index: include/clang/Checker/PathSensitive/GRState.h

— include/clang/Checker/PathSensitive/GRState.h (revision 119815)
+++ include/clang/Checker/PathSensitive/GRState.h (working copy)
@@ -1,4 +1,4 @@
-//== GRStateh - Path-Sens. “State” for tracking valuues ------ C++ -–==//
+//== GRState.h - Path-sensitive “State” for tracking values -----
- C++ -*–==//

Good cleanup.

//
// The LLVM Compiler Infrastructure
//
@@ -7,7 +7,7 @@
//
//===----------------------------------------------------------------------===//
//
-// This file defines SymbolRef, ExprBindKey, and GRState*
+// This file defines SymbolRef, ExprBindKey, and GRState*.
//
//===----------------------------------------------------------------------===//

@@ -53,15 +53,17 @@
};

//===----------------------------------------------------------------------===//
-// GRState- An ImmutableMap type Stmt*/Decl*/Symbols to SVals.
+// GRState- An ImmutableMap from Stmt*/Decl*/Symbols to SVals.

This comment is actually out-of-date, since it isn’t an ImmutableMap object anymore.

//===----------------------------------------------------------------------===//

class GRStateManager;

-/// GRState - This class encapsulates the actual data values for
-/// for a “state” in our symbolic value tracking. It is intended to be
-/// used as a functional object; that is once it is created and made
-/// “persistent” in a FoldingSet its values will never change.
+/// GRState - This class encapsulates the symbolic values of all Stmts
+/// in the program and the constraints on the values at a particular
+/// moment.

I would probably bullet it out. The GRState represents:

(1) Mapping from locations to values (Store)
(2) Mapping from expressions to values (Environment)
(3) Constraints on symbolic values

Together these represent the “abstract state” of a program.

A client of this class may also embed arbitrary data in
+/// it via a GenericDataMap. GRState is intended to be used as a
+/// functional object; that is, once it is created and made
+/// “persistent” in a FoldingSet, its values will never change.
class GRState : public llvm::FoldingSetNode {
public:
typedef llvm::ImmutableSetllvm::APSInt* IntSetTy;
@@ -73,9 +75,9 @@
friend class GRStateManager;

GRStateManager *StateMgr;

  • Environment Env;
  • Environment Env; // Maps a Stmt to its current SVal.
    Store St;
  • GenericDataMap GDM;
  • GenericDataMap GDM; // custom data stored by a client of this class

Please keep comments consistent. Start with capital letters; end with with periods.

/// makeWithStore - Return a GRState with the same values as the current
/// state with the exception of using the specified Store.
@@ -120,8 +122,9 @@

void setGDM(GenericDataMap gdm) { GDM = gdm; }

  • /// Profile - Profile the contents of a GRState object for use
  • /// in a FoldingSet.
  • /// Profile - Profile the contents of a GRState object for use in a
  • /// FoldingSet. Two GRState objects are considered equal if they
  • /// have the same Environment, Store, and GenericDataMap.

Looks great.

static void Profile(llvm::FoldingSetNodeID& ID, const GRState* V) {
V->Env.Profile(ID);
ID.AddPointer(V->St);
@@ -163,19 +166,14 @@
// (3) A binary value “Assumption” that indicates whether the constraint is
// assumed to be true or false.
//

  • // The output of “Assume” are two values:
  • // The output of “Assume*” is a new GRState object with the added constraints.
  • // If no new state is feasible, NULL is returned.

Looks great.

//

  • // (a) “isFeasible” is set to true or false to indicate whether or not
  • // the assumption is feasible.
  • //
  • // (b) A new GRState object with the added constraints.
  • //
  • // FIXME: (a) should probably disappear since it is redundant with (b).
  • // (i.e., (b) could just be set to NULL).
  • //

const GRState *Assume(DefinedOrUnknownSVal cond, bool assumption) const;

  • // FIXME: The type of this method doesn’t match the comment above.
  • // Comment on what this really does.

Here’s what this method does. The above version of Assume takes an ‘assumption’ parameter. This one doesn’t. This method assumes both “true” and “false” for ‘cond’, and returns both corresponding states. It’s shorthand for doing ‘Assume’ twice.

std::pair<const GRState*, const GRState*>
Assume(DefinedOrUnknownSVal cond) const;

@@ -194,9 +192,7 @@
//==---------------------------------------------------------------------==//

/// BindCompoundLiteral - Return the state that has the bindings currently

  • /// in ‘state’ plus the bindings for the CompoundLiteral. ‘R’ is the region
  • /// for the compound literal and ‘BegInit’ and ‘EndInit’ represent an
  • /// array of initializer values.
  • /// in this state plus the bindings for the CompoundLiteral.

Looks good.

const GRState bindCompoundLiteral(const CompoundLiteralExpr CL,
const LocationContext *LC,
SVal V) const;
Index: include/clang/Checker/PathSensitive/SVals.h

— include/clang/Checker/PathSensitive/SVals.h (revision 119815)
+++ include/clang/Checker/PathSensitive/SVals.h (working copy)
@@ -39,13 +39,26 @@
class GRStateManager;
class ValueManager;

+/// SVal - This represents a symbolic expression, which can be either
+/// an L-value or an R-value.
+///

Looks good.

class SVal {
public:

  • enum BaseKind { UndefinedKind, UnknownKind, LocKind, NonLocKind };
  • enum BaseKind {
  • // The enumerators must be representable using 2 bits.
  • UndefinedKind = 0, // for subclass UndefinedVal (an uninitialized value)
  • UnknownKind = 1, // for subclass UnknownVal (a void value)
  • LocKind = 2, // for subclass Loc (an L-value)
  • NonLocKind = 3 // for subclass NonLoc (an R-value that’s not
  • // an L-value)
  • };
    enum { BaseBits = 2, BaseMask = 0x3 };

protected:
const void* Data;
+

  • /// The lowest 2 bits are a BaseKind (0 – 3).
  • /// The higher bits are an unsigned “kind” value.
    unsigned Kind;

protected:
@@ -200,7 +213,7 @@
return V->getBaseKind() == UnknownKind;
}
};

class DefinedSVal : public DefinedOrUnknownSVal {
private:
// Do not implement. We want calling these methods to be a compiler
Index: lib/Analysis/CFG.cpp

— lib/Analysis/CFG.cpp (revision 119815)
+++ lib/Analysis/CFG.cpp (working copy)
@@ -900,12 +900,29 @@
return VisitChildren(S);
}

+// Returns true if a CFGBuilder should treat ‘S’ as an L-value when
+// visiting it.
+static bool ShouldTreatAsLValue(Stmt* S) {

  • // A call to a reference-returning function is an L-value.
  • if (CallExpr* CallEx = dyn_cast(S))
  • return CallEx->getCallReturnType()->isReferenceType();
  • // Otherwise, we assume it’s not an L-value.
  • // FIXME: document why this assumption is OK, or fix this in a
  • // principled way if it’s not.
  • return false;
    +}

/// VisitChildren - Visit the children of a Stmt.
CFGBlock CFGBuilder::VisitChildren(Stmt Terminator) {
CFGBlock *B = Block;
for (Stmt::child_iterator I = Terminator->child_begin(),
E = Terminator->child_end(); I != E; ++I) {

  • if (*I) B = Visit(*I);
  • if (*I) {
  • B = Visit(*I, ShouldTreatAsLValue(*I) ?
  • AddStmtChoice::AsLValueNotAlwaysAdd :
  • AddStmtChoice::NotAlwaysAdd);
  • }
    }
    return B;
    }

I don’t this is right. I’ll comment in a separate email.

Index: lib/Checker/Environment.cpp

— lib/Checker/Environment.cpp (revision 119815)
+++ lib/Checker/Environment.cpp (working copy)
@@ -101,7 +101,8 @@
Environment EnvironmentManager::bindExprAndLocation(Environment Env,
const Stmt *S,
SVal location, SVal V) {

  • return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S), V), S, V));
  • return Environment(F.Add(F.Add(Env.ExprBindings, MakeLocation(S), location),
  • S, V));
    }

This is a good bug fix, and should go into a patch of its own.

namespace {
Index: lib/Checker/GRExprEngine.cpp

— lib/Checker/GRExprEngine.cpp (revision 119815)
+++ lib/Checker/GRExprEngine.cpp (working copy)
@@ -909,7 +909,7 @@
break;
}

  • case Stmt::CXXMemberCallExprClass: {
  • case Stmt::CXXMemberCallExprClass: {
    const CXXMemberCallExpr *MCE = cast(S);
    VisitCXXMemberCallExpr(MCE, Pred, Dst);
    break;
    @@ -1944,10 +1944,11 @@
    EvalBind(Dst, StoreE, *NI, GetState(*NI), location, Val);
    }

What changed here?

-void GRExprEngine::EvalLoad(ExplodedNodeSet& Dst, const Expr Ex,
+void GRExprEngine::EvalLoad(ExplodedNodeSet& Dst, const Expr Ex,
ExplodedNode
Pred,
const GRState
state, SVal location,
const void *tag, QualType LoadTy) {

  • assert(!isa(location) && “location cannot be a NonLoc.”);

Is it possible to change ‘location’ to a Loc, instead of having this assertion?

// Are we loading from a region? This actually results in two loads; one
// to fetch the address of the referenced value and one to fetch the
@@ -3154,7 +3155,7 @@
assert (U->isIncrementDecrementOp());
ExplodedNodeSet Tmp;
const Expr* Ex = U->getSubExpr()->IgnoreParens();

  • VisitLValue(Ex, Pred, Tmp);
  • VisitLValue(Ex, Pred, Tmp); // This modifies Tmp.

Do you think this comment is necessary? This happens all over the place within the engine.

for (ExplodedNodeSet::iterator I = Tmp.begin(), E = Tmp.end(); I!=E; ++I) {

@@ -3163,7 +3164,7 @@

// Perform a load.
ExplodedNodeSet Tmp2;

  • EvalLoad(Tmp2, Ex, *I, state, V1);
  • EvalLoad(Tmp2, Ex, *I, state, V1); // This modifies Tmp2.

Do you think this comment is necessary? This happens all over the place within the engine.

  1. The meat of the fix is in CFG.cpp. I’m still not sure if this is a
    principled fix, but all tests pass, so I hope it’s at least an
    improvement.

Hi Zhanyong,

More comments on this part of your patch:

+// Returns true if a CFGBuilder should treat ‘S’ as an L-value when
+// visiting it.
+static bool ShouldTreatAsLValue(Stmt* S) {

  • // A call to a reference-returning function is an L-value.
  • if (CallExpr* CallEx = dyn_cast(S))
  • return CallEx->getCallReturnType()->isReferenceType();
  • // Otherwise, we assume it’s not an L-value.
  • // FIXME: document why this assumption is OK, or fix this in a
  • // principled way if it’s not.
  • return false;
    +}

/// VisitChildren - Visit the children of a Stmt.
CFGBlock CFGBuilder::VisitChildren(Stmt Terminator) {
CFGBlock *B = Block;
for (Stmt::child_iterator I = Terminator->child_begin(),
E = Terminator->child_end(); I != E; ++I) {

  • if (*I) B = Visit(*I);
  • if (*I) {
  • B = Visit(*I, ShouldTreatAsLValue(*I) ?
  • AddStmtChoice::AsLValueNotAlwaysAdd :
  • AddStmtChoice::NotAlwaysAdd);
  • }
    }
    return B;
    }

I think the motivation behind this change is right, but not the change itself.

The problem with VisitChildren() is that it has no context of whether ‘Terminator’ (which is misnamed) expects the subexpression to be evaluated as an lvalue. Since your logic is specific to CallExpr, it really could just be sunk into processing of CallExpr itself. When the caller passses down ‘AsLValue’, it is indicating that the parent expects that the child is evaluated as an lvalue, but that’s based on the parent’s context. There is no such information here.

Further, I think it’s no technically correct. Consider:

const int &foo();
int i = foo();

This is what the AST looks like for the second line:

(DeclStmt 0x101842c18 <line:3:3, col:16>
0x101818510 “int i =
(CallExpr 0x1018185d8 <col:11, col:15> ‘const int’ lvalue
(ImplicitCastExpr 0x1018185b8 col:11 ‘const int &(*)(void)’
(DeclRefExpr 0x101818560 col:11 ‘const int &(void)’ lvalue FunctionDecl=‘foo’ 0x101818450)))”))

In this example, CallExpr does indeed return an lvalue. The problem is that there is an implicit lvalue-to-rvalue conversion. So the semantics really are:

(DeclStmt 0x101842c18 <line:3:3, col:16>
0x101818510 “int i =
(RValueConversion <— I made this up
(CallExpr 0x1018185d8 <col:11, col:15> ‘const int’ lvalue
(ImplicitCastExpr 0x1018185b8 col:11 ‘const int &(*)(void)’
(DeclRefExpr 0x101818560 col:11 ‘const int &(void)’ lvalue FunctionDecl=‘foo’ 0x101818450)))”))

Currently we don’t represent lvalue-to-rvalue conversions explicitly in the AST, although in my conversations with John McCall there is interest in doing this. The problem with your patch is that the parent expression may actually expect that there is an rvalue, even though the subexpression returns an lvalue. Your patch doesn’t actually address this problem at all, but actually papers over it.

Codegen goes through some histrionics to infer these semantics from the AST, but they analyzer mostly goes off the AST itself for semantics. As you can see this is a problem, since we really have no “sequence point” for the analyzer to do the lvalue-to-rvalue conversion. We get around this limitation in various places when it comes to C-style constructs, but there are many corner cases for C++.

In this particular case, I think we need to add special support in the CFG builder for UnaryOperator. Specifically, we need to add a ‘VisitUnaryOperator’ to the CFGBuilder class, and have it determine, base on the opcode, whether or not the operation requires the subexpression to be an lvalue or rvalue. It then passes the appropriate value for AddStmtChoice.

Similarly, for CallExprs, any CallExpr that returns a reference needs to be treated as returning an lvalue. The problem here is that we don’t have a sequence point for doing arbitrary lvalue-to-rvalue conversion in the AST, but this is something we can possibly model in GRExprEngine given enough breadcrumbs in the CFG.

Hi Ted,

Thanks for the quick feedback! I'll respond to your comments about
the code comments later, as they are less important.

I think the motivation behind this change is right, but not the change
itself.
The problem with VisitChildren() is that it has no context of whether
'Terminator' (which is misnamed) expects the subexpression to be evaluated
as an lvalue. Since your logic is specific to CallExpr, it really could
just be sunk into processing of CallExpr itself. When the caller passses
down 'AsLValue', it is indicating that the parent expects that the child is
evaluated as an lvalue, but that's based on the parent's context. There is
no such information here.

Thanks for explaining. It's a revelation to me to learn that:

1. The "AsLValue" bit encodes whether the *consumer* of an expression
   uses it as an LValue, not whether the expression itself is an LValue.

2. The caller of Visit*() must provide this bit accurately. It's not
   OK for the caller to be "conservative" (i.e. say "AsLValue" when
   the consumer of the expression only needs an RValue).

I suspected #1 was the case, but wasn't certain. I wasn't sure about
#2 at all (hence my patch is the way it is). The fact that all tests
passed with the patch led me to think that it's fine for the caller of
Visit*() to say AsLValue when only an RValue is needed. Now I see
that it's just inadequate test coverage.

I'll document this in my new patch. Also, how would you suggest to
add a test to catch violation of #2?

Further, I think it's no technically correct. Consider:
const int &foo();
int i = foo();
This is what the AST looks like for the second line:
(DeclStmt 0x101842c18 <line:3:3, col:16>
0x101818510 "int i =
(CallExpr 0x1018185d8 <col:11, col:15> 'const int' lvalue
(ImplicitCastExpr 0x1018185b8 <col:11> 'const int &(*)(void)'
<FunctionToPointerDecay>
(DeclRefExpr 0x101818560 <col:11> 'const int &(void)' lvalue
FunctionDecl='foo' 0x101818450)))"))
In this example, CallExpr does indeed return an lvalue. The problem is that
there is an implicit lvalue-to-rvalue conversion. So the semantics really
are:
(DeclStmt 0x101842c18 <line:3:3, col:16>
0x101818510 "int i =
(RValueConversion <--- I made this up
(CallExpr 0x1018185d8 <col:11, col:15> 'const int' lvalue
(ImplicitCastExpr 0x1018185b8 <col:11> 'const int &(*)(void)'
<FunctionToPointerDecay>
(DeclRefExpr 0x101818560 <col:11> 'const int &(void)' lvalue
FunctionDecl='foo' 0x101818450)))"))
Currently we don't represent lvalue-to-rvalue conversions explicitly in the
AST, although in my conversations with John McCall there is interest in
doing this. The problem with your patch is that the parent expression may
actually expect that there is an rvalue, even though the subexpression
returns an lvalue. Your patch doesn't actually address this problem at all,
but actually papers over it.
Codegen goes through some histrionics to infer these semantics from the AST,
but they analyzer mostly goes off the AST itself for semantics. As you can
see this is a problem, since we really have no "sequence point" for the
analyzer to do the lvalue-to-rvalue conversion. We get around this
limitation in various places when it comes to C-style constructs, but there
are many corner cases for C++.

This is really unfortunate. :frowning:

In this particular case, I think we need to add special support in the CFG
builder for UnaryOperator. Specifically, we need to add a
'VisitUnaryOperator' to the CFGBuilder class, and have it determine, base on
the opcode, whether or not the operation requires the subexpression to be an
lvalue or rvalue. It then passes the appropriate value for AddStmtChoice.

I'll try that.

One question: for the purpose of the CFG builder, does an LValue have
to be mutable? For example, given

  void Foo(const Bar& x) { ... }
  Bar& GetBar() { ... }

when the analyzer sees "Foo(GetBar());", should it treat "GetBar()" as
an LValue or RValue? What if the implementation of Foo() actually
cares about the identity of the input, e.g.

  void Foo(const Bar& x) {
    if (&x == &someGlobalBarVairalbe)
      ...;
  }

? Thanks.

2010/11/20 Zhanyong Wan (λx.x x) <wan@google.com>

Hi Ted,

Thanks for the quick feedback! I’ll respond to your comments about
the code comments later, as they are less important.

I think the motivation behind this change is right, but not the change
itself.
The problem with VisitChildren() is that it has no context of whether
‘Terminator’ (which is misnamed) expects the subexpression to be evaluated
as an lvalue. Since your logic is specific to CallExpr, it really could
just be sunk into processing of CallExpr itself. When the caller passses
down ‘AsLValue’, it is indicating that the parent expects that the child is
evaluated as an lvalue, but that’s based on the parent’s context. There is
no such information here.

Thanks for explaining. It’s a revelation to me to learn that:

  1. The “AsLValue” bit encodes whether the consumer of an expression
    uses it as an LValue, not whether the expression itself is an LValue.

This is right.

  1. The caller of Visit*() must provide this bit accurately. It’s not
    OK for the caller to be “conservative” (i.e. say “AsLValue” when
    the consumer of the expression only needs an RValue).

I don’t quite understand this point.

Just like Ted said, the first step of the fix is:

when the ‘foo()’ is added as the block-level expr in CFG construction, its kind should be StatementAsLValue, and this should be confined in the unary operator case.

I suspected #1 was the case, but wasn’t certain. I wasn’t sure about
#2 at all (hence my patch is the way it is). The fact that all tests
passed with the patch led me to think that it’s fine for the caller of
Visit*() to say AsLValue when only an RValue is needed. Now I see
that it’s just inadequate test coverage.

I’ll document this in my new patch. Also, how would you suggest to
add a test to catch violation of #2?

Further, I think it’s no technically correct. Consider:
const int &foo();
int i = foo();
This is what the AST looks like for the second line:
(DeclStmt 0x101842c18 <line:3:3, col:16>
0x101818510 “int i =
(CallExpr 0x1018185d8 <col:11, col:15> ‘const int’ lvalue
(ImplicitCastExpr 0x1018185b8 col:11 ‘const int &()(void)’

(DeclRefExpr 0x101818560 col:11 ‘const int &(void)’ lvalue
FunctionDecl=‘foo’ 0x101818450)))"))
In this example, CallExpr does indeed return an lvalue. The problem is that
there is an implicit lvalue-to-rvalue conversion. So the semantics really
are:
(DeclStmt 0x101842c18 <line:3:3, col:16>
0x101818510 "int i =
(RValueConversion <— I made this up
(CallExpr 0x1018185d8 <col:11, col:15> ‘const int’ lvalue
(ImplicitCastExpr 0x1018185b8 col:11 'const int &(
)(void)’

(DeclRefExpr 0x101818560 col:11 ‘const int &(void)’ lvalue
FunctionDecl=‘foo’ 0x101818450)))”))
Currently we don’t represent lvalue-to-rvalue conversions explicitly in the
AST, although in my conversations with John McCall there is interest in
doing this. The problem with your patch is that the parent expression may
actually expect that there is an rvalue, even though the subexpression
returns an lvalue. Your patch doesn’t actually address this problem at all,
but actually papers over it.
Codegen goes through some histrionics to infer these semantics from the AST,
but they analyzer mostly goes off the AST itself for semantics. As you can
see this is a problem, since we really have no “sequence point” for the
analyzer to do the lvalue-to-rvalue conversion. We get around this
limitation in various places when it comes to C-style constructs, but there
are many corner cases for C++.

This is really unfortunate. :frowning:

In this particular case, I think we need to add special support in the CFG
builder for UnaryOperator. Specifically, we need to add a
‘VisitUnaryOperator’ to the CFGBuilder class, and have it determine, base on
the opcode, whether or not the operation requires the subexpression to be an
lvalue or rvalue. It then passes the appropriate value for AddStmtChoice.

I’ll try that.

One question: for the purpose of the CFG builder, does an LValue have
to be mutable? For example, given

void Foo(const Bar& x) { … }
Bar& GetBar() { … }

when the analyzer sees “Foo(GetBar());”, should it treat “GetBar()” as
an LValue or RValue? What if the implementation of Foo() actually
cares about the identity of the input, e.g.

void Foo(const Bar& x) {
if (&x == &someGlobalBarVairalbe)
…;
}

? Thanks.

In this context, we evaluate GetBar() to its l-value, i.e., a MemRegion.

Hi Zhongxing,

I'll try that.

One question: for the purpose of the CFG builder, does an LValue have
to be mutable? For example, given

void Foo(const Bar& x) { ... }
Bar& GetBar() { ... }

when the analyzer sees "Foo(GetBar());", should it treat "GetBar()" as
an LValue or RValue?

In all cases, the result of GetBar() is an lvalue (because it returns a reference type). However, the question is whether or not the parent expression expects an lvalue or an rvalue. In this case it expects an lvalue, so no lvalue-to-rvalue conversion is necessary.

What if the implementation of Foo() actually
cares about the identity of the input, e.g.

void Foo(const Bar& x) {
   if (&x == &someGlobalBarVairalbe)
     ...;
}

This raises other topics. Assuming we can lvalue and rvalue semantics correct, this is more about how to we reason about the addresses of memory as well as how do we reason about function calls. For the former, memory locations are represented using "memory regions" (MemRegion.h), which can represent abstract chunks of memory. For the latter, we have a variety of choices. Currently we analyze functions independently (intra-procedurally), but we also have the option to analyze functions inter-procedurally either using inlining (which we have prototyped) or function summaries (something still to do).

Stepping back, I think we may need some extra design here, either in the ASTs or in the analyzer, to model lvalue-to-rvalue conversions correctly for C++.

With C, lvalue-to-rvalue conversions happen in specific cases, e.g.:

- Loading from a variable

- Dereferencing a pointer

Semantically such conversions result in a "load" of a value from a memory address. This is what 'EvalLoad()' does in GRExprEngine. Because the places where such conversions occur in C are fairly limited, up to now we have been able to model lvalue-to-rvalue conversions without much effort.

With C++ (especially with the use of references) there is a lot more cases to reason about. It raises the question of whether we should try to recreate the logic in codegen in GRExprEngine to infer all of these cases, or to try and represent them explicitly in the AST.

There are advantages to representing such conversions explicitly in the AST. For one, the distinction of an lvalue versus an rvalue in the CFG probably evaporates. Any place where we see an lvalue-to-rvalue conversion simply means the subexpression was a pointer (or reference), and the conversion represents a load. Moreover, different analyses (not just GRExprEngine) will be able to more adeptly reason about the semantics of analyzed code without having the burden to re-infer where all the lvalue-to-rvalue conversions take place.

The other advantage is that it takes care of discrepanancies between C and C++ like the following:

$ cat t.c
void foo() {
  int a = 0;
  ++(++a);
}

$ clang -fsyntax-only t.c
t.c:3:3: error: expression is not assignable
  ++(++a);
  ^ ~~~~~
1 error generated.

$ clang -fsyntax-only -x c++ t.c
$

Here C and C++ differ on the semantics of '++'. When the file is compiled as C code, we get an error because the expression '++a' evaluates to an rvalue. When we compile the code as C++, however, the expression evaluate to an lvalue. This difference isn't represented in the AST at all.

NOTE: This is also another reason why GRExprEngine might crash in some cases with '++' on C++ code, because it assumes the C semantics.

To explain the difference some more, consider:

  int a = 0;
  int b = ++a;

With C, the '++a' evaluates to an rvalue immediately, which is assigned to 'b'. In C++, the expression '++a' first evaluates to an lvalue, and then there is an implicit lvalue-to-rvalue conversion that loads the new value. Of course the compiler doesn't need to generate more IR to represent this at the execution level, but this is what is conceptually happening in the language.

So to summarize, these discrepancies of how lvalues and rvalues are reasoned about in C versus C++ is really something we need to solve systematically in the analyzer (and/or ASTs) with a cohesive solution. That's why I think this individual case that you reported is only symptomatic of a bigger issue that needs a well-designed solution.

Cheers,
Ted

Thanks for thinking through this and sharing your analysis, Ted! I'm
all for fixing this in a principled way. However, the crash I
reported is blocking some experiments I'm doing. Therefore I'd really
like to put in a fix for the symptom first to unblock me, and then we
can take the time necessary to fix the whole thing the right way.
Would you oppose to that? Thanks,