CFG blocks and variable scope

Hi,
I'm trying to write a static analysis for checking if pointers to local
variables leave the scope where the variable exists (and therefore
become invalid). I've figured how to use DataflowSolver for this but I
can't figure out how to find out in which CFG blocks the local variable
in question still exists. Is this info available somewhere or do I have
to hook into CFG builder and generate it myself?

Regards,
Martin Doucha

Hi Martin,

There is currently no scope information in the CFG (or the AST for that matter). Adding this information would be extremely useful, and would probably tie in for eventual support for encoding calls to C++ destructors in the CFG as well.

Ted

Ted Kremenek wrote:

Hi Martin,

There is currently no scope information in the CFG (or the AST for
that matter). Adding this information would be extremely useful, and
would probably tie in for eventual support for encoding calls to C++
destructors in the CFG as well.

Ted

Great, so what's the preferable way of doing this? My idea is to have a
tree of scopes (corresponding to CompoundStmt), each scope containing a
complete list of variables declared inside it (not including
declarations in nested scopes) regardless of control flow. Then each CFG
block would have a single parent scope (the one directly above it) and a
list of scopes inside it with a statement iterator pair designating the
start and end of the scope in the block. Now the question is, can
different edges leaving the block leave different sets of scopes?

Regards,
Martin

There is currently no scope information in the CFG (or the AST for

that matter). Adding this information would be extremely useful, and

would probably tie in for eventual support for encoding calls to C++

destructors in the CFG as well.

Great, so what’s the preferable way of doing this? My idea is to have a
tree of scopes (corresponding to CompoundStmt), each scope containing a
complete list of variables declared inside it (not including
declarations in nested scopes) regardless of control flow.

Hi Martin,

I haven’t given a lot of thought to this yet, but I will comment on this point. Scopes can be introduced in many places, especially in C++. While I’m not certain if you suggested this, we wouldn’t want to reconstruct the work done by Sema in generating scope information; ideally this information would still be accessible (when desired) when one has the ASTs.

In the CFG, my thought was that potentially destructor calls could be explicitly modeled. The lifetimes of regular stack variables could also be modeled using the same mechanism. Since we haven’t resolved how we want to represent destructors in the AST or CFG, I think that should probably be addressed first.

Then each CFG
block would have a single parent scope (the one directly above it) and a
list of scopes inside it with a statement iterator pair designating the
start and end of the scope in the block. Now the question is, can
different edges leaving the block leave different sets of scopes?

Within a single basic block multiple scopes may be “pushed” and “popped”. The CFG only corresponds to control-flow, and thus nested compound statements are flattened. Note that C++ also introduces scopes in many places that C does not. e.g.,

int y = 0;
if (int x = y + 1) { … }

There are three scopes here. The scope containing the ‘if’ statement and ‘int y = 0’, the scope containing ‘int x = 1’, and the scope within the { … }. The statements ‘int y = 0’ and ‘int x = y + 1’ occur within the same basic block. The successors of that basic block will have entirely different scopes.

At a high level, I don’t think there is much value in modeling the notion of “scope” at all within a CFG, and the complexity cost would be high. Scope is a concept of the language and its syntax, and thus it relates much more directly to the AST than the CFG. The CFG encodes control-flow between expressions. I really think that all that you are interested in here is the effects of scope on object lifetime rather than scope itself. Since an object getting destroyed (and here an object can be anything that is stack allocated, not just a C++ object) is an actual event with potential side-effects, modeling that in the CFG makes sense. To me it muddles up the conceptual clarity of CFGs by trying to have them model scoping (which would make CFGs a mongrel of two orthogonal concepts).

Don’t get me wrong: there is still value in having a way to query the scope of a variable, but I don’t think that belongs in the CFG. Modeling scope information (which is done in Sema but not in the ASTs) means having some object or handle that represents a particular scope, being able to query what objects are in a scope and where a scope begins and ends, etc. Ultimately analyses based on CFGs probably don’t care about that information at all but rather about the ramifications of scope in terms of object lifetime. This information could be captured during CFG construction (which could inspect the scope information) but the notion of scope shouldn’t be in the CFGs themselves.

Ted

Ted Kremenek wrote:

In the CFG, my thought was that *potentially* destructor calls could
be explicitly modeled. The lifetimes of regular stack variables could
also be modeled using the same mechanism. Since we haven't resolved
how we want to represent destructors in the AST or CFG, I think that
should probably be addressed first.

Maybe "scope" is not the right word. Think of what I called "scope" in
my last email as just a list of variables with common destruction
points. The tree structure is there just to reduce memory usage by
keeping each variable in exactly one list. Modeling the life of each
variable explicitly (if you mean explicitly marking each point where the
variable gets destroyed) could eat a lot of memory. Imagine a loop with
a lot of variables declared on top and a lot of breaks and continues
below. Explicit model would require marking the destruction for each
break/continue and each variable (or a special destruction block). The
list model will keep one list for all variables and one pointer to the
list per basic block with break/continue statement.

Within a single basic block multiple scopes may be "pushed" and
"popped". The CFG only corresponds to control-flow, and thus nested
compound statements are flattened. Note that C++ also introduces
scopes in many places that C does not.

I am aware of that. That's why I've proposed the list of "scopes" that
live inside a basic block. And the "scopes" don't have to follow the
language concept of scope to the letter. "Scopes" with exactly the same
exit points can be merged into one and empty "scopes" can be removed
completely. I am interested in the effects, I just don't see any other
efficient way of storing this information.

Regards,
Martin

Would it make sense for the CFG to contain a "virtual statement" saying "variable x destroyed here"? This would be a natural way to handle C++ dtors and would also be useful in C, because you'd know the end of the variable's life. CFG construction would handle this as it is walking the scopes.

-Chris

Yes, this is the approach I was envisioning. We still need the scope information to be available during CFG construction.

For declstmt's? I think the AST has enough information to know scopes. What is missing?

-Chris

I would like a solution that will work well with C++ (e.g., the construction/destruction of temporaries, etc.), so having an ad hoc solution to reconstruct scope and other places where objects get destroyed seems suboptimal. I suppose the CFG could construct that, but it might be nice if that information was affixed to the AST in some way. Other clients (e.g., CodeGen) might want to use that as well.

Chris Lattner wrote:

Would it make sense for the CFG to contain a "virtual statement"
saying "variable x destroyed here"? This would be a natural way to
handle C++ dtors and would also be useful in C, because you'd know the
end of the variable's life. CFG construction would handle this as it
is walking the scopes.

It would if you didn't have to destroy variables in transition between
CFG blocks. Look at C99/C++ for loop:
for (int i = 0; i < 10; i++) { int x =i; ... }
Here, 'x' is destroyed on each pass while 'i' is destroyed in transition
to the block directly after the loop. And destroying it in the following
block doesn't work because other blocks which have no variable
corresponding to 'i' may point at this block as well. So you either have
to start making virtual basic blocks for variable destruction or you can
list variables to be destroyed at the edge between basic blocks.

Yes, virtual statements could be used for destroying variables in the
middle of CFG block. But that's not enough. And each of those statements
should point to a group of variables rather than list each of them
separately. Listing each of them separately will take a lot of memory in
more complex code structures.

For declstmt's? I think the AST has enough information to know
scopes. What is missing?

Something like mapping between CFG blocks and complete AST?

Regards,
Martin

Chris Lattner wrote:

Would it make sense for the CFG to contain a "virtual statement"
saying "variable x destroyed here"? This would be a natural way to
handle C++ dtors and would also be useful in C, because you'd know the
end of the variable's life. CFG construction would handle this as it
is walking the scopes.

It would if you didn't have to destroy variables in transition between
CFG blocks. Look at C99/C++ for loop:
for (int i = 0; i < 10; i++) { int x =i; ... }
Here, 'x' is destroyed on each pass while 'i' is destroyed in transition
to the block directly after the loop. And destroying it in the following
block doesn't work because other blocks which have no variable
corresponding to 'i' may point at this block as well. So you either have
to start making virtual basic blocks for variable destruction or you can
list variables to be destroyed at the edge between basic blocks.

Hi Martin,

I think having "virtual" blocks is a much better solution (although all basic blocks are "virtual" because they are implied by the control-flow of the code). Unlike the other approach, it doesn't introduce a new concept that clients of CFGs would have to understand. Besides, CFG blocks are just ordered lists; there is no real advantage of having lists attached to edges versus having a basic block representing a list of destroyed variables. Moreover, there is control-flow between variable destruction; the ordering between object destruction is particularly important in C++, and that control-flow should be captured explicitly and clearly in the CFG itself. By having virtual basic blocks, clients are isolated from having to reason about the high-level control-flow rules of the language and instead just think about evaluating individual expressions.

Yes, virtual statements could be used for destroying variables in the
middle of CFG block. But that's not enough. And each of those statements
should point to a group of variables rather than list each of them
separately. Listing each of them separately will take a lot of memory in
more complex code structures.

I don't follow this argument. First, I don't understand the memory concerns. Having statements point to group of N variables still requires N+1 pointers versus having N pointers in a basic block to virtual statements. A basic block that contains virtual statements (which would be a format for this information that existing clients would already understand) would require space for the basic block (which is insignificant) and N pointers. To me the solution of having virtual statements in basic blocks is strictly better because it is much simpler from the perspective of clients and has almost exactly the same memory requirements.

Note that virtual statements don't necessarily require allocating new objects; we could use bit-swizzinling in the pointer values to distinguish between regular Stmt* and virtual statements (which could be Decl* mangled with a specific bit).

Second, even if there were memory concerns, CFGs are transient data structures whose primary role is to greatly simplify the reasoning about control-flow for clients. Typically we only construct a CFG when analyzing a specific function or method. Once we are done analyzing a function we release the memory associated with the CFG. Even for functions with a ton of control-flow (switches with nested loops, etc), I have yet to seen where the size of the CFG causes any real memory concerns.

For declstmt's? I think the AST has enough information to know
scopes. What is missing?

Something like mapping between CFG blocks and complete AST?

I think this is already present. CFG blocks contain Stmt* which directly reference the AST. This seems more than adequate to reference the complete AST (although I'm not certain if "complete" implies something more specific).

Ted Kremenek wrote:

I don't follow this argument. First, I don't understand the memory
concerns. Having statements point to group of N variables still
requires N+1 pointers versus having N pointers in a basic block to
virtual statements. A basic block that contains virtual statements
(which would be a format for this information that existing clients
would already understand) would require space for the basic block
(which is insignificant) and N pointers.

The problem is there may be multiple blocks with virtual statements. Now
if you have M blocks and N variables to destroy (the same variables in
each block), the group approach requires O(M+N) memory while the direct
approach requires O(M*N) memory. Though these numbers are unlikely to be
significantly different for hand-written code, don't underestimate the
stupidity of automatic code generators.

Something like mapping between CFG blocks and complete AST?

I think this is already present. CFG blocks contain Stmt* which
directly reference the AST. This seems more than adequate to
reference the complete AST (although I'm not certain if "complete"
implies something more specific).

Complete implies some sense of start and end of CFG block with respect
to scopes (CoumpoundStmts) in the AST. Though I'm sure it can be mapped
by hand, no such mapping is available without extra effort.

Regards,
Martin

This is an interesting point. I wonder how much, however, this is a real concern in practice. Consider the LLVM IR for a complex C++ program. The LLVM IR is SSA-based CFG that needs to model C++ destructor calls at a much lower level. Any "continue", "break", or "goto" that crosses scope boundaries requires logic in the IR for the necessary destructor calls. It seems like the same problems you describe would be present there.

This aside, we certainly can handle the case you described with the right abstractions and still provide the same CFG interface that clients expect. Since we are free to implement basic blocks in any matter that we wish (and have different kinds of basic block encodings), we can use virtual basic blocks to encode what variables are destroyed but employ data sharing between these virtual basic blocks to keep the memory costs down. Through a proper iterator interface, clients would iterate over the "virtual" statements in the virtual basic blocks, even if we don't explicitly create those statements. This approach can be viewed of the combination of the two approaches discussed, except that this has a simpler interface for clients.

Ted Kremenek wrote:

This is an interesting point. I wonder how much, however, this is a
real concern in practice. Consider the LLVM IR for a complex C++
program. The LLVM IR is SSA-based CFG that needs to model C++
destructor calls at a much lower level. Any "continue", "break", or
"goto" that crosses scope boundaries requires logic in the IR for the
necessary destructor calls. It seems like the same problems you
describe would be present there.

This aside, we certainly can handle the case you described with the
right abstractions and still provide the same CFG interface that
clients expect. Since we are free to implement basic blocks in any
matter that we wish (and have different kinds of basic block
encodings), we can use virtual basic blocks to encode what variables
are destroyed but employ data sharing between these virtual basic
blocks to keep the memory costs down. Through a proper iterator
interface, clients would iterate over the "virtual" statements in the
virtual basic blocks, even if we don't explicitly create those
statements. This approach can be viewed of the combination of the two
approaches discussed, except that this has a simpler interface for
clients.

Excellent. Let's get coding. Where do I start implementing this design?

Regards,
Martin

Hi Martin,

I think for starters there are two things:

(1) Where do we get the scope information that we need?

(2) What should be the interface for clients?

Ideally (1) comes from Sema. I don't want us to reconstruct scope. We should solicit comments from others on cfe-dev on the best way to query scope information (those working on the C++ side of things might have some specific thoughts).

For (2), we need to think of how to deliver this information to existing clients. Do we introduce a new subclass of Stmt that is publicly visible in the ASTs (I don't think this is a great idea) or do we have CFGBlocks contain smart pointers that reference either Stmt* (the case right now) or Decl* (which would represent a variable going out of scope). If we choose the latter, we need to update the client interface for CFGs (and the clients themselves) before we start making changes to CFG construction. That way once we have CFGs that contain the "object is now dead" information our clients will be able to pick it up.

Also, I think the compact representation in the CFG that we discussed should come last. I see it as an optimization. Once all the interfaces are in place we should be able to switch over to it. Doing it first would introduce one more implementation detail that would get in the way of having an end-to-end implementation.

What do you think?

Ted Kremenek wrote:

Ideally (1) comes from Sema. I don't want us to reconstruct scope.
We should solicit comments from others on cfe-dev on the best way to
query scope information (those working on the C++ side of things might
have some specific thoughts).

Agreed.

For (2), we need to think of how to deliver this information to
existing clients. Do we introduce a new subclass of Stmt that is
publicly visible in the ASTs (I don't think this is a great idea) or
do we have CFGBlocks contain smart pointers that reference either
Stmt* (the case right now) or Decl* (which would represent a variable
going out of scope). If we choose the latter, we need to update the
client interface for CFGs (and the clients themselves) before we start
making changes to CFG construction. That way once we have CFGs that
contain the "object is now dead" information our clients will be able
to pick it up.

This decision is yours to make. But I'd prefer to keep CFG interface as
is. Adding virtual statements during flattening of AST into basic blocks
sounds like a good idea to me.

Also, I think the compact representation in the CFG that we discussed
should come last. I see it as an optimization. Once all the
interfaces are in place we should be able to switch over to it. Doing
it first would introduce one more implementation detail that would get
in the way of having an end-to-end implementation.

Agreed.

Regards,
Martin

A virtual statement would indeed be simpler from the client's perspective (and much simpler to implement). I only suggested the second approach because it is less invasive to the AST data structures. Such a statement class would have to be declared in Stmt.h, and we would need to include comments that instances of that class will not usually be created in the AST itself. This is probably the better approach unless others have strong objections.