Checking Model Used by 'ArrayBoundV2'

Hi all,

I was wondering if anyone could tell me what model of array bounds
checking the static analyzer uses in
'-analyzer-checker=experimental.security.ArrayBoundV2' ? Is the
implementation based on a checking algorithm or a type system that's
been documented (formally or informally) elsewhere?

In particular, I'm trying to figure out which kinds of false negatives
(e.g. http://llvm.org/bugs/show_bug.cgi?id=11114) should be viewed as
bugs in the implementation, and which are necessary consequences of the
checking procedure that's applied.

Thanks,

- --Benjamin Schulz

Benjamin,

The ArrayBoundV2 is based on symbolic execution analyses like most of the other analyzer checkers. There is no formal document which would help you easily identify false negatives/positives (source code and the analyzer site are the best sources of documentation). Currently, the main limitations are the lack of interprocedural analysis and a rather weak solver ( ex: it doesn't correctly model truncation). Feel free to file bugs for issues if you are unsure. Even if the analyzer doesn't reason about them now, they'd be valuable test cases to target as we make the analyses more powerful.

Cheers,
Anna.

Firstly, thanks for the clarification.

(source code and the analyzer site are the best sources of documentation).

I've taken a look at these. As a lead-in to my next question, I wanted
to make sure that I so far understand the overall structure of the analyzer.

The SA developer manual states that: "Each time the analyzer engine
explores a new statement, it notifies each checker registered to listen
for that statement, giving it an opportunity to either report a bug or
modify the state. (As a rule of thumb, the checker itself should be
stateless.)"

Am I correct to read this as indicating that each active checker is
invoked each time a statement is symbolically executed, with each
checker reading the same symbolic store and context through accessors at
each invocation?

Currently, the main limitations are the lack of interprocedural analysis
and a rather weak solver ( ex: it doesn't correctly model truncation)
Feel free to file bugs for issues if you are unsure.

Given that all of the above is true, I have a hypothesis that Jones and
Kelly's runtime checking scheme [1] could be adapted or weakened to
produce a symbolically executed check that could overcome some of the
present checker's limitations. The main difficulty is that it would not
be sufficient to just write a new checker; the symbolic execution engine
would also need a modest extension.

My questions are: (1) Has this possibility already been considered? (2)
What's the likelihood that such an addition would represent a change
that's too big or ambitious for Clang/LLVM's accepted practice of small,
incremental changes?

Thanks,

- --Benjamin Schulz

* * *

[1] Richard W. M. Jones and Paul H. J. Kelly. "Backwards-Compatible
Bounds Checking for Arrays and Pointers in C Programs." AADEBUG'97.
Available here:
http://www.doc.ic.ac.uk/~phjk/phjk-Publications.html#Jones:Kelly:BC:97

The SA developer manual states that: “Each time the analyzer engine
explores a new statement, it notifies each checker registered to listen
for that statement, giving it an opportunity to either report a bug or
modify the state. (As a rule of thumb, the checker itself should be
stateless.)”

Am I correct to read this as indicating that each active checker is
invoked each time a statement is symbolically executed, with each
checker reading the same symbolic store and context through accessors at
each invocation?

The SA developer manual is still rather weak (thus the red “under construction” sign:)).

Each checker can choose to register for the events it is interested in through implementing callbacks (See http://clang.llvm.org/doxygen/classento_1_1CheckerDocumentation.html). This allows the checkers to report bugs or/and modify the program state (for example, a checker can add it’s own information to the state for bookkeeping). Suppose, several checkers (C1, C2, C3) register for the same event, say checkPostStmt for a function call. In order to allow each checker to examine/modify the state, we chain them together. Specifically, we might add a chain of nodes (State_after_call, Loc_X) → (State_C1, Loc_X) → (State_C2, Loc_X) → (State_C3, Loc_X) to the symbolic execution graph, where State_Ci is the state produced by checker Ci.

Currently, the main limitations are the lack of interprocedural analysis

and a rather weak solver ( ex: it doesn’t correctly model truncation)

Feel free to file bugs for issues if you are unsure.

Given that all of the above is true, I have a hypothesis that Jones and
Kelly’s runtime checking scheme [1] could be adapted or weakened to
produce a symbolically executed check that could overcome some of the
present checker’s limitations. The main difficulty is that it would not
be sufficient to just write a new checker; the symbolic execution engine
would also need a modest extension.

All of the mentioned weaknesses lie within the analyzer core engine. The general design idea is that the core engine does most of the work and only the check specific rules are implemented within each checker.

My questions are: (1) Has this possibility already been considered?

The parer discusses a runtime checking technique (similarly to AddressSanitizer, SAFECode), so only some of the presented ideas would be applicable.

For example, as I mentioned earlier, the analyzer does not have interprocedural analysis, which means that when we symbolically execute a function call, we completely abstract it to say that we do not know what happened to the variables which could have been modified by the call. Also, we assume that we do not know anything about the input parameters to functions. There are ways to solve these issues by implementing interprocedural analysis (inlining or summary based) or/and adding annotations on function arguments. However, none have been implemented yet.

On the other hand, I suspect there could be checks mentioned in the paper that the ArrayBoundV2 checker does not have.

I only briefly looked at the paper. Do you have other suggestions?

(2)
What’s the likelihood that such an addition would represent a change
that’s too big or ambitious for Clang/LLVM’s accepted practice of small,
incremental changes?

Even big additions can be executed as a sequence of small steps. As long as we have a clear picture of what we are trying to achieve and how it fits into the bigger setting.

Cheers,
Anna.

On the other hand, I suspect there could be checks mentioned in the
paper that the ArrayBoundV2 checker does not have.

I only briefly looked at the paper. Do you have other suggestions?

Momentarily putting aside the issue of interprocedural analysis, I might
suggest that something similar to Jones and Kelly's method could be used
to warn of out-of-bounds pointer dereferences whenever the originating
pointer is declared within the scope of the function and the array in
question is declared as a predefined constant or literal.

E.g. I would expect that even in the purely intraprocedural case it
could catch something like this:

int *foo, *bar;
int baz = something;

foo = (int*) malloc(CONSTANT * sizeof(int));
bar = (foo + CONSTANT + 1);
*bar = baz;

, which the present implementation of 'ArrayBoundV2' does not catch.

There are ways to solve these issues by implementing interprocedural
analysis (inlining or summary based) or/and adding annotations on
function arguments. However, none have been implemented yet.

On the other hand, I suspect there could be checks mentioned in the
paper that the ArrayBoundV2 checker does not have.

Is anyone actively working on this problem, or would I otherwise be
stepping on anyone's toes if I tinkered with 'ArrayBoundV2' to explore
the possibility of improving its false negative rate along these lines?
I'd like to see if it's possible to do so using the current
implementation of the symbolic execution engine. If not, I could come
back with a concrete proposal on how the engine might be extended.
Otherwise, I can continue to try to dream up test cases.

On the other hand, I suspect there could be checks mentioned in the
paper that the ArrayBoundV2 checker does not have.

I only briefly looked at the paper. Do you have other suggestions?

Momentarily putting aside the issue of interprocedural analysis, I might
suggest that something similar to Jones and Kelly's method could be used
to warn of out-of-bounds pointer dereferences whenever the originating
pointer is declared within the scope of the function and the array in
question is declared as a predefined constant or literal.

E.g. I would expect that even in the purely intraprocedural case it
could catch something like this:

int *foo, *bar;
int baz = something;

foo = (int*) malloc(CONSTANT * sizeof(int));
bar = (foo + CONSTANT + 1);
*bar = baz;

, which the present implementation of 'ArrayBoundV2' does not catch.

This example is not handled because the checker does not reason about malloc. Note, the analyzer CAN find the bug in the following example:

void m() {
    int foo[3];
    int *bar;
    bar = (foo + 3 + 1);
    *bar = 3;
}

There are ways to solve these issues by implementing interprocedural
analysis (inlining or summary based) or/and adding annotations on
function arguments. However, none have been implemented yet.

On the other hand, I suspect there could be checks mentioned in the
paper that the ArrayBoundV2 checker does not have.

Is anyone actively working on this problem, or would I otherwise be
stepping on anyone's toes if I tinkered with 'ArrayBoundV2' to explore
the possibility of improving its false negative rate along these lines?

If you see opportunities to improve the checker you are more than welcome to submit patches. We are definitely interested in having a better buffer overflow solution (which requires improvements in different parts of the analyzer.)

The way I’d like to see this generally approached is that ArrayBoundV2 could consult some API to check the “known bound” for a region. We have a mechanism, known as “extents”. Specific checkers could register their knowledge of extents (e.g., the Malloc checker could register an extent for malloc’ed memory) and ArrayBoundCheckerV2 just cues off that extent information.

I’m interested in a general, scalable solution for adding such information. In the short term, that could be added to ArrayBoundCheckerV2 as a post-visit on CallExprs. This would leave the current bounds checking logic to just consult the generic extent information in the ProgramState.

More generally, I’d like any checker be able to register information about extents, and then have the ArrayBoundCheckerV2 automatically get smarter by leveraging such domain-specific information. Further, once IPA comes available, we may be able to infer this extent information by other means.

One thing that springs immediately to my mind is that symbolic pointers
could be more closely associated to instances of MemRegion, e.g. added
as a class member, and that this would make a number of things easier.
I've looked through the definitions in
'~/lib/StaticAnalyzer/Core/MemRegion.cpp' and
'~/include/clang/StaticAnalyzer/Core/MemRegion.h', and it wasn't
immediately clear to me how the base region of a MemRegion relates back
to a symbolic value (i.e. SVal). This appears to create some
difficulties for the current implementation, e.g. the stubbed check of
'computeExtentBegin()' in 'ArrayBoundV2'. On first glance, it looks
like the checker has to ask for a lot more information from the
execution context than just the symbolic start of the region and its
expected extent.

Maybe I've misunderstood the code, but it seems like a symbolic pointer
could be associated to a memory region whenever it is declared or
mallocked, and propagated through the execution as just another
reaching definition, with the extent of the associated region included
as part of that definition. In particular, any pointer 'q' into the
region originally associated to 'p' would be expected to satisfy
something like '(q in bound) => (exists i such that (p + i in bound and
q = p + i))'. Constraints along these lines could then be applied as
needed.

Is this something like what you're thinking of? Have I missed something
in the code or misunderstood what I've seen so far?

I think you are missing a few critical pieces of the overall design.

SVals are meant to represent the semantic evaluation of expressions. They can represent things like concrete integers, symbolic integers, or memory locations (which are memory regions). They are a discriminated union of “values”, symbolic and otherwise. There is no meaning to associate a memory region back to its SVal, because that relationship doesn’t exist. An SVal is just an utterance of a value; it’s not the representative of the value itself.

In contrast, symbols are meant to represent abstract, but named, symbolic values. Symbolic values can have constraints associated with them. Symbols represent an actual (immutable) value. We might not know what its specific value is, but we can associate constraints with that value as we analyze a path. To be more concrete, suppose we have:

int foo(int x) {
int y = x * 2;
int z = x;

}

Let’s look at how x2 gets evaluated. When ‘x’ is evaluated, we first construct an SVal that represents the lvalue of ‘x’, in this case it is an SVal that references the MemoryRegion for ‘x’ (more on that later). Afterwards, when we do the lvalue-to-rvalue conversion, we get a new SVal, which references the value currently bound to ‘x’. That value is symbolic; it’s whatever ‘x’ was bound to at the start of the function. Let’s call that symbol $0. Similarly, we evaluate the expression for ‘2’, and get an SVal that references the concrete number ‘2’. When we evaluate x2, we take the two SVals of the subexpressions, and create a new SVal that represents their multiplication (which in this case is a new symbolic value, which we might call $1). When we evaluate the assignment to ‘y’, we again compute its lvalue (a MemRegion), and then bind the SVal for the RHS (which references the symbolic value $1) to the MemRegion in the symbolic store.

The second line is similar. When we evaluate ‘x’ again, we do the same dance, and have an SVal that references $0. That SVal isn’t malloc’ed (with a persistent address), it is just a C++ value object. Think of it as just a smart pointer, referencing the real values. Two SVals might reference the same underlying values.

To clarify, a MemRegion is similar to a symbol. It is used to provide a lexicon of how to describe abstract memory. Regions can layer on top of other regions, providing a layered description of how to describe memory. For example, a struct object on the stack might be represented by a VarRegion, but a FieldRegion which is a subregion of the VarRegion could be used to represent the memory associated with a specific field of that object.

Stepping back, from this perspective, there is no inherit association between a MemRegion and an SVal. An SVal might reference a MemRegion, and an SVal might represent the value bound to a MemRegion in the symbolic store, but there is a many-to-one relationship from SVals (which are just value references) to any given MemRegion.

So how do we represent symbolic memory regions? That’s what SymbolicRegion is for. It is a MemRegion that has an associated symbol. Since the symbol is unique and has a unique name, that symbol names the region.

To summarize, MemRegions are unique names for blocks of memory. Symbols are unique names for abstract symbolic values. Some MemRegions represents abstract symbolic chunks of memory, and thus are also based on symbols. SVals are just references to values, and can reference either MemRegions, Symbols, or concrete values (e.g., the number 1).

So what about extents? Extents are just symbolic values as well. We also have various mechanisms to relate symbols to other symbols. Indeed, we already have SymbolExtent, which can be used to represent the associated length of a SymbolicRegion. By using symbols to represent extents of symbolic regions, we employ the same constraint management logic for extents that we do for any other symbolic value.

We really do have all the pieces in place. We already have symbolic extents. The idea might need to be refined in its implementation, but everything we need is already basically there.

Much thanks, Ted. The explanation below actually clears up a great deal
of confusion about how the analyzer represents memory regions. I saw a
message on the commits list suggesting the explanation had been put into
the inline code comments, but for what it's worth, something similar
would also look pretty good in the online documentation, if there's a
place to put it.

At any rate, your patient explanations are greatly appreciated.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Much thanks, Ted. The explanation below actually clears up a great deal
of confusion about how the analyzer represents memory regions. I saw a
message on the commits list suggesting the explanation had been put into
the inline code comments, but for what it's worth, something similar
would also look pretty good in the online documentation, if there's a
place to put it.

I did add the explanation to the online documentation in r 146048 (www is in the same repository as the code itself).

Anna.

I'm glad this explanation helped. Over time this will become better documented.