[analyzer] Using reaching definitions analysis for bug report construction

Hi!

In this mail, I’ll briefly summarize what I already achieved during my GSoC and discuss what I plan to do moving forward. Feel free to skip ahead if you feel up to date!

What happened so far?

I proposed to enhance bug reports in the Static Analyzer [1]. I specifically targeted reports that were incomplete in a sense that the report didn’t contain information to explain why we concluded that the bug report is a true positive. I’ve shared the following code snippet numerous times, as it captures several different problems the analyzer can’t deal with yet:

01 int flag;

02 bool coin();

03

04 void foo() {

05 flag = coin(); // no note, but there should be one

06 }

07

08 int main() {

09 int *x = 0; // x initialized to 0

10 flag = 1;

11 foo(); // no note, but there should be one

12 if (flag) // assumed false

13 x = new int;

14 foo(); // no note, but there should be one

15

16 if (flag) // assumed true

17 *x = 5; // warn

18 }

In my first update mail [2] I explained that the assumption on line 12 is very important, because we wouldn’t find a nullpointer dereference had flag been true. Similarly, the assumption of line 16 is important. You can see that following the notes in the bug path, despite setting flag to 1 on line 10, we assume it to to be zero, and to be non-zero again.

The goal is to teach the analyzer that the flag’s value on both line 16 and 12 is important, and not to prune the calls to foo, where we should note that flag’s value has been reset. The standard way to achieve this is to track flag in these location (similarly to how x is already tracked). When a variable is tracked, the analyzer constructs notes that explain why believe that it’s value is what believe it to be at the tracking location.

This implies that analyzer can argue about data dependencies quite well (it can use the information it gathered during symbolic analysis by inspecting the exploded graph (all the information gathered during symbolic analysis), specifically the bugpath (the linear path from the error node [where the error was emitted] to the root [where analysis begun]).

The technique I came up with to help on this was inspired by static backward program slicing [3]. Program slicing is essentially creating a subset of the program (program slice) relevant to a (statement, {set of variables}) pair (slicing criterion).

void f() {

int n; read(n);

int sum = 0; int product = 1; for (int i = 0; i < n; ++i) { sum += i;

product *= i; } write(sum);

write(product);

}

For the slicing criterion (write(product), {product}) the following program slice would be created:

void f() {

int n; read(n);

int product = 1; for (int i = 0; i < n; ++i) {

product *= i; }

write(product);

}

Program slicing, roughly speaking, works by collecting a set of relevant variables and a set of relevant statements by tracing data and control dependencies [3].

Since the analyzer is also in need of this functionality (realizing which parts of the program is relevant to the bug’s occurrence), and since data dependencies are already understood, I modified LLVM’s Iterated Dominance Frontier calculator to be used with clang’s CFG, which can be used to calculate control dependencies, assisting bug report generation.

So far, I used this to realize that the condition on line 16 is important, and start tracking it. The patch that implemented it works by tracking the conditions of control dependency blocks of already tracked variables: since x is tracked on line 17, and the block on line 16 is a control dependency of it, we track the condition expression of that block (flag).

It was quickly discovered that it’s a bad approach to use the same tracking for both error-causing variables and conditions, as it lead to a bunch of uninteresting notes being inserted to bug reports – we mainly want the bug reports to be untouched, and only expand it with information previously not present that made it potentially impossible to understand how the bug report came about (like flag’s value changing). Ideas for this shared in a mail [4], where the consensus was to basically prune every note that didn’t happen in a nested stackframe (the stackframe where flag’s value changed (foo()) is nested in the frame where we started tracking it (main())).

I gathered data on several large, open source C++ projects, and I feel confident that there are only so many loose ends left to be tied: don’t track asserts conditions, calls to operator bool, and make the extra notes state clearly that we’re talking about “just a condition”, rather then a bug causing variable.

In my first mail [2] I also detailed that realizing that line 16 is important is a vastly different case then the one on line 12: the bugpath doesn’t contain line 13 (since we assumed flag to be false on line 12), so using BugReporterVisitors wouldn’t be sufficient: we called there cases must-have-happened, where the information we need to realize that a control dependency is important is contained in the bugpath, and should-not-have-happened when it isn’t.

In followup mail to that thread [5] explained that the should-not-have-happened case can be further divided into inlined category (initially referred to as category 1), where the information we need isn’t in the bugpath, but is in function that the analyzer visited, allowing us to resolve parameters and have access to a variety of tools, and the not inlined category (initially referred to as category 2) where the infromation lays in a non-inlined function.

Inlined category (bar is inlined, but we don’t explore line 10):

01 int flag;

02 bool coin();

03

04 void foo() {

05 flag = coin(); // no note, but there should be one

06 }

07

08 void bar(int &i) {

09 if (flag) // assumed false

10 i = new int;

11 }

12

13 int main() {

14 int *x = 0; // x initialized to 0

15 flag = 1;

16 foo(); // no note, but there should be one

17 bar(x);

18 foo(); // no note, but there should be one

19

20 if (flag) // assumed true

21 *x = 5; // warn

22 }

Not inlined category (bar isn’t inlined at all):

01 int flag;

02 bool coin();

03

04 void foo() {

05 flag = coin(); // no note, but there should be one

06 }

07

08 void bar(int &i) {

09 i = new int;

10 }

11

12 int main() {

13 int *x = 0; // x initialized to 0

14 flag = 1;

15 foo(); // no note, but there should be one

16 if (flag) // assumed false

17 bar(x);

18 foo(); // no note, but there should be one

19

20 if (flag) // assumed true

21 *x = 5; // warn

22 }

The not inlined category seems to be far more difficult, like, not even solvable probably with the amount of time I have left, so I only plan to aim for the inlined category cases within the should-not-have-happened problem. I shared a sketch algorithm in a followup mail [6], but I it was brought to my attention that using a reaching definition algorithm might be a superior approach.

There were a variety of other things I did under the name of the project, but they aren’t as exciting to be put in a summary :wink:

Using reaching definitions analysis to solve the should-not-have-happened inline category cases

I am no expert on this topic, please feel free to correct me if I write something dumb.

The reaching definitions analysis was originally made for instruction, not C++ code: “a reaching definition for a given instruction is an earlier instruction whose target variable can reach (be assigned to) the given one without an intervening assignment.” [7].

When we say reaching definitions we refer to the analysis: the REACHin[S] set for basic block S is the set of definitions reaching S, and REACHout[S] are all reaching definitions of its predecessors minus those reaching definitions whose variable is killed by S (contains an intervening assignment) plus any new definitions generated within S.

Now, when we’re talking about variables, the shortcomings of this algorithm when it comes to C++ are obvious: aliasing. I guess we can always just try to be conservative in one way or another: no longer look for reaching definitions after a call to a function where the variable was passed as a non-const reference, or a write to reference of the same type.

There are plenty of big unknown here – I’m inexperienced with implementing such algorithms, and I fear some esoteric AST elements in Clang may need to be a source of a lot of headache.

Here’s an idea to how we could use reaching definitions analysis to discover important statements (like the one on line 13 in the first example of this mail), and track it’s control dependencies:

// Traverses the bugpath to discover all reaching definitions, even

// in nested stackframes. Returns true if any reaching definition is

// found.

function conditionTracker(N : ExplodedNode, X : variable,

Report : BugReport, OriginB : OriginB)

SetOfReachingDefinitions = reaching definitions to X at N

while N is not null do

// Look for the function call to F

N := N->getFirstPred()

if not N->getLocationAs() then

continue

fi

B := CFGBlock associated with N

if there is no path from B to OriginB then

continue

fi

CallEnterN := N’s matching CallEnter

if CallEnterN is null then

continue // and assert that we’re in a top frame

fi

// CallEnterN’s pred should be in the same CFG as OriginB.

CalleeE := CFGElement associated with CallEnterN->getFirstPred()

ParamSet := set of F’s parameters at CallEnterN to which X

is passed to as a non-const reference

for all NewX variables in ParamSet do

if conditionTracker(N, NewX, Report, F’s exit block) then

SetOfReachingDefinitions += CalleeE

fi

od

// X is global/static, or a field and F is a method, etc…

if F would have access to X regardless then

if conditionTracker(N, X, Report, F’s exit block) then

SetOfReachingDefinitions += CalleeE

fi

fi

od

for all RD CFGElementRefs in SetOfReachingDefinitions do

track control dependencies of RD

od

return not whether SetOfReachingDefinitions is empty

Should we just traverse the ExplodedGraph?

In a mail sent by Artem [8], another possible application of the reaching definition algorithm was discussed. However, during our meeting, it was mentioned that maybe for that specific use case, traversing the ExplodedGraph might be a better approach. Should I approach the should-not-have-happened case with this technique as well, maybe it could be reused.

I don’t fancy this. First off, generalizing in the future to also cover the not inlined category is I think more difficult with this approach. Second, algorithms like reaching definitions are set in stone, and coming up with a reliable, correct way to handle the ExplodedGraph may be too difficult and time consuming for me at this point.

Any feedback is appreciated!

Cheers,

Kristóf

[1] https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit

[2] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062535.html

[3] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062613.html

[4] Tip, Frank. A survey of program slicing techniques. Centrum voor Wiskunde en Informatica, 1994.

[5] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062540.html

[6] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062566.html

[7] https://en.wikipedia.org/wiki/Reaching_definition

[8] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062727.html

After talking to Gábor and some other folks I’ve come to the conclusion that indeed, a reaching definitions algorithm is both more reasonable finish in the timeframe I have left, and more reliable.

I chose the implement the following algorithm I found on wikipedia [1]. I also read the related part of “Compilers: Principles, techniques, and tools second edition” [2], which gave me great insights on special cases related to C/C++. Here are a couple things that I tweaked on, and things I have left to do:

  • Clang’s CFG doesn’t contain the parameter declarations, so I manually inspected the FunctionDecl, and added each parameter to the entry blocks GEN set. Now, traditionally, the entry block is skipped altogether, but I doubt this matters much.

  • In general, I wouldn’t like to reason about reaching definitions to pointees.

  • Function calls are hard to reason about other than inspecting the CallExpr’s arguments, so let’s use invalidation (or, in other words, regarded them as definitions):

  • Global and static variables are invalidated on each function call. (I’ve manage to implement this one)

  • If a variable is passed as a non-const reference/pointer to a function, invalidate it. If that variable is of a record type, invalidate it if it’s passed as value, and the copy constructor is user provided.

  • Presume the function doesn’t use offsetof to modify other parts of the a record object, when only a field is passed to it as a parameter.

  • If a record object is invalidated, invalidate all of its fields.

  • Invalidate the the record object on a non-static, non-const method call.- I haven’t touched the aliasing problem yet either (in all cases, I mean that a T * typed object isn’t CXXThis):

  • Presume that the code follows strict aliasing rules.

  • If the pointee of a pointer/reference of type T * is written or invalidated, invalidate all T typed (regardless of signedness) variables.

  • If a record object that contains a T * or void * field is invalidated, all T typed variables should be invalidated.

  • When it comes to polymorphism, a write to a base pointer typed variable should invalidate all derived typed variables.

  • Something I’m quite unsure about: char * can alias everything, should it invalidate everything? Is this too conservative? Is it possible that we don’t even write char * that often so such an invalidation wouldn’t be a big deal anyways?
    I’m not yet sure how I’ll end up handling record objects in general, but I’m reasonably comfortable saying that I’ll be able to finish at least a good subset of this before wrapping the project up. I got a WIP patch going on phabricator [3]. After that, I don’t see big unknowns using the gathered data in a visitor.

[1] https://en.wikipedia.org/wiki/Reaching_definition
[2] Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2007). Compilers: Principles, techniques, and tools second edition.

[3] https://reviews.llvm.org/D64991

We’ve had a lovely meeting about this, during which I’ve been made aware of type based alias analysis, something I’ll need to research a little more. We also talked about implementing the rules with an invalidation propagating algorithm.

I’ll lay this part of the project aside until we can turn the condition tracking feature of the analyzer on by default, just wanted to have this out in the open :).