[RFC] A dataflow analysis framework for Clang AST

ymand · October 12, 2021, 3:32pm

Summary

We propose a new static analysis framework for implementing Clang bug-finding and refactoring tools (including ClangTidy checks) based on a dataflow algorithm.

Some ClangTidy checks and core compiler warnings are already dataflow based. However, due to a lack of a reusable framework, all reimplement common dataflow-analysis and CFG-traversal algorithms. In addition to supporting new checks, the framework should allow us to clean up those parts of Clang and ClangTidy.

We intend that the new framework should have a learning curve similar to AST matchers and not require prior expertise in static analysis. Our goal is a simple API that isolates the user from the dataflow algorithm itself. The initial version of the framework combines support for classic dataflow analyses over the Clang CFG with symbolic evaluation of the code and a simple model of local variables and the heap. A number of useful checks can be expressed solely in terms of this model, saving their implementers from having to reimplement modeling of the semantics of C++ AST nodes. Ultimately, we want to extend this convenience to a wide variety of checks, so that they can be written in terms of the abstract semantics of the program (as modeled by the framework), while only matching and modeling AST nodes for key parts of their domain (e.g., custom data types or functions).

Motivation

Clang has a few subsystems that greatly help with implementing bug-finding and refactoring tools:

AST matchers allow one to easily find program fragments whose AST matches a given pattern. This works well for finding program fragments whose desired semantics translates to a small number of possible syntactic expressions (e.g., “find a call to strlen() where the argument is a null pointer constant”). Note that patterns expressed by AST matchers often check program properties that hold for all execution paths.
Clang Static Analyzer can find program paths that end in a state that satisfies a specific user-defined property, regardless of how the code was expressed syntactically (e.g., “find a call to strlen() where the argument is dynamically equal to null”). CSA abstracts from the syntax of the code, letting analyses focus on its semantics. While not completely eliding syntax, symbolic evaluation abstracts over many syntactic details. Semantically equivalent program fragments result in equivalent program states computed by the CSA.

When implementing certain bug-finding and refactoring tools, it is often necessary to find program fragments where a given property holds on all execution paths, while placing as few demands as possible on how the user writes their code. The proposed dataflow analysis framework aims to satisfy both of these needs simultaneously.

Example: Ensuring that std::optional is used safely

Consider implementing a static analysis tool that proves that every time a std::optional is unwrapped, the program has ensured that the precondition of the unwrap operation (the optional has a value) is satisfied.

void TestOK(std::optional x) {

if (x.has_value()) {

use(x.value());

}

}

void TestWithWarning(std::optional x) {

use(x.value()); // warning: unchecked access to optional value

}

|

|

If we implemented this check in CSA, it would likely find the bug in TestWithWarning, and would not report anything in TestOK. However, CSA did not prove the desired property (all optional unwraps are checked on all paths) for TestOK; CSA just could not find a path through TestOK where this property is violated.

A simple function like TestOK only has two paths and would be completely explored by the CSA. However, when loops are involved, CSA uses heuristics to decide when to stop exploring paths in a function. As a result, CSA would miss this bug:

void MissedBugByCSA(std::optional x) {

for (int i = 0; i < 10; i++) {

if (i < 5) {

use(i);

} else {

use(x.value()); // CSA does not report a warning

}

}

}

|

|

If we implemented this check with AST matchers (as typically done in ClangTidy), we would have to enumerate way too many syntactic patterns. Consider just a couple coding variations that users expect this analysis to understand to be safe:

void TestOK(std::optional x) {

if (x) {

use(x.value());

}

}

void TestOK(std::optional x) {

if (x.has_value() == true) {

use(x.value());

}

}

void TestOK(std::optional x) {

if (!x.has_value()) {

use(0);

} else {

use(x.value());

}

}

void TestOK(std::optional x) {

if (!x.has_value())

return;

use(x.value());

}

|

|

To summarize, CSA can find some optional-unwrapping bugs, but it does not prove that the code performs a check on all paths. AST matchers can prove it for a finite set of hard-coded syntactic cases. But, when the set of safe patterns is infinite, as is often the case, AST matchers can only capture a subset. Capturing the full set requires building a framework like the one presented in this document.

Comparison

|

AST Matchers Library

|

Clang Static Analyzer

|

Dataflow Analysis Framework

|

| - | - | - |

Proves that a property holds on all execution paths

|

Possible, depends on the matcher

|

No

|

Yes

|

Checks can be written in terms of program semantics

|

No

|

Yes

|

Yes

|

Further Examples

While we have started with a std::optional checker, we have a number of further use cases where this framework can help. Note that some of these are already implemented in complex, specialized checks, either directly in the compiler or as clang-tidy checks. A key goal of our framework is to simplify the implementation of such checks in the future.

“the pointer may be null when dereferenced”, “the pointer is always null when dereferenced”,
“unnecessary null check on a pointer passed to delete/free”,
“this raw pointer variable could be refactored into a std::unique_ptr”,
“expression always evaluates to a constant”, “if condition always evaluates to true/false”,
“basic block is never executed”, “loop is executed at most once”,
“assigned value is always overwritten”,
“value was used after having been moved”,
“value guarded by a lock was used without taking the lock”,
“unnecessary existence check before inserting a value into a std::map”.

Some of these use cases are covered by existing ClangTidy checks that are implemented with pattern matching. We believe there is an opportunity to simplify their implementation by refactoring them to use the dataflow framework, and simultaneously improve the coverage and precision of how they model C++ semantics.

Overview

The initial version of our framework provides the following key features:

a forward dataflow algorithm, parameterized by a user-defined value domain and a transfer function, which describes how program statements modify domain values.
a built-in model of local variables and the heap. This model provides two benefits to users: it tracks the flow of values, accounting for a wide variety of C++ language constructs, in a way that is reusable across many different analyses; it discriminates value approximations based on path conditions, allowing for higher precision than dataflow alone.
an interface to an SMT solver to verify safety of operations like dereferencing a pointer,
test infrastructure for validating results of the dataflow algorithm.

Dataflow Algorithm

Our framework solves forward dataflow equations over a product of a user-defined value domain and a built-in model of local variables and the heap, called the environment. The user implements the transfer function for their value domain, while the framework implements the transfer function that operates on the environment, only requiring a user-defined widening operator to avoid infinite growth of the environment model. We intend to expand the framework to also support backwards dataflow equations at some point in the future. The limitation to the forward direction is not essential.

In more detail, the algorithm is parameterized by:

a user-defined, finite-height partial order, with a join function and maximal element,
an initial element from the user-defined lattice,
a (monotonic) transfer function for statements, which maps between lattice elements,
comparison and merge functions for environment values.

We expand on the first three elements here and return to the comparison and merge functions in the next section.

The Lattice and Analysis Abstractions

We start with a representation of a lattice:

class BoundedJoinSemiLattice {

public:

friend bool operator==(const BoundedJoinSemiLattice& lhs,

const BoundedJoinSemiLattice& rhs);

friend bool operator<=(const BoundedJoinSemiLattice& lhs,

const BoundedJoinSemiLattice& rhs);

static BoundedJoinSemiLattice Top();

enum class JoinEffect {

Changed,

Unchanged,

};

JoinEffect Join(const BoundedJoinSemiLattice& element);

};

|

|

Not all of these declarations are necessary. Specifically, operator== and operator<= can be implemented in terms of Join and Top isn’t used by the dataflow algorithm. We include the operators because direct implementations are likely more efficient, and Top because it provides evidence of a well-founded bounded join semi-lattice.

The Join operation is unusual in its choice of mutating the object instance and returning an indication of whether the join resulted in any changes. A key step of the fixpoint algorithm is code like j = Join(x, y); if (j == x) …. This formulation of Join allows us to express this code instead as if (x.Join(y) == Unchanged)… For lattices whose instances are expensive to create or compare, we expect this formulation to improve the performance of the dataflow analysis.

Based on the lattice abstraction, our analysis is similarly traditional:

class DataflowAnalysis {

public:

using BoundedJoinSemiLattice = …;

BoundedJoinSemiLattice InitialElement();

BoundedJoinSemiLattice TransferStmt(

const clang::Stmt* stmt, const BoundedJoinSemiLattice& element,

clang::dataflow::Environment& environment);

};

|

|

Environment: A Built-in Lattice

All user-defined operations have access to an environment, which encapsulates the program context of whatever program element is being considered. It contains:

a path condition,
a storage model (which models both local variables and the heap).

The built-in transfer functions model the effects of C++ language constructs (for example, variable declarations, assignments, pointer dereferences, arithmetic etc.) by manipulating the environment, saving the user from doing so in custom transfer functions.

The path condition accumulates the set of boolean conditions that are known to be true on every path from function entry to the current program point.

The environment maintains the maps from program declarations and pointer values to storage locations. It also maintains the map of storage locations to abstract values. Storage locations can be atomic (“scalar”) or aggregate multiple sublocations:

ScalarStorageLocation is a storage location that is not subdivided further for the purposes of abstract interpretation, for example bool, int*.
AggregateStorageLocation is a storage location which can be subdivided into smaller storage locations that can be independently tracked by abstract interpretation. For example, a struct with public members.

In addition to this storage structure, our model tracks three kinds of values: basic values (like integers and booleans), pointers (which reify storage location into the value space) and records with named fields, where each field is itself a value. For aggregate values, we ensure coherence between the structure represented in storage locations and that represented in the values: for a given aggregate storage location Lagg with child f:

valueAt(Lagg) is a record, and
valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),

where valueAt maps storage locations to values.

In our first iteration, the only basic values that we will track are boolean values. Additional values and variables may be relevant, but only in so far as they constitute part of a boolean expression that we care about. For example, if (x > 5) { … } will result in our tracking x > 5, but not any particular knowledge about the integer-valued variable x.

The path conditions and the model refer to the same sets of values and locations.

Accounting for infinite height

Despite the simplicity of our storage model, its elements can grow infinitely large: for example, path conditions on loop back edges, records when modeling linked lists, and even just the number of locations for a simple for-loop. Therefore, we provide the user with two opportunities to interpret the abstract values so as to bound it: comparison of abstract values and a merge operation at control-flow joins.

// Compares values from two different environments for semantic equivalence.

bool Compare(

clang::dataflow::Value* value1, clang::dataflow::Environment& env1,

clang::dataflow::Value* value2, clang::dataflow::Environment& env2);

// Given value1 (w.r.t. env1) and value2 (w.r.t. env2), returns a

// Value (w.r.t. merged_env) that approximates value1 and value2. This

// could be a strict lattice join, or a more general widening operation.

clang::dataflow::Value* Merge(

clang::dataflow::Value* value1, clang::dataflow::Environment& env1,

clang::dataflow::Value* value2, clang::dataflow::Environment& env2,

clang::dataflow::Environment& merged_env);

|

|

Verification by SAT Solver

The path conditions are critical in refining our approximation of the local variables and heap. At an abstract level, they don’t need to directly influence the execution of the dataflow analysis. Instead, we can imagine that, once concluded, our results are annotated with/incorporate path conditions, which in turn allows us to make more precise conclusions about our program from the analysis results.

For example, if the analysis concludes that a variable conditionally holds a non-null pointer, and, for a given program point, the path condition implies said condition, we can conclude that a dereference of the pointer at that point is safe. In this sense, the verification of the safety of a deference is separate from the analysis, which models memory states with formulae. That said, invocation of the solver need not wait until after the analysis has concluded. It can also be used during the analysis, for example, in defining the widening operation in Merge.

With that, we can see the role of an SAT solver: to solve questions of the form “Does the path condition satisfy the pre-condition computed for this pointer variable at this program point”? To that end, we include an API for asking such questions of a SAT solver. The API is compatible with Z3 and we also include our own (simple) SAT solver, for users who may not want to integrate with Z3.

Testing dataflow analysis

We provide test infrastructure that allows users to easily write tests that make assertions about a program state computed by dataflow analysis at any code point. Code is annotated with labels on points of interest and then test expectations are written with respect to the labeled points.

For example, consider an analysis that computes the state of optional values.

ExpectDataflowAnalysis(R"(

void f(std::optional opt) {

if (opt.has_value()) {

// [[true-branch]]

} else {

// [[false-branch]]

}

})", (const auto& results) {

EXPECT_LE(results.ValueAtCodePoint(“true-branch”), OptionalLattice::Engaged());

EXPECT_LE(results.ValueAtCodePoint(“false-branch”), OptionalLattice::NullOpt());

});

|

|

Timeline

We have a fully working implementation of the framework as proposed here. We also are interested in contributing a clang-tidy checker for std::optional and related types as soon as possible, given its demonstrated efficacy and interest from third parties. To that end, as soon as we have consensus for starting this work in the clang repository, we plan to start sending patches for review. Given that the framework is relatively small, we expect it will fit within 5 (largish) patches, meaning we can hope to have it available by December 2021.

Future Work

We consider our first version of the framework as experimental. It suits the needs of a number of particular checks that we’ve been writing, but it has some obvious limitations with respect to general use. After we commit our first version, here are some planned areas of improvement and investigation.

Backwards analyses. The framework only solves forward dataflow equations. This clearly prevents implementing some very familiar and useful analyses, like variable liveness. We intend to expand the framework to also support backwards dataflow equations. There’s nothing inherent to our framework that limits it to forwards.

Reasoning about non-boolean values. Currently, we can symbolically model equality of booleans, but nothing else. Yet, for locations on the store that haven’t changed, we can understand which values are identical (because the input value object is passed by the transfer function to the output). For example,

void f(bool *a, bool *b) {

std::optional opt;

if (a == b) { opt = 42; }

if (a == b) { opt.value(); }

}

|

|

Since neither a nor b have changed, we can reason that the two occurrences of the expression a == b evaluate to the same value (even though we don’t know the exact value). We intend to improve the framework to support this increased precision, given that it will not cost us any additional complexity.

Improved model of program state. A major obstacle to writing high-quality checks for C++ code is the complexity of C++’s semantics. In writing our framework, we’ve come to see the value of shielding the user from the need to handle these directly by modeling program state and letting the user express their check in terms of that model, after it has been solved for each program point. As is, our model currently supports a number of checks, including the std::optional check and another that infers which pointer parameters are out parameters. We intend to expand the set of such checks, with improvements to the model and the API for expressing domain-specific semantics relevant to a check. Additionally, we are considering explicitly tracking other abstract program properties, like reads and writes to local variables and memory, so that users need not account for all the syntactic variants of these operations.

Refactoring existing checks. Once the framework is more mature and stable, we would like to refactor existing checks that use custom, ad-hoc implementations of dataflow to use this framework.

Conclusion

We have proposed an addition to the Clang repository of a new static analysis framework targeting all-paths analyses. The intended clients of such analyses are code-checking and transformation tools like clang-tidy, where sound analyses can support sound transformations of code. We discussed the key features and APIs of our framework and how they fit together into a new analysis framework. Moreover, our proposal is based on experience with a working implementation that supports a number of practical, high-precision analyses, including one for safe use of types like std::optional and another that finds missed opportunities for use of std::move.

Xazax-hun · October 12, 2021, 7:01pm

Hi Yitzhak,

Yay, finally happening! I 100% support these efforts and feel free to add me as a reviewer to the patches, design documents etc.
I have some questions/comments inline.

Summary

We propose a new static analysis framework for implementing Clang bug-finding and refactoring tools (including ClangTidy checks) based on a dataflow algorithm.

Some ClangTidy checks and core compiler warnings are already dataflow based. However, due to a lack of a reusable framework, all reimplement common dataflow-analysis and CFG-traversal algorithms. In addition to supporting new checks, the framework should allow us to clean up those parts of Clang and ClangTidy.

We do have some helpers to make writing ad-hoc dataflow analyses somewhat easier, e.g., see llvm-project/DataflowWorklist.h at main · llvm/llvm-project (github.com)
But I do agree there is a lot of room for improvement and having a full-fledged framework could be a great way forward.

There is a nice summary of some of the ad-hoc solutions we have in Clang in this survey: [cfe-dev] A survey of dataflow analyses in Clang (llvm.org)
Do you plan to replace the ad-hoc solutions in the long run?

We intend that the new framework should have a learning curve similar to AST matchers and not require prior expertise in static analysis. Our goal is a simple API that isolates the user from the dataflow algorithm itself. The initial version of the framework combines support for classic dataflow analyses over the Clang CFG with symbolic evaluation of the code and a simple model of local variables and the heap. A number of useful checks can be expressed solely in terms of this model, saving their implementers from having to reimplement modeling of the semantics of C++ AST nodes. Ultimately, we want to extend this convenience to a wide variety of checks, so that they can be written in terms of the abstract semantics of the program (as modeled by the framework), while only matching and modeling AST nodes for key parts of their domain (e.g., custom data types or functions).

Just to reiterate, you want to abstract away both from the dataflow algorithm and the semantics of C++. E.g., the user might not need to handle the myriad of ways someone can assign a value to a variable (overloaded operator=, output argument, primitive assignment, etc), only a single abstract concept of assignment.

Motivation

Clang has a few subsystems that greatly help with implementing bug-finding and refactoring tools:

AST matchers allow one to easily find program fragments whose AST matches a given pattern. This works well for finding program fragments whose desired semantics translates to a small number of possible syntactic expressions (e.g., “find a call to strlen() where the argument is a null pointer constant”). Note that patterns expressed by AST matchers often check program properties that hold for all execution paths.

Clang Static Analyzer can find program paths that end in a state that satisfies a specific user-defined property, regardless of how the code was expressed syntactically (e.g., “find a call to strlen() where the argument is dynamically equal to null”). CSA abstracts from the syntax of the code, letting analyses focus on its semantics. While not completely eliding syntax, symbolic evaluation abstracts over many syntactic details. Semantically equivalent program fragments result in equivalent program states computed by the CSA.

When implementing certain bug-finding and refactoring tools, it is often necessary to find program fragments where a given property holds on all execution paths, while placing as few demands as possible on how the user writes their code. The proposed dataflow analysis framework aims to satisfy both of these needs simultaneously.

Example: Ensuring that std::optional is used safely

Consider implementing a static analysis tool that proves that every time a std::optional is unwrapped, the program has ensured that the precondition of the unwrap operation (the optional has a value) is satisfied.

void TestOK(std::optional x) {

if (x.has_value()) {

use(x.value());

}

}

void TestWithWarning(std::optional x) {

use(x.value()); // warning: unchecked access to optional value

}

|

|

If we implemented this check in CSA, it would likely find the bug in TestWithWarning, and would not report anything in TestOK. However, CSA did not prove the desired property (all optional unwraps are checked on all paths) for TestOK; CSA just could not find a path through TestOK where this property is violated.

A simple function like TestOK only has two paths and would be completely explored by the CSA. However, when loops are involved, CSA uses heuristics to decide when to stop exploring paths in a function. As a result, CSA would miss this bug:

void MissedBugByCSA(std::optional x) {

for (int i = 0; i < 10; i++) {

if (i < 5) {

use(i);

} else {

use(x.value()); // CSA does not report a warning

}

}

}

|

|

If we implemented this check with AST matchers (as typically done in ClangTidy), we would have to enumerate way too many syntactic patterns. Consider just a couple coding variations that users expect this analysis to understand to be safe:

void TestOK(std::optional x) {

if (x) {

use(x.value());

}

}

void TestOK(std::optional x) {

if (x.has_value() == true) {

use(x.value());

}

}

void TestOK(std::optional x) {

if (!x.has_value()) {

use(0);

} else {

use(x.value());

}

}

void TestOK(std::optional x) {

if (!x.has_value())

return;

use(x.value());

}

|

|

To summarize, CSA can find some optional-unwrapping bugs, but it does not prove that the code performs a check on all paths. AST matchers can prove it for a finite set of hard-coded syntactic cases. But, when the set of safe patterns is infinite, as is often the case, AST matchers can only capture a subset. Capturing the full set requires building a framework like the one presented in this document.

While most of the checks in CSA are using symbolic execution, there are also some syntactic rules. One dilemma of implementing a new check is where to put it (CSA? Tidy? Warnings?). While it would be great to get rid of this dilemma, I think we want to keep all of the options open. I just wanted to make it explicit what kinds of layering constraints we have. It should be possible to add CSA checks/features using the new dataflow framework.

Comparison

|

AST Matchers Library

|

Clang Static Analyzer

|

Dataflow Analysis Framework

|

| - | - | - |

Proves that a property holds on all execution paths

|

Possible, depends on the matcher

|

No

|

Yes

|

Checks can be written in terms of program semantics

|

No

|

Yes

|

Yes

|

Another example where the forward symbolic execution of CSA falls short is backwards reasoning, e.g.:

void f(int *p)
{
*p = 5;
if (p) // Should warn: redundant null check or null dereference above.
{
…
}
}

CSA will not warn on the code snippet above because it is assumed p is never null at the first dereference (to avoid noise) and it is really hard to revise old assumptions given new evidence. Such problems can be found using dataflow analysis.

Further Examples

While we have started with a std::optional checker, we have a number of further use cases where this framework can help. Note that some of these are already implemented in complex, specialized checks, either directly in the compiler or as clang-tidy checks. A key goal of our framework is to simplify the implementation of such checks in the future.

“the pointer may be null when dereferenced”, “the pointer is always null when dereferenced”,

“unnecessary null check on a pointer passed to delete/free”,

“this raw pointer variable could be refactored into a std::unique_ptr”,

“expression always evaluates to a constant”, “if condition always evaluates to true/false”,

“basic block is never executed”, “loop is executed at most once”,

“assigned value is always overwritten”,

“value was used after having been moved”,

“value guarded by a lock was used without taking the lock”,

“unnecessary existence check before inserting a value into a std::map”.

Some of these use cases are covered by existing ClangTidy checks that are implemented with pattern matching. We believe there is an opportunity to simplify their implementation by refactoring them to use the dataflow framework, and simultaneously improve the coverage and precision of how they model C++ semantics.

Overview

The initial version of our framework provides the following key features:

a forward dataflow algorithm, parameterized by a user-defined value domain and a transfer function, which describes how program statements modify domain values.

a built-in model of local variables and the heap. This model provides two benefits to users: it tracks the flow of values, accounting for a wide variety of C++ language constructs, in a way that is reusable across many different analyses; it discriminates value approximations based on path conditions, allowing for higher precision than dataflow alone.

an interface to an SMT solver to verify safety of operations like dereferencing a pointer,

test infrastructure for validating results of the dataflow algorithm.

Dataflow Algorithm

Our framework solves forward dataflow equations over a product of a user-defined value domain and a built-in model of local variables and the heap, called the environment. The user implements the transfer function for their value domain, while the framework implements the transfer function that operates on the environment, only requiring a user-defined widening operator to avoid infinite growth of the environment model. We intend to expand the framework to also support backwards dataflow equations at some point in the future. The limitation to the forward direction is not essential.

In more detail, the algorithm is parameterized by:

a user-defined, finite-height partial order, with a join function and maximal element,

an initial element from the user-defined lattice,

a (monotonic) transfer function for statements, which maps between lattice elements,

comparison and merge functions for environment values.

What kind of control will the user have over the environment? E.g. could we add checks/extensions that model the effects of certain functions (e.g. std::swap) in the environment?

We expand on the first three elements here and return to the comparison and merge functions in the next section.

The Lattice and Analysis Abstractions

We start with a representation of a lattice:

class BoundedJoinSemiLattice {

public:

friend bool operator==(const BoundedJoinSemiLattice& lhs,

const BoundedJoinSemiLattice& rhs);

friend bool operator<=(const BoundedJoinSemiLattice& lhs,

const BoundedJoinSemiLattice& rhs);

static BoundedJoinSemiLattice Top();

enum class JoinEffect {

Changed,

Unchanged,

};

JoinEffect Join(const BoundedJoinSemiLattice& element);

};

|

|

Not all of these declarations are necessary. Specifically, operator== and operator<= can be implemented in terms of Join and Top isn’t used by the dataflow algorithm. We include the operators because direct implementations are likely more efficient, and Top because it provides evidence of a well-founded bounded join semi-lattice.

The Join operation is unusual in its choice of mutating the object instance and returning an indication of whether the join resulted in any changes. A key step of the fixpoint algorithm is code like j = Join(x, y); if (j == x) …. This formulation of Join allows us to express this code instead as if (x.Join(y) == Unchanged)… For lattices whose instances are expensive to create or compare, we expect this formulation to improve the performance of the dataflow analysis.

Based on the lattice abstraction, our analysis is similarly traditional:

class DataflowAnalysis {

public:

using BoundedJoinSemiLattice = …;

BoundedJoinSemiLattice InitialElement();

BoundedJoinSemiLattice TransferStmt(

const clang::Stmt* stmt, const BoundedJoinSemiLattice& element,

clang::dataflow::Environment& environment);

};

|

|

Do you plan to provide building blocks to help users build lattices? E.g., I really like SPARTA. They added some helpers to encode finite abstract domains, see SPARTA/FiniteAbstractDomain.h at main · facebook/SPARTA (github.com) for details.

Environment: A Built-in Lattice

All user-defined operations have access to an environment, which encapsulates the program context of whatever program element is being considered. It contains:

a path condition,

a storage model (which models both local variables and the heap).

The built-in transfer functions model the effects of C++ language constructs (for example, variable declarations, assignments, pointer dereferences, arithmetic etc.) by manipulating the environment, saving the user from doing so in custom transfer functions.

The path condition accumulates the set of boolean conditions that are known to be true on every path from function entry to the current program point.

I am a bit opposed to using the term path-condition, as users might confuse this with similar terms used in path-sensitive analyses. What about flow-condition?

The environment maintains the maps from program declarations and pointer values to storage locations. It also maintains the map of storage locations to abstract values. Storage locations can be atomic (“scalar”) or aggregate multiple sublocations:

ScalarStorageLocation is a storage location that is not subdivided further for the purposes of abstract interpretation, for example bool, int*.

AggregateStorageLocation is a storage location which can be subdivided into smaller storage locations that can be independently tracked by abstract interpretation. For example, a struct with public members.

Could you elaborate on the storage model? How do you plan to deal with aliasing? How do you represent arrays (with potentially unknown lengths)?

In addition to this storage structure, our model tracks three kinds of values: basic values (like integers and booleans), pointers (which reify storage location into the value space) and records with named fields, where each field is itself a value. For aggregate values, we ensure coherence between the structure represented in storage locations and that represented in the values: for a given aggregate storage location Lagg with child f:

valueAt(Lagg) is a record, and

valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),

where valueAt maps storage locations to values.

In our first iteration, the only basic values that we will track are boolean values. Additional values and variables may be relevant, but only in so far as they constitute part of a boolean expression that we care about. For example, if (x > 5) { … } will result in our tracking x > 5, but not any particular knowledge about the integer-valued variable x.

The path conditions and the model refer to the same sets of values and locations.

Accounting for infinite height

Despite the simplicity of our storage model, its elements can grow infinitely large: for example, path conditions on loop back edges, records when modeling linked lists, and even just the number of locations for a simple for-loop. Therefore, we provide the user with two opportunities to interpret the abstract values so as to bound it: comparison of abstract values and a merge operation at control-flow joins.

// Compares values from two different environments for semantic equivalence.

bool Compare(

clang::dataflow::Value* value1, clang::dataflow::Environment& env1,

clang::dataflow::Value* value2, clang::dataflow::Environment& env2);

// Given value1 (w.r.t. env1) and value2 (w.r.t. env2), returns a

// Value (w.r.t. merged_env) that approximates value1 and value2. This

// could be a strict lattice join, or a more general widening operation.

clang::dataflow::Value* Merge(

clang::dataflow::Value* value1, clang::dataflow::Environment& env1,

clang::dataflow::Value* value2, clang::dataflow::Environment& env2,

clang::dataflow::Environment& merged_env);

|

|

Verification by SAT Solver

The path conditions are critical in refining our approximation of the local variables and heap. At an abstract level, they don’t need to directly influence the execution of the dataflow analysis. Instead, we can imagine that, once concluded, our results are annotated with/incorporate path conditions, which in turn allows us to make more precise conclusions about our program from the analysis results.

For example, if the analysis concludes that a variable conditionally holds a non-null pointer, and, for a given program point, the path condition implies said condition, we can conclude that a dereference of the pointer at that point is safe. In this sense, the verification of the safety of a deference is separate from the analysis, which models memory states with formulae. That said, invocation of the solver need not wait until after the analysis has concluded. It can also be used during the analysis, for example, in defining the widening operation in Merge.

With that, we can see the role of an SAT solver: to solve questions of the form “Does the path condition satisfy the pre-condition computed for this pointer variable at this program point”? To that end, we include an API for asking such questions of a SAT solver. The API is compatible with Z3 and we also include our own (simple) SAT solver, for users who may not want to integrate with Z3.

What representation do you use for the path condition? Using the syntactic forms, like x > 5 might not be sufficient as x might have different values at different program points. E.g.,

void f(int x)
{
assert(x > 0);
…
x = -1;
funcWithPreconditionXIsPositive(x); // Wants to execute the SAT query here.
}

Testing dataflow analysis

We provide test infrastructure that allows users to easily write tests that make assertions about a program state computed by dataflow analysis at any code point. Code is annotated with labels on points of interest and then test expectations are written with respect to the labeled points.

For example, consider an analysis that computes the state of optional values.

ExpectDataflowAnalysis(R"(

void f(std::optional opt) {

if (opt.has_value()) {

// [[true-branch]]

} else {

// [[false-branch]]

}

})", (const auto& results) {

EXPECT_LE(results.ValueAtCodePoint(“true-branch”), OptionalLattice::Engaged());

EXPECT_LE(results.ValueAtCodePoint(“false-branch”), OptionalLattice::NullOpt());

});

|

|

Timeline

We have a fully working implementation of the framework as proposed here. We also are interested in contributing a clang-tidy checker for std::optional and related types as soon as possible, given its demonstrated efficacy and interest from third parties. To that end, as soon as we have consensus for starting this work in the clang repository, we plan to start sending patches for review. Given that the framework is relatively small, we expect it will fit within 5 (largish) patches, meaning we can hope to have it available by December 2021.

Future Work

We consider our first version of the framework as experimental. It suits the needs of a number of particular checks that we’ve been writing, but it has some obvious limitations with respect to general use. After we commit our first version, here are some planned areas of improvement and investigation.

Backwards analyses. The framework only solves forward dataflow equations. This clearly prevents implementing some very familiar and useful analyses, like variable liveness. We intend to expand the framework to also support backwards dataflow equations. There’s nothing inherent to our framework that limits it to forwards.

Reasoning about non-boolean values. Currently, we can symbolically model equality of booleans, but nothing else. Yet, for locations on the store that haven’t changed, we can understand which values are identical (because the input value object is passed by the transfer function to the output). For example,

void f(bool *a, bool *b) {

std::optional opt;

if (a == b) { opt = 42; }

if (a == b) { opt.value(); }

}

|

|

Since neither a nor b have changed, we can reason that the two occurrences of the expression a == b evaluate to the same value (even though we don’t know the exact value). We intend to improve the framework to support this increased precision, given that it will not cost us any additional complexity.

Improved model of program state. A major obstacle to writing high-quality checks for C++ code is the complexity of C++’s semantics. In writing our framework, we’ve come to see the value of shielding the user from the need to handle these directly by modeling program state and letting the user express their check in terms of that model, after it has been solved for each program point. As is, our model currently supports a number of checks, including the std::optional check and another that infers which pointer parameters are out parameters. We intend to expand the set of such checks, with improvements to the model and the API for expressing domain-specific semantics relevant to a check. Additionally, we are considering explicitly tracking other abstract program properties, like reads and writes to local variables and memory, so that users need not account for all the syntactic variants of these operations.

Refactoring existing checks. Once the framework is more mature and stable, we would like to refactor existing checks that use custom, ad-hoc implementations of dataflow to use this framework.

Is inter-procedural analysis out of scope (e.g., using summaries)?

NoQ · October 12, 2021, 9:34pm

Sounds like the mail didn't hit the mailing list due to size restriction. I'm quoting in plain text now to make sure it definitely hits the list in some form.

Like Gabor already said, It's amazing that this is happening. I completely agree with your goals and I think we've discussed a lot of this offline earlier. The amount of existing flow-sensitive warnings may be low but it's steadily growing, Valeriy's -Wcompletion-handler being one of the notable latest additions - a backwards analysis over finite state lattice with a sophisticated system of notes attached to the warning to explain what's happening (generalizing over the latter may possibly be a goal for your framework as well).

Aside from the ability to cover *the other 50% of warnings* compared to what the static analyzer already covers (may-problems vs. must-problems aka all-paths problems), another strong benefit of flow sensitive warnings is their performance which allows us to use them as compiler warnings (as opposed to separate tools like clang-tidy or static analyzer). This in turn delivers the warning to every user of the compiler which increases the user coverage dramatically. This consideration may even apply to may-problems in which the static analyzer already excels technically.

I'm also very curious about your SAT solver implementation; I have a recent hobby of plugging various solvers into the static analyzer (cf. ⚙ D110125 [analyzer] Add a Fourier-Motzkin elimination solver backend.) because it obviously needs some - arguably as much as your endeavor.

ymand · October 13, 2021, 12:52am

Gabor, Artem,

Thanks for your support, and, Artem, thank you for quoting as plain text – you’re right, the system shunted it to the moderator because of its size.

To answer some of the questions/comments raised:

There is a nice summary of some of the ad-hoc solutions we have in Clang in this survey: [cfe-dev] A survey of dataflow analyses in Clang (llvm.org)
Do you plan to replace the ad-hoc solutions in the long run?

We mention this briefly in the RFC as motivation, but to reiterate here: we’d like the framework to at least be suitable to replace these ad-hoc solutions. Whether we would get to this ourselves or encourage others, review patches, etc. I can’t say at this point. I suspect we’ll refactor at least a few ourselves just to test out the robustness of the framework.

Just to reiterate, you want to abstract away both from the dataflow algorithm and the semantics of C++. E.g., the user might not need to handle the myriad of ways someone can assign a value to a variable (overloaded operator=, output argument, primitive assignment, etc), only a single abstract concept of assignment.

Exactly! Our first version already does this to some extent, we just feel it’s somewhat tailored to our current needs and could use some generalization. For example, it would be great to know when an lvalue is read for the last time and written to another lvalue, because that would suggest a potential std::move, among other things. Yet, that balance of detail and abstraction is not yet available. We only model the effects of reads and writes without explicitly abstracting them (say, tagging each CFG node w/ the set of reads and writes). So, we think there’s room to explore the design space here, but also think what we have will be valuable to a number of analyses right away.

While most of the checks in CSA are using symbolic execution, there are also some syntactic rules. One dilemma of implementing a new check is where to put it (CSA? Tidy? Warnings?). While it would be great to get rid of this dilemma, I think we want to keep all of the options open. I just wanted to make it explicit what kinds of layering constraints we have. It should be possible to add CSA checks/features using the new dataflow framework.

Great point! We hadn’t thought about this possibility, but there’s no good reason to rule out use in CSA. Our key layering constraint is that it be available in clang-tidy. So, if we put it somewhere central, like clang/include/clang/Analysis, would that be suitable for use in CSA? We’re certainly open to other suggestions as well.

Valeriy’s -Wcompletion-handler being
one of the notable latest additions - a backwards analysis over finite
state lattice with a sophisticated system of notes attached to the
warning to explain what’s happening (generalizing over the latter may
possibly be a goal for your framework as well).

Yes, we’re definitely interested in being able to produce warnings that give the user a clear idea of what is wrong and (if possible) how to fix it. This is complicated once we involve the SAT solver, since it tends to give us simple yes/no answers. But, actionable warnings seem crucial for successful user adoption.

I’m also very curious about your SAT solver implementation; I have a
recent hobby of plugging various solvers into the static analyzer (cf.
https://reviews.llvm.org/D110125) because it obviously needs some -
arguably as much as your endeavor.

Yes, we’d love to tell you more, but I’ll have to defer to sgatev@, since he’s the author and most familiar with the details.

Cheers,
Yitzhak

ymand · October 13, 2021, 1:39am

-vsavchenko@apple.com (which keeps bouncing)

ymand · October 13, 2021, 1:40am

(for real this time)

Stanislav_Gatev · October 13, 2021, 12:09pm

I’m also very curious about your SAT solver implementation; I have a

recent hobby of plugging various solvers into the static analyzer (cf.

https://reviews.llvm.org/D110125) because it obviously needs some -

arguably as much as your endeavor.

Yes, we’d love to tell you more, but I’ll have to defer to sgatev@, since he’s the author and most familiar with the details.

Currently, we convert a path condition boolean formula to CNF and use a basic DPLL solver to check validity. The solver is not really optimized to handle large formulae as it performs some unnecessary copies in the recursive step, but this hasn’t been a problem so far as the analysis for safe use of std::optional doesn’t produce large formulae. Still, I believe we can improve the performance of the solver while keeping the implementation lightweight. We’re also working on a more general interface that would allow plugging different solver implementations in the framework.

Cheers,
Stanislav

Stanislav_Gatev · October 14, 2021, 7:44am

(replying again because my first reply didn’t end up in the mailing list)

I’m also very curious about your SAT solver implementation; I have a

recent hobby of plugging various solvers into the static analyzer (cf.

https://reviews.llvm.org/D110125) because it obviously needs some -

arguably as much as your endeavor.

Yes, we’d love to tell you more, but I’ll have to defer to sgatev@, since he’s the author and most familiar with the details.

Currently, we convert a path condition boolean formula to CNF and use a basic DPLL solver to check validity. The solver is not really optimized to handle large formulae as it performs some unnecessary copies in the recursive step, but this hasn’t been a problem so far as the analysis for safe use of std::optional doesn’t produce large formulae. Still, I believe we can improve the performance of the solver while keeping the implementation lightweight. We’re also working on a more general interface that would allow plugging different solver implementations in the framework.

Cheers,
Stanislav

gribozavr · October 18, 2021, 5:35pm

Hi Gábor,

Hi Yitzhak,

Yay, finally happening! I 100% support these efforts and feel free to add me as a reviewer to the patches, design documents etc.

Thank you so much! We will!

Another example where the forward symbolic execution of CSA falls short is backwards reasoning, e.g.:

void f(int *p)
{
*p = 5;
if (p) // Should warn: redundant null check or null dereference above.
{
…
}
}

CSA will not warn on the code snippet above because it is assumed p is never null at the first dereference (to avoid noise) and it is really hard to revise old assumptions given new evidence. Such problems can be found using dataflow analysis.

Agreed, that is a good example!

What kind of control will the user have over the environment? E.g. could we add checks/extensions that model the effects of certain functions (e.g. std::swap) in the environment?

This is an open question. Already, users can model the effects of particular function calls. They would add a case for such calls in their transfer function, where they would update the environment accordingly. The full API for manipulating the environment is available to the user. But, we expect there’s more room for supporting this kind of domain-specific semantics, either a simpler API or, likely, a richer one.

Do you plan to provide building blocks to help users build lattices? E.g., I really like SPARTA. They added some helpers to encode finite abstract domains, see SPARTA/FiniteAbstractDomain.h at main · facebook/SPARTA (github.com) for details.

We have a goal of making dataflow analysis to become as accessible as AST matchers are, so we do support adding whatever building blocks necessary to free the checks of boilerplate and make them easy to write. Whether it would be similar to SPARTA or not, I think is yet to be seen.

I am a bit opposed to using the term path-condition, as users might confuse this with similar terms used in path-sensitive analyses. What about flow-condition?

Renaming sounds good, let’s paint this bikeshed in code review.

Could you elaborate on the storage model? How do you plan to deal with aliasing? How do you represent arrays (with potentially unknown lengths)?

A limiting factor for the storage model in particular is that we try to keep the framework accessible to non-experts. It is a limiting factor for all design choices, but it hurts the storage model more than anything else.

At the moment we represent the stack and heap memory at each program point as a couple of maps:

llvm::DenseMap<const ValueDecl*, StorageLocation*> decl_to_loc_;
llvm::DenseMap<const Expr*, StorageLocation*> expr_to_loc_;
llvm::DenseMap<const StorageLocation*, Value*> loc_to_value_;

ValueDecl pointers are the Clang AST nodes for global variables, function parameter variables, and local variables.

Expr pointers are the Clang AST nodes for the expressions whose values have been already computed and are accessible at this program point.

StorageLocation represents a memory location. A memory location stores a Value. A Value knows its type, and one can retrieve its sub-objects as Values. Nevertheless, the Value is merely a name of a symbolic variable and does not store the approximation of the runtime value computed by the analysis so far. Value objects participate in the path condition, and that’s where they get constrained.

We create the first map when we initialize the analysis (before running the dataflow fixpoint algorithm). We are making an unsound assumption that each Value that has a pointer or reference type points to a distinct object. We find this assumption to be useful in practice, however it is not essential for the framework. We would like all unsound behavior, no matter how useful, to be eventually guarded by flags that each checker can set in a way that works best for it. Furthermore, we think that the framework could support different levels of detail in representing storage, and a checker could similarly request the necessary abstraction level.

A limiting factor for representing aliasing is that we would probably have to rely on an SMT (not just a SAT) solver to make queries to the path condition if we track that two pointers may (but not necessarily must) alias.

We compute the other two maps for every program point. Assignments, no-op casts, temporary materialization etc. can forward the input Value objects to their destination StorageLocations, allowing us to symbolically track the data flow.

The environment is of course subject to the join operation. To join distinct Value objects, we have a built-in rule for booleans (we build a new Value that is constrained to be a disjunction of the values being joined), and we allow the user-defined checker to handle joining values of domain-specific types. Expanding the built-in rules to cover more types is desirable, but would likely create a hard dependency on an SMT (not just a SAT) solver.

As for arrays, we don’t model them right now and we don’t immediately have a good idea of how we could do it in a simple way.

What representation do you use for the path condition? Using the syntactic forms, like x > 5 might not be sufficient as x might have different values at different program points. E.g.,

void f(int x)
{
assert(x > 0);
…
x = -1;
funcWithPreconditionXIsPositive(x); // Wants to execute the SAT query here.
}

We incorporate the symbolic value of the expression into the path condition. Since we don’t support integer comparisons, let’s use a boolean:

void f(std::optional opt) {
bool is_valid = opt.has_value();
if (is_valid) {
is_valid = false;
use(opt.value());
}
}

When we evaluate the conditional branch, we add to the path condition the symbolic value stored in the storage location of is_valid at the program point of the branch. Therefore the subsequent overwrite of is_valid does not interfere with our ability to check the path condition when we evaluate opt.value().

For those boolean conditions that we don’t understand, the expression is associated with a fresh boolean value, and that value is incorporated into the path condition. Practically, this means a very conservative interpretation of the boolean “x > 5”: every appearance in the code is handled uniquely, even if x has not changed. So, in the following code:

{ …
if (x > 5) p = new …
if (x > 5) f(*p);
}

we can’t at the moment deduce that in *p the pointer is never null. But we can deduce that here:

{ …
if (x > 5) {
b = true;
p = new …;
}
if (b) f(*p);
}

It is certainly desirable to improve this.

Is inter-procedural analysis out of scope (e.g., using summaries)?

Not at all! It’s just well ahead of where we are now. That said, we think it’s important to consider sooner rather than later, since design decisions now could block good interprocedural design later. We are happy to discuss proposals, we’re just prioritizing getting the initial framework upstreamed.

Dmitri

Xazax-hun · October 18, 2021, 7:03pm

Hi Dmitri,

Thanks for the answers, I have some further questions inline.

Hi Gábor,

Hi Yitzhak,

Yay, finally happening! I 100% support these efforts and feel free to add me as a reviewer to the patches, design documents etc.

Thank you so much! We will!

Another example where the forward symbolic execution of CSA falls short is backwards reasoning, e.g.:

void f(int *p)
{
*p = 5;
if (p) // Should warn: redundant null check or null dereference above.
{
…
}
}

CSA will not warn on the code snippet above because it is assumed p is never null at the first dereference (to avoid noise) and it is really hard to revise old assumptions given new evidence. Such problems can be found using dataflow analysis.

Agreed, that is a good example!

What kind of control will the user have over the environment? E.g. could we add checks/extensions that model the effects of certain functions (e.g. std::swap) in the environment?

This is an open question. Already, users can model the effects of particular function calls. They would add a case for such calls in their transfer function, where they would update the environment accordingly. The full API for manipulating the environment is available to the user. But, we expect there’s more room for supporting this kind of domain-specific semantics, either a simpler API or, likely, a richer one.

One of the key learnings for the static analyzer was that we want a strict separation between checks that issue warnings and “checks” that model behavior. E.g., when a user sees false positives from a null dereference checker, it is very likely they still want to have the modeling from the same check that can exclude many infeasible paths. So for the best possible user experience it should be possible for the users to only disable the relevant warnings while still getting the benefits of modeling.Also, it can be useful to disable some modelling without affecting the emitted warnings. On the flip side, the users can shoot themselves in the foot. E.g., even though running the Static Analyzer with the core modelling checks off is not supported, we still get crash reports from such scenarios (that we will not fix).

Do you plan to provide building blocks to help users build lattices? E.g., I really like SPARTA. They added some helpers to encode finite abstract domains, see SPARTA/FiniteAbstractDomain.h at main · facebook/SPARTA (github.com) for details.

We have a goal of making dataflow analysis to become as accessible as AST matchers are, so we do support adding whatever building blocks necessary to free the checks of boilerplate and make them easy to write. Whether it would be similar to SPARTA or not, I think is yet to be seen.

I am a bit opposed to using the term path-condition, as users might confuse this with similar terms used in path-sensitive analyses. What about flow-condition?

Renaming sounds good, let’s paint this bikeshed in code review.

Could you elaborate on the storage model? How do you plan to deal with aliasing? How do you represent arrays (with potentially unknown lengths)?

A limiting factor for the storage model in particular is that we try to keep the framework accessible to non-experts. It is a limiting factor for all design choices, but it hurts the storage model more than anything else.

At the moment we represent the stack and heap memory at each program point as a couple of maps:

llvm::DenseMap<const ValueDecl*, StorageLocation*> decl_to_loc_;
llvm::DenseMap<const Expr*, StorageLocation*> expr_to_loc_;
llvm::DenseMap<const StorageLocation*, Value*> loc_to_value_;

ValueDecl pointers are the Clang AST nodes for global variables, function parameter variables, and local variables.

The store is likely to change very little in each of the transitions. The static analyzer tries to leverage this fact by using immutable data structures, so we can share the common parts between states. This can help reduce memory consumption. And the analyzer favors significantly smaller memory use to slightly better performance because this lets users leverage parallelism across TUs better with the same amount of memory. I was wondering whether you considered using immutable data structures (like llvm/ADT/ImmutableMap.h) or is this just a first version?

Also, how will this model handle expressions that evaluate to rvalues? Do you have locations for rvalues?

Expr pointers are the Clang AST nodes for the expressions whose values have been already computed and are accessible at this program point.

StorageLocation represents a memory location. A memory location stores a Value. A Value knows its type, and one can retrieve its sub-objects as Values. Nevertheless, the Value is merely a name of a symbolic variable and does not store the approximation of the runtime value computed by the analysis so far. Value objects participate in the path condition, and that’s where they get constrained.

We create the first map when we initialize the analysis (before running the dataflow fixpoint algorithm). We are making an unsound assumption that each Value that has a pointer or reference type points to a distinct object.

The analyzer is making a very similar assumption and it seems to work reasonably well. There are certain exceptions. E.g., for overloaded operator= implementations, it will simulate the method both with this != &other and this == &other assumptions. This will double the analysis time for these methods, but it can help finding errors where users forgot to handle the latter case. The APIs of the analyzer is flexible enough that this split can be done in a checker.

We find this assumption to be useful in practice, however it is not essential for the framework. We would like all unsound behavior, no matter how useful, to be eventually guarded by flags that each checker can set in a way that works best for it.

I am perfectly comfortable with unsound behavior by default, there is a whole movement behind this approach: Soundiness Home Page

Furthermore, we think that the framework could support different levels of detail in representing storage, and a checker could similarly request the necessary abstraction level.

A limiting factor for representing aliasing is that we would probably have to rely on an SMT (not just a SAT) solver to make queries to the path condition if we track that two pointers may (but not necessarily must) alias.

We compute the other two maps for every program point. Assignments, no-op casts, temporary materialization etc. can forward the input Value objects to their destination StorageLocations, allowing us to symbolically track the data flow.

The environment is of course subject to the join operation. To join distinct Value objects, we have a built-in rule for booleans (we build a new Value that is constrained to be a disjunction of the values being joined), and we allow the user-defined checker to handle joining values of domain-specific types. Expanding the built-in rules to cover more types is desirable, but would likely create a hard dependency on an SMT (not just a SAT) solver.

When a pointer point to two different locations in two different branches do you join them to a new location and add constraints just like you do for booleans?

As for arrays, we don’t model them right now and we don’t immediately have a good idea of how we could do it in a simple way.

What representation do you use for the path condition? Using the syntactic forms, like x > 5 might not be sufficient as x might have different values at different program points. E.g.,

void f(int x)
{
assert(x > 0);
…
x = -1;
funcWithPreconditionXIsPositive(x); // Wants to execute the SAT query here.
}

We incorporate the symbolic value of the expression into the path condition. Since we don’t support integer comparisons, let’s use a boolean:

void f(std::optional opt) {
bool is_valid = opt.has_value();
if (is_valid) {
is_valid = false;
use(opt.value());
}
}

When we evaluate the conditional branch, we add to the path condition the symbolic value stored in the storage location of is_valid at the program point of the branch. Therefore the subsequent overwrite of is_valid does not interfere with our ability to check the path condition when we evaluate opt.value().

This is awesome, the static analyzer works exactly the same way!

For those boolean conditions that we don’t understand, the expression is associated with a fresh boolean value, and that value is incorporated into the path condition. Practically, this means a very conservative interpretation of the boolean “x > 5”: every appearance in the code is handled uniquely, even if x has not changed. So, in the following code:

{ …
if (x > 5) p = new …
if (x > 5) f(*p);
}

we can’t at the moment deduce that in *p the pointer is never null. But we can deduce that here:

{ …
if (x > 5) {
b = true;
p = new …;
}
if (b) f(*p);
}

It is certainly desirable to improve this.

Agreed. I suspect the proper solution would involve many non-trivial steps including escape analysis.

gribozavr · October 19, 2021, 5:47pm

Hi Gábor,

This is an open question. Already, users can model the effects of particular function calls. They would add a case for such calls in their transfer function, where they would update the environment accordingly. The full API for manipulating the environment is available to the user. But, we expect there's more room for supporting this kind of domain-specific semantics, either a simpler API or, likely, a richer one.

One of the key learnings for the static analyzer was that we want a strict separation between checks that issue warnings and "checks" that model behavior.

Yes, I'm aware I agree that the right direction here would be to
separate the warning and modelling plugins, but probably we should
take it one step further. Running multiple warnings in a shared at the
same time seems to be the root of the problem. So I think that maybe
each warning should configure the modeling (and unsoundness
assumptions in the core) according to its own needs.

The store is likely to change very little in each of the transitions. The static analyzer tries to leverage this fact by using immutable data structures, so we can share the common parts between states. This can help reduce memory consumption. And the analyzer favors significantly smaller memory use to slightly better performance because this lets users leverage parallelism across TUs better with the same amount of memory. I was wondering whether you considered using immutable data structures (like llvm/ADT/ImmutableMap.h) or is this just a first version?

It is just the first version, we haven't looked at optimization
opportunities deeply. However, when we evaluate the transfer functions
within a basic block, we keep mutating the same environment object
while applying transfer functions. We only keep one copy of the
environment per basic block, as we must do that to check fixpoint
convergence.

Also, how will this model handle expressions that evaluate to rvalues? Do you have locations for rvalues?

We have storage locations for rvalues. It is correct for xvalues, but
it is wrong for prvalues.

I am perfectly comfortable with unsound behavior by default, there is a whole movement behind this approach: Soundiness Home Page

I'm also OK with it, but I think that different analyses should feel
free to choose the unsoundness they want, and avoid unsoundness that
hurts them.

The environment is of course subject to the join operation. To join distinct Value objects, we have a built-in rule for booleans (we build a new Value that is constrained to be a disjunction of the values being joined), and we allow the user-defined checker to handle joining values of domain-specific types. Expanding the built-in rules to cover more types is desirable, but would likely create a hard dependency on an SMT (not just a SAT) solver.

When a pointer point to two different locations in two different branches do you join them to a new location and add constraints just like you do for booleans?

That's what we would do if we would use an SMT solver.

Currently we drop such a pointer from the environment in the join
operation, which in dataflow terms means that we assign the bottom
element, which is not correct, we should be using top, but we don't
have that implemented in our model for pointer values.

Agreed. I suspect the proper solution would involve many non-trivial steps including escape analysis.

Yes. And I would very much want to see what ideas from separation
logic could be transferred into a dataflow setting.

Dmitri

NoQ · October 27, 2021, 10:30pm

One more question. You say that you have a generic machinery for things like heap and constraint solving. You also say that you plan your machinery to be fast enough to be used in compiler warnings. Do you think you can achieve both *simultaneously*? Or is your plan to make it highly modular so that it was possible to opt out of expensive modeling when a specific warning doesn't need it?

gribozavr · October 27, 2021, 10:38pm

The latter. It should also help with predictability which users expect
from certain warnings and ClangTidy checkers.

Dmitri

dblaikie · November 7, 2021, 4:02pm

I guess this would subsume the Clang consumption attributes: https://clang.llvm.org/docs/AttributeReference.html#consumed-annotation-checking - could these older attributes then be reimplemented in terms of this new infrastructure? (or broadened in other ways to better expose this functionality)

gribozavr · November 8, 2021, 10:00am

Yes, it should be possible to reimplement llvm-project/clang/lib/Analysis/Consumed.cpp in terms of the proposed dataflow analysis framework.

Dmitri

dblaikie · November 8, 2021, 9:00pm

I guess this would subsume the Clang consumption attributes: https://clang.llvm.org/docs/AttributeReference.html#consumed-annotation-checking - could these older attributes then be reimplemented in terms of this new infrastructure? (or broadened in other ways to better expose this functionality)

Yes, it should be possible to reimplement llvm-project/clang/lib/Analysis/Consumed.cpp in terms of the proposed dataflow analysis framework.

Would it be reasonable to have that on your plan/timeline? Would be nice to avoid the consumed analysis languishing as a dead end if it could be migrated to the new stuff (& be a nice proof-of-concept/place to make this new functionality live in an existing feature perhaps relatively early on (not that I think the consumed warnings/analysis are widely used - though I’m not sure)))

ymand · November 19, 2021, 1:03pm

We’ve released our first two patches:

First technical patch: https://reviews.llvm.org/D114234

An introduction to the dataflow analysis framework: https://reviews.llvm.org/D114231

Topic		Replies	Views
[analyzer][GSoC] Implementing a dataflow framework for the Clang Static Analyzer Static Analyzer	4	150	March 14, 2019
Building upon the clang static analyzer Clang Frontend	7	132	April 18, 2013
Declarative ExplodedGraph matching Static Analyzer	2	292	February 11, 2022
A survey of dataflow analyses in Clang Clang Frontend	1	195	October 6, 2020
Yet another static analysis tool Clang Frontend	9	662	April 20, 2022

[RFC] A dataflow analysis framework for Clang AST

Motivation

Example: Ensuring that std::optional is used safely

Comparison

Further Examples

Overview

Dataflow Algorithm

The Lattice and Analysis Abstractions

Environment: A Built-in Lattice

Accounting for infinite height

Verification by SAT Solver

Testing dataflow analysis

Timeline

Future Work

Conclusion

Motivation

Example: Ensuring that std::optional is used safely

Comparison

Further Examples

Overview

Dataflow Algorithm

The Lattice and Analysis Abstractions

Environment: A Built-in Lattice

Accounting for infinite height

Verification by SAT Solver

Testing dataflow analysis

Timeline

Future Work

Related Topics