Summary
We propose a new static analysis framework for implementing Clang bug-finding and refactoring tools (including ClangTidy checks) based on a dataflow algorithm.
Some ClangTidy checks and core compiler warnings are already dataflow based. However, due to a lack of a reusable framework, all reimplement common dataflow-analysis and CFG-traversal algorithms. In addition to supporting new checks, the framework should allow us to clean up those parts of Clang and ClangTidy.
We intend that the new framework should have a learning curve similar to AST matchers and not require prior expertise in static analysis. Our goal is a simple API that isolates the user from the dataflow algorithm itself. The initial version of the framework combines support for classic dataflow analyses over the Clang CFG with symbolic evaluation of the code and a simple model of local variables and the heap. A number of useful checks can be expressed solely in terms of this model, saving their implementers from having to reimplement modeling of the semantics of C++ AST nodes. Ultimately, we want to extend this convenience to a wide variety of checks, so that they can be written in terms of the abstract semantics of the program (as modeled by the framework), while only matching and modeling AST nodes for key parts of their domain (e.g., custom data types or functions).
Motivation
Clang has a few subsystems that greatly help with implementing bug-finding and refactoring tools:
-
AST matchers allow one to easily find program fragments whose AST matches a given pattern. This works well for finding program fragments whose desired semantics translates to a small number of possible syntactic expressions (e.g., “find a call to strlen() where the argument is a null pointer constant”). Note that patterns expressed by AST matchers often check program properties that hold for all execution paths.
-
Clang Static Analyzer can find program paths that end in a state that satisfies a specific user-defined property, regardless of how the code was expressed syntactically (e.g., “find a call to strlen() where the argument is dynamically equal to null”). CSA abstracts from the syntax of the code, letting analyses focus on its semantics. While not completely eliding syntax, symbolic evaluation abstracts over many syntactic details. Semantically equivalent program fragments result in equivalent program states computed by the CSA.
When implementing certain bug-finding and refactoring tools, it is often necessary to find program fragments where a given property holds on all execution paths, while placing as few demands as possible on how the user writes their code. The proposed dataflow analysis framework aims to satisfy both of these needs simultaneously.
Example: Ensuring that std::optional is used safely
Consider implementing a static analysis tool that proves that every time a std::optional is unwrapped, the program has ensured that the precondition of the unwrap operation (the optional has a value) is satisfied.
void TestOK(std::optional x) {
if (x.has_value()) {
use(x.value());
}
}
void TestWithWarning(std::optional x) {
use(x.value()); // warning: unchecked access to optional value
}
|
- |
If we implemented this check in CSA, it would likely find the bug in TestWithWarning, and would not report anything in TestOK. However, CSA did not prove the desired property (all optional unwraps are checked on all paths) for TestOK; CSA just could not find a path through TestOK where this property is violated.
A simple function like TestOK only has two paths and would be completely explored by the CSA. However, when loops are involved, CSA uses heuristics to decide when to stop exploring paths in a function. As a result, CSA would miss this bug:
void MissedBugByCSA(std::optional x) {
for (int i = 0; i < 10; i++) {
if (i < 5) {
use(i);
} else {
use(x.value()); // CSA does not report a warning
}
}
}
|
- |
If we implemented this check with AST matchers (as typically done in ClangTidy), we would have to enumerate way too many syntactic patterns. Consider just a couple coding variations that users expect this analysis to understand to be safe:
void TestOK(std::optional x) {
if (x) {
use(x.value());
}
}
void TestOK(std::optional x) {
if (x.has_value() == true) {
use(x.value());
}
}
void TestOK(std::optional x) {
if (!x.has_value()) {
use(0);
} else {
use(x.value());
}
}
void TestOK(std::optional x) {
if (!x.has_value())
return;
use(x.value());
}
|
- |
To summarize, CSA can find some optional-unwrapping bugs, but it does not prove that the code performs a check on all paths. AST matchers can prove it for a finite set of hard-coded syntactic cases. But, when the set of safe patterns is infinite, as is often the case, AST matchers can only capture a subset. Capturing the full set requires building a framework like the one presented in this document.
Comparison
|
AST Matchers Library
|
Clang Static Analyzer
|
Dataflow Analysis Framework
|
- | - | - | - |
Proves that a property holds on all execution paths
|
Possible, depends on the matcher
|
No
|
Yes
|
Checks can be written in terms of program semantics
|
No
|
Yes
|
Yes
|
Further Examples
While we have started with a std::optional checker, we have a number of further use cases where this framework can help. Note that some of these are already implemented in complex, specialized checks, either directly in the compiler or as clang-tidy checks. A key goal of our framework is to simplify the implementation of such checks in the future.
-
“the pointer may be null when dereferenced”, “the pointer is always null when dereferenced”,
-
“unnecessary null check on a pointer passed to delete/free”,
-
“this raw pointer variable could be refactored into a std::unique_ptr”,
-
“expression always evaluates to a constant”, “if condition always evaluates to true/false”,
-
“basic block is never executed”, “loop is executed at most once”,
-
“assigned value is always overwritten”,
-
“value was used after having been moved”,
-
“value guarded by a lock was used without taking the lock”,
-
“unnecessary existence check before inserting a value into a std::map”.
Some of these use cases are covered by existing ClangTidy checks that are implemented with pattern matching. We believe there is an opportunity to simplify their implementation by refactoring them to use the dataflow framework, and simultaneously improve the coverage and precision of how they model C++ semantics.
Overview
The initial version of our framework provides the following key features:
-
a forward dataflow algorithm, parameterized by a user-defined value domain and a transfer function, which describes how program statements modify domain values.
-
a built-in model of local variables and the heap. This model provides two benefits to users: it tracks the flow of values, accounting for a wide variety of C++ language constructs, in a way that is reusable across many different analyses; it discriminates value approximations based on path conditions, allowing for higher precision than dataflow alone.
-
an interface to an SMT solver to verify safety of operations like dereferencing a pointer,
-
test infrastructure for validating results of the dataflow algorithm.
Dataflow Algorithm
Our framework solves forward dataflow equations over a product of a user-defined value domain and a built-in model of local variables and the heap, called the environment. The user implements the transfer function for their value domain, while the framework implements the transfer function that operates on the environment, only requiring a user-defined widening operator to avoid infinite growth of the environment model. We intend to expand the framework to also support backwards dataflow equations at some point in the future. The limitation to the forward direction is not essential.
In more detail, the algorithm is parameterized by:
-
a user-defined, finite-height partial order, with a join function and maximal element,
-
an initial element from the user-defined lattice,
-
a (monotonic) transfer function for statements, which maps between lattice elements,
-
comparison and merge functions for environment values.
We expand on the first three elements here and return to the comparison and merge functions in the next section.
The Lattice and Analysis Abstractions
We start with a representation of a lattice:
class BoundedJoinSemiLattice {
public:
friend bool operator==(const BoundedJoinSemiLattice& lhs,
const BoundedJoinSemiLattice& rhs);
friend bool operator<=(const BoundedJoinSemiLattice& lhs,
const BoundedJoinSemiLattice& rhs);
static BoundedJoinSemiLattice Top();
enum class JoinEffect {
Changed,
Unchanged,
};
JoinEffect Join(const BoundedJoinSemiLattice& element);
};
|
- |
Not all of these declarations are necessary. Specifically, operator== and operator<= can be implemented in terms of Join and Top isn’t used by the dataflow algorithm. We include the operators because direct implementations are likely more efficient, and Top because it provides evidence of a well-founded bounded join semi-lattice.
The Join operation is unusual in its choice of mutating the object instance and returning an indication of whether the join resulted in any changes. A key step of the fixpoint algorithm is code like j = Join(x, y); if (j == x) …. This formulation of Join allows us to express this code instead as if (x.Join(y) == Unchanged)… For lattices whose instances are expensive to create or compare, we expect this formulation to improve the performance of the dataflow analysis.
Based on the lattice abstraction, our analysis is similarly traditional:
class DataflowAnalysis {
public:
using BoundedJoinSemiLattice = …;
BoundedJoinSemiLattice InitialElement();
BoundedJoinSemiLattice TransferStmt(
const clang::Stmt* stmt, const BoundedJoinSemiLattice& element,
clang::dataflow::Environment& environment);
};
|
- |
Environment: A Built-in Lattice
All user-defined operations have access to an environment, which encapsulates the program context of whatever program element is being considered. It contains:
-
a path condition,
-
a storage model (which models both local variables and the heap).
The built-in transfer functions model the effects of C++ language constructs (for example, variable declarations, assignments, pointer dereferences, arithmetic etc.) by manipulating the environment, saving the user from doing so in custom transfer functions.
The path condition accumulates the set of boolean conditions that are known to be true on every path from function entry to the current program point.
The environment maintains the maps from program declarations and pointer values to storage locations. It also maintains the map of storage locations to abstract values. Storage locations can be atomic (“scalar”) or aggregate multiple sublocations:
-
ScalarStorageLocation is a storage location that is not subdivided further for the purposes of abstract interpretation, for example bool, int*.
-
AggregateStorageLocation is a storage location which can be subdivided into smaller storage locations that can be independently tracked by abstract interpretation. For example, a struct with public members.
In addition to this storage structure, our model tracks three kinds of values: basic values (like integers and booleans), pointers (which reify storage location into the value space) and records with named fields, where each field is itself a value. For aggregate values, we ensure coherence between the structure represented in storage locations and that represented in the values: for a given aggregate storage location Lagg with child f:
-
valueAt(Lagg) is a record, and
-
valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),
where valueAt maps storage locations to values.
In our first iteration, the only basic values that we will track are boolean values. Additional values and variables may be relevant, but only in so far as they constitute part of a boolean expression that we care about. For example, if (x > 5) { … } will result in our tracking x > 5, but not any particular knowledge about the integer-valued variable x.
The path conditions and the model refer to the same sets of values and locations.
Accounting for infinite height
Despite the simplicity of our storage model, its elements can grow infinitely large: for example, path conditions on loop back edges, records when modeling linked lists, and even just the number of locations for a simple for-loop. Therefore, we provide the user with two opportunities to interpret the abstract values so as to bound it: comparison of abstract values and a merge operation at control-flow joins.
// Compares values from two different environments for semantic equivalence.
bool Compare(
clang::dataflow::Value* value1, clang::dataflow::Environment& env1,
clang::dataflow::Value* value2, clang::dataflow::Environment& env2);
// Givenvalue1
(w.r.t.env1
) andvalue2
(w.r.t.env2
), returns a
//Value
(w.r.t.merged_env
) that approximatesvalue1
andvalue2
. This
// could be a strict lattice join, or a more general widening operation.
clang::dataflow::Value* Merge(
clang::dataflow::Value* value1, clang::dataflow::Environment& env1,
clang::dataflow::Value* value2, clang::dataflow::Environment& env2,
clang::dataflow::Environment& merged_env);
|
- |
Verification by SAT Solver
The path conditions are critical in refining our approximation of the local variables and heap. At an abstract level, they don’t need to directly influence the execution of the dataflow analysis. Instead, we can imagine that, once concluded, our results are annotated with/incorporate path conditions, which in turn allows us to make more precise conclusions about our program from the analysis results.
For example, if the analysis concludes that a variable conditionally holds a non-null pointer, and, for a given program point, the path condition implies said condition, we can conclude that a dereference of the pointer at that point is safe. In this sense, the verification of the safety of a deference is separate from the analysis, which models memory states with formulae. That said, invocation of the solver need not wait until after the analysis has concluded. It can also be used during the analysis, for example, in defining the widening operation in Merge.
With that, we can see the role of an SAT solver: to solve questions of the form “Does the path condition satisfy the pre-condition computed for this pointer variable at this program point”? To that end, we include an API for asking such questions of a SAT solver. The API is compatible with Z3 and we also include our own (simple) SAT solver, for users who may not want to integrate with Z3.
Testing dataflow analysis
We provide test infrastructure that allows users to easily write tests that make assertions about a program state computed by dataflow analysis at any code point. Code is annotated with labels on points of interest and then test expectations are written with respect to the labeled points.
For example, consider an analysis that computes the state of optional values.
ExpectDataflowAnalysis(R"(
void f(std::optional opt) {
if (opt.has_value()) {
// [[true-branch]]
} else {
// [[false-branch]]
}
})", (const auto& results) {
EXPECT_LE(results.ValueAtCodePoint(“true-branch”), OptionalLattice::Engaged());
EXPECT_LE(results.ValueAtCodePoint(“false-branch”), OptionalLattice::NullOpt());
});
|
- |
Timeline
We have a fully working implementation of the framework as proposed here. We also are interested in contributing a clang-tidy checker for std::optional and related types as soon as possible, given its demonstrated efficacy and interest from third parties. To that end, as soon as we have consensus for starting this work in the clang repository, we plan to start sending patches for review. Given that the framework is relatively small, we expect it will fit within 5 (largish) patches, meaning we can hope to have it available by December 2021.
Future Work
We consider our first version of the framework as experimental. It suits the needs of a number of particular checks that we’ve been writing, but it has some obvious limitations with respect to general use. After we commit our first version, here are some planned areas of improvement and investigation.
Backwards analyses. The framework only solves forward dataflow equations. This clearly prevents implementing some very familiar and useful analyses, like variable liveness. We intend to expand the framework to also support backwards dataflow equations. There’s nothing inherent to our framework that limits it to forwards.
Reasoning about non-boolean values. Currently, we can symbolically model equality of booleans, but nothing else. Yet, for locations on the store that haven’t changed, we can understand which values are identical (because the input value object is passed by the transfer function to the output). For example,
void f(bool *a, bool *b) {
std::optional opt;
if (a == b) { opt = 42; }
if (a == b) { opt.value(); }
}
|
- |
Since neither a nor b have changed, we can reason that the two occurrences of the expression a == b evaluate to the same value (even though we don’t know the exact value). We intend to improve the framework to support this increased precision, given that it will not cost us any additional complexity.
Improved model of program state. A major obstacle to writing high-quality checks for C++ code is the complexity of C++’s semantics. In writing our framework, we’ve come to see the value of shielding the user from the need to handle these directly by modeling program state and letting the user express their check in terms of that model, after it has been solved for each program point. As is, our model currently supports a number of checks, including the std::optional check and another that infers which pointer parameters are out parameters. We intend to expand the set of such checks, with improvements to the model and the API for expressing domain-specific semantics relevant to a check. Additionally, we are considering explicitly tracking other abstract program properties, like reads and writes to local variables and memory, so that users need not account for all the syntactic variants of these operations.
Refactoring existing checks. Once the framework is more mature and stable, we would like to refactor existing checks that use custom, ad-hoc implementations of dataflow to use this framework.
Conclusion
We have proposed an addition to the Clang repository of a new static analysis framework targeting all-paths analyses. The intended clients of such analyses are code-checking and transformation tools like clang-tidy, where sound analyses can support sound transformations of code. We discussed the key features and APIs of our framework and how they fit together into a new analysis framework. Moreover, our proposal is based on experience with a working implementation that supports a number of practical, high-precision analyses, including one for safe use of types like std::optional and another that finds missed opportunities for use of std::move.