[analyzer] GSoC 2018 Final Report - cplusplus.InnerPointer


This summer I've been working on a Clang Static Analyzer checker for dangling
string pointers in C++ as a Google Summer of Code project. You can find a
summary of the accomplishments below. A more detailed and prettier form of the
same report can be found here <https://rnkovacs.github.io/gsoc2018/&gt;\.

If you wouldn't like to read it, but happen to use the cplusplus.InnerPointer
checker on your codebase, I'd appreciate any kind of feedback!

=== Motivation ===

The C++ std::string class provides methods that return a raw pointer to a
string’s inner character buffer. When a std::string is destroyed, the character
buffer is deallocated. A common bug is to use such a raw pointer after string
deallocation, which may lead to crashes or other unexpected behavior.
A real-world example:

return std::to_string(size).c_str();

std::to_string() returns a string by value, a temporary object that will be
destroyed at the end of the full-expression, before the return statement is
executed. The caller will therefore receive a pointer to an already deallocated
character buffer.

Apart from being destroyed, many other operations on the string can cause the
inner buffer to be reallocated (most non-const member functions, and STL
functions that take a string argument by a non-const reference), leading to
similar problems. The goal of this project was to add a new checker to the
Clang Static Analyzer to find when a dangling inner string pointer is used.

Even though the bug shown above might have been found by path-insensitive
analysis (“c_str() is called on a temporary object expression”), the checker
itself is path-sensitive for a number of reasons. One reason is that avoiding
false positive warnings with path-insensitive analysis rapidly becomes more and
more complicated.

For example,

strcpy(dest, string(...).c_str());

is valid because the destructor of string(...) is invoked after strcpy(), but
avoiding a report here would require additional reasoning like “… and the
result isn’t consumed before the end of the full-expression”. Another argument
is that path-sensitive analysis enables us to catch errors like:

const char *foo(bool cond) {
     std::string s;
     if (cond)
         return s.c_str();

Here, the local variable s is destroyed by an automatic destructor at the end
of the function, but we are still able to find the bug.

There is another family of use-after-free errors involving strings that the
summer project unfortunately haven’t covered, the incorrect usage of
std::string_views. A string_view is an object that refers to the character
array inside of an existing string but does not own it - it is the programmer’s
responsibility to ensure that it does not outlive the character sequence it
points to. More information about how the string_view-checking functionality
could be implemented in the analyzer can be found in the "Future work" section.

=== Work outline ===

The basic idea of the checker is to keep record of raw pointers referring to
the inner buffer of a string container (obtained by a c_str() or data() call)
in the program state. The task is to recognize if any of the tracked pointers
is used after a potentially invalidating operation.

Because this is a kind of a use-after-free problem, much of the functionality
we wish to have has already been implemented in MallocChecker. MallocChecker
finds general cases of use-after-free, double-free, and similar issues by
holding information about symbols referring to memory returned by allocation
functions (such as malloc(), alloca(), the C++ new expression, etc.) in its own
data structure in the program state. Under symbols we mean special, fixed
“values” in the analyzer that are unknown at compile time, but can be tracked
to see how they are used, no matter where they are stored in the program. E.g.
the value of a variable that would be read in from the standard input during
normal execution is represented by a symbol during the analysis. (A great
source of further information on how the analyzer represents values is the
Checker Developer Manual

But the internals of containers like std::string are hard for MallocChecker to
understand. An API-specific checker can however figure out exactly what to
track in case of a particular container object, and it can then make good use
of MallocChecker’s information about whether the memory associated with a
symbol has been released. MallocChecker also provides a bug reporter visitor
that attaches useful notes to the bug path leading to the use-after-free

Based on this, the task can be broken down into two components:

1) Implement a new checker (eventually named InnerPointerChecker) that tracks
raw inner pointers of strings, and recognizes if any operation is made on the
string that may re- or deallocate the inner buffer, invalidating the pointer.
In such cases, “hand over” the corrupted pointer symbol to MallocChecker in a
“released” state.

2) Extend MallocChecker to recognize these special, inherently “released”
symbols originating from InnerPointerChecker, so that it can issue an
appropriate warning message and appropriate diagnostic notes along the bug

A more detailed progress description can be found on the page mentioned in the
introduction <https://rnkovacs.github.io/gsoc2018/&gt;, with links pointing to
differential revisions on Phabricator. For those who prefer to browse the
commits themselves, here
<Commits · llvm-mirror/clang · GitHub;
is a list of commits on GitHub.

In summary, the checker is functional, enabled by default, and ready to be used
on real-world projects. Any feedback is greatly appreciated!

=== Future work ===

There are a handful of possible improvements that could enhance the checker:

- Adding support for std::string_view. The analyzer does not understand the
internals of std::string_views, so adding support would mean either modeling
all interesting operations concerning them in the checker, or somehow
convincing the analyzer the inline its methods.

- Adding support for other STL containers apart from strings that are
vulnerable to inner pointer issues, e.g. std::vector.

- Adding support for non-STL containers, such as custom strings, string views,
vectors present in projects like LLVM, Boost, WebKit.

- Pointer escape into global variables is currently not recognized.

=== Evaluation ===

The checker has been evaluated on a number of open-source software projects
(together with the libraries they depend on), most of which appear to use
strings extensively: Bitcoin, Ceph, Harfbuzz, ICU, LibreOffice, LLVM,

In its present form, the checker produces a very low number of reports. Among
the analysis results of the above listed projects, no false positive findings
have been observed, meaning that all reports seemed to be real bugs and have
been reported to the respective developer communities:

- 1 report in Ceph <'Re: Use-after-free problem in RocksDBStore.cc' - MARC;
- 1 report in Facebook’s RocksDB
<Issues · facebook/rocksdb · GitHub;
- 1 report in GPGME <Login;

Project maintainers responded extremely swiftly, and all of these bugs were
fixed within a day!

=== How to use ===

Because all the development was made on the master branch, all you need to use
the checker is a fresh copy of clang. The command to run it on a single file
(together with the default set of checks):

$ clang --analyze file.cpp

Alternatively, it can be enabled explicitly:

$ clang -cc1 -analyze -analyzer-checker=cplusplus.InnerPointer file.cpp

It can also be used with a fresh copy of clang-tidy:

$ clang-tidy -checks=clang-analyzer-cplusplus.InnerPointer file.cpp --

When analyzing a project with scan-build, no further effort is needed, as the
checker is turned on by default.

=== Acknowledgement ===

I would like to express my special thanks to Artem Dergachev and Gábor Horváth
for being the awesome mentors they were. With their guidance, patient
explanations and quick responses even during the weekends, I feel like I’ve
learned an immense amount during the summer. I am also super grateful to George
Karpenkov and Devin Coughlin for their tips and comments on the patches.