StaticAnalyzer: Implementing checks for std::string

I was looking at trying to implement a checker and thought a fun one would be to try to implement checkers around std::string. For example, I figured I could start with trying to detect out-of-bound access when the length of the string can be determined.

One clang doc(*) suggested that there should be BodyFarm implementations of std::string functions. Would that be the best way to try to solve the OOB check problem? Or is it better to try to “emulate” interactions with std::string objects and try to track the size of the string?

It seems that providing an actual implementation of the functions will be more accurate than an emulation, but then I’m not sure why that’s not already visible from ? Is there some limitation in the static analyzer that keeps it from already having the source from ? Or will an emulation provide something richer than the raw source could provide?

I’ll probably have follow up questions, but my questions branch out from those basic approaches, so I figured I’d start there.

Jared

(*) http://clang-analyzer.llvm.org/open_projects.html

I was looking at trying to implement a checker and thought a fun one would be to try to implement checkers around std::string. For example, I figured I could start with trying to detect out-of-bound access when the length of the string can be determined.

The existing (experimental) CString checker already performs a similar check. You can take a look at how it associates the length of a string with the region that represents the string. The main two reasons it’s experimental is that it has not been sufficiently tested on real code and we felt that the diagnostics for out-of-bound are usually hard to understand. For example, often out-of-bound is due to an overflow or underflow; and it would be beneficial to have some explanation of these events to the user.

The CSting checker does not use BodyFarm (mainly, because it was written before it). The idea behind using the BodyFarm, would be to model the functions using the farm but you would still write the checks inside a checker.

One clang doc(*) suggested that there should be BodyFarm implementations of std::string functions.

This would allow the analyzer to have precise reasoning about these known functions, which would, for example, lead to less false positives.

Would that be the best way to try to solve the OOB check problem? Or is it better to try to “emulate” interactions with std::string objects and try to track the size of the string?

It seems that providing an actual implementation of the functions will be more accurate than an emulation, but then I’m not sure why that’s not already visible from ? Is there some limitation in the static analyzer that keeps it from already having the source from ? Or will an emulation provide something richer than the raw source could provide?

The analyzer only sees what is implemented in the header. Also, we do have limits on cross-function-analyzes. Specifically, we do not “inline” function calls that are too deep or too large. Another reason to model functions with body farm is that the analyzer does not always “understand” the invariants of the code as it is written in the library.

I was looking at trying to implement a checker and thought a fun one would be to try to implement checkers around std::string. For example, I figured I could start with trying to detect out-of-bound access when the length of the string can be determined.

The existing (experimental) CString checker already performs a similar check. You can take a look at how it associates the length of a string with the region that represents the string. The main two reasons it’s experimental is that it has not been sufficiently tested on real code and we felt that the diagnostics for out-of-bound are usually hard to understand. For example, often out-of-bound is due to an overflow or underflow; and it would be beneficial to have some explanation of these events to the user.

The CSting checker does not use BodyFarm (mainly, because it was written before it). The idea behind using the BodyFarm, would be to model the functions using the farm but you would still write the checks inside a checker.

One clang doc(*) suggested that there should be BodyFarm implementations of std::string functions.

This would allow the analyzer to have precise reasoning about these known functions, which would, for example, lead to less false positives.

Thanks, Anna.

That makes sense; also, I could see that you might be able to get much deeper analysis than simple emulation.

Would that be the best way to try to solve the OOB check problem? Or is it better to try to “emulate” interactions with std::string objects and try to track the size of the string?

It seems that providing an actual implementation of the functions will be more accurate than an emulation, but then I’m not sure why that’s not already visible from ? Is there some limitation in the static analyzer that keeps it from already having the source from ? Or will an emulation provide something richer than the raw source could provide?

The analyzer only sees what is implemented in the header. Also, we do have limits on cross-function-analyzes. Specifically, we do not “inline” function calls that are too deep or too large. Another reason to model functions with body farm is that the analyzer does not always “understand” the invariants of the code as it is written in the library.

Does BodyFarm get any freedom over the data layout of std::string? STL doesnt specify how the objects are implemented. So, would a particular implementation of BodyFarm be restricted to work only against libc++? Or do you ignore the actual data model and just have it conjure the values itself via analyzer?

I was looking at trying to implement a checker and thought a fun one would be to try to implement checkers around std::string. For example, I figured I could start with trying to detect out-of-bound access when the length of the string can be determined.

The existing (experimental) CString checker already performs a similar check. You can take a look at how it associates the length of a string with the region that represents the string. The main two reasons it’s experimental is that it has not been sufficiently tested on real code and we felt that the diagnostics for out-of-bound are usually hard to understand. For example, often out-of-bound is due to an overflow or underflow; and it would be beneficial to have some explanation of these events to the user.

The CSting checker does not use BodyFarm (mainly, because it was written before it). The idea behind using the BodyFarm, would be to model the functions using the farm but you would still write the checks inside a checker.

One clang doc(*) suggested that there should be BodyFarm implementations of std::string functions.

This would allow the analyzer to have precise reasoning about these known functions, which would, for example, lead to less false positives.

Thanks, Anna.

That makes sense; also, I could see that you might be able to get much deeper analysis than simple emulation.

Would that be the best way to try to solve the OOB check problem? Or is it better to try to “emulate” interactions with std::string objects and try to track the size of the string?

It seems that providing an actual implementation of the functions will be more accurate than an emulation, but then I’m not sure why that’s not already visible from ? Is there some limitation in the static analyzer that keeps it from already having the source from ? Or will an emulation provide something richer than the raw source could provide?

The analyzer only sees what is implemented in the header. Also, we do have limits on cross-function-analyzes. Specifically, we do not “inline” function calls that are too deep or too large. Another reason to model functions with body farm is that the analyzer does not always “understand” the invariants of the code as it is written in the library.

Does BodyFarm get any freedom over the data layout of std::string? STL doesnt specify how the objects are implemented. So, would a particular implementation of BodyFarm be restricted to work only against libc++? Or do you ignore the actual data model and just have it conjure the values itself via analyzer?

The BodyFarm emulation should be generic and as simple as possible. It would probably be an abstraction of what the APIs actually do. We do not have to follow the existing data layout.

Right. BodyFarm should just be viewed as a factory for faux implementations that model the implementations of functions/methods in service of the analyzer. Like graphics in games (where reality is “faked” to look good), we can get away with something that has completely no bearing to the original implementation as long as it is also self-consistent within the analyzer and makes the analyzer smarter. For example, suppose parts of the implementation of std::string are visible to the analyzer, say in methods in the headers. If BodyFarm conjures up methods that don’t interact well with those implementations (which may be visible to the analyzer) than that could be a problem.

I was looking at trying to implement a checker and thought a fun one would be to try to implement checkers around std::string. For example, I figured I could start with trying to detect out-of-bound access when the length of the string can be determined.

The existing (experimental) CString checker already performs a similar check. You can take a look at how it associates the length of a string with the region that represents the string. The main two reasons it’s experimental is that it has not been sufficiently tested on real code and we felt that the diagnostics for out-of-bound are usually hard to understand. For example, often out-of-bound is due to an overflow or underflow; and it would be beneficial to have some explanation of these events to the user.

The CSting checker does not use BodyFarm (mainly, because it was written before it). The idea behind using the BodyFarm, would be to model the functions using the farm but you would still write the checks inside a checker.

One clang doc(*) suggested that there should be BodyFarm implementations of std::string functions.

This would allow the analyzer to have precise reasoning about these known functions, which would, for example, lead to less false positives.

Thanks, Anna.

That makes sense; also, I could see that you might be able to get much deeper analysis than simple emulation.

This is more about being able to explain the analyzer about the methods and data structures, so that it can reason about them well to both find more bugs and not report false positives. For example, we currently disable some reports because the analyzer does not understand the data structure invariants. (See LikelyFalsePositiveSuppressionBRVisitor::getEndPath in BugReporterVisitors.cpp.) A downside of such blank suppressions is that some real bugs will not be found by the analyzer.

Ted & Anna,

Thanks for your help. Over the weekend, I dug in and studied more about how clang and the analyzer work, and how the BodyFarm fits into it all. I was able to adjust BodyFarm to let me generate bodies for std::basic_string methods, and I have generated a few bodies just to prove I made it work. I know this doesnt sound like much, but it took me some study and trial-error, as I’m still learning how all this works. :slight_smile:

Now I’m ready to start implementing the “real” fake bodies. Here’s a rough idea of what I think this could look like:

void basic_string::clear() {

hook_invalidate_iterators();
hook_set_content(0, “”); // size, dummy pointer
}

basic_string& operator=(basic_string&& str)
{
hook_invalidate_iterators();
hook_set_content(str.size(), str.data());
str.clear();
return *this;
}

size_type basic_string::size() const
{
return hook_get_size();
}

const_pointer basic_string::data() const
{
return hook_get_data();
}

Where each of the “hook_” functions are basic primitives that would get trapped by the respective Checkers (eg, iterator validation checker, out-of-bounds checker).

Does that sound like a reasonable approach? It would mean creating fake CXXMethodDecl (or maybe FunctionDecl that explicitly take ‘this’) that dont really anchor into the actual source, but as far as I can tell, the analyzer core wont care.

What do you think?

Jared

Hi Jared,

I had some offline design discussions with Jordan and Anna, as this proposal raises some interesting design points. I always saw synthesized function bodies as being a very flexible tool we could use for extending the analyzer’s functionality, including writing new checkers, but we hadn’t really thought that much about the design space. I think what you propose seems like a very promising direction.

One thing that is really nice about using the synthesized bodies is that they are easier to get the path-sensitive reasoning in the checker correct. By creating fake bodies, the analyzer engine is responsible for reasoning about the branches, etc., and thus simulating the full path-sensitivity that way. Complicated checkers, on the other hand, need to deal with a bunch of assumption logic, and consider fully-constrained versus under constrained conditions for bugs. That’s likely not to go fully away with checkers written using synthesized bodies, but it may get much simpler. Also it would be interesting to see if we could write full checkers using synthesized bodies, e.g. report the actual bug by calling a hook, but that’s something we can explore over time.

While we are excited about using fake bodies for this approach, we made a few important observations in our offline design discussion.

Most importantly, the BugReporter logic (the code which generates error reports) isn’t that hardened to deal with methods/functions that have invalid source locations, which is the case for all synthesized bodies. That’s somewhat expected, and essentially exposes further work to be done in BugReporter. Essentially, when reporting a bug, BugReporter needs to the diagnostics accordingly when we don’t have a proper source location.

For example, suppose we had a fake implementation of the function foo(), and we knew that foo() dereferenced its pointer argument, for example:

int *p = 0;
foo(p);

In this case, we would want to report a null dereference warning. The problem is that the warning would trigger inside the body of foo(), which is synthesized, and has no valid source locations for the statements in that function body. Until recently the analyzer would just crash when this happened, but Jordan put in a stop gap fix for the BugReporter to discard such bug reports. That’s better than crashing, but obviously not what we want in the long term. Instead, we’d prefer to see something like this:

Dereference of null pull pointer (within call to function foo())

and have that diagnostic reported at the call to foo(). That’s completely doable, and necessary, but it hasn’t been implemented yet. The upshot is that if you want to use BodyFarm to provide these fake implementations of std::string methods that will only help as much as simulating behavior but you won’t be able to report bugs within those functions.

More comments inline.

Ted & Anna,

Thanks for your help. Over the weekend, I dug in and studied more about how clang and the analyzer work, and how the BodyFarm fits into it all. I was able to adjust BodyFarm to let me generate bodies for std::basic_string methods, and I have generated a few bodies just to prove I made it work. I know this doesnt sound like much, but it took me some study and trial-error, as I’m still learning how all this works. :slight_smile:

Now I’m ready to start implementing the “real” fake bodies. Here’s a rough idea of what I think this could look like:

void basic_string::clear() {

hook_invalidate_iterators();
hook_set_content(0, “”); // size, dummy pointer
}

basic_string& operator=(basic_string&& str)
{
hook_invalidate_iterators();
hook_set_content(str.size(), str.data());
str.clear();
return *this;
}

size_type basic_string::size() const
{
return hook_get_size();
}

const_pointer basic_string::data() const
{
return hook_get_data();
}

Where each of the “hook_” functions are basic primitives that would get trapped by the respective Checkers (eg, iterator validation checker, out-of-bounds checker).

This is an interesting design. I really like using the hooks to extend functionality. It’s an unexplored space right now, but I think it is worth a shot.

A few caveats:

(1) This is all subject to further design refinement as we explore this design.
(2) There is a potential layering problem here between BodyFarm, which doesn’t know about your checker, and the checker itself. Ideally the fake function bodies should probably come from code that also provides the hooks, so that it is all self-consistent. Moreover, we’d like a design where the analyzer truly can be made extensible via this approach without the fake function bodies coming from one place.

That said, #2 is a layering problem that I don’t want to block your initial prototyping work. We can figure that out as we go. What I do ask is that you guard the synthesized string bodies under an analyzer configuration variable so that they aren’t synthesized unless the configuration option is passed to the analyzer. This allows you to have a playground to try out this new approach without disrupting the current analyzer functionality.

Does that sound like a reasonable approach? It would mean creating fake CXXMethodDecl (or maybe FunctionDecl that explicitly take ‘this’) that dont really anchor into the actual source, but as far as I can tell, the analyzer core wont care.

It sounds like a great approach, but there are many details to figure out. As I pointed out, the analyzer does care about the fact that the source locations are bogus, but we can also fix that as well over time.