[StaticAnalyzer] C++ related checkers

Hi!

I’m new to this list and to Clang development. Nevertheless I’ve been interested in Clang Static Analyzer for a while. I’ve been using it on a large code base with a lot of success. So let me start by saying: thanks for this amazing piece of code!

But… Some time ago I realized there are hardly any strictly C++ related checkers in CSA. I was wondering if there’s any movement in this area. I was thinking about some checkers for use-after-free for STL containers like std::string, for example:

const char* x = NULL;
{
std::string foo(“foo”);
x = foo.c_str();
}
printf(“%s”, x); // boom

There are also some other common types of errors in C++ like use of iterator after it has been invalidated. FYI this one in particular is detected by cppcheck.

So I decided to dig a bit to find out whether it is hard to write a checker for use-after-free like in the example with std::string. It looks like MallocChecker deals with a similar class of issues.

I was wondering whether it would be the right approach to try to “bend” MallocChecker to my needs (but it’s already 2.5k lines of code) or to start something new on my own.

Honestly it took me some time even to detect a simple std::string constructor call so the road looks rather long and bumpy…

Any hints, pointers? Any related work?

Thanks in advance.

Best regards,
Adam Romanek

There are still a few core issues to resolve in the analyzer before it’ll get really useful for large C++ codebases (<- opinion :slight_smile: and I think that’s why we’re not seeing that many C++ related checks.

Specifically, temporary constructors/destructors still have work to do, especially regarding passing temporaries as by-value function parameters.

Could you be more specific about these limitations of the engine? Are they documented somewhere? Are there any plans or ongoing work on getting rid of them?

Best regards,
Adam Romanek

Could you be more specific about these limitations of the engine? Are
they documented somewhere?

Not that I’m aware of. The analyzer works well for a large set of code, and people like us for whom it doesn’t work well enough yet don’t use it.

Are there any plans or ongoing work on
getting rid of them?

It’s an open source project, so help is always welcome :slight_smile:

Hi!

I’m new to this list and to Clang development. Nevertheless I’ve been interested in Clang Static Analyzer for a while. I’ve been using it on a large code base with a lot of success. So let me start by saying: thanks for this amazing piece of code!

But… Some time ago I realized there are hardly any strictly C++ related checkers in CSA. I was wondering if there’s any movement in this area. I was thinking about some checkers for use-after-free for STL containers like std::string, for example:

const char* x = NULL;
{
std::string foo(“foo”);
x = foo.c_str();
}
printf(“%s”, x); // boom

There are also some other common types of errors in C++ like use of iterator after it has been invalidated. FYI this one in particular is detected by cppcheck.

So I decided to dig a bit to find out whether it is hard to write a checker for use-after-free like in the example with std::string. It looks like MallocChecker deals with a similar class of issues.

I was wondering whether it would be the right approach to try to “bend” MallocChecker to my needs (but it’s already 2.5k lines of code) or to start something new on my own.

Honestly it took me some time even to detect a simple std::string constructor call so the road looks rather long and bumpy…

Any hints, pointers? Any related work?

I have looked at this in the past, but it was about 18 months ago. So take my thoughts with that grain of salt. Also note that I’m not a regular or major contributor here. I’ve done very minor patches, but always hoping to do more :slight_smile: So here’s my thoughts, and take them as you will.

The MallocChecker is fine, but the problem is that libc++ is really hard to analyze. It is an efficient implementation, but that cleverness really stresses the analyzer. For example, std::string’s memory layout is a union of three different types (“long”, “short”, “raw” buffers). I think the SA gives up on unions immediately.

The best way around this is to simplify what the analyzer sees. Here are two approaches.

One idea is to use “BodyFarm”, whose role is to synthesize alternate implementations for functions that should be simple to model. If you look here, you’ll see a bit about that: http://clang-analyzer.llvm.org/open_projects.html

Another idea is to actually implement a “simple libc++” and interpose that for analysis. For example, std::basic_string class would just be a pointer and two size_t’s, along with simple implementations of all the member functions and simple iterators. In the future, you could add other analysis hooks (for example, check for iterator invalidation).

I did play around a bit on this for Body Farm, and I can forward you the code I did. I got a couple constructors implemented, as well as “empty()” and “size()” for some very basic cases (string literal initialized strings). However, it got a bit tedious and I’m not sure it would scale. I think the second approach is far more interesting and maintainable. But a “simple libc++” could be hard for its own reasons.

Anyway I’m happy to give you my sketches. I’ll email them off-list. Take them or ignore them however you like.

Hi Jared!

You might be interested in this GSoC project from last year: http://www.google-melange.com/gsoc/project/details/google/gsoc2014/xazax/5717271485874176

It makes it possible to wrote C++ code for the bodyfarm instead of assembling the AST manually. It works for simple cases and available in the trunk already. Unfortunately there is a lot of work left to do which I plan to solve, but I lack the time for that at the moment.

Cheers,

Gábor

Hi Gábor!

Thanks for the information. The work you've done might in fact help me to push the C++ related checkers further. I'd like to investigate it a bit.

Is there any summary of what has been done and what is still missing? Are there any code examples?

BTW, how would I know which parts of C++ standard library might need to be synthesized through BodyFarm?

I'm just trying to understand the amount of work required to use this approach for a basic checker I mentioned at the beginning. Any further hints would be useful.

Thanks!
Adam Romanek

    Could you be more specific about these limitations of the engine? Are
    they documented somewhere?

Not that I'm aware of. The analyzer works well for a large set of code,
and people like us for whom it doesn't work well enough yet don't use it.

Well, as I indicated at the beginning, I use it all the time. I'm just trying to collect the knowledge required to push it towards C++ a bit further.

    Are there any plans or ongoing work on
    getting rid of them?

  It's an open source project, so help is always welcome :slight_smile:

This is exactly why I started this conversation. I plan to contribute, I just don't want to waste my time and end up in a dead end.

Thanks,
Adam Romanek

Hi Adam!

   Could you be more specific about these limitations of the engine? Are
   they documented somewhere?

Not that I'm aware of. The analyzer works well for a large set of code,
and people like us for whom it doesn't work well enough yet don't use it.

Well, as I indicated at the beginning, I use it all the time. I'm just trying to collect the knowledge required to push it towards C++ a bit further.

The impact of the limitations in the core analysis depend on the codebase, so you might see less of these than Manuel. However, the main reason we do not have a lot of C++ specific checks is that C++ support was a relatively recent addition and is still lagging behind.

For example, we still get a few false positives on the LLVM codebase.

   Are there any plans or ongoing work on
   getting rid of them?

It's an open source project, so help is always welcome :slight_smile:

This is exactly why I started this conversation. I plan to contribute, I just don't want to waste my time and end up in a dead end.

Thanks,
Adam Romanek

   Best regards,
   Adam Romanek

    > There are still a few core issues to resolve in the analyzer before
    > it'll get really useful for large C++ codebases (<- opinion :slight_smile: and I
    > think that's why we're not seeing that many C++ related checks.
    >
    > Specifically, temporary constructors/destructors still have work
   to do,
    > especially regarding passing temporaries as by-value function
   parameters.
    >
    >
    > Hi!
    >
    > I'm new to this list and to Clang development. Nevertheless I've
    > been interested in Clang Static Analyzer for a while. I've been
    > using it on a large code base with a lot of success. So let
   me start
    > by saying: thanks for this amazing piece of code!
    >
    > But... Some time ago I realized there are hardly any strictly C++
    > related checkers in CSA. I was wondering if there's any
   movement in
    > this area. I was thinking about some checkers for
   use-after-free for
    > STL containers like std::string, for example:
    >
    > const char* x = NULL;
    > {
    > std::string foo("foo");
    > x = foo.c_str();
    > }
    > printf("%s", x); // boom
    >
    > There are also some other common types of errors in C++ like
   use of
    > iterator after it has been invalidated. FYI this one in
   particular
    > is detected by cppcheck.
    >
    > So I decided to dig a bit to find out whether it is hard to
   write a
    > checker for use-after-free like in the example with
   std::string. It
    > looks like MallocChecker deals with a similar class of issues.
    >
    > I was wondering whether it would be the right approach to try to
    > "bend" MallocChecker to my needs (but it's already 2.5k lines of
    > code) or to start something new on my own.
    >
    > Honestly it took me some time even to detect a simple std::string
    > constructor call so the road looks rather long and bumpy...
    >
    > Any hints, pointers? Any related work?
    >

Pushing forward the work on body farms and using that to model string APIs sounds like a good direction. As others mentioned, you'd only need to model as much as needed by your checker. For example, to make sure that it understands the string APIs that "use and free". You might also want to add models that would tell the checker that the certain APIs do not change the state. I would start with a fresh new checker rather than extending malloc. You might need to work on improving Body Farm support as you go since we had not used it much before.

Iterator invalidation checker is also interesting. (I recall starting the work on it and running into some issues with tracking temporaries.)

Neither of these are trivial projects but both are valuable.
Anna.