[analyzer] Expressing dependencies between arguments and return values

Hi!

I just got started with writing a Clang Static Analyzer checker,
primarily to detect these classes of issues in C:

- allocator/deallocator mismatch: e.g. "g_malloc" must be accompanied by
  "g_free" and not "free".
- memory leak due to allocator mismatch: e.g. "g_list_new" must be
  accompanied by "g_list_free" and not "g_free".
- memory leaks and use-after-free issues in general.

I have an initial version working based on Anna's and Jordan's talk, but
am looking at methods to reduce false positives and false negatives.
Consider this example:

    void checkIdentityFunctionLeak() {
      char *p = g_strdup("");
      char *q = g_strdelimit(p, "_", '-');
    } // expected-warning {{Memory leak}}

To avoid false positives with memleak reporting, I added a pointerEscape
callback that removes symbols. This however has as side-effect that "p"
in the above example is not reported as leaked. "g_strdelimit" can be
modelled as an identity function (p==q), but my attempt to express this
constraint in the PostCall callback failed:

    SValBuilder &svalBuilder = C.getSValBuilder();
    ProgramStateRef State = C.getState();
    SVal Arg0 = Call.getArgSVal(0);
    SVal RetVal = Call.getReturnValue();
    DefinedOrUnknownSVal arg0 = Arg0.castAs<DefinedOrUnknownSVal>();
    DefinedOrUnknownSVal retVal = RetVal.castAs<DefinedOrUnknownSVal>();
    DefinedOrUnknownSVal PtrEQ = svalBuilder.evalEQ(State, arg0, retVal);
    State = State->assume(PtrEQ, true);
    C.addTransition(State);

While the debug.ViewExplodedGraph option shows that the symbols are
assumed to be the same in the CallExpr node:

    conj_$1{gpointer} : { [1, 18446744073709551615] }
    (conj_$4{gchar *}) - (conj_$1{gpointer}) : { [0, 0] }

this range information is somehow lost in the next nodes.

Another example which is currently not caught either (strstr returns an
address from the region of "p"):

    int main() {
        char *p = strdup("xyz");
        if (!p) return 1;
        char *p2 = strstr(p, "z");
        free(p);
        puts(p2); // expected-warning {{Use-after-free}}
    }

Have I misunderstood something in modelling this? Is it possible to
express dependencies between the return value and the arguments of a
library function?

Hello Peter,

07.05.2018 01:07, Peter Wu via cfe-dev пишет:

Hi!

I just got started with writing a Clang Static Analyzer checker,
primarily to detect these classes of issues in C:

- allocator/deallocator mismatch: e.g. "g_malloc" must be accompanied by
   "g_free" and not "free".
- memory leak due to allocator mismatch: e.g. "g_list_new" must be
   accompanied by "g_list_free" and not "g_free".
- memory leaks and use-after-free issues in general.

I have an initial version working based on Anna's and Jordan's talk, but
am looking at methods to reduce false positives and false negatives.
Consider this example:

     void checkIdentityFunctionLeak() {
       char *p = g_strdup("");
       char *q = g_strdelimit(p, "_", '-');
     } // expected-warning {{Memory leak}}

To avoid false positives with memleak reporting, I added a pointerEscape
callback that removes symbols. This however has as side-effect that "p"
in the above example is not reported as leaked. "g_strdelimit" can be
modelled as an identity function (p==q), but my attempt to express this
constraint in the PostCall callback failed:

     SValBuilder &svalBuilder = C.getSValBuilder();
     ProgramStateRef State = C.getState();
     SVal Arg0 = Call.getArgSVal(0);
     SVal RetVal = Call.getReturnValue();
     DefinedOrUnknownSVal arg0 = Arg0.castAs<DefinedOrUnknownSVal>();
     DefinedOrUnknownSVal retVal = RetVal.castAs<DefinedOrUnknownSVal>();
     DefinedOrUnknownSVal PtrEQ = svalBuilder.evalEQ(State, arg0, retVal);
     State = State->assume(PtrEQ, true);
     C.addTransition(State);

The best way to tell the analyzer that some expression returns _exactly same_ value as the argument is to bind the value of the argument as the return result of this expression.

     ProgramStateRef State = C.getState();
     SVal Arg0Val = Call.getArgSVal(0);
     State = State->BindExpr(Call.getOriginExpr(), C.getLocationContext(), Arg0Val);
     C.addTransition(State);

Our primary solver (RangeConstraintManager) has very limited possibilities of handling complex symbolic expressions so it is better to simplify them as much as possible.

While the debug.ViewExplodedGraph option shows that the symbols are
assumed to be the same in the CallExpr node:

     conj_$1{gpointer} : { [1, 18446744073709551615] }
     (conj_$4{gchar *}) - (conj_$1{gpointer}) : { [0, 0] }

this range information is somehow lost in the next nodes.

Another example which is currently not caught either (strstr returns an
address from the region of "p"):

     int main() {
         char *p = strdup("xyz");
         if (!p) return 1;
         char *p2 = strstr(p, "z");
         free(p);
         puts(p2); // expected-warning {{Use-after-free}}
     }

Have I misunderstood something in modelling this? Is it possible to
express dependencies between the return value and the arguments of a
library function?

strstr() lacks modeling in analyzer so it just returns a conjured value. A correct return value should be an ElementRegion whose base region is the region pointed by 'p'. You can try construct this return value using MemRegionManager methods.

Hope this helps.

Hello Peter,

07.05.2018 01:07, Peter Wu via cfe-dev пишет:

Hi!

I just got started with writing a Clang Static Analyzer checker,
primarily to detect these classes of issues in C:

- allocator/deallocator mismatch: e.g. "g_malloc" must be accompanied by
"g_free" and not "free".
- memory leak due to allocator mismatch: e.g. "g_list_new" must be
accompanied by "g_list_free" and not "g_free".
- memory leaks and use-after-free issues in general.

The easiest way to accomplish these would be to extend the existing functionality of MallocChecker. It already warns about mismatching malloc()/operator delete etc. in c++ by tracking allocation families, so it most likely has all the functionality you need. It even already has support for g_malloc and g_free, but not for many other glib functions.

I have an initial version working based on Anna's and Jordan's talk

You might want to look into my workbook as well:
https://github.com/haoNoQ/clang-analyzer-guide/releases/download/v0.1/clang-analyzer-guide-v0.1.pdf

, but
am looking at methods to reduce false positives and false negatives.
Consider this example:

 void checkIdentityFunctionLeak\(\) \{
   char \*p = g\_strdup\(&quot;&quot;\);
   char \*q = g\_strdelimit\(p, &quot;\_&quot;, &#39;\-&#39;\);
 \} // expected\-warning \{\{Memory leak\}\}

To avoid false positives with memleak reporting, I added a pointerEscape
callback that removes symbols. This however has as side-effect that "p"
in the above example is not reported as leaked.

MallocChecker has a relatively sophisticated mechanism to figure out if a certain function deserves a pointer escape or not. You might want to tweak it. The idea to provide a correct return value is a good call but it isn't necessary.

"g_strdelimit" can be
modelled as an identity function (p==q), but my attempt to express this
constraint in the PostCall callback failed:

 SValBuilder &amp;svalBuilder = C\.getSValBuilder\(\);
 ProgramStateRef State = C\.getState\(\);
 SVal Arg0 = Call\.getArgSVal\(0\);
 SVal RetVal = Call\.getReturnValue\(\);
 DefinedOrUnknownSVal arg0 = Arg0\.castAs&lt;DefinedOrUnknownSVal&gt;\(\);
 DefinedOrUnknownSVal retVal = RetVal\.castAs&lt;DefinedOrUnknownSVal&gt;\(\);
 DefinedOrUnknownSVal PtrEQ = svalBuilder\.evalEQ\(State, arg0, retVal\);
 State = State\-&gt;assume\(PtrEQ, true\);
 C\.addTransition\(State\);

The best way to tell the analyzer that some expression returns _exactly same_ value as the argument is to bind the value of the argument as the return result of this expression.

ProgramStateRef State = C\.getState\(\);
SVal Arg0Val = Call\.getArgSVal\(0\);
State = State\-&gt;BindExpr\(Call\.getOriginExpr\(\), C\.getLocationContext\(\), Arg0Val\);
C\.addTransition\(State\);

Our primary solver (RangeConstraintManager) has very limited possibilities of handling complex symbolic expressions so it is better to simplify them as much as possible.

I'd like to add to this that you shouldn't do BindExpr in checkPostCall or checkPostStmt, only in evalCall. Because otherwise it'd conflict with other checkers that may try to model that function.

For modeling string operations you should have a look at CStringChecker and extend it.

While the debug.ViewExplodedGraph option shows that the symbols are
assumed to be the same in the CallExpr node:

 conj\_$1\{gpointer\} : \{ \[1, 18446744073709551615\] \}
 \(conj\_$4\{gchar \*\}\) \- \(conj\_$1\{gpointer\}\) : \{ \[0, 0\] \}

this range information is somehow lost in the next nodes.

Another example which is currently not caught either (strstr returns an
address from the region of "p"):

 int main\(\) \{
     char \*p = strdup\(&quot;xyz&quot;\);
     if \(\!p\) return 1;
     char \*p2 = strstr\(p, &quot;z&quot;\);
     free\(p\);
     puts\(p2\); // expected\-warning \{\{Use\-after\-free\}\}
 \}

Have I misunderstood something in modelling this?Is it possible to
express dependencies between the return value and the arguments of a
library function?

strstr() lacks modeling in analyzer so it just returns a conjured value. A correct return value should be an ElementRegion whose base region is the region pointed by 'p'. You can try construct this return value using MemRegionManager methods.

Hope this helps.

The high-level API for this stuff is SValBuilder::evalBinOp(), just add an integer to a pointer. See how CStringChecker handles functions that return the end of the string (look out for the returnEnd flag).

> Hello Peter,
>
> 07.05.2018 01:07, Peter Wu via cfe-dev пишет:
> > Hi!
> >
> > I just got started with writing a Clang Static Analyzer checker,
> > primarily to detect these classes of issues in C:
> >
> > - allocator/deallocator mismatch: e.g. "g_malloc" must be accompanied by
> > "g_free" and not "free".
> > - memory leak due to allocator mismatch: e.g. "g_list_new" must be
> > accompanied by "g_list_free" and not "g_free".
> > - memory leaks and use-after-free issues in general.
> >

The easiest way to accomplish these would be to extend the existing
functionality of MallocChecker. It already warns about mismatching
malloc()/operator delete etc. in c++ by tracking allocation families, so it
most likely has all the functionality you need. It even already has support
for g_malloc and g_free, but not for many other glib functions.

I did use MallocChecker as source of inspiration, but it has only one
level of differentiation. It can detect new vs delete, but I need to
handle mismatches in project-specific allocators as well:

    void *wmem_alloc(wmem_allocator_t *allocator, const size_t size);
    void wmem_free(wmem_allocator_t *allocator, void *ptr);

where "allocator" is NULL, "wmem_epan_scope()", "wmem_file_scope()",
"wmem_packet_scope()" or something else. So this would be wrong:

    void *p = wmem_alloc(wmem_file_scope(), 1);
    wmem_free(NULL, p);

If other projects have similar allocation functions, I could look into
generalizing this sufficiently into function attributes, but at the
moment I am still exploring the best way to model this.

> > I have an initial version working based on Anna's and Jordan's talk

You might want to look into my workbook as well:
https://github.com/haoNoQ/clang-analyzer-guide/releases/download/v0.1/clang-analyzer-guide-v0.1.pdf

I found that reference in an earlier message, thanks for writing this :slight_smile:
Here is my current list of references:
https://github.com/Lekensteyn/clang-alloc-free-checker#developer-resources

> > Consider this example:
> >
> > void checkIdentityFunctionLeak() {
> > char *p = g_strdup("");
> > char *q = g_strdelimit(p, "_", '-');
> > } // expected-warning {{Memory leak}}
> >
> > To avoid false positives with memleak reporting, I added a pointerEscape
> > callback that removes symbols. This however has as side-effect that "p"
> > in the above example is not reported as leaked.

MallocChecker has a relatively sophisticated mechanism to figure out if a
certain function deserves a pointer escape or not. You might want to tweak
it. The idea to provide a correct return value is a good call but it isn't
necessary.

Looks it uses a lot of heuristics, and one of the assumptions is that
all functions in non-system-headers free memory. Modelling a function
more precisely (i.e. argument == return value) would make detecting
double-free/use-after-free possible.

> The best way to tell the analyzer that some expression returns _exactly
> same_ value as the argument is to bind the value of the argument as the
> return result of this expression.
>
> ProgramStateRef State = C.getState();
> SVal Arg0Val = Call.getArgSVal(0);
> State = State->BindExpr(Call.getOriginExpr(),
> C.getLocationContext(), Arg0Val);
> C.addTransition(State);

Thanks, this did pass my test cases.

> Our primary solver (RangeConstraintManager) has very limited
> possibilities of handling complex symbolic expressions so it is better
> to simplify them as much as possible.

I'd like to add to this that you shouldn't do BindExpr in checkPostCall or
checkPostStmt, only in evalCall. Because otherwise it'd conflict with other
checkers that may try to model that function.

For modeling string operations you should have a look at CStringChecker and
extend it.

I'll look into it. If binding in PostCall gives bad results, would it be
an idea to add debug assertions that catches this?

> > While the debug.ViewExplodedGraph option shows that the symbols are
> > assumed to be the same in the CallExpr node:
> >
> > conj_$1{gpointer} : { [1, 18446744073709551615] }
> > (conj_$4{gchar *}) - (conj_$1{gpointer}) : { [0, 0] }
> >
> > this range information is somehow lost in the next nodes.
> >
> > Another example which is currently not caught either (strstr returns an
> > address from the region of "p"):
> >
> > int main() {
> > char *p = strdup("xyz");
> > if (!p) return 1;
> > char *p2 = strstr(p, "z");
> > free(p);
> > puts(p2); // expected-warning {{Use-after-free}}
> > }
> >
> > Have I misunderstood something in modelling this?Is it possible to
> > express dependencies between the return value and the arguments of a
> > library function?
>
> strstr() lacks modeling in analyzer so it just returns a conjured value.
> A correct return value should be an ElementRegion whose base region is
> the region pointed by 'p'. You can try construct this return value using
> MemRegionManager methods.
>
> Hope this helps.
>
The high-level API for this stuff is SValBuilder::evalBinOp(), just add an
integer to a pointer. See how CStringChecker handles functions that return
the end of the string (look out for the returnEnd flag).

Thanks, I was already puzzled on how to use MemRegionManager, knowing
that evalBinOp/evalEQ can be used directly is a relief. The BindExpr
hint is spot-on, these are also used in cases such as
CStringChecker::evalCopyCommon and CStringChecker::evalStrsep.

Perhaps it is time for a new CSA presentation that incorporates this?
(hint, hint :))