RFC: static analysis, malloc annontation for return via pointer argument

Hi All,

This patch allows for APIs were memory is allocated and placed in a
pointer given to them. (like asprintf, but without the realloc feature)

Currently, when using a pointer to a stack variable as the input to the
function, it declares the memory leak on the next source line after the
stack variable is used. I think it should be declaring the leak on the
last line of the current scope. Which is correct ?

Regards,
Scott

0001-Make-clang-static-analysis-support-allocation-into-a.patch (10.8 KB)

This patch allows for APIs were memory is allocated and placed in a
pointer given to them. (like asprintf, but without the realloc feature)

Thanks Scott. I’m not such a fan of the name of the attribute. Since this is returning an object by reference, how about ownership_returns_byref? I’m sure others will have an opinion, but ownership_returns_pointer really doesn’t tell the user what this attribute does.

As for the implementation itself, it looks okay. The annotation support eventually needs to be migrated from the alpha.unix.MallocWithAnnotations checker to the unix.Malloc checker (same file, logic controlled with a flag), but that’s a separate issue.

Currently, when using a pointer to a stack variable as the input to the
function, it declares the memory leak on the next source line after the
stack variable is used. I think it should be declaring the leak on the
last line of the current scope. Which is correct ?

The current behavior is correct. The current scope could be very big, and may end long after the leak occurs. We have found that reporting leaks as close as possible to where the leak occurred is a much better experience for users.

Hi Scott,

I've thought about this patch some more, and I have some questions. What are the intended semantics of this attribute? With the other attributes its clearer. If a function is annotated as returning an owned pointer, it's clear that if it is non-NULL then it is an owned pointer. But what's the case here? Can a function decide not to return a value at all (e.g., on failure)? How would that be captured by the attribute?

Thanks,
Ted

Hi Ted,

I believe it should take 3 parameters.
- which argument is being used to return
- whether the ret value and pointer are associated
(and how, invalid if < 0, != 0 , 0)
- what the size is (this is complicated)

lets take the case of asprintf()
return argument is 1, ret == -1 is invalid,
size is (ret+1) * sizeof(pointeeetype)

now the api from our codebase we are trying to analyze
int avl_create (struct avl_tree **avl_tree, int max_nodes, int (*compare_function) (void *data1, void *data2))
return argument is 1, ret != 0 is invalid,
size is fixed at sizeof(struct avl_tree)
also, it will leak if not free'd with the correct function, however we
can deal with that issue separately.

I think if I looked I would find more examples.

I guess the annotation should take 3 parameters, how would I represent
the ret value to validness binding ? and similarly for the size ?

Can an annotation have an expression in it ?
If so, something like (for the asprintf case)
ownership_returns_byref(malloc, 1, ==-1, (ret+1) * sizeof)
and (for the avl_create case)
ownership_returns_byref(malloc, 1, !=0, sizeof)

Regards,
Scott

It feels like attributes are not strong enough to represent the complex relationships in the general case. Even if we managed to make it work, the annotations would look non-intuitive. It would be great to have a more general system to summarize the effects of the functions/represent pre and post conditions.

Hi Ted,

I believe it should take 3 parameters.
- which argument is being used to return
- whether the ret value and pointer are associated
(and how, invalid if < 0, != 0 , 0)
- what the size is (this is complicated)

lets take the case of asprintf()
return argument is 1, ret == -1 is invalid,
size is (ret+1) * sizeof(pointeeetype)

now the api from our codebase we are trying to analyze
int avl_create (struct avl_tree **avl_tree, int max_nodes, int (*compare_function) (void *data1, void *data2))
return argument is 1, ret != 0 is invalid,
size is fixed at sizeof(struct avl_tree)
also, it will leak if not free'd with the correct function, however we
can deal with that issue separately.

I think specifying a custom free function is very important (even more than the size) since this is what the malloc checker is supposed to check for. This could be solved by specifying the corresponding free function's name in the attribute.

I think if I looked I would find more examples.

I guess the annotation should take 3 parameters, how would I represent
the ret value to validness binding ?

You could probably use a separate annotation to represent the fact that the function "succeeded" or "failed". This is not as generic since it would not be related to the specific malloc attribute, but could work well in practice.

and similarly for the size ?

Can an annotation have an expression in it ?

If so, something like (for the asprintf case)
ownership_returns_byref(malloc, 1, ==-1, (ret+1) * sizeof)
and (for the avl_create case)
ownership_returns_byref(malloc, 1, !=0, sizeof)

As for the name, maybe it could be possible to reuse "ownership_returns" attribute and rely on the argument to differentiate between returns as the return value from the function vs returns by ptr/reference through one of the arguments?

This is what the family is for. Or rather, what it was intended for, since no one added code to make anything other than "malloc" work.

Using the existing family, how the relation between the family name and the free functions is specified?

Specifying a free function directly does have a negative, where it's not as clean to support multiple free functions, but I am not convinced that it's worse than the family.

Well, the family currently has no use at all, but my original impression was that its primary use was to specify the free function (or set of free functions). The secondary use could be to add reasoning about pointer aliasing, but we don’t have that hooked up either.

The advantage I see of directly specifying the free function is that it makes things instantly extensible, but then why do we have the family at all? (Maybe we should take the family out instead.)

I see how it’s used, you basically pass the family name as the first parameter to all the attributes, which belong to one family. By passing it as the first parameter to the free function, you associate the allocation and deallocation functions. Makes sense to me.

Anna.

Hi,

Just FYI, gcc has an attribute (alloc_size) that allows one to specify that a function returns a buffer of a size that is the multiplication of a given list of parameters. e.g.:

void* my_calloc(size_t, size_t) __attribute__((alloc_size(1,2)))

We already have Sema support for this attribute.

However, this attribute is not expressive enough for other functions (e.g., strdup), so I've proposed a new attribute back in June:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2012-June/022272.html

There wasn't much feedback at the time, and therefore I didn't implemented it. But I think such an attribute would be helpfull for several things, including the clang static analyzer, and the run-time instrumentation features we have.

Nuno

Hi,

I have attached a new patch with the attribute name changed.

I think that extensions for size, family != malloc, binding return value
to pointer can be done separately.

I need to implement the family support one as part of the work I'm
currently doing. I intend to change it to name the free function to use,
instead of the allocator, I think this will be simpler.

Cheers,
Scott

0001-Make-clang-static-analysis-support-allocation-into-a.patch (10.8 KB)

I would favor using the family name for identifying the free function to
use.

How and to which class were you going to attach the information about
which allocator family the data is from ?

Scott,

I do not see the new attribute and it's arguments documented anywhere. (In one of the pervious emails, Ted asked for the semantics of the attribute.) What happens when the allocation function fails to allocate? This is important since the user code often does not free the pointer on the failure path. Not handling this case would lead to false positives.

Also, have you tried to reuse the existing "ownership_returns" attribute? I think that would be cleaner. However, it would be good to see if it can be elegantly extended or not.

Other minor comments:
1) A lot of MallocMemRetPtrAux is a copy and paste from MallocMemAux. Is it possible to factor so that there is more code reuse?
2) 80 column rule.
3) We probably need more than one test.

Cheers,
Anna.

This would need to be stored in RefState. For example:
Kind: Allocated
Module/Family: "malloc"

Though, for efficiency reasons, we might not want to store the string in the ProgramState but rather some ID that maps back to the family string.

Anna.

Anna,

Where is the normal place to document such things ?
Currently it takes 2 arguments, the family name and the argument number
that is the reference to return into.

The binding thing is interesting.
I could add an argument that is true/false indicating whether binding
should occur, and always bind for < 0 is invalid (I haven't found any
APIs where this isn't the case, yet)

The ownership_returns attribute already has an option argument (which
argument is the size) that would conflict with trying to overload it.
I would have to add an optional argument that triggers byref, indicating
which arguments is the reference, of course there is no way to tell
which of these two optional cases is intended, without changing the api
for the ownership_returns attribute.

1) Yes, just working on this now.
2) oops, enabled the boundary marker in my editor now.
3) Yes.

I ran into a minor problem, if you call the returns_byref function
twice, it doesn't notice that the memory leaks. Am I meant to test for
this in the MallocMemAux function ? Or is some other function meant to
notice ?

Cheers,
Scott

Anna,

Where is the normal place to document such things ?

The checker itself would be a good place.

Currently it takes 2 arguments, the family name and the argument number
that is the reference to return into.

The binding thing is interesting.
I could add an argument that is true/false indicating whether binding
should occur, and always bind for < 0 is invalid (I haven't found any
APIs where this isn't the case, yet)

I am not clear what you are talking about here. The return value on failure?

The ownership_returns attribute already has an option argument (which
argument is the size) that would conflict with trying to overload it.
I would have to add an optional argument that triggers byref, indicating
which arguments is the reference, of course there is no way to tell
which of these two optional cases is intended, without changing the api
for the ownership_returns attribute.

OK.

1) Yes, just working on this now.
2) oops, enabled the boundary marker in my editor now.
3) Yes.

I ran into a minor problem, if you call the returns_byref function
twice, it doesn't notice that the memory leaks. Am I meant to test for
this in the MallocMemAux function ? Or is some other function meant to
notice ?

Each allocation should return a different symbol. Each symbol should be tracked separately, so you should see 2 calls to MallocMemAux.

Maybe when you pass the pointer to the second call to the custom malloc function, the checker stops tracking the symbol as a result of pointerEscapes callback...

> Anna,
>
> Where is the normal place to document such things ?

The checker itself would be a good place.

> Currently it takes 2 arguments, the family name and the argument number
> that is the reference to return into.
>
> The binding thing is interesting.
> I could add an argument that is true/false indicating whether binding
> should occur, and always bind for < 0 is invalid (I haven't found any
> APIs where this isn't the case, yet)

I am not clear what you are talking about here. The return value on failure?

yes, the return value of the function on failure.

>
> The ownership_returns attribute already has an option argument (which
> argument is the size) that would conflict with trying to overload it.
> I would have to add an optional argument that triggers byref, indicating
> which arguments is the reference, of course there is no way to tell
> which of these two optional cases is intended, without changing the api
> for the ownership_returns attribute.
>
OK.

> 1) Yes, just working on this now.
> 2) oops, enabled the boundary marker in my editor now.
> 3) Yes.
>
> I ran into a minor problem, if you call the returns_byref function
> twice, it doesn't notice that the memory leaks. Am I meant to test for
> this in the MallocMemAux function ? Or is some other function meant to
> notice ?
>

Each allocation should return a different symbol. Each symbol should be tracked separately, so you should see 2 calls to MallocMemAux.

I think this is happening

Maybe when you pass the pointer to the second call to the custom malloc function, the checker stops tracking the symbol as a result of pointerEscapes callback...

There is an assumption in doesNotFreeMemory that assumes all
user-defined functions can free memory. I would think only correctly
annotated functions can free memory ? (ownership_takes or
ownership_holds)

I notice that this test doesn't work as expected:
void test() {
  int *p;
  int **q = &p;
  q[0] = malloc(4);
  q[0] = malloc(12);
  free(p);
}

(expected that the second malloc will cause a "memory never released"
error, because the memory has most certainly leaked at this point)

which is probably why my annotation also doesn't work correctly.
Ideas ?

Anna,

Where is the normal place to document such things ?

The checker itself would be a good place.

Currently it takes 2 arguments, the family name and the argument number
that is the reference to return into.

The binding thing is interesting.
I could add an argument that is true/false indicating whether binding
should occur, and always bind for < 0 is invalid (I haven’t found any
APIs where this isn’t the case, yet)

I am not clear what you are talking about here. The return value on failure?

yes, the return value of the function on failure.

What about capturing the failure with an annotation? You are proposing to hardcode it in the checker, right? (I don’t fully understand the above.)

The ownership_returns attribute already has an option argument (which
argument is the size) that would conflict with trying to overload it.
I would have to add an optional argument that triggers byref, indicating
which arguments is the reference, of course there is no way to tell
which of these two optional cases is intended, without changing the api
for the ownership_returns attribute.

OK.

  1. Yes, just working on this now.
  2. oops, enabled the boundary marker in my editor now.
  3. Yes.

I ran into a minor problem, if you call the returns_byref function
twice, it doesn’t notice that the memory leaks. Am I meant to test for
this in the MallocMemAux function ? Or is some other function meant to
notice ?

Each allocation should return a different symbol. Each symbol should be tracked separately, so you should see 2 calls to MallocMemAux.

I think this is happening

Maybe when you pass the pointer to the second call to the custom malloc function, the checker stops tracking the symbol as a result of pointerEscapes callback…

There is an assumption in doesNotFreeMemory that assumes all
user-defined functions can free memory. I would think only correctly

annotated functions can free memory ? (ownership_takes or
ownership_holds)

The optimistic and pessimistic checkers behave differently here. For optimistic checker (checker with annotations), the PointerEscape callback should not do anything (as you point out above) - it should just return:

if (Filter.CMallocOptimistic)
return;

My bad, since the optimistic checker is alpha we have not been properly maintaining it. There are probably more subtle issues like this one.

I notice that this test doesn’t work as expected:
void test() {
int *p;
int **q = &p;
q[0] = malloc(4);
q[0] = malloc(12);
free(p);
}

(expected that the second malloc will cause a “memory never released”
error, because the memory has most certainly leaked at this point)

What is the output you are getting here? I am getting a leak warning. It is the memory returned by the first call to malloc that is being leaked.
$ /Volumes/Data/ws/llvmgit/build/Debug+Asserts/bin/clang --analyze ~/tmp/ex.c -Xanalyzer -analyzer-output=text

/Users/zaks/tmp/ex.c:7:2: warning: Memory is never released; potential leak
free(p);
^~~~~~~
/Users/zaks/tmp/ex.c:5:9: note: Memory is allocated
q[0] = malloc(4);
^
/Users/zaks/tmp/ex.c:7:2: note: Memory is never released; potential leak
free(p);
^
1 warning generated.

Anna,

To deal with binding return value, and byref value, I was suggesting to
add another argument to the byref annotation indicating that they are
bound (and always bind ret < 0, byref is invalid)

Can we merge (malloc and mallocwithannotations) such that we always do
the optimistic case ?
Also there is some heavily related code in SecKeychainAPI (some parts of
that API appear to be a specific case of my more generic byref checker)

Oh, in my test case I was assuming it would complain a line earlier than
it does (where the actual leak occurs). Try with free(p) twice however.
(I get the doublely free'd error, but not the leaked memory error)

Regards,
Scott

Anna,

To deal with binding return value, and byref value, I was suggesting to
add another argument to the byref annotation indicating that they are
bound (and always bind ret < 0, byref is invalid)

I don't think that assuming that all functions that want to use this annotation would return a negative number on failure is generic enough. The annotation mechanism does not currently support inclusion of expressions. Do you have examples of functions that would get the new annotation and how the failure is communicated back. For example, if the pointer is set to NULL on failure to allocate in all of them, it might be acceptable to hardcode that in the checker (this would be analogous with malloc). Another possibility would be to add a separate annotation to represent the failure mode of a function (not sure if others would like this idea).

It is extremely important to get the annotation design right and ensure that it will be generic since others might start relying on it and this is not something we can easily change in the future.

Can we merge (malloc and mallocwithannotations) such that we always do
the optimistic case ?

It is very important to have the pessimistic checker by default - it is pessimistic in that it does not assume that user had annotated their code with ownership attributes. Unless you annotate all the functions which might free memory in your codebase, the optimistic checker will give false positives. It is very common to allocate a pointer in one function and pass it to another one which would, for example, use it and deallocate when done. Similarly, one might store a pointer in a global variable, and it will get freed by another function reading the same global. (Not sure what the optimistic checker should do in the second case, but most likely it should stop tracking the pointer. This case would get detected through the pointerEscape callback as well.) The analyzer does not reason cross-translation unit and cannot track the pointer perfectly. This is why the pessimistic checker assumes that the pointer will get properly handled when it "escapes".

It is possible that what we need is a pessimistic checker (does not assume that all functions are annotated) with annotations for extra checking. For example, if you have a custom function that allocates memory, we just want to track it with the pessimistic checker. This might be more useful in general.

Also there is some heavily related code in SecKeychainAPI (some parts of
that API appear to be a specific case of my more generic byref checker)

Malloc and KeyChainAPI checkers are very similar and could be combined. We've kind of bypassed the annotation design issue by writing a new checker for that API, which is not a good precedent. (Also, the SecKeychainAPI checker was written before the pessimistic Malloc, and the plan was to merge them back together later if it made sense.)

Oh, in my test case I was assuming it would complain a line earlier than
it does (where the actual leak occurs).

This is the byproduct of how the analyzer detects leaks. Falls out of performance/precision tradeoff.

Try with free(p) twice however.
(I get the doublely free'd error, but not the leaked memory error)

This is because leaks are considered to be "less severe" (non-sink) issues than the use-afer-frees(sinks). As soon as we discover the use-after-free, we stop exploring the path, so the analyzer never gets to discover the leak.