[Analyzer][RFC] Function Summaries

Hi,

This year we plan to experiment with function summaries in CSA. We have a budge roughly for 2 people to work on this.

Short-term plans:
^^^^^^^^^^^^^^^^^
We’d like to be able to read and use hand-written function summaries. A good example of such hand-written summaries are in Cppcheck. The existing StdLibraryFunctionsChecker is the entry point for our upcoming changes. I have created a patch which eliminates the macros in that checker, this way the used classes like ValueRange become extensible [https://reviews.llvm.org/D73897]. E.g. we can add an argument constraint that must be true on all branches, if the constraint fails then we warn [https://reviews.llvm.org/D73898]. (Note that these patches are still WIP.)

What could be in a summary?

  • Branches
  • E.g. for f(x), if x is in [‘0’, ‘9’] then ret is not 0.
  • This is already handled in StdLibraryFunctionsChecker.
  • Argument constraints (a contsraint that must be true on all branches)
  • Range constraint: E.g. x must be in [0, 255] on all branches. This is the
    case for instance with isalpha(x) where if x is out of the range then the
    program is ill-formed (undefined behaviour) according to the C standard.
  • Size requirements
    E.g.: asctime_s(char *buf, rsize_t bufsz, const struct tm *time_ptr);
    buf size must be at least bufsz.
  • Not-null
  • Not-uninitalized
  • Not-tainted (?)
  • Return value properties (?)
  • E.g.: the return value must be checked always.
    void foo() {
    void *P = aligned_alloc(4, 8);
    if (P == NULL) { // OK
    }
    }
    void bar() {
    void *P = aligned_alloc(4, 8);
    void *X = P + 1; // expected-warning{{Use of return value that was not checked}}
    }
    Storing this information in summaries could provide configuration for checkers like this https://reviews.llvm.org/D72705.
    Also statistics based checkers could also benefit from this information.

Perhaps we should come up with a summary description file format (e.g. json or
whatever) that could be parsed by the future variant of the existing
StdLibraryFunctionsChecker.

After we have a strong foundation of handling existing summaries then we can
move on to generate summaries automatically. We do believe that summary-based
analysis could have a strong added value only if we are capable of
automatically synthesizing the summaries.

Mid-term plans:
^^^^^^^^^^^^^^^
Automatically create summaries based on the methods described in SMART and Microsoft PEX [Godefroid2007, Godefroid2008].
The basic idea is that we create summaries on-demand during the analysis and then later we try to reuse them if the calling-context allows that.
This is something like an advanced caching of the effects of function calls.
A summary is in the form [pre]f(x)[post], where pre is the preconditions over function inputs and post is postconditions over function outputs.
There are many open questions and concerns here. E.g. about performance: probably we should use an SMT solver to decide whether we can apply a summary.
Also, the mentioned publications do runtime execution to drive the symbolic execution towards certain paths.
Naturally we cannot do that, but we hope that the worklist based driving could have a similar role.

Long-term plans:
^^^^^^^^^^^^^^^^
Create summaries based on the ideas in Facebook Infer’s publications [FBInfer2005, FBInfer2020].
The basic idea here is to create summaries the are free from the calling-context.
This can be done only with a special data-flow computation and with a proofing system.
This solution would render the analysis to a two-phase analysis.

References:
^^^^^^^^^^^
[Godefroid2007] Godefroid, Patrice. “Compositional dynamic test generation.” Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 2007.
[Godefroid2008] Anand, Saswat, Patrice Godefroid, and Nikolai Tillmann. “Demand-driven compositional symbolic execution.” International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, Berlin, Heidelberg, 2008.
[FBInfer2005] Berdine, Josh, Cristiano Calcagno, and Peter W. O’hearn. “Symbolic execution with separation logic.” Asian Symposium on Programming Languages and Systems. Springer, Berlin, Heidelberg, 2005.
[FBInfer2020] O’Hearn, Peter W. “Incorrectness logic.” Proceedings of the ACM on Programming Languages 4.POPL (2019): 1-32.

Thanks,
Gabor

Hi!

I think being able to write such summaries is valuable for the analyzer!

Please see some comments/questions inline.

Hi,

This year we plan to experiment with function summaries in CSA. We have a budge roughly for 2 people to work on this.

Short-term plans:
^^^^^^^^^^^^^^^^^
We’d like to be able to read and use hand-written function summaries. A good example of such hand-written summaries are in Cppcheck.

How does the content of the summaries you proposed compare to CppCheck’s summaries? Is it the same or did you add/remove something?

What could be in a summary?

Some of these items like nullability have dedicated annotations. I think the design should include the following point:

  • Should we have this duplicate information or only rely on annotations? Should we use/upstream APINotes?
  • What would happen if the annotation and the summary contradict each other?

Perhaps we should come up with a summary description file format (e.g. json or
whatever) that could be parsed by the future variant of the existing
StdLibraryFunctionsChecker.

Being able to write summaries in a file that is not hardcoded into the compiler is nice. Would APINotes work for this purpose? What would be the distribution model for those summaries? Would clang ship them? Would the library authors ship them along with their libraries? If the latter how would clang pick it up? What if multiple files have summaries for the same function? (In C it is not that hard to have a situation like this due to the lack of namespaces.) How widespread is CppCheck’s format? Should we support that?

Mid-term plans:
^^^^^^^^^^^^^^^
Automatically create summaries based on the methods described in SMART and Microsoft PEX [Godefroid2007, Godefroid2008].

A summary is in the form [pre]f(x)[post], where pre is the preconditions over function inputs and post is postconditions over function outputs.

While I’d really like to have summaries I think we should lower our expectations a bit here. The current constraint solver of the analyzer might not be powerful enough to handle non-trivial summaries of this form. I do believe it can provide some value but often those papers describing these summaries are using a much more powerful SMT solver as a backend. The other concern is how to model checker specific pre- and post-conditions. Our checker API is not written in an SMT solver friendly way, so it would be fairly hard to synthesize those summaries for our checkers.

Also, I do believe the analyzer could profit a lot from some classical dataflow analysis like mod/ref analysis. If we would annotate the arguments with the results of such an analysis it could help the analyzer to invalidate the right amount of state when evaluating a function call without a body.

Long-term plans:
^^^^^^^^^^^^^^^^
Create summaries based on the ideas in Facebook Infer’s publications [FBInfer2005, FBInfer2020].
The basic idea here is to create summaries the are free from the calling-context.
This can be done only with a special data-flow computation and with a proofing system.
This solution would render the analysis to a two-phase analysis.

I think the previous plans with automatically generated summaries would already render the analyzer into a two-phase analysis. In case we are interested in infer style summaries I think we should first make our memory modeling based on separation logic. I have no idea how far we are from that at this point and how much work would it be. Also, I believe reasoning about such summaries requires specialized solvers that are much more powerful than the current range based one.

Cheers,
Gabor

Ahaa, great, i'd love to see where it goes!!

When doing a file format for summaries, please try to somehow coordinate with the current work on taint checker yaml files. I don't want multiple incompatible formats to be circulating around :slight_smile:

One of the fresher thoughts that i've had recently on automatically generated summaries was re-using The Attributor (https://llvm.org/devmtg/2019-10/talk-abstracts.html#tech24) to produce function summaries that the static analyzer can import and rely upon.

If you go for automatically generated summaries, please keep in mind the following potential problems:

(a) We do not want checker API to become significantly more complicated than it currently is. It is very likely that generating checker-specific summaries of a function would require very non-trivial checker-side support that'll be too hard for a typical checker developer to handle. Please talk to me once you get there; i'll try to throw in a few examples to see if your approach is lightweight and general enough.

(b) Please understand the problem of "inlined defensive checks" that we have with inlining-based IPA. Basically, when modeling a function call, you need to have a good intuition about whether multiple branches within the function imply multiple branches in the caller context. Like, it's fine to warn that passing a null pointer into foo() causes a null dereference, but it's not fine to emit null dereference warnings upon dereferencing a pointer simply because that pointer was previously passed into function bar() that checked it for null but has long exited since then. Most of the time the ideal solution is to have exactly one branch in the summary. Functions like `isalpha` are a rare exception. Most of the time the state split is not justified because there's no indication on the caller side that a certain callee branch is necessarily taken, so we should not keep these assumptions in the state after the function has exited, so they shouldn't be part of the summary.

In other news - did you ever consider using ASTImporter for generating function body ASTs from hand-written code? Like BodyFarms but without constructing the AST by hand. I imagine that the advantage here is that the solution is more reliable than ASTImporter-based CTU because you can easily make sure that a certain hand-written code under your control is always imported correctly (as opposed to arbitrary code written by the users). This is probably not that hard to implement and that'd speed up our progress on supporting functions that generally benefit from body farms, such as atomic intrinsics. I.e.:

const Stmt *CompareAndSwapBody = ASTImporter.import("""
bool compare_and_swap(int *p, int old, int new) {
if (*p != old)
return false;
*p = new;
return true;
}
""");

?

Hi Artem!

Hi Gábor,

Thanks for your feedback.

How does the content of the summaries you proposed compare to CppCheck’s summaries? Is it the same or did you add/remove something?

I consider the annotations in CppCheck valuable, so most of the proposed argument constraints are coming from there. The exception is the not-tainted argument constraint, that’s new compared to CppCheck. Also, return value properties are not present in CppCheck summaries.

Being able to write summaries in a file that is not hardcoded into the compiler is nice. Would APINotes work for this purpose?

My biggest concern with APINotes is that it modifies the AST with Sema and then the whole compilation chain - including CodeGen - works on the modified AST. With summaries we’d like to describe complex information (e.g. relations between parameters and the return value) that is important only for static analysis and that may be totally irrelevant for CodeGen or other parts of the compiler. Another concern is that APINotes does not support C++.

What would be the distribution model for those summaries? Would clang ship them? Would the library authors ship them along with their libraries? If the latter how would clang pick it up?

In short term, summaries for stable libs like libc, libc++ and posix can be hard-coded in the source of StdLibraryFunctionsChecker. In long term, I think we should have a specified file format (.e.g YAML) in which we can describe these summaries. Summaries for stable libs (libc, libc++, posix) should be shipped with Clang. For extensiblity, (for library authors) probably we will need an option that specifies a directory from which we could read additional summary files. Similarly to what we already have with ModelInjector’s -model-path config option.

What if multiple files have summaries for the same function? (In C it is not that hard to have a situation like this due to the lack of namespaces.)

Option1: Give a run-time warning, so when we discover the situation we skip the summary and the analysis continues.
Option2: Give a run-time error and then stop the analysis, the user must provide unambiguous summaries.
I prefer Option1.

How widespread is CppCheck’s format? Should we support that?

I think it is a CppCheck internal format. So, I’d rather not support a file format that is not controlled in LLVM/Clang. I think we should support only one file format which is easily extensible.

While I’d really like to have summaries I think we should lower our expectations a bit here. The current constraint solver of the analyzer might not be powerful enough to handle non-trivial summaries of this form. I do believe it can provide some value but often those papers describing these summaries are using a much more powerful SMT solver as a backend.

I agree that the range based solver is not powerful enough. But we have the more powerful Z3 there, unfortunately that is slow. If we have a summary that has preconditions then we must check in the caller whether that pre is satisfied before applying the summary, and this would require the use of Z3. There are techniques to cache and/or decrease the problem space for the solver, but probably this might not scale, I agree. Another option could be to apply the summary without checking the pre and once we found a bug only then would we check the path feasibility with Z3 (similarly to what we do today with inlining).

The other concern is how to model checker specific pre- and post-conditions. Our checker API is not written in an SMT solver friendly way, so it would be fairly hard to synthesize those summaries for our checkers.

I’d like to avoid checker specific pre/post conditions. I hope that the summaries could consist of local path constraints of the function and once the summary is applied then the checker could populate its own GDM based on that.

Also, I do believe the analyzer could profit a lot from some classical dataflow analysis like mod/ref analysis. If we would annotate the arguments with the results of such an analysis it could help the analyzer to invalidate the right amount of state when evaluating a function call without a body.

Yes, as Artem mentioned, we could get some of these annotations from the IR, e.g. by utilizing the Attributor framework.

I think the previous plans with automatically generated summaries would already render the analyzer into a two-phase analysis.

Not necessarily (I hope). The methods described in Godefroid’s work generates the summaries on-demand when the symbolic execution first sees the functions.

In case we are interested in infer style summaries I think we should first make our memory modeling based on separation logic. I have no idea how far we are from that at this point and how much work would it be. Also, I believe reasoning about such summaries requires specialized solvers that are much more powerful than the current range based one.

Yes, indeed. That would mean developing a whole new/different kind of reasoning for summary generation and that might be too much to deal with. On the other hand, I’ve seen some very valuable thoughts in the paper about incorrectness logic, perhaps we could reuse some of those even in the current form of the CSA symbolic execution.

Gabor

Hi Artem,

Thanks for your feedback.

When doing a file format for summaries, please try to somehow coordinate
with the current work on taint checker yaml files. I don’t want multiple
incompatible formats to be circulating around :slight_smile:

Sure, I am in sync with Balazs :slight_smile:

One of the fresher thoughts that i’ve had recently on automatically
generated summaries was re-using The Attributor
(https://llvm.org/devmtg/2019-10/talk-abstracts.html#tech24) to produce
function summaries that the static analyzer can import and rely upon.

Yes, absolutely. I consider this approach very valuable. I think we should continue the work towards this direction after we are done with the extension of StdLibraryFunctionsChecker and before starting to experiment with Godefroid’s methods.

If you go for automatically generated summaries, please keep in mind the
following potential problems:

(a) We do not want checker API to become significantly more complicated
than it currently is. It is very likely that generating checker-specific
summaries of a function would require very non-trivial checker-side
support that’ll be too hard for a typical checker developer to handle.
Please talk to me once you get there; i’ll try to throw in a few
examples to see if your approach is lightweight and general enough.

I’d like to avoid checker specific pre/post conditions. I hope that the summaries could consist of local path constraints of the function and once the summary is applied then the checker could populate its own GDM based on that.

(b) Please understand the problem of “inlined defensive checks” that we
have with inlining-based IPA. Basically, when modeling a function call,
you need to have a good intuition about whether multiple branches within
the function imply multiple branches in the caller context. Like, it’s
fine to warn that passing a null pointer into foo() causes a null
dereference, but it’s not fine to emit null dereference warnings upon
dereferencing a pointer simply because that pointer was previously
passed into function bar() that checked it for null but has long exited
since then. Most of the time the ideal solution is to have exactly one
branch in the summary. Functions like isalpha are a rare exception.
Most of the time the state split is not justified because there’s no
indication on the caller side that a certain callee branch is
necessarily taken, so we should not keep these assumptions in the state
after the function has exited, so they shouldn’t be part of the summary.

Ok, this seems to be an interesting problem. I think if we have a summary for bar() then after we applied that summary then that becomes part of the path constraint and that constraint should be checked before we apply the summary for foo(). This implies that we should have a proper precondition for foo() as well, which is not clear for me how could we deduce that pre automatically. Also, perhaps we could apply the summary without checking the pre and once we found a bug only then would we check the path feasibility with Z3 (similarly to what we do today with inlining).

In other news - did you ever consider using ASTImporter for generating
function body ASTs from hand-written code? Like BodyFarms but without
constructing the AST by hand. I imagine that the advantage here is that
the solution is more reliable than ASTImporter-based CTU because you can
easily make sure that a certain hand-written code under your control is
always imported correctly (as opposed to arbitrary code written by the
users). This is probably not that hard to implement and that’d speed up
our progress on supporting functions that generally benefit from body
farms, such as atomic intrinsics. I.e.:

const Stmt *CompareAndSwapBody = ASTImporter.import(“”"
bool compare_and_swap(int *p, int old, int new) {
if (*p != old)
return false;
*p = new;
return true;
}
“”");

?

Ok, this could be another approach, but very similar to the work that have been done by Gabor(xazax) in ModelInjector and by this patch which builds on that. However, I see a few concerns: (1) This requires that summaries are syntactically valid C/C++ code, thus we must use attributes or [[attributes]]. Meaning we must extend the generic Attr.td very often with analyzer exclusive data, and I think we’d have a hard time to get those changes accepted there. Also I am not sure if we could express easily everything we want to with these attributes, e.g. what about relations between two arguments? (2) In the past some developers were strictly against this approach . (3) A DeclContext can be set up to have an external source and then later this source could be queried to extend the list of the contained Decls. This mechanism is used in LLDB when we parse an expression. There we import everything into the context of the expression. But in the approach you mention, the direction of the import would be reversed, and I see some technical difficulties in that. I am not sure if ASTImporter and DeclContext is prepared at the moment to realize this unless the summary code is a self contained code (meaning it can be parsed in it’s own).
So, I’d rather strive for the Attributor/IR based approach after we pimped up StdLibraryFunctionsChecker (and after we came up with a nice YAML format or something).

Thanks,
Gabor

Hi!

My biggest concern with APINotes is that it modifies the AST with Sema and then the whole compilation chain - including CodeGen - works on the modified AST. With summaries we’d like to describe complex information (e.g. relations between parameters and the return value) that is important only for static analysis and that may be totally irrelevant for CodeGen or other parts of the compiler. Another concern is that APINotes does not support C++.

During analysis we don’t do CodeGen so I do not really see the first part as a concern. We do not load those notes during compilation and we do not generate code when we load the notes. The lack of C++ support is definitely a problem. The question is, is it better to improve APINotes to fit our needs or to do something similar from scratch?

In short term, summaries for stable libs like libc, libc++ and posix can be hard-coded in the source of StdLibraryFunctionsChecker. In long term, I think we should have a specified file format (.e.g YAML) in which we can describe these summaries. Summaries for stable libs (libc, libc++, posix) should be shipped with Clang. For extensiblity, (for library authors) probably we will need an option that specifies a directory from which we could read additional summary files. Similarly to what we already have with ModelInjector’s -model-path config option.

Note that the current -model-path option is very limited. But it is a good questions what is the best semantic for such an option. My initial thought was to have something that behaves very similar to ‘-I’ flags with headers.

I agree that the range based solver is not powerful enough. But we have the more powerful Z3 there, unfortunately that is slow. If we have a summary that has preconditions then we must check in the caller whether that pre is satisfied before applying the summary, and this would require the use of Z3. There are techniques to cache and/or decrease the problem space for the solver, but probably this might not scale, I agree. Another option could be to apply the summary without checking the pre and once we found a bug only then would we check the path feasibility with Z3 (similarly to what we do today with inlining).

If we want to pursue Z3 for summaries we first will need to fix some of the bugs we currently have. Unfortunately, those are not trivial. So I think if this is our goal the first step is to make Z3 support solid.

I’d like to avoid checker specific pre/post conditions. I hope that the summaries could consist of local path constraints of the function and once the summary is applied then the checker could populate its own GDM based on that.

I do see some problems with this. For example, we do have modeling checks that can guide the analyzer by adding/removing constraints. Leaving those checks out from summary generation can have its costs in terms of precision. Also, applying constraints, the checkers will miss crucial events for populating gdm. Such as checkers that subscribed for certain events like checkPostStatement, checkRegionChanges and so on. If I understand your proposal correctly, from the checker’s point of view we would evaluate the call conservatively, and we would only use the pre/post states to help pruning more infeasible paths. In case we can avoid checkers being overly confident (see the inline defensive checks problem) this might work but we will see.

Cheers,
Gabor

During analysis we don’t do CodeGen so I do not really see the first part as a concern. We do not load those notes during compilation and we do not generate code when we load the notes. The lack of C++ support is definitely a problem. The question is, is it better to improve APINotes to fit our needs or to do something similar from scratch?

APINotes is a generic framework for attributing any functions. And the metadata read from the .apinotes files is stored in the AST and can be queried later as Decl->getAttr<...>(). This mechanism could be very useful for the whole community, i.e. for optimization, and for other tools. However, with analyzer specific function summaries we don’t want to have the burden of creating such a generic framework that is suitable for the whole community. The summary based experiments are way too immature now, and we don’t want to turn our focus on creating such a generic framework. Integrating APINotes to mainstream Clang would require coordination and consensus within the whole Clang community. That would require huge efforts from all of us and we would quickly loose the focus from the original goal which is to experiment with summaries within the analyzer. Thus, we’d like to keep the summaries describing format inside the premises of the analyzer.
Instead of APINotes, we propose to factor out the YAML parsing part from GenericTaintChecker into a modeling checker that would populate a GDM with the summaries metadata. That data could be used then in any checker, e.g. by the Taint or the StdLibraryFunctions checker.

APINotes is a generic framework for attributing any functions. And the metadata read from the .apinotes files is stored in the AST and can be queried later as Decl->getAttr<...>(). This mechanism could be very useful for the whole community, i.e. for optimization, and for other tools. However, with analyzer specific function summaries we don’t want to have the burden of creating such a generic framework that is suitable for the whole community. The summary based experiments are way too immature now, and we don’t want to turn our focus on creating such a generic framework.

For prototyping purposes anything that works is OK. The question is, what format should we choose once we know the direction we want to take. I think we should revisit the question of using API notes once we see what info is stored in the summaries after the prototype considered successful.

Integrating APINotes to mainstream Clang would require coordination and consensus within the whole Clang community.

There were earlier discussions about upstreaming APINotes. As far as I remember the consensus was that the community would love to see it upstream if C++ support is implemented. Whatever format we choose we do need to implement C++ support (e.g. supporting overloaded functions). I would expect the community to be onboard if we choose to go with API notes. So I think the question is not whether we would have consensus, it is more like is it less or more code to write/maintain.

That would require huge efforts from all of us and we would quickly loose the focus from the original goal which is to experiment with summaries within the analyzer.

To experiment, any format is OK. The reason why I am pushing for API Notes is because it has been in production for years and it might be less work to adapt that to our needs than doing everything from scratch. For a prototype we do not need to solve certain questions like what is the distribution model, what is the most reasonable way for the driver to handle the related flags, have easy to understand diagnostics for all the cases like missing files etc. Moreover, since Objective-C is one of the supported languages by the analyzer, an upstream solution should support that as well. APINotes already have this support. A custom solution would need this to be designed from scratch. So I think at this point we cannot know whether using API notes would be a huge effort or actually saving some effort.

Thus, we’d like to keep the summaries describing format inside the premises of the analyzer.

Instead of APINotes, we propose to factor out the YAML parsing part from GenericTaintChecker into a modeling checker that would populate a GDM with the summaries metadata. That data could be used then in any checker, e.g. by the Taint or the StdLibraryFunctions checker.

Here is some documentation about the APINotes format here https://github.com/apple/llvm-project/blob/apple/master/clang/docs/APINotes.rst
I see no reason why it could not be extended for the needs of the analyzer. It actually already contains some info useful for analysis, for example you can use it to annotate the nullability for each parameter. It is a trivial step to add range information etc.

Again, once we have a summary format the covers our need we will know more, but for now, I do not see why this format (with potential extensions) would not fit our needs.

Hi,

If we want to pursue Z3 for summaries we first will need to fix some of the bugs we currently have. Unfortunately, those are not trivial. So I think if this is our goal the first step is to make Z3 support solid.

The biggest issue when encoding the program in SMT is the lack of extension/truncation in the constraints from the analyzer. The analyzer drops all casts and the backend has to apply them before encoding including things like integer promotion; there is even some duplicated code from sema in smt_conv.h.

If you are really interested in using the Z3 backend, my suggestion is to first support extension/truncation in the analyzer. I might be able to provide some help there as well.