[analyzer] Proof-of-concept for Matcher-like checker API (graph matchers)

Hello everyone,

I'd like to share some results of my investigation targeted on improvement of Clang Static Analyzer checker API. You can find some previous conversation on this topic here: http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. In my investigation, I tried to solve a particular problem of building a checker without generating new nodes.

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look like this:

StatementMatcher NotChdir =
callExpr(unless(callee(functionDecl(hasName("::chdir")))));
Finder.addMatcher(
hasSequence(
postStmt(hasStatement(
callExpr(callee(functionDecl(hasName("::chroot")))))),
unless(stmtPoint(hasStatement(callExpr(
callee(functionDecl(hasName("::chdir"))),
hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir))))
.bind("bug_node")),
&Callback);
Finder.match(G);

and I have managed to make some simple working examples.

The entire diff can be found here: https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
The code itself: https://github.com/a-sid/clang/tree/a4

There are several reasons why I have tried this approach.

1. AST Matchers are already extensively used API for AST checking. They are available both in Clang-Tidy and CSA. And I wanted to use existing functionality as much as possible. So, I decided to extend an existing API to make its usage seamless across different kinds of checks: path-sensitive, AST-based and CFG-based.

2. AST matchers effectively help clients to avoid a lot of checking of dyn_cast results. This feature not only makes them more convenient but also more safe because, in my experience, forgetting a nullptr/None check is a pretty common mistake for checker writers.

3. Testing of AST matchers don't require writing C++ code - it can be done interactively with clang-query tool. And I believe that we need similar functionality for CSA as well.

I didn't want to replace existing checker API. Instead, I tried to make new possibilities usable independently or together with existing.

--------- Design and implementation ---------

The implementation consisted of a number of steps.

1. Allow matching nodes of path-sensitive graph like usual AST nodes. To make this possible, three actions were needed:
1.1. ASTTypeTraits and DynTypedNode were extended to support path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. The implementation with graph node support is moved into a separate class (ASTGraphTypeTraits) in ento namespace to resolve cyclic dependencies (they are still exist, unfortunately, but are header-only, so we can build the PoC).
1.2. Some additions to AST matchers were made to add support for new kinds of nodes.
1.3. To make MatchFinder able to query specific options not supported by pure AST, it was augmented with "Contexts". A matcher that needs to query the path-sensitive engine asks the Finder for the required Context which provides specific helper functions.

As the result of this step, we are able to write matchers like expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).

2. Create an engine for graph exploration and matching.
Unlike normal AST matchers, hasSequence is a path-sensitive matcher. It is launched against ExplodedGraph. These matchers are handled by GraphMatchFinder object. While searching a graph, it collects matches. Each match contains a pointer to the corresponding matcher and State ID of this match. The way how matches are translated from one state ID to another is determined by matcher operators.

For example, SequenceVariadicOperator, which is the base of hasSequence() matcher, has "positive" and "negative" sub-matches. Each positive matcher has its corresponding State ID. In order to advance to the next State ID, a node being matched should match all "negative" matchers before the next "positive" matchers and the next "positive" matcher itself. Failure in matching "negative" matcher discards the match.

The role of GraphMatchFinder is similar to MatchFinder: it is only responsible for graph exploration and keeping bound nodes and matchers.

3. Manage bound nodes.
While matching a single graph node, BoundNodes from AST MatchFinder are used. AST matchers for path-sensitive nodes support bindings out-of-the-box. However, the situation with graph matching is a bit different. For graph matching, we have another system of bound nodes. Each graph node has a related map of bounds aka GDMTy (yes, the name is not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap which, in part, is effectively a map from std::string to DynTypedNodes. This look pretty much like how GDM is organized in ExplodedGraph, just without one level of indirection (it can be added, though).

MatchFinder contexts should allow support of their own bindings. This is how equalsBoundNode() matcher works for path-sensitive nodes: it just queries all available contexts for the binding.

Finally, I have managed to make two checkers work in this way: ChrootChecker (which is present in the introduction) and TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal checkers: produce warnings and use visitors. The main difference is that they cannot generate sinks and use checkEndAnalysis callback. The code of these checkers can be found here:

-------- Some features --------

1.The design of bindings has an interesting consequence: it enables the dynamic introspection of GDM which was pretty hard before. (Hello alpha renaming?)
2. Nothing prevents matchers to be used with existing checker API for simplifying conditional checking inside callbacks. The matchers are not planned as the replacement for the current API, but look like a nice complementary part.
3. Implicit conversion of Matcher<ProgramPoint> to Matcher<ExplodedNode> and Matcher<SymExpr || MemRegion> to Matcher<SVal> for writing shorter code.

-------- Not implemented/not checked yet --------
I tried to keep the PoC as minimal as possible. As the result, some features are not implemented yet, but I believe they can be added.
1. Usage of matchers inside checker callbacks
This is not exactly unimplemented, but is still untested.
2. Dynamic matchers (clang-query interface)
In addition to work for supporting dynamic nodes (I don't know how many efforts it can take), we need to enable matching against AST nodes, not graph. But it doesn't look as a problem, we can just make matchers polymorphic.
3. Binding to successfully matched paths is not implemented yet - only bindings to separate nodes. I wonder if we need path bindings at all.
4. Some corner cases are still FIXMEs: single-node sequences, for example.
5. GDM is not based on immutable data structures like it is done in CSA - it is just an STL map. Its performance can be poor (full copy on every new node), but I don't think that changing it to use immutable structures is hard.
6. Matching on-the-fly
GraphMatchFinder should support on-the-fly matching during ExplodedGraph building. For this support, we should just call its advance() method each time a new node is created. However, this possibility wasn't checked yet.
7. Matching CFG and CallGraph isn't implemented because it was considered to be far out of simple PoC.
8. Only sequential matching is available now, and I didn't try to implement other operators yet. So, implementing a checker similar to PthreadLock can be tricky or even impossible for now.

-------- Known and potential issues --------
From matchers' side:
1. Some tests don't pass because they rely on the checker ability to generate sink nodes. Our matchers cannot do it by design so tests don't pass.
2. Representing checker bindings as a map can be less effective than storing data in structures. I didn't measure the overhead, so I cannot give any numbers.
3. Matchers are called on every node independently of its type. This is not what CheckerManager does. I wonder if this detail can affect performance as well.
4. Problems with cyclic dependencies. clangStaticAnalyzer has a dependency on clangASTMatchers, therefore, clangASTMatchers cannot depend on clangStaticAnalyzer. However, if we want ASTMatchers to support static analyzer data structures, there should be a backward dependency. Now this dependency is header-only.
5. Code duplication. This is mostly fine for a sandbox but needs to be reduced.

From analyzer's side:
1. Many events are not reflected in the final ExplodedGraph. For example, checkers can receive PointerEscape event, but the event itself will not be recorded in the ExplodedGraph - only changes made by checkers will. That's also true for Stmt nodes: I noticed same issues with PostCondition. This makes matching a bit harder. Should we add some information into ExplodedGraph?
2. (Minor) Some static analyzer data structures don't support traditional LLVM casting methods (dyn_cast, isa) because they lack classof() method. I have fixed this internally and will put a patch on review soon.

-------- Conclusion --------
Finally, there is a graph matching engine supporting basic functionality and two checkers using it. I apologize for not starting the discussion earlier, but I just wasn't sure that the approach will work. Is anybody interested to have this stuff in upstream (maybe, changed or improved)? If yes, I'll be happy to contribute this work patch-by-patch and continue its improvement. If no - well, I had enough fun playing with it.

Hi Alexey,

In general, I like the idea of having a more declarative way to define new
checks.

Hello everyone,

I’d like to share some results of my investigation targeted on
improvement of Clang Static Analyzer checker API. You can find some
previous conversation on this topic here:
http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html.
In my investigation, I tried to solve a particular problem of building a
checker without generating new nodes.

Why do you focus on such checks that does not have traits, does not generate new nodes.

Do you find this to be the majority of the checks you need to implement?

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look
like this:

StatementMatcher NotChdir =
callExpr(unless(callee(functionDecl(hasName(“::chdir”)))));
Finder.addMatcher(
hasSequence(
postStmt(hasStatement(
callExpr(callee(functionDecl(hasName(“::chroot”)))))),
unless(stmtPoint(hasStatement(callExpr(
callee(functionDecl(hasName(“::chdir”))),
hasArgument(0, hasValue(stringRegion(refersString(“/”)))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir))))
.bind(“bug_node”)),
&Callback);
Finder.match(G);

I think, a common (design) pitfall of writing checks is to try to match against
the AST when a symbolic execution callback should be used instead.
I am a bit afraid having an API like this would make this pitfall more common.
Maybe a separation between the path sensitive, flow sensitive and
AST based checks is good for the checker writers and new users and
I am not sure that the same abstraction fits all of these cases.

In case of path sensitive checks I wonder if a state machine like abstraction would
fit the use cases better (and also cover checks that are using traits)
where the guarding conditions of the state transitions can include AST matcher like checks.

and I have managed to make some simple working examples.

The entire diff can be found here:
https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
The code itself: https://github.com/a-sid/clang/tree/a4

There are several reasons why I have tried this approach.

  1. AST Matchers are already extensively used API for AST checking. They
    are available both in Clang-Tidy and CSA. And I wanted to use existing
    functionality as much as possible. So, I decided to extend an existing
    API to make its usage seamless across different kinds of checks:
    path-sensitive, AST-based and CFG-based.

I am not sure if this is a good idea. I think the checker author supposed to
be aware if the check is AST based, flow sensitive, or path sensitive.

And this should also be easy to tell by reading the code. I am also not
sure whether there is an abstraction that fits all.

I think the reason why this idea works well for checks that only inspect the
exploded graph, because in this case we are still doing pattern matching on
graphs in a similar manner as doing pattern matching on the AST.

But does this generalize later on to stateful checks?

  1. AST matchers effectively help clients to avoid a lot of checking of
    dyn_cast results. This feature not only makes them more convenient but
    also more safe because, in my experience, forgetting a nullptr/None
    check is a pretty common mistake for checker writers.

I think if something cannot be null we should return references, otherwise
the client need to check for the pointer beeing null (unless there are some
dynamic invariant that would make the check redundant). If an API does
not match this philosophy, maybe we should fix the API first.

Regardless of fixing the API, I agree that it would be great to have higher
level APIs for checker authors.

  1. Testing of AST matchers don’t require writing C++ code - it can be
    done interactively with clang-query tool. And I believe that we need
    similar functionality for CSA as well.

Having a query tool for exploded graphs could be very helpful, I agree.
These graphs tend to be very large and sometimes it is not trivial to
find the relevant parts during debugging.

I didn’t want to replace existing checker API. Instead, I tried to make
new possibilities usable independently or together with existing.

--------- Design and implementation ---------

The implementation consisted of a number of steps.

  1. Allow matching nodes of path-sensitive graph like usual AST nodes. To
    make this possible, three actions were needed:
    1.1. ASTTypeTraits and DynTypedNode were extended to support
    path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc.

I wonder, if in general it is a good idea to pattern match in checks on SymExpr.
A more general approach here would be to use the constraint solver for queries
in a similar manner how ProgramState::assume works.

The implementation with graph node support is moved into a separate
class (ASTGraphTypeTraits) in ento namespace to resolve cyclic
dependencies (they are still exist, unfortunately, but are header-only,
so we can build the PoC).
1.2. Some additions to AST matchers were made to add support for new
kinds of nodes.
1.3. To make MatchFinder able to query specific options not supported
by pure AST, it was augmented with “Contexts”. A matcher that needs to
query the path-sensitive engine asks the Finder for the required Context
which provides specific helper functions.

As the result of this step, we are able to write matchers like
expr(hasArgument(0, hasValue(stringRegion(refersString(“/”))))).

I think this is the part of the proposal that I like the most, it would be
a very concise way to write down guarding conditions.

  1. Create an engine for graph exploration and matching.
    Unlike normal AST matchers, hasSequence is a path-sensitive matcher.
    It is launched against ExplodedGraph. These matchers are handled by
    GraphMatchFinder object. While searching a graph, it collects matches.
    Each match contains a pointer to the corresponding matcher and State ID
    of this match. The way how matches are translated from one state ID to
    another is determined by matcher operators.

Is this matching done after the exploded graph already built? If so,
these checks will be unable to generate sink nodes. I think having sink
nodes sometimes desirable even if the check itself does not have a trait.

For example, SequenceVariadicOperator, which is the base of
hasSequence() matcher, has “positive” and “negative” sub-matches. Each
positive matcher has its corresponding State ID. In order to advance to
the next State ID, a node being matched should match all “negative”
matchers before the next “positive” matchers and the next “positive”
matcher itself. Failure in matching “negative” matcher discards the match.

The role of GraphMatchFinder is similar to MatchFinder: it is only
responsible for graph exploration and keeping bound nodes and matchers.

  1. Manage bound nodes.
    While matching a single graph node, BoundNodes from AST MatchFinder
    are used. AST matchers for path-sensitive nodes support bindings
    out-of-the-box. However, the situation with graph matching is a bit
    different. For graph matching, we have another system of bound nodes.
    Each graph node has a related map of bounds aka GDMTy (yes, the name is
    not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap
    which, in part, is effectively a map from std::string to DynTypedNodes.
    This look pretty much like how GDM is organized in ExplodedGraph, just
    without one level of indirection (it can be added, though).

MatchFinder contexts should allow support of their own bindings. This
is how equalsBoundNode() matcher works for path-sensitive nodes: it just
queries all available contexts for the binding.

Finally, I have managed to make two checkers work in this way:
ChrootChecker (which is present in the introduction) and
TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp
and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal
checkers: produce warnings and use visitors. The main difference is that
they cannot generate sinks and use checkEndAnalysis callback. The code
of these checkers can be found here:

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp
https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp

-------- Some features --------

1.The design of bindings has an interesting consequence: it enables the
dynamic introspection of GDM which was pretty hard before. (Hello alpha
renaming?)
2. Nothing prevents matchers to be used with existing checker API for
simplifying conditional checking inside callbacks. The matchers are not
planned as the replacement for the current API, but look like a nice
complementary part.
3. Implicit conversion of Matcher to Matcher
and Matcher<SymExpr || MemRegion> to Matcher for writing shorter code.

-------- Not implemented/not checked yet --------
I tried to keep the PoC as minimal as possible. As the result, some
features are not implemented yet, but I believe they can be added.

  1. Usage of matchers inside checker callbacks
    This is not exactly unimplemented, but is still untested.
  2. Dynamic matchers (clang-query interface)
    In addition to work for supporting dynamic nodes (I don’t know how
    many efforts it can take), we need to enable matching against AST nodes,
    not graph. But it doesn’t look as a problem, we can just make matchers
    polymorphic.
  3. Binding to successfully matched paths is not implemented yet - only
    bindings to separate nodes. I wonder if we need path bindings at all.
  4. Some corner cases are still FIXMEs: single-node sequences, for example.
  5. GDM is not based on immutable data structures like it is done in CSA
  • it is just an STL map. Its performance can be poor (full copy on every
    new node), but I don’t think that changing it to use immutable
    structures is hard.
  1. Matching on-the-fly
    GraphMatchFinder should support on-the-fly matching during
    ExplodedGraph building. For this support, we should just call its
    advance() method each time a new node is created. However, this
    possibility wasn’t checked yet.
  2. Matching CFG and CallGraph isn’t implemented because it was
    considered to be far out of simple PoC.
  3. Only sequential matching is available now, and I didn’t try to
    implement other operators yet. So, implementing a checker similar to
    PthreadLock can be tricky or even impossible for now.

-------- Known and potential issues --------
From matchers’ side:

  1. Some tests don’t pass because they rely on the checker ability to
    generate sink nodes. Our matchers cannot do it by design so tests don’t
    pass.
  2. Representing checker bindings as a map can be less effective than
    storing data in structures. I didn’t measure the overhead, so I cannot
    give any numbers.
  3. Matchers are called on every node independently of its type. This is
    not what CheckerManager does. I wonder if this detail can affect
    performance as well.
  4. Problems with cyclic dependencies. clangStaticAnalyzer has a
    dependency on clangASTMatchers, therefore, clangASTMatchers cannot
    depend on clangStaticAnalyzer. However, if we want ASTMatchers to
    support static analyzer data structures, there should be a backward
    dependency. Now this dependency is header-only.
  5. Code duplication. This is mostly fine for a sandbox but needs to be
    reduced.

From analyzer’s side:

  1. Many events are not reflected in the final ExplodedGraph. For
    example, checkers can receive PointerEscape event, but the event itself
    will not be recorded in the ExplodedGraph - only changes made by
    checkers will. That’s also true for Stmt nodes: I noticed same issues
    with PostCondition. This makes matching a bit harder. Should we add some
    information into ExplodedGraph?
  2. (Minor) Some static analyzer data structures don’t support
    traditional LLVM casting methods (dyn_cast, isa) because they lack
    classof() method. I have fixed this internally and will put a patch on
    review soon.

-------- Conclusion --------
Finally, there is a graph matching engine supporting basic functionality
and two checkers using it. I apologize for not starting the discussion
earlier, but I just wasn’t sure that the approach will work. Is anybody
interested to have this stuff in upstream (maybe, changed or improved)?
If yes, I’ll be happy to contribute this work patch-by-patch and
continue its improvement. If no - well, I had enough fun playing with it.

Thanks for sharing this results. Regardless of being upstreamed as is or
in a separate form or not at all, this is an interesting experiment for the
community.

Allowing checkers to generate new nodes and sinks leads to coverage issues: the coverage changes if we disable or enable checkers. The ability to modify state (Environment and Store) also makes us think about how checkers will interact with each other. Regarding traits - path matchers are not completely trait-less. They just use other GDM and can only store DynTypedNodes (extended by path-sensitive nodes). The bindings that matchers have added while matching separate graph nodes are propagated across ExplodedGraph. Actually, path bindings act very similar to checker traits, and they offer a subset of GDM abilities. In my experience, nearly 2/3 path-sensitive checkers we wrote follow state machine abstraction. But it is possible to implement other matcher operator that will use another matching strategy (grammar-like, for example), and declare a path matcher based on it. Yes, there is never ideal solution. But we can just use descriptive matcher names to notify users that they are going to use path-sensitive analysis. Even in this small PoC, path matchers are placed into a separate header and into a separate namespace (and are even part of clangStaticAnalyzer, not clangASTMatchers), so one should really understand what he does to make path matchers work. Also, we can often meet some hand-written AST matching inside checker callbacks. We can just make it more consise. I’m not sure I understood you correctly. If you mean that stateful checkers are checkers that need to modify state, then I think we need to put state modification out of checkers if possible to get reproducible coverage independently on checkers enabled or disabled. It is not only possible, but is already implemented - partially, of course. You can take a look at canBeZero() matcher in TestAfterDivZeroV2. In the implemented checkers - yes, the matching is done on the ExplodedGraph already built. But nothing prevents GraphMatchFinder to be used as a part of CheckerManager or ExprEngine. It has a method called advance() that performs a single transition to the new node not processed before, so it can be fed with single nodes that engine creates during graph building. The matching on-the-fly is possible, just was not implemented.

Hi Alexey,

In general, I like the idea of having a more declarative way to define new
checks.

Hello everyone,

I’d like to share some results of my investigation targeted on
improvement of Clang Static Analyzer checker API. You can find some
previous conversation on this topic here:
http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html.
In my investigation, I tried to solve a particular problem of building a
checker without generating new nodes.

Why do you focus on such checks that does not have traits, does not generate new nodes.

Do you find this to be the majority of the checks you need to implement?

Ok, so as far as i understand, these new checkers don’t support traits because they don’t need them. Instead, they plan to use “backreferences” in matchers, i.e. double file descriptor close can be described as:

hasSequence(
call(hasName(“fclose”), hasArg(0, sym().bind(“fd”))),
call(hasName(“fclose”), hasArg(0, sym(equalsBoundNode(“fd”))))
);

Right?

So we can’t eliminate the path on which the pointer is null after we dereference it, like we normally do, we can’t introduce state splits, we can’t produce fatal error nodes, but we can still use finite state machines.

The real question is, how do we implement RetainCountChecker that needs to count the potentially infinite number of retains and releases before emitting the warning, and is therefore not a finite state machine?^^

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look
like this:

StatementMatcher NotChdir =
callExpr(unless(callee(functionDecl(hasName(“::chdir”)))));
Finder.addMatcher(
hasSequence(
postStmt(hasStatement(
callExpr(callee(functionDecl(hasName(“::chroot”)))))),
unless(stmtPoint(hasStatement(callExpr(
callee(functionDecl(hasName(“::chdir”))),
hasArgument(0, hasValue(stringRegion(refersString(“/”)))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir))))
.bind(“bug_node”)),
&Callback);
Finder.match(G);

Aha, so am i understanding correctly that unless() here is not your normal unless()? The normal unless() clause would have meant

“somewhere between chroot() and the offending call there’s a node that’s not chdir()”

but what we’re trying to say is

“nowhere between chroot() and the offending call there’s a node that’s chdir()”

?

I think, a common (design) pitfall of writing checks is to try to match against
the AST when a symbolic execution callback should be used instead.
I am a bit afraid having an API like this would make this pitfall more common.
Maybe a separation between the path sensitive, flow sensitive and
AST based checks is good for the checker writers and new users and
I am not sure that the same abstraction fits all of these cases.

In case of path sensitive checks I wonder if a state machine like abstraction would
fit the use cases better (and also cover checks that are using traits)
where the guarding conditions of the state transitions can include AST matcher like checks.

and I have managed to make some simple working examples.

The entire diff can be found here:
https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
The code itself: https://github.com/a-sid/clang/tree/a4

There are several reasons why I have tried this approach.

  1. AST Matchers are already extensively used API for AST checking. They
    are available both in Clang-Tidy and CSA. And I wanted to use existing
    functionality as much as possible. So, I decided to extend an existing
    API to make its usage seamless across different kinds of checks:
    path-sensitive, AST-based and CFG-based.

When we say CFG-based checks, we usually mean “all paths” checks that find bugs indicated by a certain invariant holding on all paths through a certain program point(s). Such as TestAfterDivideZero or DeadStore.

The problem with CFG-based “all paths” checks that i don’t really know how to solve and that’s probably not going to be solved by giving them a good API is error reporting. Imagine a conversation with a user:

A n a l y z e r : Hey this assignment is not used.

U s e r : But hey, it’s used here points to the code.

A n a l y z e r : But this code is dead.

U s e r : Why so? Suppose ‘x’ is equal to true, we jump straight into this code.

A n a l y z e r : That’s because the initializer for ‘x’ is in fact always evaluated to false.

U s e r : WHY t___t

Etc. The first piece of information is produced by a DeadStore checker. The second piece of information is produced by some sort of DeadCode checker (currently alpha). The third piece of information is produced by some sort of SameResLogicalExpr checker (currently in the list of potential checkers). All three checkers are all-paths checkers. And unless all three checker’s diagnostics are presented to the user at the same time, the user will be unable to understand the bug report.

I believe that we should really solve this problem somehow (or explain to me why it isn’t a problem) before we try to introduce a massive increase in the number of CFG-based checkers we have (currently 1).

I am not sure if this is a good idea. I think the checker author supposed to
be aware if the check is AST based, flow sensitive, or path sensitive.

And this should also be easy to tell by reading the code. I am also not
sure whether there is an abstraction that fits all.

I think the reason why this idea works well for checks that only inspect the
exploded graph, because in this case we are still doing pattern matching on
graphs in a similar manner as doing pattern matching on the AST.

But does this generalize later on to stateful checks?

  1. AST matchers effectively help clients to avoid a lot of checking of
    dyn_cast results. This feature not only makes them more convenient but
    also more safe because, in my experience, forgetting a nullptr/None
    check is a pretty common mistake for checker writers.

I think if something cannot be null we should return references, otherwise
the client need to check for the pointer beeing null (unless there are some
dynamic invariant that would make the check redundant). If an API does
not match this philosophy, maybe we should fix the API first.

Regardless of fixing the API, I agree that it would be great to have higher
level APIs for checker authors.

  1. Testing of AST matchers don’t require writing C++ code - it can be
    done interactively with clang-query tool. And I believe that we need
    similar functionality for CSA as well.

Having a query tool for exploded graphs could be very helpful, I agree.
These graphs tend to be very large and sometimes it is not trivial to
find the relevant parts during debugging.

I didn’t want to replace existing checker API. Instead, I tried to make
new possibilities usable independently or together with existing.

--------- Design and implementation ---------

The implementation consisted of a number of steps.

  1. Allow matching nodes of path-sensitive graph like usual AST nodes. To
    make this possible, three actions were needed:
    1.1. ASTTypeTraits and DynTypedNode were extended to support
    path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc.

I wonder, if in general it is a good idea to pattern match in checks on SymExpr.
A more general approach here would be to use the constraint solver for queries
in a similar manner how ProgramState::assume works.

The implementation with graph node support is moved into a separate
class (ASTGraphTypeTraits) in ento namespace to resolve cyclic
dependencies (they are still exist, unfortunately, but are header-only,
so we can build the PoC).
1.2. Some additions to AST matchers were made to add support for new
kinds of nodes.
1.3. To make MatchFinder able to query specific options not supported
by pure AST, it was augmented with “Contexts”. A matcher that needs to
query the path-sensitive engine asks the Finder for the required Context
which provides specific helper functions.

As the result of this step, we are able to write matchers like
expr(hasArgument(0, hasValue(stringRegion(refersString(“/”))))).

Yeah, we’ll also eventually need operators over symbols, eg.:

arraySubscriptExpr(hasSubscript(expr(hasSVal(sym(isLessThan(
sym(equalsBoundNode(“array_size”)))).bind(“array_subscript”)))))

Still, SVal matchers are something i always wanted. I even made SVal visitors. The latter turned out to be pretty useless though :confused:

I think this is the part of the proposal that I like the most, it would be
a very concise way to write down guarding conditions.

  1. Create an engine for graph exploration and matching.
    Unlike normal AST matchers, hasSequence is a path-sensitive matcher.
    It is launched against ExplodedGraph. These matchers are handled by
    GraphMatchFinder object. While searching a graph, it collects matches.
    Each match contains a pointer to the corresponding matcher and State ID
    of this match. The way how matches are translated from one state ID to
    another is determined by matcher operators.

Is this matching done after the exploded graph already built? If so,
these checks will be unable to generate sink nodes. I think having sink
nodes sometimes desirable even if the check itself does not have a trait.

I’m really curious about how the performance scales with the number of checkers. Is it going to be significantly more expensive to run 100 DSL-based checkers on a fully built graph than to run 100 normal checkers as the graph is being built?

This boils down to: by looking at the sub-matcher, will you be able to filter out checkers that don’t need to act upon the node as quickly as you do that when calling checker callbacks?

And this immediately leads to: how difficult would it be to simply auto-generate an old-style checker (with state traits and callbacks) from the new-style matcher? Maybe that’s something we should focus on instead of trying to match a fully constructed graph? Maybe it’d also be more powerful, i.e. allow certain imperative side effects whenever partial match is encountered?

For example, SequenceVariadicOperator, which is the base of
hasSequence() matcher, has “positive” and “negative” sub-matches. Each
positive matcher has its corresponding State ID. In order to advance to
the next State ID, a node being matched should match all “negative”
matchers before the next “positive” matchers and the next “positive”
matcher itself. Failure in matching “negative” matcher discards the match.

The role of GraphMatchFinder is similar to MatchFinder: it is only
responsible for graph exploration and keeping bound nodes and matchers.

  1. Manage bound nodes.
    While matching a single graph node, BoundNodes from AST MatchFinder
    are used. AST matchers for path-sensitive nodes support bindings
    out-of-the-box. However, the situation with graph matching is a bit
    different. For graph matching, we have another system of bound nodes.
    Each graph node has a related map of bounds aka GDMTy (yes, the name is
    not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap
    which, in part, is effectively a map from std::string to DynTypedNodes.
    This look pretty much like how GDM is organized in ExplodedGraph, just
    without one level of indirection (it can be added, though).

MatchFinder contexts should allow support of their own bindings. This
is how equalsBoundNode() matcher works for path-sensitive nodes: it just
queries all available contexts for the binding.

Finally, I have managed to make two checkers work in this way:
ChrootChecker (which is present in the introduction) and
TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp
and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal
checkers: produce warnings and use visitors.

I wonder if our false positive suppression visitors (or maybe even regular checker bugvisitors) can be rewritten with these matchers. trackNullOrUndefValue() and its system of visitors doesn’t look like a matcher at all to me, unfortunately - it can potentially track an infinite amount of events. But every individual visitor within that system probably is.

The main difference is that
they cannot generate sinks and use checkEndAnalysis callback. The code
of these checkers can be found here:

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp
https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp

-------- Some features --------

1.The design of bindings has an interesting consequence: it enables the
dynamic introspection of GDM which was pretty hard before. (Hello alpha
renaming?)
2. Nothing prevents matchers to be used with existing checker API for
simplifying conditional checking inside callbacks. The matchers are not
planned as the replacement for the current API, but look like a nice
complementary part.
3. Implicit conversion of Matcher to Matcher
and Matcher<SymExpr || MemRegion> to Matcher for writing shorter code.

-------- Not implemented/not checked yet --------
I tried to keep the PoC as minimal as possible. As the result, some
features are not implemented yet, but I believe they can be added.

  1. Usage of matchers inside checker callbacks
    This is not exactly unimplemented, but is still untested.
  2. Dynamic matchers (clang-query interface)
    In addition to work for supporting dynamic nodes (I don’t know how
    many efforts it can take), we need to enable matching against AST nodes,
    not graph. But it doesn’t look as a problem, we can just make matchers
    polymorphic.
  3. Binding to successfully matched paths is not implemented yet - only
    bindings to separate nodes. I wonder if we need path bindings at all.
  4. Some corner cases are still FIXMEs: single-node sequences, for example.
  5. GDM is not based on immutable data structures like it is done in CSA
  • it is just an STL map. Its performance can be poor (full copy on every
    new node), but I don’t think that changing it to use immutable
    structures is hard.
  1. Matching on-the-fly
    GraphMatchFinder should support on-the-fly matching during
    ExplodedGraph building. For this support, we should just call its
    advance() method each time a new node is created. However, this
    possibility wasn’t checked yet.
  2. Matching CFG and CallGraph isn’t implemented because it was
    considered to be far out of simple PoC.
  3. Only sequential matching is available now, and I didn’t try to
    implement other operators yet. So, implementing a checker similar to
    PthreadLock can be tricky or even impossible for now.

-------- Known and potential issues --------
From matchers’ side:

  1. Some tests don’t pass because they rely on the checker ability to
    generate sink nodes. Our matchers cannot do it by design so tests don’t
    pass.
  2. Representing checker bindings as a map can be less effective than
    storing data in structures. I didn’t measure the overhead, so I cannot
    give any numbers.
  3. Matchers are called on every node independently of its type. This is
    not what CheckerManager does. I wonder if this detail can affect
    performance as well.
  4. Problems with cyclic dependencies. clangStaticAnalyzer has a
    dependency on clangASTMatchers, therefore, clangASTMatchers cannot
    depend on clangStaticAnalyzer. However, if we want ASTMatchers to
    support static analyzer data structures, there should be a backward
    dependency. Now this dependency is header-only.
  5. Code duplication. This is mostly fine for a sandbox but needs to be
    reduced.

From analyzer’s side:

  1. Many events are not reflected in the final ExplodedGraph. For
    example, checkers can receive PointerEscape event, but the event itself
    will not be recorded in the ExplodedGraph - only changes made by
    checkers will.

Weeeell maaaaaybe. This would essentially mean passing a CheckerContext into checkPointerEscape and checkRegionChanges. With a predecessor node having a new sort of ProgramPoint such as “PointerEscapePoint”. And we can put the list of escaped stuff directly into the program point instead of passing it separately.

Which means allowing state splits in these callbacks. Which of course can be disallowed later (to keep the code that calls the callbacks simple), and graph matchers won’t need them any time soon anyway.

But it’d still be a lot of refactoring because currently these callbacks are called from code that doesn’t even know anything about nodes, eg. from within State->invalidateRegions().

Hello everyone,

I'd like to share some results of my investigation targeted on improvement of Clang Static Analyzer checker API. You can find some previous conversation on this topic here: http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. In my investigation, I tried to solve a particular problem of building a checker without generating new nodes.

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look like this:

StatementMatcher NotChdir =
callExpr(unless(callee(functionDecl(hasName("::chdir")))));
Finder.addMatcher(
hasSequence(
postStmt(hasStatement(
callExpr(callee(functionDecl(hasName("::chroot")))))),
unless(stmtPoint(hasStatement(callExpr(
callee(functionDecl(hasName("::chdir"))),
hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir))))
.bind("bug_node")),
&Callback);
Finder.match(G);

and I have managed to make some simple working examples.

The entire diff can be found here: Path-sensitive matchers: proof-of-concept (MWE) · a-sid/clang@9a0b1d1 · GitHub
The code itself: GitHub - a-sid/clang at a4

There are several reasons why I have tried this approach.

1. AST Matchers are already extensively used API for AST checking. They are available both in Clang-Tidy and CSA. And I wanted to use existing functionality as much as possible. So, I decided to extend an existing API to make its usage seamless across different kinds of checks: path-sensitive, AST-based and CFG-based.

2. AST matchers effectively help clients to avoid a lot of checking of dyn_cast results. This feature not only makes them more convenient but also more safe because, in my experience, forgetting a nullptr/None check is a pretty common mistake for checker writers.

3. Testing of AST matchers don't require writing C++ code - it can be done interactively with clang-query tool. And I believe that we need similar functionality for CSA as well.

I didn't want to replace existing checker API. Instead, I tried to make new possibilities usable independently or together with existing.

--------- Design and implementation ---------

The implementation consisted of a number of steps.

1. Allow matching nodes of path-sensitive graph like usual AST nodes. To make this possible, three actions were needed:
1.1. ASTTypeTraits and DynTypedNode were extended to support path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. The implementation with graph node support is moved into a separate class (ASTGraphTypeTraits) in ento namespace to resolve cyclic dependencies (they are still exist, unfortunately, but are header-only, so we can build the PoC).
1.2. Some additions to AST matchers were made to add support for new kinds of nodes.
1.3. To make MatchFinder able to query specific options not supported by pure AST, it was augmented with "Contexts". A matcher that needs to query the path-sensitive engine asks the Finder for the required Context which provides specific helper functions.

As the result of this step, we are able to write matchers like expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).

2. Create an engine for graph exploration and matching.
Unlike normal AST matchers, hasSequence is a path-sensitive matcher. It is launched against ExplodedGraph. These matchers are handled by GraphMatchFinder object. While searching a graph, it collects matches. Each match contains a pointer to the corresponding matcher and State ID of this match. The way how matches are translated from one state ID to another is determined by matcher operators.

For example, SequenceVariadicOperator, which is the base of hasSequence() matcher, has "positive" and "negative" sub-matches. Each positive matcher has its corresponding State ID. In order to advance to the next State ID, a node being matched should match all "negative" matchers before the next "positive" matchers and the next "positive" matcher itself. Failure in matching "negative" matcher discards the match.

The role of GraphMatchFinder is similar to MatchFinder: it is only responsible for graph exploration and keeping bound nodes and matchers.

3. Manage bound nodes.
While matching a single graph node, BoundNodes from AST MatchFinder are used. AST matchers for path-sensitive nodes support bindings out-of-the-box. However, the situation with graph matching is a bit different. For graph matching, we have another system of bound nodes. Each graph node has a related map of bounds aka GDMTy (yes, the name is not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap which, in part, is effectively a map from std::string to DynTypedNodes. This look pretty much like how GDM is organized in ExplodedGraph, just without one level of indirection (it can be added, though).

MatchFinder contexts should allow support of their own bindings. This is how equalsBoundNode() matcher works for path-sensitive nodes: it just queries all available contexts for the binding.

Finally, I have managed to make two checkers work in this way: ChrootChecker (which is present in the introduction) and TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal checkers: produce warnings and use visitors. The main difference is that they cannot generate sinks and use checkEndAnalysis callback. The code of these checkers can be found here:

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp

-------- Some features --------

1.The design of bindings has an interesting consequence: it enables the dynamic introspection of GDM which was pretty hard before. (Hello alpha renaming?)
2. Nothing prevents matchers to be used with existing checker API for simplifying conditional checking inside callbacks. The matchers are not planned as the replacement for the current API, but look like a nice complementary part.
3. Implicit conversion of Matcher<ProgramPoint> to Matcher<ExplodedNode> and Matcher<SymExpr || MemRegion> to Matcher<SVal> for writing shorter code.

-------- Not implemented/not checked yet --------
I tried to keep the PoC as minimal as possible. As the result, some features are not implemented yet, but I believe they can be added.
1. Usage of matchers inside checker callbacks
This is not exactly unimplemented, but is still untested.
2. Dynamic matchers (clang-query interface)
In addition to work for supporting dynamic nodes (I don't know how many efforts it can take), we need to enable matching against AST nodes, not graph. But it doesn't look as a problem, we can just make matchers polymorphic.
3. Binding to successfully matched paths is not implemented yet - only bindings to separate nodes. I wonder if we need path bindings at all.
4. Some corner cases are still FIXMEs: single-node sequences, for example.
5. GDM is not based on immutable data structures like it is done in CSA - it is just an STL map. Its performance can be poor (full copy on every new node), but I don't think that changing it to use immutable structures is hard.
6. Matching on-the-fly
GraphMatchFinder should support on-the-fly matching during ExplodedGraph building. For this support, we should just call its advance() method each time a new node is created. However, this possibility wasn't checked yet.
7. Matching CFG and CallGraph isn't implemented because it was considered to be far out of simple PoC.
8. Only sequential matching is available now, and I didn't try to implement other operators yet. So, implementing a checker similar to PthreadLock can be tricky or even impossible for now.

-------- Known and potential issues --------
From matchers' side:
1. Some tests don't pass because they rely on the checker ability to generate sink nodes. Our matchers cannot do it by design so tests don't pass.
2. Representing checker bindings as a map can be less effective than storing data in structures. I didn't measure the overhead, so I cannot give any numbers.
3. Matchers are called on every node independently of its type. This is not what CheckerManager does. I wonder if this detail can affect performance as well.
4. Problems with cyclic dependencies. clangStaticAnalyzer has a dependency on clangASTMatchers, therefore, clangASTMatchers cannot depend on clangStaticAnalyzer. However, if we want ASTMatchers to support static analyzer data structures, there should be a backward dependency. Now this dependency is header-only.
5. Code duplication. This is mostly fine for a sandbox but needs to be reduced.

From analyzer's side:
1. Many events are not reflected in the final ExplodedGraph. For example, checkers can receive PointerEscape event

Hmm, what if we "invert" this event, i.e. make it opt-out rather than opt-in, i.e., pointers are escaped by the matcher engine automatically, but if the checker wants a certain event to not escape the pointer, it adds extra matching code, such as "the sequence may include 0 or more passes of the symbol to a certain set of functions in a certain manner".

Hi Alexey,

As classics say, "Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.”

Jokes aside, I think the concept and what you are doing is great,
and we could certainly benefit from declarative matchers.
However, I think the actual implementation and the set of matchers and predicates would require quite a bit of bikeshedding.

Are you familiar with similar works in this area?
E.g. Oracle has a Soufflé project doing a similar task: http://souffle-lang.org.
They have found achieving high performance very challenging, and IIRC they resort to generating C++ code from the declaratively described matcher.
Maybe we would have to do the same in tablegen spirit.

I’m sure there are more such works in the literature.

Hello everyone,

I’d like to share some results of my investigation targeted on improvement of Clang Static Analyzer checker API. You can find some previous conversation on this topic here: http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. In my investigation, I tried to solve a particular problem of building a checker without generating new nodes.

I agree that this is a worthy goal.

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look like this:

StatementMatcher NotChdir =
callExpr(unless(callee(functionDecl(hasName(“::chdir”)))));
Finder.addMatcher(
hasSequence(
postStmt(hasStatement(
callExpr(callee(functionDecl(hasName(“::chroot”)))))),
unless(stmtPoint(hasStatement(callExpr(
callee(functionDecl(hasName(“::chdir”))),
hasArgument(0, hasValue(stringRegion(refersString(“/”)))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir))))
.bind(“bug_node”)),
&Callback);
Finder.match(G);

Bikeshedding: I think it’s much better for the sequence matcher to run from start to end,
since that’s how we think about the execution of a program.

and I have managed to make some simple working examples.

The entire diff can be found here: https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
The code itself: https://github.com/a-sid/clang/tree/a4

There are several reasons why I have tried this approach.

  1. AST Matchers are already extensively used API for AST checking. They are available both in Clang-Tidy and CSA. And I wanted to use existing functionality as much as possible. So, I decided to extend an existing API to make its usage seamless across different kinds of checks: path-sensitive, AST-based and CFG-based.

  2. AST matchers effectively help clients to avoid a lot of checking of dyn_cast results. This feature not only makes them more convenient but also more safe because, in my experience, forgetting a nullptr/None check is a pretty common mistake for checker writers.

  3. Testing of AST matchers don’t require writing C++ code - it can be done interactively with clang-query tool. And I believe that we need similar functionality for CSA as well.

Moreover, new matchers could be written without modifying clang. I wonder if we should support some kind of a plugin infrastructure which support matchers
defined in a text file, e.g. something along the lines of:

clang -cc1 —analyze -analyzer-checker=core,mymatcher -analyzer-load-declarative-checker=mymatcher.txt

I didn’t want to replace existing checker API. Instead, I tried to make new possibilities usable independently or together with existing.

--------- Design and implementation ---------

The implementation consisted of a number of steps.

  1. Allow matching nodes of path-sensitive graph like usual AST nodes. To make this possible, three actions were needed:
    1.1. ASTTypeTraits and DynTypedNode were extended to support path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. The implementation with graph node support is moved into a separate class (ASTGraphTypeTraits) in ento namespace to resolve cyclic dependencies (they are still exist, unfortunately, but are header-only, so we can build the PoC).
    1.2. Some additions to AST matchers were made to add support for new kinds of nodes.
    1.3. To make MatchFinder able to query specific options not supported by pure AST, it was augmented with “Contexts”. A matcher that needs to query the path-sensitive engine asks the Finder for the required Context which provides specific helper functions.

As the result of this step, we are able to write matchers like expr(hasArgument(0, hasValue(stringRegion(refersString(“/”))))).

  1. Create an engine for graph exploration and matching.
    Unlike normal AST matchers, hasSequence is a path-sensitive matcher. It is launched against ExplodedGraph. These matchers are handled by GraphMatchFinder object. While searching a graph, it collects matches. Each match contains a pointer to the corresponding matcher and State ID of this match. The way how matches are translated from one state ID to another is determined by matcher operators.

For example, SequenceVariadicOperator, which is the base of hasSequence() matcher, has “positive” and “negative” sub-matches. Each positive matcher has its corresponding State ID. In order to advance to the next State ID, a node being matched should match all “negative” matchers before the next “positive” matchers and the next “positive” matcher itself. Failure in matching “negative” matcher discards the match.

The role of GraphMatchFinder is similar to MatchFinder: it is only responsible for graph exploration and keeping bound nodes and matchers.

  1. Manage bound nodes.
    While matching a single graph node, BoundNodes from AST MatchFinder are used. AST matchers for path-sensitive nodes support bindings out-of-the-box. However, the situation with graph matching is a bit different. For graph matching, we have another system of bound nodes. Each graph node has a related map of bounds aka GDMTy (yes, the name is not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap which, in part, is effectively a map from std::string to DynTypedNodes. This look pretty much like how GDM is organized in ExplodedGraph, just without one level of indirection (it can be added, though).

MatchFinder contexts should allow support of their own bindings. This is how equalsBoundNode() matcher works for path-sensitive nodes: it just queries all available contexts for the binding.

Finally, I have managed to make two checkers work in this way: ChrootChecker (which is present in the introduction) and TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal checkers: produce warnings and use visitors. The main difference is that they cannot generate sinks and use checkEndAnalysis callback. The code of these checkers can be found here:

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp
https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp

-------- Some features --------

1.The design of bindings has an interesting consequence: it enables the dynamic introspection of GDM which was pretty hard before. (Hello alpha renaming?)
2. Nothing prevents matchers to be used with existing checker API for simplifying conditional checking inside callbacks. The matchers are not planned as the replacement for the current API, but look like a nice complementary part.
3. Implicit conversion of Matcher to Matcher and Matcher<SymExpr || MemRegion> to Matcher for writing shorter code.

-------- Not implemented/not checked yet --------
I tried to keep the PoC as minimal as possible. As the result, some features are not implemented yet, but I believe they can be added.

  1. Usage of matchers inside checker callbacks
    This is not exactly unimplemented, but is still untested.
  2. Dynamic matchers (clang-query interface)
    In addition to work for supporting dynamic nodes (I don’t know how many efforts it can take), we need to enable matching against AST nodes, not graph. But it doesn’t look as a problem, we can just make matchers polymorphic.
  3. Binding to successfully matched paths is not implemented yet - only bindings to separate nodes. I wonder if we need path bindings at all.
  4. Some corner cases are still FIXMEs: single-node sequences, for example.
  5. GDM is not based on immutable data structures like it is done in CSA - it is just an STL map. Its performance can be poor (full copy on every new node), but I don’t think that changing it to use immutable structures is hard.
  6. Matching on-the-fly
    GraphMatchFinder should support on-the-fly matching during ExplodedGraph building. For this support, we should just call its advance() method each time a new node is created. However, this possibility wasn’t checked yet.
  7. Matching CFG and CallGraph isn’t implemented because it was considered to be far out of simple PoC.
  8. Only sequential matching is available now, and I didn’t try to implement other operators yet. So, implementing a checker similar to PthreadLock can be tricky or even impossible for now.

-------- Known and potential issues --------
From matchers’ side:

  1. Some tests don’t pass because they rely on the checker ability to generate sink nodes. Our matchers cannot do it by design so tests don’t pass.

Actually I would disagree with this one: I think it would be better to support that, since many errors are considered fatal.
(but that of course would require running the checkers on a stream of events, rather than on the already constructed graph)

  1. Representing checker bindings as a map can be less effective than storing data in structures. I didn’t measure the overhead, so I cannot give any numbers.
  2. Matchers are called on every node independently of its type. This is not what CheckerManager does. I wonder if this detail can affect performance as well.

I would guess it would affect it hugely, and getting the performance right would be the biggest challenge for declarative matchers.

Hi Alexey,

In general, I like the idea of having a more declarative way to define new
checks.

Hello everyone,

I’d like to share some results of my investigation targeted on
improvement of Clang Static Analyzer checker API. You can find some
previous conversation on this topic here:
http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html.
In my investigation, I tried to solve a particular problem of building a
checker without generating new nodes.

Why do you focus on such checks that does not have traits, does not generate new nodes.

Do you find this to be the majority of the checks you need to implement?

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look
like this:

StatementMatcher NotChdir =
callExpr(unless(callee(functionDecl(hasName(“::chdir”)))));
Finder.addMatcher(
hasSequence(
postStmt(hasStatement(
callExpr(callee(functionDecl(hasName(“::chroot”)))))),
unless(stmtPoint(hasStatement(callExpr(
callee(functionDecl(hasName(“::chdir”))),
hasArgument(0, hasValue(stringRegion(refersString(“/”)))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir))))
.bind(“bug_node”)),
&Callback);
Finder.match(G);

I think, a common (design) pitfall of writing checks is to try to match against
the AST when a symbolic execution callback should be used instead.
I am a bit afraid having an API like this would make this pitfall more common.
Maybe a separation between the path sensitive, flow sensitive and
AST based checks is good for the checker writers and new users and
I am not sure that the same abstraction fits all of these cases.

In case of path sensitive checks I wonder if a state machine like abstraction would
fit the use cases better (and also cover checks that are using traits)
where the guarding conditions of the state transitions can include AST matcher like checks.

and I have managed to make some simple working examples.

The entire diff can be found here:
https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
The code itself: https://github.com/a-sid/clang/tree/a4

There are several reasons why I have tried this approach.

  1. AST Matchers are already extensively used API for AST checking. They
    are available both in Clang-Tidy and CSA. And I wanted to use existing
    functionality as much as possible. So, I decided to extend an existing
    API to make its usage seamless across different kinds of checks:
    path-sensitive, AST-based and CFG-based.

I am not sure if this is a good idea. I think the checker author supposed to
be aware if the check is AST based, flow sensitive, or path sensitive.

And this should also be easy to tell by reading the code. I am also not
sure whether there is an abstraction that fits all.

I think the reason why this idea works well for checks that only inspect the
exploded graph, because in this case we are still doing pattern matching on
graphs in a similar manner as doing pattern matching on the AST.

But does this generalize later on to stateful checks?

so e.g. no retain/release matchers.

Yes, exactly. As I told before, I don’t like state splitting and generating sinks in checker. However, it is not exactly true that we cannot do it with matchers - because there are not only matchers, but are also callbacks. If we perform a match on-the-fly (I didn’t implement this, unfortunately, but it doesn’t look hard), we can do almost anything in the callback after a match was found. I’m not sure that I understood you correctly, but if we want to use something different from sequential model, we can implement our custom path matchers - like it is done for AST matchers. It took some time to implement a working (but still poorly designed) example: . The idea was to generalize some most common search patterns into matchers. Counting can be implemented pretty easily because counter states are still integers. However, if our state cannot be represented as an integer, there can be some troubles. What I have also realized is that some way(s) of using different matchers together is needed. Reference counting is pretty useless on its own, it needs some actions to match if we are searching for a bug. Yes. I should probably create a normal matcher and don’t reuse unless() for this to make it less confusive. Looks like I have used wrong words to describe what I wanted to do with CFG. Initially, I wanted to support CFG search as well but quicky realized nobody just needs it; however, some rudiments are still remaining in the code. Instead, I meant exposing CFG analyses in a user-friendly way, like it is done in TestAfterDivZeroV2. I think it makes much more sense because CFG analyses like reachability or postdomination can be used for FP suppression in the checkers and in the analyzer engine. There is nothing impossible here - such expressions should be easy to write with augmented AST matchers used in this PoC. No benchmarks => no answer, sorry. My wild guess won’t work here :frowning: I think it’s a hard goal to achieve. 1. To separate matcher callbacks, we need for matchers to expose meta-information. Some single matchers also can participate in multiple callbacks (anyOf()). 2. To make matchers callable with a matcher which is a part of it, we need to re-design the path matchers. They should have a mapping between sub-matchers and actions required if a sub-matcher matches. We can use matchers inside checker callbacks if we need extended functionality. The auto-generation for matchers with side effects is non-trivial and needs supporting partial match actions. However, since we have BindableMatchers, we can also have ActingMatchers (unsupported dynamically) that have .call() method. Hm, I didn’t even think about it. Do you think it can give us some advantages for BugVisitors? Internally, I created a dummy node in CheckerManager methods and then launched checker callbacks with it. But this doesn’t look like a right way.

21.05.2018 21:53, Artem Dergachev пишет:

Hello everyone,

I'd like to share some results of my investigation targeted on improvement of Clang Static Analyzer checker API. You can find some previous conversation on this topic here: http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. In my investigation, I tried to solve a particular problem of building a checker without generating new nodes.

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look like this:

StatementMatcher NotChdir =
callExpr(unless(callee(functionDecl(hasName("::chdir")))));
Finder.addMatcher(
hasSequence(
postStmt(hasStatement(
callExpr(callee(functionDecl(hasName("::chroot")))))),
unless(stmtPoint(hasStatement(callExpr(
callee(functionDecl(hasName("::chdir"))),
hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir))))
.bind("bug_node")),
&Callback);
Finder.match(G);

and I have managed to make some simple working examples.

The entire diff can be found here: Path-sensitive matchers: proof-of-concept (MWE) · a-sid/clang@9a0b1d1 · GitHub
The code itself: GitHub - a-sid/clang at a4

There are several reasons why I have tried this approach.

1. AST Matchers are already extensively used API for AST checking. They are available both in Clang-Tidy and CSA. And I wanted to use existing functionality as much as possible. So, I decided to extend an existing API to make its usage seamless across different kinds of checks: path-sensitive, AST-based and CFG-based.

2. AST matchers effectively help clients to avoid a lot of checking of dyn_cast results. This feature not only makes them more convenient but also more safe because, in my experience, forgetting a nullptr/None check is a pretty common mistake for checker writers.

3. Testing of AST matchers don't require writing C++ code - it can be done interactively with clang-query tool. And I believe that we need similar functionality for CSA as well.

I didn't want to replace existing checker API. Instead, I tried to make new possibilities usable independently or together with existing.

--------- Design and implementation ---------

The implementation consisted of a number of steps.

1. Allow matching nodes of path-sensitive graph like usual AST nodes. To make this possible, three actions were needed:
1.1. ASTTypeTraits and DynTypedNode were extended to support path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. The implementation with graph node support is moved into a separate class (ASTGraphTypeTraits) in ento namespace to resolve cyclic dependencies (they are still exist, unfortunately, but are header-only, so we can build the PoC).
1.2. Some additions to AST matchers were made to add support for new kinds of nodes.
1.3. To make MatchFinder able to query specific options not supported by pure AST, it was augmented with "Contexts". A matcher that needs to query the path-sensitive engine asks the Finder for the required Context which provides specific helper functions.

As the result of this step, we are able to write matchers like expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).

2. Create an engine for graph exploration and matching.
Unlike normal AST matchers, hasSequence is a path-sensitive matcher. It is launched against ExplodedGraph. These matchers are handled by GraphMatchFinder object. While searching a graph, it collects matches. Each match contains a pointer to the corresponding matcher and State ID of this match. The way how matches are translated from one state ID to another is determined by matcher operators.

For example, SequenceVariadicOperator, which is the base of hasSequence() matcher, has "positive" and "negative" sub-matches. Each positive matcher has its corresponding State ID. In order to advance to the next State ID, a node being matched should match all "negative" matchers before the next "positive" matchers and the next "positive" matcher itself. Failure in matching "negative" matcher discards the match.

The role of GraphMatchFinder is similar to MatchFinder: it is only responsible for graph exploration and keeping bound nodes and matchers.

3. Manage bound nodes.
While matching a single graph node, BoundNodes from AST MatchFinder are used. AST matchers for path-sensitive nodes support bindings out-of-the-box. However, the situation with graph matching is a bit different. For graph matching, we have another system of bound nodes. Each graph node has a related map of bounds aka GDMTy (yes, the name is not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap which, in part, is effectively a map from std::string to DynTypedNodes. This look pretty much like how GDM is organized in ExplodedGraph, just without one level of indirection (it can be added, though).

MatchFinder contexts should allow support of their own bindings. This is how equalsBoundNode() matcher works for path-sensitive nodes: it just queries all available contexts for the binding.

Finally, I have managed to make two checkers work in this way: ChrootChecker (which is present in the introduction) and TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal checkers: produce warnings and use visitors. The main difference is that they cannot generate sinks and use checkEndAnalysis callback. The code of these checkers can be found here:

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp

-------- Some features --------

1.The design of bindings has an interesting consequence: it enables the dynamic introspection of GDM which was pretty hard before. (Hello alpha renaming?)
2. Nothing prevents matchers to be used with existing checker API for simplifying conditional checking inside callbacks. The matchers are not planned as the replacement for the current API, but look like a nice complementary part.
3. Implicit conversion of Matcher<ProgramPoint> to Matcher<ExplodedNode> and Matcher<SymExpr || MemRegion> to Matcher<SVal> for writing shorter code.

-------- Not implemented/not checked yet --------
I tried to keep the PoC as minimal as possible. As the result, some features are not implemented yet, but I believe they can be added.
1. Usage of matchers inside checker callbacks
This is not exactly unimplemented, but is still untested.
2. Dynamic matchers (clang-query interface)
In addition to work for supporting dynamic nodes (I don't know how many efforts it can take), we need to enable matching against AST nodes, not graph. But it doesn't look as a problem, we can just make matchers polymorphic.
3. Binding to successfully matched paths is not implemented yet - only bindings to separate nodes. I wonder if we need path bindings at all.
4. Some corner cases are still FIXMEs: single-node sequences, for example.
5. GDM is not based on immutable data structures like it is done in CSA - it is just an STL map. Its performance can be poor (full copy on every new node), but I don't think that changing it to use immutable structures is hard.
6. Matching on-the-fly
GraphMatchFinder should support on-the-fly matching during ExplodedGraph building. For this support, we should just call its advance() method each time a new node is created. However, this possibility wasn't checked yet.
7. Matching CFG and CallGraph isn't implemented because it was considered to be far out of simple PoC.
8. Only sequential matching is available now, and I didn't try to implement other operators yet. So, implementing a checker similar to PthreadLock can be tricky or even impossible for now.

-------- Known and potential issues --------
From matchers' side:
1. Some tests don't pass because they rely on the checker ability to generate sink nodes. Our matchers cannot do it by design so tests don't pass.
2. Representing checker bindings as a map can be less effective than storing data in structures. I didn't measure the overhead, so I cannot give any numbers.
3. Matchers are called on every node independently of its type. This is not what CheckerManager does. I wonder if this detail can affect performance as well.
4. Problems with cyclic dependencies. clangStaticAnalyzer has a dependency on clangASTMatchers, therefore, clangASTMatchers cannot depend on clangStaticAnalyzer. However, if we want ASTMatchers to support static analyzer data structures, there should be a backward dependency. Now this dependency is header-only.
5. Code duplication. This is mostly fine for a sandbox but needs to be reduced.

From analyzer's side:
1. Many events are not reflected in the final ExplodedGraph. For example, checkers can receive PointerEscape event

Hmm, what if we "invert" this event, i.e. make it opt-out rather than opt-in, i.e., pointers are escaped by the matcher engine automatically, but if the checker wants a certain event to not escape the pointer, it adds extra matching code, such as "the sequence may include 0 or more passes of the symbol to a certain set of functions in a certain manner".

i'm not sure this will work for other kinds of "flaky" nodes - for PostCondition, for example. PostCondition nodes can only be generated by checkers and cannot be matched without dirty hacks (like force generation of such nodes).

Yes, is what I was always thinking while writing new AST matchers. Time-honored tradition :slight_smile: Sure. If we need this stuff - the design details are subject for discussion and change. My goal was not to achieve extreme performance but not to hurt analyzer’s performance too much. Don’t know how far I am from this target, unfortunately. Yes, I know about some researches in this area, but thank you anyway! Not sure that I understand completely what do you mean. The nodes are written in the same sequence as they should be on the path in ExplodedGraph. I planned to reuse dynamic matchers infrastructure and clang-query for this. clang-query is a command-line tool but there shouldn’t be too much difference where the matcher string we parse comes from. The problem here is that, AFAIR, using callbacks with dynamic matchers is impossible. Still it is a very cool tool for quick matcher prototyping. As I answered to Artem, this is technically possible (but still undesirable, I think) because matchers fire their given callbacks on match, and callbacks are much less limited in their possibilities. That’s sad. I’ll try to measure the overhead but I’m not sure I’ll be able to do it soon.

Yeah, but that doesn’t matter, right? We need to get a good API for checker writers, and the events are generally described in their progression as going forward.
Also “explodedNode” is probably an unfortunate name, and “node” would be better.

I am also confused by the semantics of your list, since it mixes AST and “node” statements.
Why is only the last matcher wrapped in “node”?

Sorry, I still didn’t get you. The events in the matcher are already described in the order they should happen to report a bug: 1. Call to chroot. 2. Do not call chdir("/"). 3. Call anything but chdir("/"). Or do you mean something else? Naming is always a subject to change. In this example, statement matchers are implicitly converted to exploded node matchers and are applied only to ExplodedNodes that have StmtPoint as their ProgramPoint. Initially, I was thinking it is a good idea but soon discovered that I was wrong because it causes a mess between all kinds of StmtPoint. This implicit conversions is likely to be removed. There is also another implicit conversion from ProgramPoint matcher to ExplodedNode matcher and I found it to be much more useful. All sub-matchers of hasSequence()` are matchers on ExplodedNode, but we need to emit a warning on the last node, so we need something to bind against. The bound node is used to emit a warning on it.

The description is helpful.
I’ve meant that when designing a DSL for describing checkers we should think about the most convenient way to describe checkers:
and checks are normally described going forward (as you just did), not going backward.

Overall, sounds very interesting, do you plan about posting a patch on Phabricator?

What makes me really excited about this is that with such a DSL we could then load checkers from text files at runtime,
which would give our users a much easier way to write checkers.

Hi George,

Unfortunately, I had to suspend the work on this. I hope I return to it later, but for now it is not completed yet. Do you want to see a work-in-progress patch on Phabricator?
If you just want to take a quick look, you can see the progress on Github: https://github.com/a-sid/clang/commits/a4-counting-matches

02.07.2018 21:34, George Karpenkov пишет:

Necro-updating the status since some demand for this or similar CSA feature was discovered on EuroLLVM BoF. Unfortunately, I had few time to develop path-sensitive matchers last year so the change list is not very large.

1. On-the-fly matching while building ExplodedGraph. Both offline and online matching is available. The way how a checker handles graphs depends on how it was registered in CheckerManager: there is a separate method registerPathMatcher() that adds matcher to the set of online matchers.

2. Path-sensitive matchers gained the ability to generate new nodes and sinks in the callbacks. There are two limitations, however.
a) AST Matchers don't allow calling callbacks until matching is complete, so I haven't implemented this ability for path-sensitive matchers too.
b) Only on-the-fly matchers are allowed to add nodes (because the ability to add nodes to a complete graph seems useless).

3. The primary advancement for this PoC is that path-sensitive matchers now can work as dynamic matchers so they can be used with clang-query tool. Here is an example clang-query script:

let NotChdir callExpr(unless(callee(functionDecl(hasName("::chdir"))))).bind("bad_action")

match
functionDecl(
hasPath(hasSequence(
postStmt(hasStatement(callExpr(callee(functionDecl(hasName("::chroot")))).bind("chroot"))),
unless(
stmtPoint(hasStatement(callExpr(callee(functionDecl(hasName("::chdir"))),
hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
callEnter(hasCallExpr(NotChdir)))).bind("bug_node"))))

whose output on chroot.c is:

$ bin/clang-query llvm/tools/clang/test/Analysis/chroot.c -f=debug-cmd.query
Match #1:
llvm/tools/clang/test/Analysis/chroot.c:12:3: note: "bad_action" binds here
foo(); // expected-warning {{No call of chdir("/") immediately after chroot}}
^~~~~
llvm/tools/clang/test/Analysis/chroot.c:11:3: note: "chroot" binds here
chroot("/usr/local"); // root changed.
^~~~~~~~~~~~~~~~~~~~
llvm/tools/clang/test/Analysis/chroot.c:10:1: note: "root" binds here
void f1(void) {
^~~~~~~~~~~~~~~
Match #2:
llvm/tools/clang/test/Analysis/chroot.c:24:3: note: "bad_action" binds here
foo(); // expected-warning {{No call of chdir("/") immediately after chroot}}
^~~~~
llvm/tools/clang/test/Analysis/chroot.c:22:3: note: "chroot" binds here
chroot("/usr/local"); // root changed.
^~~~~~~~~~~~~~~~~~~~
llvm/tools/clang/test/Analysis/chroot.c:21:1: note: "root" binds here
void f3(void) {
^~~~~~~~~~~~~~~
2 matches.

This can allow us to prototype checkers without re-compiling them after every change.

The code can be found on my Github:
clang: https://github.com/a-sid/clang/tree/a4-dynamic-path-matchers
clang-tools-extra: https://github.com/a-sid/clang-tools-extra/tree/a4-support

If there is still a need this feature, I'll start the discussion on its design because it requires some intrusive changes into AST Matchers code.

17.05.2018 2:37, Alexey Sidorin via cfe-dev пишет:

  • a couple folks I remember expressing interest about this subject :slight_smile: