facebook clang plugins

Hi,

I am looking for feedback on the possibility of contributing some of the clang plugins used at Facebook back to clang.

We just made available a first subset of plugins here: https://github.com/facebook/facebook-clang-plugins

The plugins fall into two groups:
1) Clang analyzer checkers for iOS;
2) A clang frontend plugin to export the internal AST of clang in an Ocaml-friendly Json. This plugin comes with Ocaml libraries for testing, parsing, and visiting the AST.

Except for the naming conventions, which are not uniform yet, and the need to update the referenced version of clang, the code should be in a relatively good state. In particular, everything has been tested quite at scale.

Thanks!

Hi,

cool!

The analyzer checkers look like they’re generally useful from a quick glance. http://llvm.org/docs/DeveloperPolicy.html describes how to contribute patches. Look at the existing checker tests in the clang repo on how to write tests in clang’s test suite.

There currently aren’t any frontend plugins checked in. Ocaml tools for parsing json seem out of scope for the clang repo. Clang used to have a “dump xml” flag which was removed here: http://llvm.org/viewvc/llvm-project?view=revision&revision=127141 Adding json output to clang itself seems reasonable if it fulfills the requirements in that commit log, but it should probably be done without a plugin. So I think the frontend plugin should probably continue living out of tree.

Nico

Awesome! Sending patches to contribute the new analyzer checkers sounds good to me.

Thanks for the insights on the Xml/Json dumpers. I am totally fine with letting this plugin out for now.

Out of curiosity, though, I have a couple of questions on some of the requirements you mention:

  • Documented, with appropriate Schema against which we can validate the output
    What platform is available to run the validation?
    Here we are using the “atd” language for the schema and an ocaml tool “atdgen”: http://mjambon.com/atdgen/atdgen-manual.html

  • Designed for C/C++/Objective-C, not Clang’s specific ASTs
    I’m not sure to understand this part. Can you elaborate?

  • Stable across Clang versions
    I guess it would need to be maintained the same way ASTDumper.cpp is. Also, it does not seem impossible to merge the two files.

Mathieu

I’d just like to say that even if OCaml tools parsing JSON is out of scope as Nico suggests, the work you have done to “schematize” the Clang AST could be the start of something really useful for upstream. Currently, I can think of at least two places that would benefit greatly from having such a schema as a “single point of truth”: Serialization and ASTMatchers.

Being able to auto-generate those two from a schema (maybe in the form of annotations in the header files) plus a relatively small amount of generator code could eliminate thousands of lines of code.

If RecursiveASTVisitor (~2500 lines) and TreeTransform (~10k lines) could also be generated from the “single point of truth” with relatively little code, then that would be a tremendous savings.

I think your OCaml tool would be quite easy to write with Clang annotated as such, but you could let the Clang developers maintain the annotations for you :wink:

This might also pave the way for a more “data-driven” approach to the DynamicASTMatchers, which could significantly reduce the binary size (which is enormous, and IIRC is mostly due to the fact that is just pre-instantiates all the static templates). The same approach might work for the “static” ASTMatchers too, letting the compiler essentially constant-fold all the indirections (which will largely be member pointers I imagine). This might also improve compile time (which is an issue; see http://llvm.org/bugs/show_bug.cgi?id=20061).

– Sean Silva

I'd just like to say that even if OCaml tools parsing JSON is out of scope
as Nico suggests, the work you have done to "schematize" the Clang AST
could be the start of something really useful for upstream. Currently, I
can think of at least two places that would benefit greatly from having
such a schema as a "single point of truth": Serialization and ASTMatchers.

Being able to auto-generate those two from a schema (maybe in the form of
annotations in the header files) plus a relatively small amount of
generator code could eliminate thousands of lines of code.

If RecursiveASTVisitor (~2500 lines) and TreeTransform (~10k lines) could
also be generated from the "single point of truth" with relatively little
code, then that would be a tremendous savings.

I think your OCaml tool would be quite easy to write with Clang annotated
as such, but you could let the Clang developers maintain the annotations
for you :wink:

This might also pave the way for a more "data-driven" approach to the
DynamicASTMatchers, which could significantly reduce the binary size (which
is enormous, and IIRC is mostly due to the fact that is just
pre-instantiates all the static templates). The same approach might work
for the "static" ASTMatchers too, letting the compiler essentially
constant-fold all the indirections (which will largely be member pointers I
imagine). This might also improve compile time (which is an issue; see
http://llvm.org/bugs/show_bug.cgi?id=20061).

Generating code for the AST matchers is something we would like, but I
think it's orthogonal to the general design of the matchers (we still want
the functional composition), so I'm not sure how it would help with the
things you mention (apart from getting rid of the manually written
matchers, which would still be a big win).

One of the big problems would be how we auto-generate the documentation for
the AST matchers, though. I agree that it would be better to have the
documentation (and the examples) on the nodes, but that'd be a lot of work.

I'd just like to say that even if OCaml tools parsing JSON is out of
scope as Nico suggests, the work you have done to "schematize" the Clang
AST could be the start of something really useful for upstream. Currently,
I can think of at least two places that would benefit greatly from having
such a schema as a "single point of truth": Serialization and ASTMatchers.

Being able to auto-generate those two from a schema (maybe in the form of
annotations in the header files) plus a relatively small amount of
generator code could eliminate thousands of lines of code.

If RecursiveASTVisitor (~2500 lines) and TreeTransform (~10k lines) could
also be generated from the "single point of truth" with relatively little
code, then that would be a tremendous savings.

I think your OCaml tool would be quite easy to write with Clang annotated
as such, but you could let the Clang developers maintain the annotations
for you :wink:

This might also pave the way for a more "data-driven" approach to the
DynamicASTMatchers, which could significantly reduce the binary size (which
is enormous, and IIRC is mostly due to the fact that is just
pre-instantiates all the static templates). The same approach might work
for the "static" ASTMatchers too, letting the compiler essentially
constant-fold all the indirections (which will largely be member pointers I
imagine). This might also improve compile time (which is an issue; see
http://llvm.org/bugs/show_bug.cgi?id=20061).

Generating code for the AST matchers is something we would like, but I
think it's orthogonal to the general design of the matchers (we still want
the functional composition), so I'm not sure how it would help with the
things you mention (apart from getting rid of the manually written
matchers, which would still be a big win).

I was talking about a possible simplification of the implementation, not
the API it exposes to users.

One of the big problems would be how we auto-generate the documentation
for the AST matchers, though. I agree that it would be better to have the
documentation (and the examples) on the nodes, but that'd be a lot of work.

The documentation that you guys have produced for the matchers is quite
good and could largely be reused/shared/migrated/adapted to the relevant
part of the AST itself. (As you said, it would be a lot of work though).

-- Sean Silva

Hi,

I am looking for feedback on the possibility of contributing some of the clang plugins used at Facebook back to clang.

We just made available a first subset of plugins here: https://github.com/facebook/facebook-clang-plugins

The plugins fall into two groups:
1) Clang analyzer checkers for iOS;

Mathieu,

One of the analyzer checkers looks similar/like a subset of this one:
   cfe/trunk/lib/StaticAnalyzer/Checkers/IdenticalExprChecker.cpp
   cfe/trunk/test/Analysis/identical-expressions.cpp

Cheers,
Anna.

We'd also prefer you to prepare separate patches for each checker. It would be a big plus if you added empirical data as well: how much testing has the checker been though, what is the false positives ratio, how effective is it (bugs found).

It looks like some of the checkers come from Facebook coding policies and might not be applicable for "a turned on by default checker". It would be good to note that as well.

Thanks!
Anna.

Thanks Anna. Indeed, it looks like the checker IdenticalExprChecker.cpp is strictly more general than the one in the plugin (since https://llvm.org/svn/llvm-project/cfe/trunk@201702).
I will remove the duplicate.

You could double check if the existing checker is on by default.

Thanks for the feedback, Sean, Manuel. I had not thought about the ASTMatchers before, but this sounds interesting (see comments below, though).

On my side, here are a few thoughts on extending the “ASTExporter" plugin to make it useful upstream. Let me stress that this is all highly speculative, and that I am not promising anything :slight_smile:

  1. With some tweaks, it seems feasible to code an ASCII-art tree-like output for the ASTExporter, that is close enough to the current ASTDumper.

  2. Similarly, it should be easy to make the ASTExporter emit binary outputs instead of Json (e.g. in this format: http://mjambon.com/biniou-format.txt which ocaml understands)

  3. Then, one could write a dual “ASTImporter”, together with a binary parsing library, so as to provide a full alternative solution for AST (de)serialization.

At this stage, the resulting code would be about the same size, but arguably more modular than the current binary (de)serialization (since it would support Json and possibly text outputs).
The binary and Json (de)serialization could be tested alone (write then read) and also by interoperating with Ocaml (if we choose the format above).

To be fair, the whole thing could also be a little less efficient (in size and time) because of the use of a uniform format.

  1. With more work, the schema (currently “atd” annotations) could be used to generate a in-memory representation of the AST in terms of tree-like plain data (“PODS"). The ASTExporter and ASTImporter classes, plus appropriate generated modules, would take care of the translation to and from this “protocol" representation.

At this stage, it should be rather easy to meta-generate visitors and perhaps matchers on these PODS. However, be aware that one would not visit/match the original AST, but a copy of it, with a different style of data structures. To me, this observation will always hold if we try to generate visitors or matchers directly (i.e. without generating first an in-memory copy of the AST) from a language-agnostic schema.
Lastly, I wouldn’t expect a meta-generated API to match an existing handwritten API word for word, even if the general style can be maintained.

— Mathieu

Thanks for the feedback, Sean, Manuel. I had not thought about the
ASTMatchers before, but this sounds interesting (see comments below,
though).

On my side, here are a few thoughts on extending the “ASTExporter"
plugin to make it useful upstream. Let me stress that this is all highly
speculative, and that I am not promising anything :slight_smile:

0) With some tweaks, it seems feasible to code an ASCII-art tree-like
output for the ASTExporter, that is close enough to the current ASTDumper.

1) Similarly, it should be easy to make the ASTExporter emit binary
outputs instead of Json (e.g. in this format:
http://mjambon.com/biniou-format.txt which ocaml understands)

If we have to choose a binary format, I think we should stick with the LLVM
bitcode binary format <http://llvm.org/docs/BitCodeFormat.html&gt;\.

2) Then, one could write a dual “ASTImporter”, together with a binary
parsing library, so as to provide a full alternative solution for AST
(de)serialization.

At this stage, the resulting code would be about the same size, but
arguably more modular than the current binary (de)serialization (since it
would support Json and possibly text outputs).
The binary and Json (de)serialization could be tested alone (write then
read) and also by interoperating with Ocaml (if we choose the format above).
To be fair, the whole thing could also be a little less efficient (in
size and time) because of the use of a uniform format.

3) With more work, the schema (currently “atd” annotations) could be
used to generate a in-memory representation of the AST in terms of
tree-like plain data (“PODS"). The ASTExporter and ASTImporter classes,
plus appropriate generated modules, would take care of the translation to
and from this “protocol" representation.

At this stage, it should be rather easy to meta-generate visitors and
perhaps matchers on these PODS. However, be aware that one would not
visit/match the original AST, but a copy of it, with a different style of
data structures. To me, this observation will always hold if we try to
generate visitors or matchers directly (i.e. without generating first an
in-memory copy of the AST) from a language-agnostic schema.

While a "protocol" representation would be revolutionary for language
bindings, I think an important first step to bringing this in-tree is to
using the schema to directly generate C++ code interoperating with the
current C++ data structure.

Realistically, the schema must be something that in-tree developers will
*want* to keep up to date because it saves them a lot of time and effort.
Otherwise, it will be considered as some sort of parasitic decoration on
the AST and will not be kept up to date (not through malice, but simply
through the "this doesn't affect me so I sometimes forget its there"
effect). And a schema that one can't be confident is kept up to date is
almost worse than no schema at all.

Lastly, I wouldn’t expect a meta-generated API to match an existing
handwritten API word for word, even if the general style can be maintained.

As I mentioned to Manuel, largely it would be about simplifying the
implementation. A lot of the matchers just involve stamping out some
repetitive piece of code (e.g. for each AST node type, or for particular
methods on nodes, etc.).

However, an alternative matching paradigm would also be great. In
particular, I think that a "protocol" representation would be really
beneficial for pushing interaction with the AST to the next level.
Currently, the matchers are basically "grep-like", in that they go through
the entire AST looking for matching nodes. A "protocol" representation with
a proper schema would significantly simplify e.g. indexing the AST and
other common techniques for speeding up searches.

-- Sean Silva