Getting involved with Clang refactoring

Greetings,

I just attended Chandler Carruth’s talk at C++ Now about using clang to develop refactoring tools. As they say on the internets “Your ideas intrigue me and I would like to subscribe to your newsletter.”

I believe that refactoring tools can be groundbreaking for C++ and would like to contribute. I’m particularly interested in getting the tools hooked into IDEs (many years ago I worked on a refactoring plugin for JBuilder), but it sounds like there’s lots of work to be done before that. How can I get started, and what most needs to be worked on?

Thanks
–David

Hi David,

Quoting Manuel Klimek from another thread on this mailing list :

See:
http://clang.llvm.org/docs/Tooling.html
http://clang.llvm.org/docs/RAVFrontendAction.html

and in case you decide to give LibTooling a try:
http://clang.llvm.org/docs/LibTooling.html

Cheers,

Welcome!

Arnaud's links point at the tooling infrastructure, which is intended for building standalone refactoring and source-based tools.

For IDE-centric tools, libclang

  http://clang.llvm.org/doxygen/group__CINDEX.html

is a C interface to Clang that focuses on the things that IDEs like to do: syntax highlighting, indexing, code completion, finding all of the references to a given declaration within a file, finding out what your cursor is pointing to, etc. I gave an introductory talk on libclang at the 2010 LLVM Developer Meeting which might be helpful:

  http://llvm.org/devmtg/2010-11/

If your primary interest is in IDEs, I suggest wiring up libclang to your favorite open-source IDE. There are various Vim and Emacs libclang bindings running around, for example, although I don't know which are best. Personally, I think it would be awesome if we could have easy-to-install, libclang-based packages for vim and emacs that were maintained along with Clang and got all of the new libclang goodness. Once you've played with libclang's integration, I'm sure you'll find something that needs improvement, and we can point you at that code if you don't find it yourself.

If your primary interest is in refactoring, Arnaud's links to tooling are a good place to start, because it's starting to collect the pieces needed to make refactoring easier.

  - Doug

clang_complete (https://github.com/Rip-Rip/clang_complete) is a nice Vim
plugin, and is easy to install --- although not shipped with clang :slight_smile:

Cheers,
Arnaud

Thanks all for the pointers.

--David

Btw, the first link (http://clang.llvm.org/docs/Tooling.html) contains
a discussion about the different frameworks. If you think I haven't
mentioned some stuff about libclang on there (or got some wrong), let
me know and I'll add to it.

And while I'm at it. One thing that might be interesting for a
contribution is getting the compilation database support up for
libclang (not sure how Doug thinks about this, but I'm sure he'll jump
in :wink: One of the problems I have with the libclang plugins I know is
that I need to specify "one common command line".

Thoughts?
/Manuel

Greetings,

I just attended Chandler Carruth's talk at C++ Now about using clang to develop refactoring tools. As they say on the internets "Your ideas intrigue me and I would like to subscribe to your newsletter."

I believe that refactoring tools can be groundbreaking for C++ and would like to contribute. I'm particularly interested in getting the tools hooked into IDEs (many years ago I worked on a refactoring plugin for JBuilder), but it sounds like there's lots of work to be done before that. How can I get started, and what most needs to be worked on?

Welcome!

Arnaud's links point at the tooling infrastructure, which is intended for building standalone refactoring and source-based tools.

For IDE-centric tools, libclang

       http://clang.llvm.org/doxygen/group__CINDEX.html

is a C interface to Clang that focuses on the things that IDEs like to do: syntax highlighting, indexing, code completion, finding all of the references to a given declaration within a file, finding out what your cursor is pointing to, etc. I gave an introductory talk on libclang at the 2010 LLVM Developer Meeting which might be helpful:

       http://llvm.org/devmtg/2010-11/

If your primary interest is in IDEs, I suggest wiring up libclang to your favorite open-source IDE. There are various Vim and Emacs libclang bindings running around, for example, although I don't know which are best. Personally, I think it would be awesome if we could have easy-to-install, libclang-based packages for vim and emacs that were maintained along with Clang and got all of the new libclang goodness. Once you've played with libclang's integration, I'm sure you'll find something that needs improvement, and we can point you at that code if you don't find it yourself.

If your primary interest is in refactoring, Arnaud's links to tooling are a good place to start, because it's starting to collect the pieces needed to make refactoring easier.

And while I'm at it. One thing that might be interesting for a
contribution is getting the compilation database support up for
libclang (not sure how Doug thinks about this, but I'm sure he'll jump
in :wink: One of the problems I have with the libclang plugins I know is
that I need to specify "one common command line".

That's a good idea. libclang assumes that you have your own compilation database (or similar) and that you use it up feed command lines to libclang, but it would be easier to use libclang if we could just point it at a build or source tree and then feed source files to libclang.

The problem with make is that it executes the compiler based on
changes in the dependency graph.

Imagine you want to run a tool on all files that use a specific
symbol. To do this you need to:
1. figure out the compile command line for the file into which you're
pointing to figure out the symbol under the cursor
2. run the tool over all files that reference that symbol

Both steps are completely independent of any dependency changes, so
running the tool as part of make / as a clang plugin is not the best
option here. You want a compilation database out of which you can
quickly (for interactive use cases) get the compile command line for
any file in a project.

LibTooling has this idea already at its core:
http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Tooling/CompilationDatabase.h?view=markup

You first generate a compilation database, and then feed that into
your tool run.

The idea would be to wrap up the CompilationDatabase in a stable C
interface in libclang, and provide parsing functions in libclang that
take the C-equivalent of a CompilationDatabase instead of the command
line itself.

Thoughts?
/Manuel

Bringing it back to 'make' a little bit... we could, conceivably, have a compilation database implicitly generated from the makefiles. If one asked it how to build 'foo.cpp', it would find the appropriate make rule and form the command-line arguments. We don't have such a 'live' compilation database right now, but it fits into the model and would be really, really cool because it would allow us to 'just work' on a makefile-based project. Unfortunately, it amounts to re-implementing 'make' :frowning:

There are other ways we could build compilation databases. There's CMake support for dumping out a compilation database; we could also add a -fcompilation-database=<blah> flag that creates a compilation database as the result of a build, which would work with any build system. That would also be a nice little project that would help the tooling effort.

  - Doug

And while I'm at it. One thing that might be interesting for a
contribution is getting the compilation database support up for
libclang

...

That's a good idea. libclang assumes that you have your own
compilation database (or similar) and that you use it up feed command
lines to libclang, but it would be easier to use libclang if we could
just point it at a build or source tree and then feed source files to
libclang.

Gentlemen, could you tell me please what you mean by a "compilation
database"? It sounds like a list of files to compile, something I'd
normally look to make(1) to provide. Apparently that's not what you
mean, or there's some reason make won't do the job.

If I understood the problem better perhaps there's something I could do
about it.

The problem with make is that it executes the compiler based on
changes in the dependency graph.

Imagine you want to run a tool on all files that use a specific
symbol. To do this you need to:
1. figure out the compile command line for the file into which you're
pointing to figure out the symbol under the cursor
2. run the tool over all files that reference that symbol

Both steps are completely independent of any dependency changes, so
running the tool as part of make / as a clang plugin is not the best
option here. You want a compilation database out of which you can
quickly (for interactive use cases) get the compile command line for
any file in a project.

LibTooling has this idea already at its core:
http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Tooling/CompilationDatabase.h?view=markup

You first generate a compilation database, and then feed that into
your tool run.

Bringing it back to 'make' a little bit... we could, conceivably, have a compilation database implicitly generated from the makefiles. If one asked it how to build 'foo.cpp', it would find the appropriate make rule and form the command-line arguments. We don't have such a 'live' compilation database right now, but it fits into the model and would be really, really cool because it would allow us to 'just work' on a makefile-based project. Unfortunately, it amounts to re-implementing 'make' :frowning:

/me pulls out the fry meme. Are you actually serious?

There are other ways we could build compilation databases. There's CMake support for dumping out a compilation database; we could also add a -fcompilation-database=<blah> flag that creates a compilation database as the result of a build, which would work with any build system. That would also be a nice little project that would help the tooling effort.

+1 on that one; I think if we really want generic support, this is the
way to go. I assume the idea is to have
-fcompilation-database=/path/to/db on every TU and then add the TU's
compilation information to that db (including locking etc)?

Cheers,
/Manuel

About re-implementing 'make'? No, we don't want that burden.

But viewing an existing build system, as implemented in a library that we can link against, through the lens of a compilation database would be useful.

  - Doug

Yep. Unfortunately I think that ship has sailed for make a long time
ago... This might be an option for ninja integration though...

Cheers,
/Manuel

For the sake of readers who, like me, don't know all the background
information, here's what I've unearthed over the last hour or two:

1. If you define CMAKE_EXPORT_COMPILE_COMMANDS cmake will create the file
   compile_commands.json.

   See http://cmake.org/gitweb?p=cmake.git;a=commitdiff;h=fe07b055
   and http://cmake.org/gitweb?p=cmake.git;a=commitdiff;h=5674844d

   I don't know if the format of this json file is documented anywhere, but
   from the above commits it seems to be an array of dicts like this:

      { "directory": "abc", "command": "g++ -xyz ...", "file": "source.cxx" }

2. Clang has a tool called scan-build that wraps an invocation of make.
   You call it like this:

      scan-build make

   Scan-build intercepts the compiler by setting CXX to some script that
   forwards on to the real compiler, and then (while it still knows all
   the compiler flags necessary to compile this file) it invokes the
   clang static analyzer.

   See http://clang-analyzer.llvm.org/scan-build.html
   and http://llvm.org/svn/llvm-project/cfe/trunk/tools/scan-build/scan-build

   It's 1400 lines of perl, but most of that seems to be command-line options,
   usage help, and generating html reports. The compiler-interception part
   doesn't seem too difficult.

   Scan-build is relevant to this discussion because one could generate a
   compilation database using a similar interposing technique.

3. Something completely different: Maybe we could figure out the compilation
   command-lines for all of a project's files at once by looking at the output
   of "make --always-make --dry-run".

   One difference from the lets-interpose-CXX approach is that this will give
   us some command-lines that are not C++ compilations, and we'd have to filter
   those out.

   Once we do know that it's a C++ compilation command-line, we still have to
   parse that command-line to figure out the name of the sourcefile (just like
   the interposed CXX script has to).

4. Doug's suggestion: Call clang with "-fcompilation-database=foo" during the
   course of a normal build. This will simultaneously compile the file and
   add/update an entry in the compilation database. (Or maybe only do the
   compilation database entry, requiring a separate invocation to do the
   actual compilation?)

Pros and cons of the various approaches:

Cmake + The compilation database is generated at "cmake" time -- we don't need
         to do a full build.

Cmake + Works on Windows.

Cmake - (Obviously) doesn't work with non-cmake build systems.

CXX interposing + Probably the easiest to implement if you have a project that
                   needs this *now* and you don't want to wait for a better
                   solution to make its way into clang.

CXX interposing + Works with any build system as long as it is compliant with
                   the CXX / CC environment variable convention.

CXX interposing - The interposed script has to parse the compilation command-
                   line to extract the source filename. This is duplication of
                   effort because clang already has to parse the command-line.

CXX interposing - Each entry to the compilation database is added as the
                   corresponding target is being built, so in
                   parallel/distributed builds it will have to lock the
                   compilation database.

make --dry-run + Works with any make-based system (I'm not very familiar with
                  non-GNU versions of make, but presumably they have similar
                  flags), except for recursive-make systems as mentioned below.

make --dry-run + Far easier than re-implementing make.

make --dry-run + No need to actually build the targets.

make --dry-run - Like the CXX interposing technique, has to parse the
                  compilation command-line.

make --dry-run - Gives you *all* the compilation commands, not just C or C++
                  compilations; you'll have to filter the output for what
                  you're interested in. Smells a bit hacky and brittle but
                  maybe that's just my prejudices speaking.

make --dry-run - Doesn't work with some complex recursive-make build systems.
                  For example if part of your makefile creates another makefile
                  and then uses that, clearly your dry-run won't work unless it
                  actually does create that second makefile. In theory make has
                  ways to make this work -- see
                  http://www.gnu.org/software/make/manual/html_node/MAKE-Variable.html
                  -- but in practice I've never seen a large build system where
                  dry-run works.

clang -fcompilation-database + Easier for the *user* than the two previous
                                shell-script-based solutions. No mucking about
                                with shell scripts: just set CXXFLAGS, run
                                make, and you're done.

clang -fcompilation-database + Will work on Windows.

clang -fcompilation-database - Like the CXX interposing technique, has to lock
                                the compilation database for parallel/
                                distributed builds.

clang -fcompilation-database - Can't generate the compilation database without
                                building your whole project with clang.

That last point is more important (to me) than you might think. Say I have a
large codebase and not all of it builds with clang; but for the source files
that *can* be parsed by clang, I want to run some clang-based tool. Still,
having "-fcompilation-database" in clang doesn't stop me from writing my own
CXX-interposing scripts if I should need them.

Well, that's all. I hope someone finds it useful -- I can't be the only one to
have wondered how to actually get the full command-line through to clang-based
tools. :slight_smile: Once we decide on an official solution let's make sure we document
it well.

--Dave.

Hi Dave,

thanks for writing all the stuff down!

I don’t think that an “official” solution for how to generate the compile database is important, as long as

  1. the format is clear
  2. we support a wide range of use cases

This is open source :slight_smile: People can generally implement all of the above solutions. Some of them might not need to live inside clang’s repository; it would generally be good to have at least one solution that is as generic as possible living inside clang without the need for 3rd party things (like cmake or ninja). I think for that solution the switch is the best one, as it’s the only one that does not increase the dependency needs of clang users at build time.

Thoughts?
/Manuel

Hi Manuel & Dave,

Although the switch makes it easy to be a self-contained solution, this is not generic enough to cover an important use case : people may not be using clang for compiling their code, but still want all the clang goodies (code completion, …) thru an external tool. This is for example the case when using clang_complete with vim : you are not forced to compile your project with clang.

Cheers,

Yep, that is true.

On the other hand, the more tools we have the more other OS projects (cmake / ninja / etc) will support creating compile command lines. So we need to find the right trade-off for what to include in the clang codebase. As I said, I think we don’t need to support all use cases from what’s available inside the clang tree.

To some degree we’ll always require a compiler that is “compatible enough” with clang, because we’ll probably not want to implement all other compiler’s command line argument parsing inside clang.

In the end it depends on who’s willing to write which solution and propose a patch to clang :wink:

Cheers,
/Manuel

Do not tempt me :wink:

The use case I have here is that several people are using clang_complete
+ vim, but have to use gcc because their target requires it. We have
added a cmake target to generate the ".clang_complete" configuration
file, which provides the options necessary for compiling the source
files : in essence, that's a compilation-db. On its side,
clang_complete also provides a script to extract the compilation options
from the build system.

I am interested in working on this compilationdb, as this will clearly
be a benefit for clang and all derivative projects.

Cheers,

So, there are multiple things to the CompilationDatabase where work is needed:

  1. Add ways to create the JSON compilation-db. We already have a way to create it from CMake, there were discussions on getting it out of ninja, and even creating one yourself with a python script is not too hard.

  2. Integrating the compilation database into libclang, as tools like clang-complete are libclang based, which currently doesn’t expose the CompilationDatabase interfaces; adding a C interface around CompilationDatabase would help there.

  3. Making CompilationDatabase work in more use cases:

  • add code to deal with gcc specific flags, where people have gcc as their main tool; don’t know whether it makes sense to add code to deal with other compilers
  • add other formats for the CompilationDatabase, for example for Ninja it might be viable to basically link in a Ninja runtime (if available at clang compile time) to directly read the compile commands from the ninja files

Cheers,
/Manuel

I'm following this discussion closely, as I'm working on a very
similar constellation
of problems. I'm interested in clang_complete and some form of clang indexing,
but for emacs, not for vim. I'm thinking that the indexing tool
should be something
like ctags or gid, but smarter. I'm working on a python tool to
extract build information
from the output of gmake. (FWIW, I think I have management interest
in contributing
my code to clang, or emacs, or whatever my code ends up being useful for.)

Does anybody know if the uses of the compilation database cares about the
first parameter in the command? What I'm asking is, if I interpose my command,
say mk_compile_db, for a call to g++, and then do gmake --makeall
CC=mk_compile_db...
the compilation command will say that mk_compile_db is the command, not g++.
It turns out that the path to g++ is explicit in our build system, so
I can't just change
the first word to "g++" and give an accurate answer. On the other
hand, maybe it
doesn't matter what the command is. Maybe the libtooling facilities ignore the
first word, or maybe they expect the first word to be elided altogether.

I could investigate the code, but I think I'd rather know what the
intended behavior is.

Few years ago I read a very interesting article about the Coverity tool,
not that far from what Clang tooling infrastructure is about.
http://cacm.acm.org/magazines/2010/2/69354-a-few-billion-lines-of-code-later/fulltext

They talk about this very issue too: how to correctly analyze the build
process?

If I remember well, after using man-in-the-middle attack against the
compiler invocations, they went into running the build process in
tracing mode to capture all the Windows subtleties for example. :slight_smile:

This is a very interesting article to read. It seems to be also
available with some related presentations too on
http://www.stanford.edu/~engler/

Parsing the output of "strace -f" may be your friend to start with on
Unix. :slight_smile: