Announcing "clang-ctags"

Announcing "clang-ctags", a libclang-based ctags implementation written
in python.

Source code: https://github.com/drothlis/clang-ctags

I took care to structure the commits in a tutorial-like fashion, so you
could start from the oldest commit:
https://github.com/drothlis/clang-ctags/commits/master/clang-ctags
As time permits I'll write up a tutorial presenting the material in a
more structured way.

Currently clang-ctags only supports the Emacs ("etags") format
(mainly because I haven't figured out how to write integration tests
for vim).

WHY:

This seemed like the simplest tool I could write to get acquainted with
libclang, and still be useful.

https://github.com/drothlis/clang-ctags/blob/master/test/why.sh tests
some specific cases that the traditional etags doesn't handle well.

LESSONS LEARNED:

Using this tool is far more complicated than existing ctags/etags
implementations. To process a source file you need its compilation
command line (there are several ways to obtain this:
https://github.com/drothlis/clang-ctags#compilation-command-line).

There are other complications. How do you process header files? You
don't have a compilation command line for headers. The approach I've
taken is to generate tags for a header file encountered during
processing a source file, but only if that header file was also
specified on the clang-ctags command line. This matches the way you
invoke traditional ctags tools, but instead of:
    find . -name '*.[ch]pp' | xargs ctags
you say:
    find . -name '*.[ch]pp' |
    xargs clang-ctags --compile-commands=compile_commands.json

(I also added a "--non-system-headers" flag to generate tags for all
header files encountered that are under the directory where clang-ctags
is invoked.)

In general the development was very straight-forward, and clearly a tool
to index C++ with this accuracy wouldn't be feasible without clang. But
it still took far longer than I expected, and I'm beginning to
understand why we haven't seen more clang-based tools springing up.
(This is a fault of C++, not of clang! And I expect things will get
easier as more tooling is added, like the new support for compile
command databases.)

ipython (a python shell with tab-completion) is great for discovering
the libclang api.

Deployment is going to be difficult -- until libclang and its python
bindings are included in your system's clang packages, you'll have to
build clang from source. (A separate project, the clang_complete plugin
for vim, works around this by shipping a copy of cindex.py, but then you
have to make sure that your system's version of libclang matches
clang_complete's cindex.py.)

PERFORMANCE:

Running clang-ctags over the `lib` directory of the `clang`
source code (480 files totalling 470k lines of code) takes 37 minutes on
my 1.8GHz Intel Core i7. 98% of this time is the parsing done by
libclang itself. By comparison, GNU etags takes 0.5 *seconds* on the
same input.

CONCLUSION:

As a replacement for traditional ctags/etags, the disadvantages of
clang-ctags may outweigh the advantages. But it could be useful as
a base to build a more advanced indexing tool. :slight_smile:

Cheers
David Rothlisberger.

PERFORMANCE:

Running clang-ctags over the `lib` directory of the `clang`
source code (480 files totalling 470k lines of code) takes 37 minutes on
my 1.8GHz Intel Core i7. 98% of this time is the parsing done by
libclang itself. By comparison, GNU etags takes 0.5 *seconds* on the
same input.

Ouch. This is longer than compiling it!

My guess is that the performance problem is that a couple factors come together.

1. This is running completely single threaded. This makes it like a
factor of 2, 4, 8, etc. times slower depending on your cpu.
2. You are actually processing things like includes and such, rather
than simply doing a linear grep-like scan of all the files as I
believe GNU etags does.

The second one is the real problem, and unfortunately, there doesn't
seem to be a way to circumvent it with your current architecture.

This seems like a job much better suited to a plugin
<http://clang.llvm.org/docs/ClangPlugins.html>. If you look at the
criteria which <http://clang.llvm.org/docs/Tooling.html> suggests for
how to decide how to use Clang to build tools, this seems the most
natural fit: one of the "canonical examples" it gives is "creating
additional build artifacts from a single compile step". It also says
to use plugins when you "need your tool to rerun if any of the
dependencies change". This seems like a natural fit for what you're
doing.

--Sean Silva

David Röthlisberger <david@rothlis.net> writes:

Announcing "clang-ctags", a libclang-based ctags implementation written in
python.

Source code: https://github.com/drothlis/clang-ctags

I also have a similar project, call clang-tags:

    https://github.com/drothlis/clang-tags

Mine is in C++, and records tagging information into a relational database, to
allow for compact, expressive queries about the entities you are looking for
(and their context, or relation to other entities).

Development is blocked on clang#13224 right now, but when it's unblocked I
plan on added a very sexy Emacs interface for browsing your code based on that
information.

Like your tool, I rely on the compilation database, so building the index is
not a trivial (or a quick) thing. But once gathered, the information
available is very rich.

John Wiegley <johnw@boostpro.com> writes:

I also have a similar project, call clang-tags:

    https://github.com/drothlis/clang-tags

The gods of cut&paste have smited me. The real link is:

    https://github.com/jwiegley/clang-tags

It's not at 1.0 yet, I just wanted to let you know it existed and work has
been ongoing since the C++Now! conference.

Development is blocked on clang#13224 right now [...]

Could you please check that this still happens on the latest trunk? If
it still does, could you try to produce a reduced testcase?

--Sean Silva

Sean Silva <silvas@purdue.edu> writes:

Could you please check that this still happens on the latest trunk? If it
still does, could you try to produce a reduced testcase?

I would love to reduce it, but I'm not sure how. When it happens depends on
which machine I run it on, and it requires parsing several rather large C++
files before it happens at all. So, I'm not sure the case to reproduce will
ever be small, but it is consistent. That much is in the bug report.

If it consistently triggers inside that function, then maybe you could
single step it through that function? It's a bit tedious to do 5-11
times, but it will probably find the offending stray memory reference.

--Sean Silva

Sean Silva <silvas@purdue.edu> writes:

If it consistently triggers inside that function, then maybe you could
single step it through that function? It's a bit tedious to do 5-11 times,
but it will probably find the offending stray memory reference.

I built Clang with ASan, so I know where the reference happens (see bug
report), I just don't know why it's using a freed memory block.

I saw the bug report, but that gives only memory addresses and
instruction offsets for what the offending code is (`0x1048b50b1 in
(anonymous namespace)::ASTStatCache::getStat(char const*,
stat&, int*) (in clang-tags) + 977`); given the size of `getStat` and
a brief perusal of the source, it seems like at least one level of
inlining is involved. It would be immensely helpful if you could tie
that address back to the statement/expression which causes the
reference.

--Sean Silva

I saw the bug report, but that gives only memory addresses and
instruction offsets for what the offending code is (`0x1048b50b1 in
(anonymous namespace)::ASTStatCache::getStat(char const*,
stat&, int*) (in clang-tags) + 977`); given the size of `getStat` and
a brief perusal of the source, it seems like at least one level of
inlining is involved. It would be immensely helpful if you could tie
that address back to the statement/expression which causes the
reference.

Also, as noted on the bug, the Tooling should never pull in
ASTStatCache. Any pointers to how that might be instantiated would be
of great help.

Cheers,
/Manuel

I suspect this is what is happening:

  - Clang is loading a precompiled header, which wires a stat cache into the FileManager. That stat cache points into the mmap'd memory for the precompiled header.
  - That instance of the compiler completes, and everything goes away *except* the stat cache, since the FileManager is reused. We now have a stat cache in the FileManager that points at the location of previously-mmap'd memory for the precompiled header.
  - Later instances of the compiler wire more stat caches into the FileManager, and most lookups hit those earlier caches, so the problem doesn't reproduce easily
  - Eventually, we have a cache miss in a later instance of the compiler, and the dangling pointer into the previously-mmap'd precompiled header ends up getter used after those addresses have been reused, and BOOM!
  
Solution: clear out the stat caches attached to the FileManager when re-using that file manager.

  - Doug

Thx for solving the mystery :slight_smile: Do you have a pointer (example test or
something) on what's the best way to create a precompiled header for a
small test?

Cheers,
/Manuel

There are a bunch of tests in test/PCH that do this, but they're based on clang -cc1, which tooling is not. Instead, just use the driver-level options:

  clang -x c++-header foo.h -o foo.h.pch

to create the PCH and

  clang -include foo.h foo.cpp

to use that PCH file

  - Doug

Perfect. Thanks!

Here's something I found quite interesting:

When I run clang-ctags over clang/lib/Analysis/AnalysisDeclContext.cpp
it reports† that the call to clang_parseTranslationUnit takes 6.4s.

Giving the same compilation command directly to clang++ takes 1.4s.

This is a huge discrepancy, given that clang_parseTranslationUnit is
only doing parsing, whereas clang++ is parsing *and* compiling! Even
with -O3 clang++ only takes 1.6s. What can explain this?

† Running clang-ctags with "--verbose" prints the compilation command as
well as how long the call to clang_parseTranslationUnit took.

Perhaps it's building a precompiled preamble, which is only worthwhile if you're going to reparse the same file many times. Try setting the environment variable LIBCLANG_TIMING=1 to see if anything odd shows up in the internal timing log; if not, please grab a profile!

  - Doug

Oh dear... this is terribly embarrasing, but I was comparing a debug
build of libclang against a release build of clang++.

It's good news for clang-ctags, though. With a release build,
clang-ctags indexes clang/lib/**/*.[ch]* in 4.3 minutes (instead of the
37 minutes I had previously reported).

clang's parsing accounts for 72% of this time; of the remaining 28%,
maybe the overhead of python is more significant than I expected (but
I'm not going to re-implement in C++ to find out). :slight_smile:

As Sean Silva pointed out, run time could be further improved with
parallelization: If clang-ctags supported multiple processes writing to
the same tag file, one could use GNU parallel instead of xargs:

    find lib -name '*.[ch]*' | parallel clang-ctags --append ...

Cheers,
Dave.

I was aiming for a tool that was as similar as possible to existing
ctags implementations. I also wanted clang-ctags to be as easy as
possible for the user, and having to integrate it into your build system
sounded complicated. (As it turns out, *not* integrating with the build
system might be the more complicated way!)

Thanks for your feedback!

--Dave

This sounds like an excellent start on a very useful tool.

Applying this technology to cscope, which has a _lot_ of trouble
with complicated C and C++ source code, would be a definite step up.

Same goes for 'indent' and the various source code colorizers.

- Gary