Clang-based indexer and code navigator

Hi,

I've been working on a source code navigation tool for C and C++ that uses Clang to index code. I thought people on this list might be interested in it.

I posted a demo of the tool at http://rprichard.github.com/sourceweb, and the source code is available at the GitHub page,
https://github.com/rprichard/sourceweb.

-Ryan

Looks great from the demo images. Haven't tried it though.
Do i need to reindex everything in case if i make a minor change in a file or does it track dependencies and patches the index appropriately?

There is an --incremental option to sw-clang-indexer. It doesn't patch the index, but it avoids reindexing every translation unit when the project is reindexed. With that option, the indexer saves an index file for each translation unit and avoids reindexing a translation unit if none of its source files have changed. The indexer still merges all of the translation units' indices into a single index. On my machine, for LLVM+Clang, --incremental reduces the reindexing time from about 5.5 minutes to 16 seconds. It has a few limitations:

  - It doesn't keep track of compiler options, so if the options change, it can incorrectly reuse a translation unit's index file.

  - Keeping around index files for each translation unit uses much more disk space. While the index for LLVM+Clang is about 160MB, the intermediate index files total 2.8GB. I believe the difference in sizes is due to header files. (The per-translation-unit indices are already partitioned by source file to reduce merge time; placing each partition into a separate file could theoretically reduce the disk usage.)

  - If there are somehow multiple translation units in compile_commands.json with the same object file output, then the indexer may behave unpredictably, because it may spawn a subprocess that overwrites an index file while simultaneously reading the index file. This should be easy to fix.

-Ryan

Hi

I recognize the Qt Creator style.

This looks a bit like what I did. But for the web:
demo:
http://code.woboq.org/userspace/llvm/tools/clang/tools/driver/driver.cpp.html?style=qtcreator#main
source on github: https://github.com/woboq/woboq_codebrowser

Did you know that QtCreator also tried to use clang as a C++ model:
https://blog.qt.digia.com/blog/2011/10/19/qt-creator-and-clang/
But AFAIK, they found it was too slow, so the kept maintaining their own C++
parser.

Indeed!

In r176848 I added a link to SourceWeb to
<http://clang.llvm.org/docs/ExternalClangExamples.html>.

We seem to have a proliferation of Clang-based source navigators (DXR,
woboq, and now SourceWeb). It would be really great if you could get
in contact with the authors of these other tools (I see Oliver Goffart
is already in this thread) and discuss what kinds of issues you ran
into when developing the respective tools (you probably want to
contact jcranmer for DXR); if there appear to be some common issues,
it would make sense to get that information upstream so that we can
fix/improve the situation. It would probably be sufficient to send a
mail to cfe-dev CC'ing them.

Also, if there is enough common ground between your objectives, it
would be really cool if we could pool effort and develop a solution on
trunk in clang-tools-extra: and then dogfood it! I've been wanting a
better source navigator than Doxygen's source listings for a while
now, and I think it would be appreciated.

-- Sean Silva

That's a great demo, a great teaser to draw people's attention -- people
want to know if it's going to be worth their time to install your tool,
and I think the demo helps a lot there. The installation & usage
instructions in the readme are great. Your "sw-btrace" tools looks
generally useful to other clang-based projects; we need to advertise
"this is how you generate a compile_commands.json without CMake".

What follows are a few random thoughts in no particular order:

Over the years I've tried a variety of code browsing tools (for C++ and
other languages too) but I never stuck with any of them for one simple
reason: They weren't integrated into my IDE. If I'm looking at a file I
don't want to switch to a different tool, find that file again, and
browse from there; then when I find the target, switch back to my IDE
and find *that* file. I use the term "IDE" loosely here; in my case it's
Emacs.

What we really need is a daemon that maintains an index database and
services requests from the IDE. A while ago Chandler Carruth posted a
design proposal for a "clang service daemon":
https://github.com/chandlerc/llvm-designs/blob/master/ClangService.rst

Anders Bakken wrote "rtags": https://github.com/Andersbakken/rtags/
which uses such an architecture, with (so far) a client for Emacs. It
doesn't use the compile_commands.json at all, but provides a wrapper
that you place on the PATH before cc, g++, etc. Once rtags knows about
a file, it monitors the file for changes using inotify/kqueue.
Rtags will group source files into "projects" based on some heuristics.

Another option is to use clang to generate a database in a format
understood by existing tools. I'm thinking particularly of "cscope" here
(but I don't know the cscope database's capabilities in any detail).
That way you only need to write the indexer, and you get front-end
interfaces for several major editors for free.

As you know, the clang C++ API is unstable. What is your plan for
maintaining sourceweb? Have you evaluated the stable C api ("libclang")?
Is there anything missing from libclang that prevented you from using
it? If the missing areas aren't huge, perhaps effort would be better
spent improving the libclang API than keeping an out-of-tree tool in
sync with the C++ API.

If libclang is an option, then you can use its python bindings. I can't
help feeling that C++ is the wrong language for this kind of thing. 10k
lines is quite a lot of code. (But maybe I'm entirely mistaken here
about the capabilities of libclang.)

I realise this is all sounding rather negative and for that I apologise.
Sourceweb is certainly far more polished and feature-full than my own
attempt at solving C++ indexing (
https://github.com/drothlis/clang-ctags ). I'm just trying to share what
I, personally, look for in such a tool, for whatever that's worth. :slight_smile:

Dave.

Hi,

Over the years I've tried a variety of code browsing tools (for C++ and
other languages too) but I never stuck with any of them for one simple
reason: They weren't integrated into my IDE. If I'm looking at a file I
don't want to switch to a different tool, find that file again, and
browse from there; then when I find the target, switch back to my IDE
and find *that* file. I use the term "IDE" loosely here; in my case it's
Emacs.

I fully agree with this point: the biggest problem with a separate code
browser is that I often want to "compare" where I'm working with some
browsed place, which means they both have to be either in the browser tool
or in my editor to minimise the disruption.

What we really need is a daemon that maintains an index database and
services requests from the IDE. A while ago Chandler Carruth posted a
design proposal for a "clang service daemon":
https://github.com/chandlerc/llvm-designs/blob/master/ClangService.rst

Note that _in purely practical terms_ what's most valuable is having a
"persistent, resumable, updatable" store of info that can be interrogated; I
don't know whether that's significantly less work than having an active
daemon. This comes from the following usage:

1. I rarely lookup stuff in code I've written (with the one exception being
the possibility for a "semantically correct" search and replace). It's
mostly for code in the project that I didn't write.

2. Consequently the vast majority of times I want to update the store of
info are after a VC operation (pull, change branch, etc). These are the kind
of operations where it's possible to script things to kick off a rescan of
changed files (rather than the daemon having to detect changes).

3. Of course you might want to keep the process that answers questions alive
for a whole edit session for performance reasons.

This isn't to say a daemon wouldn't be good, just that you could (for my
usage) get 95% of the benefit without it.

Cheers,
Dave

Sean Silva <silvas-olO2ZdjDehc3uPMLIKxrzw@public.gmane.org> writes:

[...]

We seem to have a proliferation of Clang-based source navigators (DXR,
woboq, and now SourceWeb). It would be really great if you could get
in contact with the authors of these other tools (I see Oliver Goffart
is already in this thread) and discuss what kinds of issues you ran
into when developing the respective tools (you probably want to
contact jcranmer for DXR); if there appear to be some common issues,
it would make sense to get that information upstream so that we can
fix/improve the situation. It would probably be sufficient to send a
mail to cfe-dev CC'ing them.

Agreed. I'm not sure that there will in fact be much overlap, but if
there is, it would be good to move it upstream. (I'm basing that on
playing with DXR's plugin, which seems fairly lightweight, as these
things go.)

One possibility (which doesn't apply to DXR, which is a compiler plugin)
is producing a compilation database. Using LD_PRELOAD and stuff feels
yucky, and cmake is (to me) useless (the projects I care about mostly
don't use cmake).

Can clang dump the relevant information in the right form? Feels like it
ought to be easy for it to do so, and that would surely be a clean way
to do it, presuming a project can be made to build with clang/clang++?
(Maybe it could be folded somehow with scan-build?)

Also, if there is enough common ground between your objectives, it
would be really cool if we could pool effort and develop a solution on
trunk in clang-tools-extra: and then dogfood it! I've been wanting a
better source navigator than Doxygen's source listings for a while
now, and I think it would be appreciated.

It would be neat to have that available alongside the doxygen API docs.

Sean Silva <silvas-olO2ZdjDehc3uPMLIKxrzw@public.gmane.org> writes:

[...]

> We seem to have a proliferation of Clang-based source navigators (DXR,
> woboq, and now SourceWeb). It would be really great if you could get
> in contact with the authors of these other tools (I see Oliver Goffart
> is already in this thread) and discuss what kinds of issues you ran
> into when developing the respective tools (you probably want to
> contact jcranmer for DXR); if there appear to be some common issues,
> it would make sense to get that information upstream so that we can
> fix/improve the situation. It would probably be sufficient to send a
> mail to cfe-dev CC'ing them.

Agreed. I'm not sure that there will in fact be much overlap, but if
there is, it would be good to move it upstream. (I'm basing that on
playing with DXR's plugin, which seems fairly lightweight, as these
things go.)

One possibility (which doesn't apply to DXR, which is a compiler plugin)
is producing a compilation database. Using LD_PRELOAD and stuff feels
yucky, and cmake is (to me) useless (the projects I care about mostly
don't use cmake).

Can you use ninja, which has recently grown the capability to throw out a
compilation database?

Can clang dump the relevant information in the right form? Feels like it

ought to be easy for it to do so, and that would surely be a clean way
to do it, presuming a project can be made to build with clang/clang++?
(Maybe it could be folded somehow with scan-build?)

Yes, it could be implemented in clang, but it's harder than doing it from a
build-aware tool, as you'll need some cross-process synchronization,
optimally in a OS-independent way...

Cheers,
/Manuel

Manuel Klimek <klimek-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:

[...]

One possibility (which doesn't apply to DXR, which is a compiler plugin)
is producing a compilation database. Using LD_PRELOAD and stuff feels
yucky, and cmake is (to me) useless (the projects I care about mostly
don't use cmake).

Can you use ninja, which has recently grown the capability to throw out a
compilation database?

I don't think so. We mostly use GNU Make, but also jam, scons (for some
bits of Swift), and bjam (for some boost libraries). Obviously
potentially we could use some unified (new) build system, but it doesn't
seem likely in the immediate future. What I have managed to do is get
DXR working (not trivial, since not all the builds allow overriding
CC/CXX).

Can clang dump the relevant information in the right form? Feels like it

ought to be easy for it to do so, and that would surely be a clean way
to do it, presuming a project can be made to build with clang/clang++?
(Maybe it could be folded somehow with scan-build?)

Yes, it could be implemented in clang, but it's harder than doing it from a
build-aware tool, as you'll need some cross-process synchronization,
optimally in a OS-independent way...

I assume because clang might be being run several times in parallel (by
the build tool)?

That feels fixable, or at least a partial result feels useful enough to
have even if it would fail if run in parallel: have it write the
relevant rule for building foo.o to foo.o.json or something. Wouldn't
that kind of approach work, at least to a first approximation (maybe
needing paths mangling before combining all the rules or something)?

[...]

Hi,

I've been working on a source code navigation tool for C and C++ that uses
Clang to index code. I thought people on this list might be interested in
it.

Indeed!

In r176848 I added a link to SourceWeb to
<http://clang.llvm.org/docs/ExternalClangExamples.html>.

We seem to have a proliferation of Clang-based source navigators (DXR,
woboq, and now SourceWeb). It would be really great if you could get
in contact with the authors of these other tools (I see Oliver Goffart
is already in this thread) and discuss what kinds of issues you ran
into when developing the respective tools (you probably want to
contact jcranmer for DXR); if there appear to be some common issues,
it would make sense to get that information upstream so that we can
fix/improve the situation. It would probably be sufficient to send a
mail to cfe-dev CC'ing them.

I already watch cfe-dev and generally perk up anytime someone mentions indexing, navigation, or documentation. :slight_smile:

Also, if there is enough common ground between your objectives, it
would be really cool if we could pool effort and develop a solution on
trunk in clang-tools-extra: and then dogfood it! I've been wanting a
better source navigator than Doxygen's source listings for a while
now, and I think it would be appreciated.

I've been planning to see if I can get an up-to-date copy of clang/llvm source code hosted at dxr.mozilla.org, but that is blocked on deployment problems.

I've also been playing with doxygen-like documentation output, and was planning on sending a feedback with all the issues I have that clang could easily fix once I had a production-ready version of it.

Cool, thanks for the update. That should be neat.

-- Sean Silva

Manuel Klimek <klimek-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:

[...]

>> One possibility (which doesn't apply to DXR, which is a compiler plugin)
>> is producing a compilation database. Using LD_PRELOAD and stuff feels
>> yucky, and cmake is (to me) useless (the projects I care about mostly
>> don't use cmake).
>>
>
> Can you use ninja, which has recently grown the capability to throw out a
> compilation database?

I don't think so. We mostly use GNU Make, but also jam, scons (for some
bits of Swift), and bjam (for some boost libraries). Obviously
potentially we could use some unified (new) build system, but it doesn't
seem likely in the immediate future. What I have managed to do is get
DXR working (not trivial, since not all the builds allow overriding
CC/CXX).

> Can clang dump the relevant information in the right form? Feels like it
>> ought to be easy for it to do so, and that would surely be a clean way
>> to do it, presuming a project can be made to build with clang/clang++?
>> (Maybe it could be folded somehow with scan-build?)
>>
>
> Yes, it could be implemented in clang, but it's harder than doing it
from a
> build-aware tool, as you'll need some cross-process synchronization,
> optimally in a OS-independent way...

I assume because clang might be being run several times in parallel (by
the build tool)?

That feels fixable, or at least a partial result feels useful enough to
have even if it would fail if run in parallel: have it write the
relevant rule for building foo.o to foo.o.json or something. Wouldn't
that kind of approach work, at least to a first approximation (maybe
needing paths mangling before combining all the rules or something)?

It would most certainly work. Somebody needs to put together a patch, and
send it out :slight_smile:

Cheers,
/Manuel

Hi,

I've been working on a source code navigation tool for C and C++ that uses
Clang to index code. I thought people on this list might be interested in
it.

That's a great demo, a great teaser to draw people's attention -- people
want to know if it's going to be worth their time to install your tool,
and I think the demo helps a lot there. The installation & usage
instructions in the readme are great. Your "sw-btrace" tools looks
generally useful to other clang-based projects; we need to advertise
"this is how you generate a compile_commands.json without CMake".

The demo is particularly useful because the tool doesn't have a web interface.

sw-btrace is similar to the bear tool, which was also announced on this list.

What follows are a few random thoughts in no particular order:

Over the years I've tried a variety of code browsing tools (for C++ and
other languages too) but I never stuck with any of them for one simple
reason: They weren't integrated into my IDE. If I'm looking at a file I
don't want to switch to a different tool, find that file again, and
browse from there; then when I find the target, switch back to my IDE
and find *that* file. I use the term "IDE" loosely here; in my case it's
Emacs.

Make sense. I think the idea is that the index and the IDE/GUI should be independent of one another?

What we really need is a daemon that maintains an index database and
services requests from the IDE. A while ago Chandler Carruth posted a
design proposal for a "clang service daemon":
https://github.com/chandlerc/llvm-designs/blob/master/ClangService.rst

Anders Bakken wrote "rtags": https://github.com/Andersbakken/rtags/
which uses such an architecture, with (so far) a client for Emacs. It
doesn't use the compile_commands.json at all, but provides a wrapper
that you place on the PATH before cc, g++, etc. Once rtags knows about
a file, it monitors the file for changes using inotify/kqueue.
Rtags will group source files into "projects" based on some heuristics.

I hadn't seen rtags before. I might take a look at it sometime.

Another option is to use clang to generate a database in a format
understood by existing tools. I'm thinking particularly of "cscope" here
(but I don't know the cscope database's capabilities in any detail).
That way you only need to write the indexer, and you get front-end
interfaces for several major editors for free.

As you know, the clang C++ API is unstable. What is your plan for
maintaining sourceweb? Have you evaluated the stable C api ("libclang")?
Is there anything missing from libclang that prevented you from using
it? If the missing areas aren't huge, perhaps effort would be better
spent improving the libclang API than keeping an out-of-tree tool in
sync with the C++ API.

I don't really have a plan, other than hope that the Clang API doesn't change too quickly. So far, the code has gone through one API transition, from 3.1 to 3.2, and the changes were minimal:

https://github.com/rprichard/sourceweb/commit/4183ca758d602e775af0a679d15bc84a3f37f287

I was using libclang at one point. IIRC, I switched away from it because I was having trouble getting some information I wanted out of it. In particular, I wanted to classify different kinds of references, e.g.:
  - For a function: is this a declaration, a definition, a call, or an address-of operation?
  - For a variable: is this a declaration, a definition, a read, a write, or an address-of operation?

I'm open to switching back to libclang, though, but that would depend on having enough time to work on the project.

If libclang is an option, then you can use its python bindings. I can't
help feeling that C++ is the wrong language for this kind of thing. 10k
lines is quite a lot of code. (But maybe I'm entirely mistaken here
about the capabilities of libclang.)

I realise this is all sounding rather negative and for that I apologise.
Sourceweb is certainly far more polished and feature-full than my own
attempt at solving C++ indexing (
https://github.com/drothlis/clang-ctags ). I'm just trying to share what
I, personally, look for in such a tool, for whatever that's worth. :slight_smile:

Thanks for your feedback.

-Ryan

hi Bruce,

you might have seen my tool, which trying to address the compilation database problem. (Just in case if you missed <https://github.com/rizsotto/Bear>) Which is using LD_PRELOAD to catch the compiler calls… And now I am wondering what does it mean ‘feels yucky’? What other, more technical, point you have against it? :wink: Was testing against: scons, GNU make, qmake, cmake, bash… and it works reliable most of the cases. On solaris/BSD systems, you could use D-Trace, which also capture all exec calls, more easy. But that’s another platform specific solution.

My conclusion was at that time, I either write OS specific solution, which works on any build system. Or write build-tool specific solution, which works on every OS. Since I’m interested in sources which are compiles on Linux, I went for the LD_PRELOAD trick.

I got the feeling that putting this kind code into Clang would not solve the problem at all, but would Clang driver itself more complex… You still need to teach your build system to use Clang. And since you were able to do that, you can write a fake compiler, which only emit the message about it’s command line arguments and generate a fake object file. (Of course you need to write fake ar/ld commands as well) But more importantly need a process which collect these messages and format into a JSON file. (By the way this is exactly what the LD_PRELOAD solution is doing, except no need for fake compiler/linker. And no need to put code into Clang.)

Regards,
Laszlo

you might have seen my tool, which trying to address the compilation database problem. (Just in case if you missed <https://github.com/rizsotto/Bear>) Which is using LD_PRELOAD to catch the compiler calls... And now I am wondering what does it mean 'feels yucky'? What other, more technical, point you have against it? :wink: Was testing against: scons, GNU make, qmake, cmake, bash... and it works reliable most of the cases. On solaris/BSD systems, you could use D-Trace, which also capture all exec calls, more easy. But that's another platform specific solution.

My conclusion was at that time, I either write OS specific solution, which works on any build system. Or write build-tool specific solution, which works on every OS. Since I'm interested in sources which are compiles on Linux, I went for the LD_PRELOAD trick.

Ryan Prichard's "sw-btrace" is similar to "bear" but supports OS X &
FreeBSD as well as Linux. "bear" is already mentioned in
http://clang.llvm.org/docs/JSONCompilationDatabase.html -- we could also
add a mention of "sw-btrace".

I got the feeling that putting this kind code into Clang would not solve the problem at all, but would Clang driver itself more complex... You still need to teach your build system to use Clang. And since you were able to do that, you can write a fake compiler, which only emit the message about it's command line arguments and generate a fake object file. (Of course you need to write fake ar/ld commands as well) But more importantly need a process which collect these messages and format into a JSON file. (By the way this is exactly what the LD_PRELOAD solution is doing, except no need for fake compiler/linker. And no need to put code into Clang.)

Maybe not add this to the Clang binary itself, but add a "bear" /
"sw-btrace" tool to the clang repository? I think it would be nice to
have such tools available directly from the clang project, instead of
having each clang-based tool invent its own or depend on yet another
project.

One benefit of this approach is that it gives the clang project the
flexibility to change the compilation database format without the fear
of breaking all these other tools. (There's still CMake, though.)

Laszlo Nagy
<rizsotto.mailinglist-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

hi Bruce,

you might have seen my tool, which trying to address the compilation
database problem. (Just in case if you missed <
https://github.com/rizsotto/Bear>) Which is using LD_PRELOAD to catch the
compiler calls... And now I am wondering what does it mean 'feels yucky'?
What other, more technical, point you have against it? :wink:

I guess I was assuming there would have to be some system-specific
nastiness in order to get the appropriate information from the compiler
processes, but actually (now I look at it) the code looks really clean
(possibly because it's system-specific, but I'm using GNU/Linux so I
don't care about that). It happens to break for one case which I care
about (I've filed an issue).

[...]

I got the feeling that putting this kind code into Clang would not solve
the problem at all, but would Clang driver itself more complex...

It feels like it would add a teeny bit of complexity, yes.

You still need to teach your build system to use Clang. And since you
were able to do that, you can write a fake compiler, which only emit
the message about it's command line arguments and generate a fake
object file. (Of course you need to write fake ar/ld commands as well)
But more importantly need a process which collect these messages and
format into a JSON file.

Yes, maybe that's a better approach for what I was thinking about. That
would be even closer to the scan-build type approach. scan-build runs
the clang analyser and a real compiler, but it could just as well run
such a trivial tool and the real compiler.

In practical terms I feel having clang do it would be easier for me
because making the project compilable by clang (or other things) is
already a valuable goal (I want to be able to run the clang analyser,
apart from anything else).

[...]

Doesn’t the analyzer already collect all the information needed for building the compilation database? Can we just reuse that infrastructure?

– Sean Silva

The problem is that there are no mature and established tools for
generating the database. As the tools improve, the problem will solve
itself; there's no reason to bring them into clang really since the major
problem is just developing mature tools in the first place. Having a
"standardized" format for the compilation database decouples this
development from clang itself.

-- Sean Silva

I really don't understand the hesitation to build compilation database
generation into Clang.

Pretty much any Clang tool you want to run on your codebase is going
to require your codebase is Clang-clean to begin with & likely the
only way you're going to get there is by integrating Clang with your
build (OK, so you could do syntax-only stuff with Clang tools in which
case you'd never have to build your project with Clang - just parse it
- but that seems like the far less common use case for Clang & its
tools). With that in mind, why wouldn't we generate a compilation
database from Clang itself? Why would we ask users to use a specific
build system to generate a file so they can use Clang tools? That
seems like a bizarre user experience.

Obviously for distributed builds that gets a bit tricky/unrealistic -
and the ability for complex build systems to generate this file seems
not inappropriate, but I'm not sure why it's being suggested that it's
the necessary/common/expected scenario.

(even if the first pass of such support would be non-parallelizable
most build systems can be forcibly run in series to avoid any
filesystem race issues - then if/when someone has the time they can
build the necessary OS locking, etc, to do it safely concurrently)