Announcing Crange

Announcing Crange: https://github.com/crange/crange

Summary

The source metadata collected by Crange can help with building tools
to provide cross-referencing, syntax highlighting, code folding and
deep source code search.

Hi Anurag,

This sounds like a very useful tool. We did something similar (Clang,
SQLite, cross-references) but never got around releasing anything.

Running crtags on Linux kernel v3.13.5 sources (containing 45K files,
size 614MB) took a little less than 7 hours (415m10.974s) on 32 CPU
Xeon server with 16GB of memory and 32 jobs. The generated tags.db
file was 22GB in size and contained 60,461,329 unique identifiers.

Did you compare against CTags/ETags? IDEs indexing services (Eclipse,
CodeBlocks, etc)?

I know the depth you get with Clang is orders of magnitude more than
ctags, but ctags finishes all my projects in a couple of minutes. I'd
only accept 28 hours (I only have 8 CPUs) if it did my morning coffee
and toast, too. :slight_smile:

I'm also guessing that this is a full run, and that when files change,
a partial run (only regarding the changed files) would run a lot
faster.

Also, for comparison, I used to have a ctags re-run once files were
saved on the disk when my editor was spawned (part of the start
script) and it would always take seconds. I think it's only when it
gets to that level that people will actually look more seriously
towards your tool...

cheers,
--renato

Yes, this looks interesting.

As Renato said, the run-time is critical. Some people may suggest to implement this in C/C++. However, starting with python is probably a good choice to try this out and also to understand the performance implications.

Some ideas:

  - You may want to look at the compilation database support in
           libclang to retrieve the set of files to process as well as
           the corresponding command lines

  - I wonder how quick querying the database is. In fact,
           if those queries are quick (less than 50ms) even for big
           databases, this would extremely interesting as you could in
           an editor for example add missing includes as you type.

Ah, and it does actually only work if I don't use this worker stuff, but apply this patch:

- pool.map(worker, worker_params)

Hi Anurag,
Last time I looked, the linux kernel was written in C. And I could use
cscope to create a crosss-reference database of the entire linux kernel
in about 10 minutes, using far fewer resources than you used. Admittedly,
cscope gets confused on a few of the complex MACRO usages. But cscope is
a very popular tool for linux kernel development.

I suggest you consider the current state of your work a proof of concept
and think about how to improve your performance by more than an order of
magnitude.

enjoy,
Karen

DXR produces the database for its indexing on mozilla-central in about 3-4 hours on a 4 core machine requiring about 6GB of memory. Mozilla-central is perhaps 94K files and around 1GB of size, although there is a great deal of non-C/C++ code in that repository. (And one of the slowest steps is that we pre-generate the entire HTML, which easily adds 20-30 minutes to the timestamp). There definitely seems to be a lot of scope to improve the performance of your tool.

I don't expect the db to be the bottleneck here. If you plan your
table reasonably well (and this usage is *very* simple for a
relational database), queries should return results in milliseconds
for almost all queries, including inserts and updates.

However, during the parse, because you'll be listing all
relationships, if you insert one at a time, preparing the queries will
amount to a significant portion of the run-time. There are ways of
inserting blocks of data on the same query that you might consider, or
even creating a CSV and importing (MySQL could do that, not sure about
SQLite).

But after the database is set up, updating and selecting will be trivial.

cheers,
--renato

In https://github.com/nico/complete , I think went with a plugin that does tags in parallel with regular compilation, and the plugin added about 10% to build time. Things like “pragma synchronous = off” and “pragma journal_mode = memory” helped quite a bit iirc. The plugin code is somewhere in the server/ folder. I haven’t touched that project in a long time, but maybe you can crib some stuff from it.

Oh, yes, now I remember. If you do it that way, you can accurately map
what are the macros and paths that the compiler is doing, given the
options you used.

In our case, because we also wanted to use our Clang tool on code
compiled by other compilers, we added a wrapper to collect the command
line options for each individual file and passed those options to
Clang when parsing the files again. Of course, it worked better with
Clang/GCC compilations.

cheers,
--renato

Hello Renato,

Did you compare against CTags/ETags? IDEs indexing services (Eclipse,
CodeBlocks, etc)?

I know the depth you get with Clang is orders of magnitude more than
ctags, but ctags finishes all my projects in a couple of minutes. I'd
only accept 28 hours (I only have 8 CPUs) if it did my morning coffee
and toast, too. :slight_smile:

I did explore Ctags, GNU Global and Cscope before going on to
reinvent. While ctags is the gold standard, when it comes to the speed
of generating tags database, Global and Cscope are also quite good at
generating xrefs. However, given the level of detail exposed by clang
nodes, I wanted to go a step further and index included files, source
range, statements, operators, types etc too.

Perhaps, I should update the messaging a bit and position this tool
differently. I don't believe Crange is a replacement for
Global/Cscope.

Python doesn't really seem to be a good choice of language to
implement this in. I suspect python's multiprocessing.Queue is
spending too much time in mutex lock. 28 hours for indexing is
definitely not encouraging.

I'm also guessing that this is a full run, and that when files change,
a partial run (only regarding the changed files) would run a lot
faster.

Also, for comparison, I used to have a ctags re-run once files were
saved on the disk when my editor was spawned (part of the start
script) and it would always take seconds. I think it's only when it
gets to that level that people will actually look more seriously
towards your tool...

Running partial database updates is on my 'must have' list. I'm
planning to index file's last modified time as well, and skip the ones
that didn't change.

Thanks for your feedback Renato.

Hello Tobias,

As Renato said, the run-time is critical. Some people may suggest to
implement this in C/C++. However, starting with python is probably a good
choice to try this out and also to understand the performance implications.

Implementing this in python did turn out to be a good learning
experience. I wouldn't have come this far, had I decided to stick to
C++.

Some ideas:

        - You may want to look at the compilation database support in
          libclang to retrieve the set of files to process as well as
          the corresponding command lines

Compilation database support is on the top of my 'must have' list. To
keep things simple for now, I merely search for all C/C++ source and
header files and parse them, thereby losing command line includes and
flags.

        - I wonder how quick querying the database is. In fact,
          if those queries are quick (less than 50ms) even for big
          databases, this would extremely interesting as you could in
          an editor for example add missing includes as you type.

Identifier looks are quite fast. For the 22GB tags database I tested
earlier, I usually get ~60ms times. For references lookup I think I've
got bad SQL, since its taking a few seconds to respond.

Ah, and it does actually only work if I don't use this worker stuff, but
apply this patch:

- pool.map(worker, worker_params)
+
+ for p in worker_params:
+ worker(p)

Ah, It seems like there's a bug in my code. It doesn't spawn the
multiprocessing pool in some cases. I'll have a close look at this.

Thanks for your feedback Tobias

Anurag

Hello Karen,

Hi Anurag,
Last time I looked, the linux kernel was written in C. And I could use
cscope to create a crosss-reference database of the entire linux kernel
in about 10 minutes, using far fewer resources than you used. Admittedly,
cscope gets confused on a few of the complex MACRO usages. But cscope is
a very popular tool for linux kernel development.

I received the same feedback from my peers - it is way slower than
cscope. I think I must reposition this tool and call it source code
indexer rather than xref generator.

I suggest you consider the current state of your work a proof of concept
and think about how to improve your performance by more than an order of
magnitude.

Sure. The indexing time in its current state is not very encouraging.
As someone said at a performance related talk - The best way to
improve performance is to start with worse :slight_smile:

Thanks for your feedback,

Anurag

This is how I'm inserting nodes into the SQLite db too. I collate all
the node metadata into a python list and run cursor.executemany().
Bulk inserts executed inside a transaction are very fast in SQLite.

Anurag

I think it could do even better than that:

https://www.youtube.com/watch?v=7uBNCN6v_gk
https://www.youtube.com/watch?v=-NYduubaf8Y

Csaba