Adding indexing support to Clangd

It looks like Marc-Andre is following our work on the Eclipse CDT indexer. We found databases to be just too slow. Also, forcing the C++ language model to fit into a bunch of tables or key/value pairs just didn't make that much sense in the end. The data is very complex.

Instead it's better to think of this as a big graph data structure that just happens to be backed by memory mapped file store. Clients generally deal directly with pointers into that data structure and follow pointers in the graph. The BTrees help speed up that navigation at various nodes in the graph. It's super fast.

Doug.

Apologies for the delay, Nathan (CC’ed) will be preparing a document describing how it works in detail, and will bring it soon to cfe-dev for discussion.

From: Dr D. Chisnall [mailto:dc552@hermes.cam.ac.uk] On Behalf Of David
Chisnall
Sent: Thursday, August 10, 2017 6:10 AM
To: Marc-André Laperle <marc-andre.laperle@ericsson.com>
Cc: via cfe-dev <cfe-dev@lists.llvm.org>; zeratul976@hotmail.com; Doug
Schaefer <dschaefer@blackberry.com>
Subject: Re: [cfe-dev] Adding indexing support to Clangd

–ClangdIndexStorage–
malloc-like interface that allocates/frees data blocks of variable sizes on
disk. The blocks can contain ints, shorts, strings, pointers (i.e. file offsets),
etc. The data is cached in 4K pieces so that local and repeated accesses are all
done quickly, in memory.
Clangd mallocs and writes its index model objects using this.

–BTree–
A pretty classic BTree implementation. Used to speed up lookup (symbol
names, file names). It allocates its nodes using ClangdIndexStorage therefore
it is stored on disk. Keys are actually records in ClangdIndexStorage so you
can really think of the BTree as a collection of sorted pointers (sorted
according to a provided comparator).

This sounds very like bdb. Is there a reason that we’re reimplementing a
large chunk of bdb, rather than just using it (or using something like sqlite for
the index storage)?

It looks like Marc-Andre is following our work on the Eclipse CDT indexer. We found databases to be just too slow.

That is somewhat surprising. Which db’s did you benchmark against?

From: Dr D. Chisnall [mailto:dc552@hermes.cam.ac.uk] On Behalf Of David
Chisnall
Sent: Thursday, August 10, 2017 6:10 AM
To: Marc-André Laperle <marc-andre.laperle@ericsson.com<mailto:marc-andre.laperle@ericsson.com>>
Cc: via cfe-dev <cfe-dev@lists.llvm.org<mailto:cfe-dev@lists.llvm.org>>; zeratul976@hotmail.com<mailto:zeratul976@hotmail.com>; Doug
Schaefer <dschaefer@blackberry.com<mailto:dschaefer@blackberry.com>>
Subject: Re: [cfe-dev] Adding indexing support to Clangd

>
> --ClangdIndexStorage--
> malloc-like interface that allocates/frees data blocks of variable sizes on
disk. The blocks can contain ints, shorts, strings, pointers (i.e. file offsets),
etc. The data is cached in 4K pieces so that local and repeated accesses are all
done quickly, in memory.
> Clangd mallocs and writes its index model objects using this.
>
> --BTree--
> A pretty classic BTree implementation. Used to speed up lookup (symbol
names, file names). It allocates its nodes using ClangdIndexStorage therefore
it is stored on disk. Keys are actually records in ClangdIndexStorage so you
can really think of the BTree as a collection of sorted pointers (sorted
according to a provided comparator).

This sounds very like bdb. Is there a reason that we’re reimplementing a
large chunk of bdb, rather than just using it (or using something like sqlite for
the index storage)?

It looks like Marc-Andre is following our work on the Eclipse CDT indexer. We found databases to be just too slow.

That is somewhat surprising. Which db's did you benchmark against?

This was many years ago and we were trying Apache Derby (since Eclipse is written in Java). And at the time the biggest issue was write performance totally destroying any gains we made with header caching. But that may have just been Derby.

Though I think the bigger issue is that the language model is a graph which doesn’t fit naturally in tables. You’ll end up spending more time fighting that than building out your own data structures. But that’s just my experience.

Also, forcing the C++ language model to fit into a bunch of tables or key/value pairs just didn't make that much sense in the end. The data is very complex.

Instead it's better to think of this as a big graph data structure that just happens to be backed by memory mapped file store. Clients generally deal directly with pointers into that data structure and follow pointers in the graph. The BTrees help speed up that navigation at various nodes in the graph. It's super fast.

Doug.

From: Manuel Klimek [mailto:klimek@google.com]
Sent: Friday, August 11, 2017 4:54 AM
To: Doug Schaefer <dschaefer@blackberry.com>; David Chisnall <David.Chisnall@cl.cam.ac.uk>; Marc-André Laperle <marc-andre.laperle@ericsson.com>
Cc: via cfe-dev <cfe-dev@lists.llvm.org>

Subject: Re: [cfe-dev] Adding indexing support to Clangd

From: Dr D. Chisnall [mailto:dc552@hermes.cam.ac.uk] On Behalf Of David
Chisnall
Sent: Thursday, August 10, 2017 6:10 AM
To: Marc-André Laperle <marc-andre.laperle@ericsson.com>
Cc: via cfe-dev <cfe-dev@lists.llvm.org>; zeratul976@hotmail.com; Doug
Schaefer <dschaefer@blackberry.com>
Subject: Re: [cfe-dev] Adding indexing support to Clangd

–ClangdIndexStorage–
malloc-like interface that allocates/frees data blocks of variable sizes on
disk. The blocks can contain ints, shorts, strings, pointers (i.e. file offsets),
etc. The data is cached in 4K pieces so that local and repeated accesses are all
done quickly, in memory.
Clangd mallocs and writes its index model objects using this.

–BTree–
A pretty classic BTree implementation. Used to speed up lookup (symbol
names, file names). It allocates its nodes using ClangdIndexStorage therefore
it is stored on disk. Keys are actually records in ClangdIndexStorage so you
can really think of the BTree as a collection of sorted pointers (sorted
according to a provided comparator).

This sounds very like bdb. Is there a reason that we’re reimplementing a
large chunk of bdb, rather than just using it (or using something like sqlite for
the index storage)?

It looks like Marc-Andre is following our work on the Eclipse CDT indexer. We found databases to be just too slow.

That is somewhat surprising. Which db’s did you benchmark against?

This was many years ago and we were trying Apache Derby (since Eclipse is written in Java). And at the time the biggest issue was write performance totally destroying any gains we made with header caching. But that may have just been Derby.

Though I think the bigger issue is that the language model is a graph which doesn’t fit naturally in tables. You’ll end up spending more time fighting that than building out your own data structures. But that’s just my experience.

Yea, the language model not fitting into tables well does match my intuition, but for simple queries, I’d have expected a db to be fast enough with room to spare. Would be curious about more details / numbers.

Sorry gang, Outlook isn’t treating this thread very well. I’ll have to top post my response.

The CDT indexer and database were written in 2005 so the data is long gone and we’re talking about different database technologies anyway. All I remember was it was slower by multiple orders of magnitude on writes. And that stopped my investigation right there. At the time I was trying to get a full index of the Firefox source down from an hour to a few minutes and incremental indexing times down to less than a second (thanks to the header caching). The DB was taking me in the other direction.

I’m just not sure architecting for simple lookups is the right approach. Just consider probably the most common use case for clangd, content assist. Given an offset into the file, determine the context/scope of the content and a prefix. You then need to find all matches for that prefix. That’s by no means a simple lookup given all the declarations, type hierarchies, scope hierarchies in play in that context. Finding references is another good one that also requires scope analysis to make sure you’re actually returning references to that exact definition and not just something that happens to have the same name. The CDT index has links back and forth between these things.

Again, I’m just relaying my experience haven built one of these things before. And it was a long time ago. All avenues are worth exploring.

Doug.

Sorry gang, Outlook isn’t treating this thread very well. I’ll have to top
post my response.

The CDT indexer and database were written in 2005 so the data is long gone
and we’re talking about different database technologies anyway. All I
remember was it was slower by multiple orders of magnitude on writes. And
that stopped my investigation right there. At the time I was trying to get
a full index of the Firefox source down from an hour to a few minutes and
incremental indexing times down to less than a second (thanks to the header
caching). The DB was taking me in the other direction.

I’m just not sure architecting for simple lookups is the right approach.
Just consider probably the most common use case for clangd, content assist.
Given an offset into the file, determine the context/scope of the content
and a prefix. You then need to find all matches for that prefix. That’s by
no means a simple lookup given all the declarations, type hierarchies,
scope hierarchies in play in that context. Finding references is another
good one that also requires scope analysis to make sure you’re actually
returning references to that exact definition and not just something that
happens to have the same name. The CDT index has links back and forth
between these things.

Clang shouldn't query by name when looking up references, instead USRs
should be used. The USRs deal with scopes correctly (Hence the U in USR).

Could you provide more info, please. Are USRs different for same named file-local entries (i.e. static functions)?

Could you provide more info, please.
Are USRs different for same named file-local entries (i.e. static
functions)?

Looking at USRGenerator::VisitVarDecl, it seems they will be the same
across different translation units.
For local entities, "location" is part of the USR. ("Location" is a
filename + an offset from the start of that file).

That was my point, U in USRs are not so Unified, so can not be used as universal keys for queries in DB. Adding location to USR makes them less usable in index if fast incremental re-index is going to be implemented vs. full reparse on each change. Vladimir.

Could you provide more info, please.

Are USRs different for same named file-local entries (i.e. static
functions)?

Looking at USRGenerator::VisitVarDecl, it seems they will be the same
across different translation units.
For local entities, "location" is part of the USR. ("Location" is a
filename + an offset from the start of that file).

That was my point, U in USRs are not so Unified, so can not be used as
universal keys for queries in DB.
Adding location to USR makes them less usable in index if fast incremental
re-index is going to be implemented vs. full reparse on each change.

Vladimir.

U is meant to be unique. USRs file-local declarations like static functions
will include the filename only. USRs for function-local declarations will
also include the offset in the file, but they're not actually indexed.
There are a couple of global declarations that include file + offset in the
USR as well - ObjC class extensions, some embedded tag decls (not that
common), template parameter names (not actually indexed) & macros (not
indexed).

What exactly prevents unique strings from being used as keys for queries?

Are filenames are a problem for the incremental re-indexing? If they
aren't, then there are probably way to work around the file offsets in USRs
for the incremental re-indexing, especially given how uncommon they
actually are.

Alex

Apologies, U does stand for Unified :). But I believe that my points still stand.

Keys have to be Unique. And previously (in 3.6?) we observed that different entries had same USRs. Good to known it is fixed now. I just wanted to share our experience, that offsets added to unique strings are not good. They add extra complexity to invalidation phase during incremental re-indexing and not stable, i.e. during simple re-format of file or adding comments. May be integral index per-file would be enough vs offset. Vladimir.

Hello.

Just 2 cents from my USR experience:

  1. USRs in some cases different for entities that are the same in C++. Namely it takes into account constness of by-value fuction arguments where C++ Front-End ignores them.
    “int foo(const int);” and “int foo(int);” are declarations for the same function in AST, but will have different USRs.

  2. Also be careful with USRs and preprocessor macros: in my experience if macro is #undef’ed it looses its USR because USR generation relies on PreprocessingRecord which clears info upon #undef.

Note: My experience is with Clang 4.0 and earlier, so maybe this is fixed already.

Regards,
Serge Preis

It’s good to know that you are interested in replacing the implementation. The higher level classes are not very polished right now so as I improve things, I will keep that in mind. For example, right now, the data storage component is not fully hidden from the indexing interface. BTW, I started putting the indexing prototype on Github [1]. The lower level “storage” parts (ClangdIndexDataStorage, BTree) I think could be ready for review soon but with the discussions about Sqlite and LMDB, I’m not sure if that’s the way to go yet. I’ll have to research more LMDB especially and also in the light of the index-while-building proposal. The file event handling, indexing operations and model are all prototype level so not ready for review yet (lots of TODOs and long methods) but they help to know how the pieces can fit together. The index-while-building proposal contains some good ideas on how to model things too and I think it makes sense to align on that. There are also some protocol parts in the prototype that could also be reviewed soon, if it’s OK for some handlers to not do anything for a while. I’ll determine what the next steps should be and keep you posted. [1] Regards, Marc-André

This sounds very like bdb.  Is there a reason that we’re reimplementing a large chunk of bdb, rather than just using it (or using something like sqlite for the index storage)?

I’m not very familiar with bdb, but perhaps it could work. I’ve seen LMDB mentioned a few times and the license seems compatible so I’ll research this a bit. My initial reasons for not going with SQL databases/other libraries:

  • From my CDT experience with the C/C++ index model, tables don’t seem like a good match
  • I was under the impression that adding a third party library in llvm repositories was not straightforward or desired. But perhaps a small one is OK (i.e. not MySQL, PostgreSQL, etc).
  • With several options available, I thought doing something similar to what I knew worked well in CDT would be a safe way to go. With a similar data storage strategy, the index model could be also inspired from CDT and give similar features.
  • At the LLVM developer meeting in March, there seemed to be an understanding among Clangd contributors that a “tailored” format was the way to go

I’m happy to consider other existing solutions if they make more sense. It’s always good not to have to maintain additional code. I’ll look a bit more into LMDB in the next days to have a better picture. One thing that would be good is if we can have “pointer values” that can point to other entries in the database (in case someone knows already).

Clang shouldn’t query by name when looking up references, instead USRs should be used. The USRs deal with scopes correctly (Hence the U in USR).

Yes, querying by USR seem to work well so far!

Regards,
Marc-André

Hi!

Just a for a quick update: I’m in the process of making small patches for review, as you might have seen already. I also work in parallel on fixing bugs and polishing the different component that are not as ready, on the Github branch. I though maybe it would be useful to have more information of what’s currently there so I added a README to the branch and will keep it up to date as I change things.

https://raw.githubusercontent.com/MarkZ3/clang-tools-extra/IndexFunctionsPrototype/clangd/index/README

I have also done some experimentation with LMDB and its been pretty positive, although I am not sure about the possibility of adding it as a dependency as of now.

Regards,

Marc-André