Building and sharing a clangd global index

William_Wagner1 · April 1, 2019, 3:14pm

Hello!

I work on a fairly large C++ project and wanted to figure out a way to regularly build (e.g. nightly via Jenkins) a global project index that can be shared with all the members of my team. I want to share it because it takes a fairly long time to build the index after starting up, and it seems pretty redundant to have each team member doing so, seeing as most of the code is not changing on a day-to-day basis. I’ve tried peeking around the mailing lists and commit history of clangd, but I’m not sure whether this is possible yet - and if it was, what flags to use, what indexer etc.

I see there’s background-indexer WIP (https://reviews.llvm.org/D59605) and an existing clangd-indexer https://github.com/llvm-mirror/clang-tools-extra/blob/master/clangd/indexer/IndexerMain.cpp
What is the difference between these?

Additionally, if anyone could provide some clarification on the different types of indexes clangd currently has (dex, background, static, etc.) that would be great

Thanks!

ilya · April 3, 2019, 8:27am

Hi William,

The difference between background-indexer and clangd-indexer is the layout of the output:

background-indexer would put the resulting index into the folder /.clangd/index.
The index is split per-file, i.e. it’s incremental and clangd would be able to update the files that changed after the index was built.
You would need to run clangd with ‘-background-index’ to load the index, it will also automatically update the index for files that changed on load.
clangd-indexer would produce a merged index, it can’t be incrementally updated and you have more control for the location of the output:
./bin/clangd-indexer -executor=all-TUs path/to/compile_commands.json > path/to/output.riff

You would need to run clangd with ‘-index-file=path/to/output.riff’ to load the index.

Note that both indexes store absolute paths, so sharing the produced index across multiple machines would only be possible if the directory structure is kept the same.
If having the same directory structure is plausible, please try it out and let us know if it works, we haven’t tried sharing the same index across multiple machines.

Which option to prefer? Depending on your situation, either of the two might be better:

If you always want an up-to-date index and storing the shared snapshot is just a performance optimization, use background-indexer.
If you not wasting resources to rebuild the index for changed files is more important than the fact that some results are stale (e.g. it’s too expensive, you want to save laptop battery, etc.), clangd-indexer might be a better choice.

Here’s a short summary on what each index means:

Static index is an index that is persisted across multiple runs of clangd. There are two flavours of it:

Background index. Incremental (split per-file) index living in ‘/.clangd/index’. Built automatically by clangd when -background-index is specified. Long-term, we want this to be enabled by default (and possibly be the only option).
Old-style “merged” index produced by clangd-indexer. The results will not get updated by clangd automatically, you can ask clangd to load it with ‘-index-file=path/to/index.riff’.

Dynamic index is an overlay for a small number of updated files (currently the open files for which we built the AST). Kept in memory, not persisted across multiple runs. We use to adjust for the fact that static index might be stale. We want the correct results for the open files in all cases.
Dex is an efficient implementation of running search queries (e.g. it models fuzzy-matching algorithm, etc.). It’s an “index” in an information retrieval sense, it is not actually specific to C++ or clangd.

Eric_Liu · April 3, 2019, 8:38am

Just to add on what Ilya said.

Note that both indexes store absolute paths, so sharing the produced index across multiple machines would only be possible if the directory structure is kept the same.
If having the same directory structure is plausible, please try it out and let us know if it works, we haven’t tried sharing the same index across multiple machines.
Paths are stored as URI in the index. By default, “file” scheme is used, so URI would simply be absolute path (e.g. file:///user/home/llvm/x/y.h). But you could also define your own URI schemes. For example, you can choose to store relative paths in the URI (e.g. llvm:///x/y.h) in a custom scheme, and they can be resolved with potentially different project roots on users’ machines to get correct full paths. For more information, please take a look at clangd/URI.h library. You could also find some sample URIScheme implementations in unit tests.

Cheers,
Eric

Sam_McCall · April 3, 2019, 12:36pm

What you want to do is possible (we do something very similar), though isn’t quite working out-of-the-box yet.

There’s two main parts:

Building and distributing an index is pretty easy: run clangd-indexer and copy the file to each machine.[1]
Translating filenames in the index to match those on the machine is what the URIs Eric mentioned are for, and isn’t polished.
The idea is clangd-indexer will see a file in /path/a/project/Foo.cc, and clangd (on another machine) will see it in a different /path/b/project/Foo.cc.
So it’s the indexer’s job to translate the path into a machine-agnostic URI like myproject:///Foo.cc, and then clangd’s job is to work out which concrete file that refers to in the current context. The clangd::URIScheme implementations handle this at both ends.
However open-source clangd only has the file scheme today, people need to patch it to handle these cases[2].

– design speculation follows –
I think we should ship a generic “project-relative” URI scheme with clangd so this can work.

One idea I have is a scheme like project://somebasedir/path/file.cc
Here the assumption is that the project is rooted under a directory with a fixed name “somebasedir” recorded in the URI authority.

URI → path is easy: find the concrete somebasedir based on the currently edited file, and concatenate.
path → URI is tricky: we need to determine which (if any) parent directory is the relevant base.
A flag makes sense for clangd-indexer, but clangd also needs to do this conversion sometimes and a flag is a burden there.
Maybe we can get away with just keeping track of the authorities we’ve seen the external index return? But this doesn’t really help for background index, and mixed internal/external index cases could get messy.
looking for compilation databases is tempting too, but complicated (requires IO in the URI scheme, and we have ways to use clangd with an external CDB, and the CDB interfaces aren’t quite right for this today)
So I don’t see a way to do this that’s super-clean (cheap, zero-config, correct) but interested in ideas others have.

Obviously this has the weakness that indexes only transfer between projects where the root has the same name, not sure how big a problem this would be in practice.

[1] There are certainly fancier variations: for google’s index we distribute the index building by running Index/IndexAction in a mapreduce, and also run the index as an RPC server and use a custom implementation of SymbolIndex that queries it. The latter means our developers have to use a patched clangd. Building the index file and copying it is a good place to start, you’ll see where the scaling limits are.

[2] Ours is pretty simple, as the project is always rooted at a directory with a fixed name.

William_Wagner · April 5, 2019, 3:17am

Hey Sam,

I do like the idea of a project relative URI scheme. You mentioned the tricky part was the path → URI conversion, if i understand correctly, part of why it’s tricky is say you had:

URI: project://foo as your “root”
Path: /home/foo/foo/foo.cc
It’d be hard to know whether the URI for this path would be project://foo/foo/foo.cc or project://foo/foo.cc. I suppose you could recurse upwards until you hit some kind of boundary (e.g. a git folder?)

Obviously this has the weakness that indexes only transfer between projects where the root has > the same name, not sure how big a problem this would be in practice.
At least for me and most of the projects I see at work, I don’t think this would be a show stopper.

… and also run the index as an RPC server and use a custom implementation of SymbolIndex
that queries it.
Trying to wrap my head around this, as i’m very intrigued. Do people run clangd servers on their local machines and only defer to the RPC server for LSP queries that have to consult the static index? Also, is an index shared my multiple people? If so, then how does the static index get updated if multiple people have different versions of the code?

Thanks,
William

ilya · April 5, 2019, 9:37am

A few quick clarifications, Sam would probably have more to add.

Is the background-indexer smart enough to do a rescan of the code base, and only update the files that changed? My assumption is yes, because the paths are the same and the digests(?) will be the same for the unchanged files, but confirmation here would be great.

Yes, that’s correct. It stores the digests of the files and tries to update only a minimal subset of the codebase that actually changed.
Of course, there might be rough edges and bugs, please let us know if you encounter them.

Ah okay, so if I wanted to use the background-index and dex, are the only arguments I have to pass (in clang-8) -background-index and -use-dex-index?

Correct. I thought we made -use-dex the default for clangd-8, but double-checked now and it’s actually not the case.

Do people run clangd servers on their local machines and only defer to the RPC server for LSP queries that have to consult the static index?
Yes, clangd is run locally on each machine and it queries the Index service via RPC to avoid keeping the whole index locally. Note that the RPC server does not serve LSP requests, instead it serves the clangd-specific operations (as previously mentioned, the full list of functions we require from the index is in clangd::SymbolIndex).

If so, then how does the static index get updated if multiple people have different versions of the code?

We assume everyone is on the mainline branch (we have a monorepo internally). The index is rebuild ~ once a day, hence the results from the RPC server are sometimes outdated.
To compensate for the staleness in the common case (local modifications, a slightly outdated index, etc.), we have overlay for the files open in the editor (and the files included by them) in the form of “dynamic index” internally.

Topic		Replies	Views
Building and sharing a clangd global index clangd	0	164	April 5, 2019
Installing clangd-indexer clangd	3	150	November 6, 2019
How to create clangd the indexes from a script clangd clang	14	1643	May 31, 2023
Sharing indexes for multiple users clangd	8	2074	April 29, 2020
RFC: Upstreaming index-while-building Clang Frontend	3	130	March 1, 2019

Building and sharing a clangd global index

Related topics