Sharing indexes for multiple users

sam-mccall · November 29, 2019, 9:48am

TL;DR: it might be nice to have an index server for clangd, so developers working on the same project can share an index built centrally. I think this would require adding an (optional) dependency on an RPC system. Worthwhile?

The background index is that it works reasonably well under certain constraints:

you have a fairly powerful machine, with lots of cores for indexing and ram for serving
you have an accurate compile_commands.json
your project isn’t too big (LLVM is OK, chrome not really)

For large codebases, a shared index server may be a better tradeoff.
We have experience with this at Google - we index our monorepo daily and overlay the clangd in memory index over a SymbolIndex implementation that sends RPCs to a shared service.
This is built on internal google infrastructure and isn’t publicly available.

I think this would be useful to other projects too, and we should consider building an open-source version. Here’s a design sketch:

a process (“indexer”) runs the following loop:
- update sources from source control
- generate compile_commands.json using the build system
- run clangd’s indexing logic, and push index shards to a git repo[1]
another process (“server”):
- continuously pulls from the git repo, and loads the latest index data whenever it changes.
- It exposes an RPC server with approximately the SymbolIndex interface
clangd can be configured (e.g. by a config file in the source tree) to add a static index that sends RPCs to the server

The main catch is we need a way to run RPC servers and clients. We’d need to take some dependency for this - it could be an optional build-time feature, but there’s still substantial cost/complexity here, especially if we want this feature tested on buildbots.

Latency does matter here, and it needs to be a system that can run over HTTP (to traverse the web reliably), and I think encryption is probably important. I know that GRPC would work and can build with CMake, but has nontrivial dependencies. I’m less familiar with other systems.

Interested in what others think about this design, what alternatives are worth considering, and what to do about the RPC question.

[1] git repo would be suitable if we’re using the sharded index form, not so much if it’s the monolithic index form. Delegating storage, transfer, access control etc to git is really tempting…

hyp · December 2, 2019, 11:40pm

Can you please elaborate on what kind of index shards are being pushed to the git repo (binary/text full /partial index shards)? Would it be a repo that’s setup specifically for storing indexing data, or would you be pushing index-specific refs to the project’s own repo (the project that’s being indexed)?

I think it would be great to have a way to work with an index over RPC. We definitely will need something like that in the future to incorporate cross-TU refactorings with sourcekit-lsp from the Swift project.

What kind of complexities do you see when it comes to testing the RPC layer? I think that testing it locally by using in-process test RPC should be fairly reliable as it should avoid the IPC/network issues. But of course it’s necessary to have actual IPC/network integration tests as well.

tauchris · December 3, 2019, 6:48pm

This seems like a promising approach. I’m looking for a solution for indexing a large codebase that has several challenges (a lot of generated source, and a lot of build flavors, and many individual translation units that are compiled multiple times with different flags and different headers, within the same project).
For RPC, I’ve worked on a project that used XMLRPC in the past, with good results and encryption support. Don’t know much about other tools, but there are several that are open-source. A JSON-based RPC implementation might be a good choice, for consistency with other clang-related services…

jkorous · December 3, 2019, 7:29pm

I am just wondering - are you guys also thinking about support for global index for multiple projects? For example features like “give me all references in Google monorepo to this symbol”.

It seems that with the kind of a setup you are aiming for it might be possible to get a lot of that done mostly by just adding some extra information to USR / SymbolId (maybe just a path?) and possibly some kind of FS overlay to translate between paths on end-user’s machine and “indexer/server” paths.

Full-fidelity should be possible if the “indexer” would be able to get & process information from linker - like “ld -M --cref” output.

MarkZ3 · December 4, 2019, 3:10am

I’m very much interested in this although I’m a bit too busy right now to contribute actual code. I’m thinking about a scenario where a global index could cover multiple projects but also multiple active branches all within the same index. So for example you could find references of a given USR across even slightly diverging code branches to have a better picture of how a symbol is used, etc. Probably the fact that is built from multiple branches can be handled just by configuring the “indexer” appropriately with the correct CDBs and the branches checked out in the indexer’s local file system.

One point that is not clear to me is how to represent file paths for the different branches that might also not be on the client’s file system - possibly some unique URI can be built. Then navigating to those URI would need to fetch the content transparently by the LSP client. I guess this is URI handling is out of the scope of this discussion and probably easier to figure out than the rest of this shared index proposal.

About access control, it might be useful to have a customizable layer for this to accommodate various possible corporate authentication mechanism. Although that layer might be easier to develop once there is already a reference server in open-source that each can fork and try to adapt to their needed authentication.

sam-mccall · December 4, 2019, 9:56am

Glad this is interesting to others too!

@hyp

Can you please elaborate on what kind of index shards are being pushed to the git repo (binary/text full /partial index shards)? Would it be a repo that’s setup specifically for storing indexing data, or would you be pushing index-specific refs to the project’s own repo (the project that’s being indexed)?

The idea (and it’s not essential to the concept) is to have a dedicated repo for the index, just used to distribute the data to the serving processes. Some distribution is necessary because the indexer is an intensive batch job that we should run on one machine, and the serving is a latency-sensitive job you probably want to replicate, and certainly don’t want fighting with the indexer for resources.

Because adding deps to LLVM is hard, and networking and security and such is hard, and git is ubiquitous and well-understood and handles all these problems, it’s tempting to just shell out to it (consider what the configuration space for network shares, cloud storage, etc look like). But git may not make sense (e.g. if all index files mostly change every run).

What kind of complexities do you see when it comes to testing the RPC layer? I think that testing it locally by using in-process test RPC should be fairly reliable as it should avoid the IPC/network issues. But of course it’s necessary to have actual IPC/network integration tests as well.

In my experience using real RPCs across components in one process is fairly fine/easy (mostly you’re worrying about picking unused ports, right security settings etc). Faking out the transport is definitely possible but I’m not sure it’s actually worth it (extra code and you test less of the real code). Actual multi-process tests are definitely more work (coordination, debugging failures etc) and I’d think we could limit this to lit integration tests.

@tauchris

I’m looking for a solution for indexing a large codebase that has several challenges (a lot of generated source, and a lot of build flavors, and many individual translation units that are compiled multiple times with different flags and different headers, within the same project)

An index server will definitely help with generated source, since you can just generate everything before indexing (vs background-indexing which can’t really do this). Build flavors is less clear - of course you can run one index server for each and let clients choose. But supporting multiple “colors” of symbols within one server would require further design. TUs compiled multiple times… hard to say! We might need to iterate on some of these.

I’ve worked on a project that used XMLRPC in the past, with good results and encryption support. Don’t know much about other tools, but there are several that are open-source. A JSON-based RPC implementation might be a good choice, for consistency with other clang-related services

I think JSON would be preferable to XML as a format for consistency (particularly within clangd to reuse ways of marshalling data) and for simplicity.

I’m wary of falling into wiring together an RPC system ourselves out of a JSON encoder, an HTTP client etc - doing that for LSP was fairly expensive (despite only stdin/stdout) and will be a maintenance burden when used over real networks.

Another issue is that we care a lot about latency, and servicing a typical code completion request from the index requires fetching quite a lot of data. A binary format and/or compression, and an RPC system optimized for latency will likely make a measurable difference to user experience. JSON over HTTP is commonly used, and various blogs that benchmarked them claim that e.g. gRPC is 5-10x faster. Warrants further investigation.

@jkorous

I am just wondering - are you guys also thinking about support for global index for multiple projects? For example features like “give me all references in Google monorepo to this symbol”.

I’m not quite sure what you’re asking here - surfacing choice of what index to use to the user? Currently the index, once configured, mostly hides silently behind various features.
I think we probably want the ability to specify the index server to use on a per-codebase basis, maybe with a file similar to .clang-format or compile_commands.json.

It seems that with the kind of a setup you are aiming for it might be possible to get a lot of that done mostly by just adding some extra information to USR / SymbolId (maybe just a path?) and possibly some kind of FS overlay to translate between paths on end-user’s machine and “indexer/server” paths.

Certainly such path translation is needed. Clangd uses URIs in the index interface rather than absolute paths to support such translation. (e.g. Symbol.Definition.FileURI is “google3://relative/path.h” in our internal index). I’m not sure what we need to add to SymbolID, though - it would be useful to keep this integer-sized.

@MarkZ3

I’m thinking about a scenario where a global index could cover multiple projects but also multiple active branches all within the same index.

Yeah, modelling source control is complicated. Even in the absence of branches, developers have source code checked out at different revisions, so one global index won’t reflect what’s actually available.

We haven’t found this to be a big problem in practice, but then again most developers at Google don’t use long-lived branches and this also encourages frequent commits.

Taking the union of multiple indexes from different configurations/branches might work well. Clangd’s index infrastructure has support for overlays too (this is how we combine the static/background index with the dynamic index of opened files). That’s particularly useful when team X owns a branch that modifies only a certain subdirectory, you can build a small index for the branch and overlay it on the main one.

One point that is not clear to me is how to represent file paths for the different branches that might also not be on the client’s file system - possibly some unique URI can be built. Then navigating to those URI would need to fetch the content transparently by the LSP client

Yeah, this is an interesting question - LSP seems designed around the idea that URIs will be file:/// for the local system, though it’s not explicit, and the use of URIs is an obvious extension point. There are other uses beyond indexes too: it’d be nice to get rid of the requirement to ship built-in-headers around (just link their content into clangd), but you still want go-to-definition to work. I think this is best thought of as a separate extension, as you say.

About access control, it might be useful to have a customizable layer for this to accommodate various possible corporate authentication mechanism.

Yes. Though the right tradeoff here might be to write a plugin and re-link clangd - dynamic plugins (processes or shared objects) may be too much complexity for the value they bring. Most options are probably amenable to this: with HTTP-based options we can let plugins mangle the headers, gRPC has a pluggable credential framework, etc.

rezamahdi · March 25, 2020, 9:52am

A good Idea!
but in may opinion, this is a NICE DevOps tool, so it must be complete at the beginning.

I followed such a idea some times ago. the features that i was considered was as this:

Use RocksDB with a dedicated object storing model to fetch data fast and efficient.
Use MsgPack or gRPC to make mini Microservice architecture that is expandable and scallable.
Parse header files separately. to increase parsing source files performance. (any header is parsed just once).
a junction to revision control system to manage changes.
provide a link to compiler, to make compiler able to use this info like a cache to compile faster

and so on…
but i was too busy to read clang docs so…

sam-mccall · April 21, 2020, 11:11pm

Sorry about the radio silence here - but should mention that Kirill Bobyrev is working on this.

In general this doesn’t yield correct results, and it’s hard to not trivial to coax clangd to share the partial ASTs and preprocessor state. This is what the modules infrastructure does though, and @adamcz is poking at it to see what might be usable.

rezamahdi · April 29, 2020, 1:24pm

Of course it is possible but is not trivial.
content of any source or header file is heavily affected by preprocessor definitions. so we can parse headers in some versions, that any version has an attached definition context. for example in database we can see somthing like this:

AST of boost/asio.hpp : (definition context: none, ID: a45d6c40)
…
AST of boost/asio.hpp: (definition context: BOOST_ASIO_STANDALONE, PATCH TO: a45d6c40, ID: b6e7ff51)
…

Topic		Replies	Views
Building and sharing a clangd global index clangd	5	252	April 5, 2019
Building and sharing a clangd global index clangd	0	147	April 5, 2019
Announcing clangd Remote Index Service for LLVM LLVM Dev List Archives	1	103	May 27, 2021
DNS for LLVM/clangd index server clangd	1	172	November 4, 2020
Network RPCs in LLVM projects LLVM Dev List Archives	6	95	December 13, 2019

Sharing indexes for multiple users

Related Topics