RFC: Symbol index for Clangd design proposal

Dear LLVM Community,

over the past few weeks, we (Google C++ Language Tools Team) have been working on the efficient symbol index proposal for Clangd. The goal is to improve overall Clangd performance by reducing the latency of different kinds of symbol search queries, such as the ones used for code completion. The plan is to follow the proposed design and replace existing implementation by the end of September.

We are happy to get feedback and comments on the proposal: suggestions are welcome!

The link to design document: https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit?usp=sharing

Kind regards,
Kirill Bobyrev

Thanks Kirill, this looks very promising.

My first question is a high level one. Clangd still lacks a referencesProvider. I don’t see mention of it in your document, but it seems a natural outcome of having a high performance index as long as it has reference location information.

Is that something you’re considering?

Thanks!

Doug.

+Haojian, who is working on references :slight_smile:

There’s some overlap for sure, though the majority is independent. I think Haojian’s plan includes these (in some overlapping order)

  1. add an interface for the index to provide xrefs to clangd

  2. implement LSP references features in terms of that interface + the AST for local references

  3. implement collection of xrefs

  4. implement xref storage/serving in the memory index ← this part overlaps with the index proposal

  5. (implement xref storage/serving in google’s internal index)

I’d guess the most likely thing is that the initial xref serving will be a fairly naive implementation that will fit in well with the current naive index :slight_smile: and then it can converge with the new index in a few months when both features are a bit more fleshed out.

Hi Kirill,

Thanks for posting this!

I wondered while reading the proposal whether you've considered the
model used by Bigtable [0] where you can have "compactions" or
re-indexing in an incremental manner. You can then do the trigraph
indexing be done quickly, and maintain different versions of the
trigraph indexes that can be concurrently searched while compactions
are happening in the background as new changes are committed through
the LSP API. This will always be a trade-off, but merging the trigraph
indices can happen independently without resorting to mutating the
actual index.

Cheers

[0] https://ai.google/research/pubs/pub27898

Hi Kirill,

Thanks a lot for posting this proposal! I have a few questions that are maybe a bit more high-level. About the “static index” mentioned, when would it be updated? If I remember correctly, I think for Google the static index might be a remote server and I assume it would be updated periodically when new commits are applied on the repo? Just making sure I understand where your use case. When you mentioned the proposal would be implemented by the end of September, would that still use the YAML as the static index storage?

If not, what did you have in mind for a more typical usage of Clangd on a single machine? Would the static index be the the unit and record files (index-while-building)?
We were thinking along those lines: when a file is changed and saved, Clangd starts a background indexing task and updates the corresponding unit/record files. Unsaved files would be the dynamic index.
Now, the index of “USR to record-files” (and other global-level info) could be generated when Clangd is started by reading all unit/record files and then kept in memory. I haven’t done measurements on how fast that would be, but judging from the presentation last fall [1], it was taking a few seconds on LLVM/Clang. I could imagine this taking a few minutes every time Clangd is started on a bigger code base. So next step would be to persist that index to disk for a greater speed-up, using LMDB or similar.

By defining well the interface for the static index, I think it should be possible to support both scenarios (local/index-while-building vs remote).

For the local scenario with background indexing, I had made a rough prototype many months ago in order to just do basic testing of the “index while building” patches. We would like to join the effort in providing that kind of functionality to Clangd but it is not clear how to proceed. I am thinking, in the short term, we could help getting the “index-while-building” patches reviewed and accepted. But it would be good to make sure we are heading in the same direction and coordinate on what needs to be done.

Regards,
Marc-André

[1] https://youtu.be/jGJhnIT-D2M?t=940

Hi Kirill,

Thanks a lot for posting this proposal! I have a few questions that are maybe a bit more high-level. About the “static index” mentioned, when would it be updated? If I remember correctly, I think for Google the static index might be a remote server and I assume it would be updated periodically when new commits are applied on the repo? Just making sure I understand where your use case. When you mentioned the proposal would be implemented by the end of September, would that still use the YAML as the static index storage?

Note that Kirill’s design is trying to address the problem of serving collected symbols efficiently (i.e. implementating SymbolIndex) instead of “indexing”/collecting symbols for static index.

One of the main goals here is to make it work well for both small dynamic index and large static index. And there should be no restriction on how symbols are collected. For clangd, static index is just a “SymbolIndex” that does not change within the span of a clangd instance. The symbols can come from the YAML global-symbol-builder or from index-while-build, and we don’t expect the symbol index implementation to change dramatically across different scenarios. So to answer your question, yes, the short term plan is to continue using the offline-built YAML symbol table (global-symbol-buider) as the symbol source for the static index :slight_smile: The yaml stuff is experimental, and we would also like to move to index-while-build in the future.

If not, what did you have in mind for a more typical usage of Clangd on a single machine? Would the static index be the the unit and record files (index-while-building)?

We were thinking along those lines: when a file is changed and saved, Clangd starts a background indexing task and updates the corresponding unit/record files. Unsaved files would be the dynamic index.

This sounds like a good optimization that can be useful in index-while-build integration.

Now, the index of “USR to record-files” (and other global-level info) could be generated when Clangd is started by reading all unit/record files and then kept in memory. I haven’t done measurements on how fast that would be, but judging from the presentation last fall [1], it was taking a few seconds on LLVM/Clang. I could imagine this taking a few minutes every time Clangd is started on a bigger code base. So next step would be to persist that index to disk for a greater speed-up, using LMDB or similar.

Another interesting problem with index-while-build is how to build a full symbol index (with both fuzzy find and USR->record lookup support) quickly for symbols from all TUs, when a new clangd instance is started. From our experience with global-symbol-builder, merging symbols across all TUs can be very expensive (>20 mins for LLVM/Clang!). YAML may contribute to some slowness here, and I would expect bit-format files used in index-while-build to speed up serialization/deserialization. But merging symbols from TUs can still be slow, so we might end up needing persistent storage for merged symbols and/or the symbol index. Obviously, this has to be measured when we have index-while-build.

By defining well the interface for the static index, I think it should be possible to support both scenarios (local/index-while-building vs remote).

For the local scenario with background indexing, I had made a rough prototype many months ago in order to just do basic testing of the “index while building” patches. We would like to join the effort in providing that kind of functionality to Clangd but it is not clear how to proceed. I am thinking, in the short term, we could help getting the “index-while-building” patches reviewed and accepted. But it would be good to make sure we are heading in the same direction and coordinate on what needs to be done.

Using index-while-building as the source of static index is also our long term goal, so we would definitely like to get aligned and also help in getting the index-while-build patches landed. I think there will be a large design space to integrate index-while-build into clangd, and a dedicated design doc would probably be a good starting point. But I think the index-while-building effort is not in the scope of Kirill’s design, which should focus on designing a performant symbol index for all scenarios.

Cheers,
Eric

Hi Dean,

Thank you very much for the suggestion! Having an index which can be searched and compacted at the same time would undoubtedly be amazing. We would certainly consider it and think about how this could fit into the design.

Right now, however, I think we should focus on immutable index model. The most building blocks (searching/query routine) are unlikely to change in the future, so we would like to start with these. The plan is to implement an immutable index and rebuild it once in a while and to get an idea of whether it is a performance bottleneck or not. Once everything is in place and we could get an idea of what parts of interaction with symbol index cause performance overhead, we can start optimizing and changing these parts.

As Eric and Marc-Andre mentioned before, we are looking into efficient index building, too, and I think both problems are connected and I think we should investigate possible solutions.

Kind regards,
Kirill Bobyrev

Hi Eric,

Thanks a lot for the clarifications. From the sound of it, we are heading in a similar direction.

Cheers,

Marc-André