Thanks a lot for posting this proposal! I have a few questions that are maybe a bit more high-level. About the “static index” mentioned, when would it be updated? If I remember correctly, I think for Google the static index might be a remote server and I assume it would be updated periodically when new commits are applied on the repo? Just making sure I understand where your use case. When you mentioned the proposal would be implemented by the end of September, would that still use the YAML as the static index storage?
Note that Kirill’s design is trying to address the problem of serving collected symbols efficiently (i.e. implementating SymbolIndex) instead of “indexing”/collecting symbols for static index.
One of the main goals here is to make it work well for both small dynamic index and large static index. And there should be no restriction on how symbols are collected. For clangd, static index is just a “SymbolIndex” that does not change within the span of a clangd instance. The symbols can come from the YAML global-symbol-builder or from index-while-build, and we don’t expect the symbol index implementation to change dramatically across different scenarios. So to answer your question, yes, the short term plan is to continue using the offline-built YAML symbol table (global-symbol-buider) as the symbol source for the static index The yaml stuff is experimental, and we would also like to move to index-while-build in the future.
If not, what did you have in mind for a more typical usage of Clangd on a single machine? Would the static index be the the unit and record files (index-while-building)?
We were thinking along those lines: when a file is changed and saved, Clangd starts a background indexing task and updates the corresponding unit/record files. Unsaved files would be the dynamic index.
This sounds like a good optimization that can be useful in index-while-build integration.
Now, the index of “USR to record-files” (and other global-level info) could be generated when Clangd is started by reading all unit/record files and then kept in memory. I haven’t done measurements on how fast that would be, but judging from the presentation last fall , it was taking a few seconds on LLVM/Clang. I could imagine this taking a few minutes every time Clangd is started on a bigger code base. So next step would be to persist that index to disk for a greater speed-up, using LMDB or similar.
Another interesting problem with index-while-build is how to build a full symbol index (with both fuzzy find and USR->record lookup support) quickly for symbols from all TUs, when a new clangd instance is started. From our experience with global-symbol-builder, merging symbols across all TUs can be very expensive (>20 mins for LLVM/Clang!). YAML may contribute to some slowness here, and I would expect bit-format files used in index-while-build to speed up serialization/deserialization. But merging symbols from TUs can still be slow, so we might end up needing persistent storage for merged symbols and/or the symbol index. Obviously, this has to be measured when we have index-while-build.
By defining well the interface for the static index, I think it should be possible to support both scenarios (local/index-while-building vs remote).
For the local scenario with background indexing, I had made a rough prototype many months ago in order to just do basic testing of the “index while building” patches. We would like to join the effort in providing that kind of functionality to Clangd but it is not clear how to proceed. I am thinking, in the short term, we could help getting the “index-while-building” patches reviewed and accepted. But it would be good to make sure we are heading in the same direction and coordinate on what needs to be done.
Using index-while-building as the source of static index is also our long term goal, so we would definitely like to get aligned and also help in getting the index-while-build patches landed. I think there will be a large design space to integrate index-while-build into clangd, and a dedicated design doc would probably be a good starting point. But I think the index-while-building effort is not in the scope of Kirill’s design, which should focus on designing a performant symbol index for all scenarios.