RFC: Upstreaming index-while-building

Hi everyone,

I’m picking up the upstreaming of index-while-building functionality for clang. For more context (the previous RFC and the patch reviews) please see the end of this message.

The code on github.com/apple/swift-clang that I am going to upstream has evolved a little in the meantime. I’m in the process of extracting incremental patches, roughly broken up by functionality, applying review feedback from the previous round as I go.

The rough plan is to have these patches:

  • DirectoryWatcher - https://reviews.llvm.org/D58418
  • model of source file - https://reviews.llvm.org/D58478
  • Decl-s representation - depends on model of source file
  • index file format
  • tools/c-index-test/JSONAggregation.* - depends on index file format
  • C API - depends on index file format
  • Actions - depends on all of the above
  • index while building feature in clang - depends on all of the above

This partitioning is optimized for better incremental code review, but a large portion of the tests will land in the later patches once everything is in place. In places where we really ought to have dedicated unit tests I’ll add them.

Big thanks in advance to anyone willing to take a look!

Thanks

Jan

Hi Jan,

I'm very happy that you're picking up this work again!

Since clangd team and the Apple source tools team talked last time,
clangd became a lot more full-featured, and I think there's a lot of
overlap between index-while-building and indexing that is already
built in clangd in the open source repository.

I would like to suggest that we figure out a way to unify these
indexing implementations. The value proposition for the community is
that there is no feature duplication. The value proposition for you
is that index-while-building would be able to reuse the infrastructure
that clangd has already built. For example, global code completion,
or the fast index that supports complex and fuzzy queries (Dex,
http://lists.llvm.org/pipermail/cfe-dev/2018-July/058487.html).

My strawman proposal is that index-while-building should use the same
data structures as clangd for representing symbols -- please take a
look at clang-tools-extra/clangd/index/Index.h. It would be also
great if index-while-building could reuse the on-disk serialization
format for the symbol information --
clang-tools-extra/clangd/index/Serialization.h.

These are central to the indexing system, and I think we should reuse
them. Doing so would allow you to reuse other indexing
infrastructure, and infrastructure build on top of indexing, like Dex.
I'm afraid if index-while-building does not speak the same data
structures for symbols, it is unlikely that the two implementations
will ever converge.

What do you think?

Dmitri

Hi Dmitri,

Could you clarify, it is my impression that clangd is using the same indexing symbol generation mechanism as what IWB (index-while-building) is using as source (the AST visitation of lib/Index and related index consumer). I assume clangd is using that as source of index symbols to process and then generate its higher-level data structures, is this correct ?

IWB aims to be essentially just an efficient serialization mechanism for that same data, to generate the same raw data during a build with minimal overhead. It purposefully doesn’t do any higher level processing of the symbols, e.g. anything that would include merging of index data across files, that would be a non-starter to do during building. The design is that IWB serializes the same data, as what lib/Index generates for a file, during a build and then a higher-level indexing mechanism can use that raw data as a source for more sophisticated processing (e.g. clangd’s data structures or a database for cross-file queries).

What seems to me as a great thing to explore would be that clangd uses the raw data that IWB generates as a source of index symbols, so that it can take advantage of the data getting generated during a build and not have to create and process all the translation unit ASTs from the user’s project separately to create its data structures.
What do you think, does this make sense ?

Hi Argyrios,

Hi Dmitri,

Could you clarify, it is my impression that clangd is using the same indexing symbol generation mechanism as what IWB (index-while-building) is using as source (the AST visitation of lib/Index and related index consumer). I assume clangd is using that as source of index symbols to process and then generate its higher-level data structures, is this correct ?

Yes, I think it uses the same index consumer.

clangd stores a "static index" on disk. Static index can be generated
either by a standalone indexing tool
(clang-tools-extra/clangd/indexer/IndexerMain.cpp), or by clangd
itself, when it is started with the `-background-index` command line
option. The static index is stored on disk per source file. For
example, if we have lib.h, and foo.cpp, bar.cpp both include lib.h,
the static index will also have three files,
`.clangd/index/{lib.h.$HASH,foo.cpp.$HASH,bar.cpp.$HASH}`. $HASH is
the hash of the file contents. Indexing information about lib.h is
not emitted into index files for foo.cpp and bar.cpp. The static
index is generated from each TU in parallel, using all available cores
-- just like IWB. The first indexing action that indexes a TU that
uses a certain header, writes the indexing information for that
header.

Each file with indexing information is more or less raw indexing data
scraped from the file. See
clang-tools-extra/clangd/index/Serialization.h, struct IndexFileIn,
struct IndexFileOut.

clangd builds a merged index over the whole project only in memory.
Therefore, the data that clangd writes to disk is *semantically*
equivalent to what IWB can write.

IWB aims to be essentially just an efficient serialization mechanism for that same data, to generate the same raw data during a build with minimal overhead. It purposefully doesn’t do any higher level processing of the symbols, e.g. anything that would include merging of index data across files, that would be a non-starter to do during building.

To be clear, I'm not proposing that IWB builds a merged index across
the whole project. clangd does even write a merged an index to disk.
clangd's indexing information from LLVM+Clang+clang-tools-extra is
less than 100 Mb on disk, and it can be quickly loaded during clangd
startup, after that clangd builds a merged index in memory.

The design is that IWB serializes the same data, as what lib/Index generates for a file, during a build and then a higher-level indexing mechanism can use that raw data as a source for more sophisticated processing (e.g. clangd’s data structures or a database for cross-file queries).

What seems to me as a great thing to explore would be that clangd uses the raw data that IWB generates as a source of index symbols, so that it can take advantage of the data getting generated during a build and not have to create and process all the translation unit ASTs from the user’s project separately to create its data structures.
What do you think, does this make sense ?

I think clangd and IWB are very aligned on the high level data flow
already. IWB is an optimization over the standalone indexing tool
(clang-tools-extra/clangd/indexer/IndexerMain.cpp), that allows
indexing information to be written out during the build instead of
having to run an extra tool.

What I'm asking is that IWB could use the same data format for the
per-file, non-merged, indexing information that clangd already uses in
the standalone indexing tool and in background indexing.

Dmitri