Querying the index for derived classes

Hi!

I'm looking at implementing support for a proposed LSP protocol extension "textDocument/subTypes", which, given the location of a symbol naming a class type, returns information about its derived classes.

(For context, this is part of the same Type Hierarchy proposal [1] which also includes "textDocument/superTypes", which I've already implemented and posted about separately [2].)

Unlike "superTypes", for which all the necessary information is in the AST, "subTypes" requires an index query to locate derived types.

Based on my current understanding of clangd's index, the information currently stored in the index is not sufficient to implement such a query efficiently. It seems to me that the index would need a mechanism to record relationships between symbols; for example, for each symbol, to store a list of pointers to other symbols which are the derived classes.

Is there interest in extending the type of information stored in clangd's index along these lines?

I also recall a presentation by Apple [3] about storing relationships like this in the index. Does anyone know what the current status of that effort is?

Thanks,
Nate

[1] https://github.com/Microsoft/vscode-languageserver-node/pull/426
[2] http://lists.llvm.org/pipermail/clangd-dev/2019-January/000238.html
[3] https://www.youtube.com/watch?v=jGJhnIT-D2M

Hi Nathan,

This sounds like a nice feature, though there’s some complexity.

One problem is that as you say we’ll need to add index data for this, but the feature is neither standardized nor widely adopted. So it’s a tough tradeoff. (I’m not sure how much the increase to disk and serving structures is likely to be).

We do have plans to add some type information to the index for the purposes of code completion ranking (boost results that have the right type).
This will need to include the base types (e.g. so that when a Shape& is expected, a Circle& will match it). So maybe we can use the same data and query for (kind=type && type=SomeBase) to find derived types.
Ilya is working on this, this might be a nice solution without expanding the index, if expected-types ranking and subtypes don’t turn out to be too divergent.

Otherwise an alternative is storing related symbols as you say (either bases in the Symbol object like you say, or separately like Refs).
One limitation here is that subtypes are not strictly a symbol->symbol relationship when templates are considered. e.g. you can imagine class String : vector. Using type tokens (like the expected-types work does) will give the right answer here. Technically you can also have a template specialization inheriting from a base class, but that seems likely to be rarer.

Regarding Apple’s index-while-build: we’re mostly not using that in clangd’s own indexing for several reasons:

  • the big advantage of reusing the actual build work and UI relies on users building with a new clang, which we can’t assume
  • much of it hasn’t been (still isn’t?) available in upstream LLVM
  • it doesn’t provide data structures suitable for serving, it’s just data extraction. We do use a bit of the infrastructure to walk ASTs.

We do have plans to add some type information to the index for the
purposes of code completion ranking (boost results that have the
right type).
This will need to include the base types (e.g. so that when a Shape& is
expected, a Circle& will match it). So maybe we can use the same data
and query for (kind=type && type=SomeBase) to find derived types.
Ilya is working on this, this might be a nice solution without
expanding the index, if expected-types ranking and subtypes don't
turn out to be too divergent.

Thanks, this is an interesting direction to explore.

I looked briefly at the expected-types patches that have landed so far. Do I understand correctly that there are plans to expand the field Symbol::Type to also encode the types of the base classes?

One limitation here is that subtypes are not strictly a
symbol->symbol relationship when templates are considered.

Good point. Indeed, in the other C++ indexer that I'm familiar with (Eclipse CDT's), types have their own representation in the index (which includes a way to represent e.g. template specializations), and bases use that representation.

Regarding Apple's index-while-build:
[...]
We do use a bit of the infrastructure to walk ASTs.

Would that be the facilities in the "clang/Index/*" headers that e.g. SymbolCollector uses?

Thanks,
Nate

We do have plans to add some type information to the index for the
purposes of code completion ranking (boost results that have the
right type).
This will need to include the base types (e.g. so that when a Shape& is
expected, a Circle& will match it). So maybe we can use the same data
and query for (kind=type && type=SomeBase) to find derived types.
Ilya is working on this, this might be a nice solution without
expanding the index, if expected-types ranking and subtypes don’t
turn out to be too divergent.

Thanks, this is an interesting direction to explore.

I looked briefly at the expected-types patches that have landed so far. Do I understand correctly that there are plans to expand the field Symbol::Type to also encode the types of the base classes?

We plan to do this for expected types at some point, however my natural instinct would be to avoid using this for anything other than completion ranking.
Not saying this wouldn’t work, but my assumption is that storing hierarchies separately will be cheap both in storage and performance, make the intention of the code querying them clearer and would allow to generalize easier for the template hierarchy (for go to template specializations/explicit instantiations).

We do have plans to add some type information to the index for the
purposes of code completion ranking (boost results that have the
right type).
This will need to include the base types (e.g. so that when a Shape& is
expected, a Circle& will match it). So maybe we can use the same data
and query for (kind=type && type=SomeBase) to find derived types.
Ilya is working on this, this might be a nice solution without
expanding the index, if expected-types ranking and subtypes don’t
turn out to be too divergent.

Thanks, this is an interesting direction to explore.

I looked briefly at the expected-types patches that have landed so far. Do I understand correctly that there are plans to expand the field Symbol::Type to also encode the types of the base classes?

We plan to do this for expected types at some point, however my natural instinct would be to avoid using this for anything other than completion ranking.
Not saying this wouldn’t work, but my assumption is that storing hierarchies separately will be cheap both in storage and performance, make the intention of the code querying them clearer and would allow to generalize easier for the template hierarchy (for go to template specializations/explicit instantiations).

Fair enough, we should measure. I’m slightly nervous about thousands of tiny posting lists in dex, but maybe my intuition is failing me.
At least we should reuse (some of) the code that encodes types opaquely for the index, though.

One limitation here is that subtypes are not strictly a
symbol->symbol relationship when templates are considered.

Good point. Indeed, in the other C++ indexer that I’m familiar with (Eclipse CDT’s), types have their own representation in the index (which includes a way to represent e.g. template specializations), and bases use that representation.

So we have a (trivial for now) encoding that allows types to be attributes, query for them etc: https://reviews.llvm.org/source/clang-tools-extra/browse/clang-tools-extra/trunk/clangd/ExpectedTypes.h$42
But we don’t have them as first-class entities in the index like Symbols and References are. My gut feeling is we might be able to get away without this, by attaching them to symbols in the right way.

Regarding Apple’s index-while-build:
[…]
We do use a bit of the infrastructure to walk ASTs.

Would that be the facilities in the “clang/Index/*” headers that e.g. SymbolCollector uses?

Yes, that’s right.

I posted a WIP implementation of type hierarchy subtypes here:

https://reviews.llvm.org/D58880

It's incomplete, but there should be enough there to critique the proposed index model and index API changes on a high level. I would welcome feedback on whether my approach, which is adding a "Relations" entry to the index in addition to Symbols and Refs, is reasonable.

Thanks,
Nate