Multi-project workflow via multiple CDBs

Hi clangd-dev,

Suppose you are working on a codebase that involves multiple projects: for example, an application and one or more libraries it depends on. The projects may live in distinct repositories, have distinct build systems, and therefore produce distinct compilation database files; yet, it may be useful to treat the set of projects as a single codebase for editing purposes.

Specifically, it would be useful to have a way for clangd to index a set of such projects together and cross-reference them as a single codebase.

To give a concrete example, consider the following toy repository:

https://github.com/simark/clangd-multi-project-test

It consists of a library called "libfoo" and an application "bar" which uses the library. For ease of demonstration they are together in the same repository, but they are built separately and produce separate compilation database files. There is a "setup.sh" script to easily configure and build them both.

To be specific about what I mean by cross-referencing them as a single codebase: we would like to be able to set up clangd in such a way that:

* "Go to definition" on the use of Multiply() in bar.cpp takes you to the definition in foo.cpp (not the declaration in foo.h)
* Searching for references of Multiply() in foo.cpp finds the usage in bar.cpp

We also have a suggestion for how clangd could support such a setup:

* Clangd could be modified to accept multiple compilation database files, either at startup time via command line flags, or in the "initialize" request, or both.
* When given multiple compilation database files, clangd would merge them into a single compilation database, and then proceed as if it had been given the merged database.

Any feedback on either the described use case, or the suggested solution, would be appreciated!

Thanks,
Nate

Hi Nathan,

clangd has an experimental option to automatically build an index for the projects (i.e. the CDBs) of the files you’re editing in the background and load them on subsequent runs.
We hope this is going to become the default in the long-term, the idea is that the index will into per-file chunks and those chunks will be loaded when the files are accessed.
I believe it currently won’t provide the definition locations, because they would be in the chunks for the ‘.cpp’ files and ‘.cpp’ files are not included by the users of the library.

However, if we figure out how to load this extra information for the symbols from the headers chunks, we should be covered.
+Sam McCall, +Kadir Çetinkaya, who are actually working on this, they might have ideas on how to do it best.

PS to try out the background index, build clangd from the latest head (it’s important to sync before building, a change that loads the index has landed today) and run clangd with ‘-background-index’ flag.

Thanks for the info! I was aware of the auto-index functionality, but didn't know that it already supported multiple CDBs.

Then if you've just files open from source/bar directory in your clangd instance,
it won't become aware of the other compilation database that lies in libfoo and
only give you declaration for Multiply, but if you open any source file under
source/libfoo (not necessarily libfoo/foo.cpp itself) clangd will be able to index
that directory as well and start giving you definition location for Multiply as well.
(You also need to #include "foo.h" in foo.cpp but I assume that was unintentional.)

Does that sound good?

It sounds very good!

Out of curiosity, though - if a client wanted to send all the CDBs upfront, to avoid the requirement that you have to open a file from each project, would that be a reasonable capability to add?

Thanks,
Nate

Thanks for the info! I was aware of the auto-index functionality, but didn’t know that it already supported multiple CDBs.

Then if you’ve just files open from source/bar directory in your clangd instance,
it won’t become aware of the other compilation database that lies in libfoo and
only give you declaration for Multiply, but if you open any source file under
source/libfoo (not necessarily libfoo/foo.cpp itself) clangd will be able to index
that directory as well and start giving you definition location for Multiply as well.
(You also need to #include “foo.h” in foo.cpp but I assume that was unintentional.)

Does that sound good?

It sounds very good!

Caveats here: how the auto-index behaves with multiple CDBs hasn’t been heavily tested yet, and I think there are likely bugs.
What Ilya describes is the behavior that naturally fell out of the design (given that a clangd instance supports multiple CDBs, which aren’t known up-front).
The index works by building a whole TU at once (using a command from the main-file’s CDB), and then partitioning the symbols according to the files they appear in. We probably accidentally assumed that each partition is then associated with the TU’s CDB, which may cause some problems (if leaf/leaf.cc and lib/lib.cc and both include lib/lib.h, then we’ll end up indexing lib.h in both the leaf/ and lib/ projects).

Out of curiosity, though - if a client wanted to send all the CDBs upfront, to avoid the requirement that you have to open a file from each project, would that be a reasonable capability to add?

That seems doable. The way this currently works is:

  • A CDB is “discovered” the first time we try to get the command for a file under its directory
  • once a CDB is “discovered”, the auto-index loads any index shards that exist for the CDB, and starts indexing files that don’t have an up-to-date shard
    So we could add a custom command or a startup option to make clangd discover particular CDBs (e.g. by looking for a compile command for dir/dummy.txt, and then throwing it away).

Out of curiosity, though - if a client wanted to send all the CDBs upfront, to avoid the
requirement that you have to open a file from each project, would that be a reasonable
capability to add?

That seems doable. The way this currently works is:
- A CDB is "discovered" the first time we try to get the command for a file under its directory
- once a CDB is "discovered", the auto-index loads any index shards that exist for the CDB,
and starts indexing files that don't have an up-to-date shard
So we could add a custom command or a startup option to make clangd discover particular
CDBs (e.g. by looking for a compile command for dir/dummy.txt, and then throwing it away).

It has since come to my attention that the LSP has recently gained a new workspace/workspaceFolders request [1], which is a mechanism for the server to learn about multiple root folders in the workspace, for editors that support such things. This request is not currently implemented in clangd.

Do you think this would be a good fit for telling the server about multiple CDBs as well? We could implement this request, and as an extension, support an additional field in the WorkspaceFolder struct to specify the CDB path.

Thanks,
Nate

[1] Redirecting…


> 


> 

Out of curiosity, though - if a client wanted to send all the CDBs upfront, to avoid the 
requirement that you have to open a file from each project, would that be a reasonable 
capability to add?

That seems doable. The way this currently works is:
 - A CDB is "discovered" the first time we try to get the command for a file under its directory
 - once a CDB is "discovered", the auto-index loads any index shards that exist for the CDB, 
and starts indexing files that don't have an up-to-date shard
So we could add a custom command or a startup option to make clangd discover particular 
CDBs (e.g. by looking for a compile command for dir/dummy.txt, and then throwing it away).


It has since come to my attention that the LSP has recently gained a new workspace/workspaceFolders request [1], which is a mechanism for the server to learn about multiple root folders in the workspace, for editors that support such things. This request is not currently implemented in clangd.

Do you think this would be a good fit for telling the server about multiple CDBs as well? We could implement this request, and as an extension, support an additional field in the WorkspaceFolder struct to specify the CDB path.

Hey Nate, just wondering how this would work in VS Code. I assume it already would send this message. I’m not sure we can override it to send the additional field. Would adding a separate request be more flexible?

Hi Nathan,

Unless there are issues that would prevent auto-discovery from working, I think we should prefer it instead.

One strong reason is that auto-discovery would work regardless of the client support.

Even if we take workspace folders into account and load CDBs from those, I think it’s not enough.
A workspace folder represents a path to a project open in the editor. However, one does not typically open
all of their dependencies in the same editor but would want to discover CDBs for those dependencies.

+1 to Doug’s suggestion, if we there’s a client that has a legitimate need to send all CDBs upfront, let’s extend LSP
to allow this instead of assigning our own semantics to the existing methods.

Unless there are issues that would prevent auto-discovery from working,
I think we should prefer it instead.
One strong reason is that auto-discovery would work regardless of the
client support.

Of course, auto-discovery should be the default. This is about providing another option.

My understanding is that the motivation for another option, is building in multiple configurations. If you have multiple configurations, the CDBs need to go somewhere other than the source folder, as there is only one source folder.

+1 to Doug's suggestion, if we there's a client that has a legitimate need
to send all CDBs upfront, let's extend LSP
to allow this instead of assigning our own semantics to the existing methods.

Agreed, that sounds reasonable to me.

Thanks,
Nate

Ah, sorry for missing the context. Makes sense and for multi-configuration use-case we might need some extra stuff too, e.g. in addition
to loading the index for the new configuration I’d expect clangd to unload the index for the old configuration, etc.

Hey Nate, just wondering how this would work in VS Code. I assume it already
would send this message. I'm not sure we can override it to send the additional
field. Would adding a separate request be more flexible?

Just wanted to follow up on this and say that after looking at VS Code's LSP client implementation a bit, it looks like they have a "middleware" feature that allows plugins to hook most (all?) LSP messages, including "workspace/workspaceFolders".

Meanwhile I'm also having an ongoing discussion about this in an LSP issue [1], and it was suggested there that the "workspace/configuration" server->client request, with the workspace folder as the "scope", would be a better fit here. (Note, "workspace/configuration" can also be hooked by plugins via the "middleware" mechanism to add to the response.)

Regards,
Nate

[1] Allow WorkspaceFolder to carry some optional metadata · Issue #703 · microsoft/language-server-protocol · GitHub