Getting source files into the CompilationDatabase interfaces

Hi,

as we’ve seen more use of the compilation database (for example through the YCM vim plugin), we noticed that for networked build systems and server applications the information we currently expose in the interfaces (path, file-name, command line arguments) is not enough - we also need to be able to get all source code, which we then can put into clang’s VFS to get a fully build-system independent clang run over a translation unit.

We think that getting the source information (perhaps optionally) as part of a getCompileCommands run is a good fit - a build system always must know how to provide the required sources, and as such it seems to be a natural fit.

I’d propose to change the CompilationDatabase interface (tools/clang/include/clang/Tooling/CompilationDatabase.h) to that end, and see two possible solutions, which both have different pros and cons:

  1. Add a map from std::string (file-name) → std::string (source content) to the CompileCommand class (in tools/clang/include/clang/Tooling/CompilationDatabase.h); let specific CompilationDatabases optionally fill in that information
  2. Do not modify CompileCommand - instead, add a getCompileCommandsAndSources(StringRef FilePath) method that returns a vector<pair<CompileCommand, map<string, string>>> which also includes the sources; I’m reluctant to split the call into two, as the compile command and the sources are tightly coupled (if a user syncs in the background, both tend to change at the same time)

I lean towards solution (1), mainly because it seems less invasive. On the other hand, it might make it less obvious to users of libclang what’s going on (for both solutions I’d propose extending the libclang interface basically in a very similar way to how the C++ code changes).

Opinions?

Thanks,
/Manuel

Hi,

as we've seen more use of the compilation database (for example through the
YCM vim plugin), we noticed that for networked build systems and server
applications the information we currently expose in the interfaces (path,
file-name, command line arguments) is not enough - we also need to be able
to get all source code, which we then can put into clang's VFS to get a
fully build-system independent clang run over a translation unit.

We think that getting the source information (perhaps optionally) as part of
a getCompileCommands run is a good fit - a build system always must know how
to provide the required sources, and as such it seems to be a natural fit.

I think the premise is sound; it's information that's useful, and
something the compilation database can definitely provide.

I'd propose to change the CompilationDatabase interface
(tools/clang/include/clang/Tooling/CompilationDatabase.h) to that end, and
see two possible solutions, which both have different pros and cons:
1. Add a map from std::string (file-name) -> std::string (source content) to
the CompileCommand class (in
tools/clang/include/clang/Tooling/CompilationDatabase.h); let specific
CompilationDatabases optionally fill in that information
2. Do not modify CompileCommand - instead, add a
getCompileCommandsAndSources(StringRef FilePath) method that returns a
vector<pair<CompileCommand, map<string, string>>> which also includes the
sources; I'm reluctant to split the call into two, as the compile command
and the sources are tightly coupled (if a user syncs in the background, both
tend to change at the same time)

Can we have a bit more information about these two options? I think I
may be a bit confused. Option #1 sounds like it maps a file name to
the file contents. But I can't seem to make heads or tails of what
the input and outputs are for Option #2.

Thanks!

~Aaron

Hi,

as we've seen more use of the compilation database (for example through
the YCM vim plugin), we noticed that for networked build systems and server
applications the information we currently expose in the interfaces (path,
file-name, command line arguments) is not enough - we also need to be able
to get all source code, which we then can put into clang's VFS to get a
fully build-system independent clang run over a translation unit.

This is a good idea, but do keep in mind (you probably do already, but I'll
mention it just in case) that for autocompletion and similar
code-comprehension features in a source code editor the canonical state of
a source file is not what the filesystem sees, it's what the user has in
his editor buffer. The state in the buffer may be unsaved but the user
still wants it used for code-completion.

So it's perfectly fine if libclang can say "here's what I think is the
state of the relevant source files" as long as the caller can then override
this with their own data.

We think that getting the source information (perhaps optionally) as part
of a getCompileCommands run is a good fit - a build system always must know
how to provide the required sources, and as such it seems to be a natural
fit.

I'd propose to change the CompilationDatabase interface
(tools/clang/include/clang/Tooling/CompilationDatabase.h) to that end, and
see two possible solutions, which both have different pros and cons:
1. Add a map from std::string (file-name) -> std::string (source content)
to the CompileCommand class (in
tools/clang/include/clang/Tooling/CompilationDatabase.h); let specific
CompilationDatabases optionally fill in that information
2. Do not modify CompileCommand - instead, add a
getCompileCommandsAndSources(StringRef FilePath) method that returns a
vector<pair<CompileCommand, map<string, string>>> which also includes the
sources; I'm reluctant to split the call into two, as the compile command
and the sources are tightly coupled (if a user syncs in the background,
both tend to change at the same time)

I'm curious how either of the two APIs would be exposed through the
libclang C API.

> Hi,
>
> as we've seen more use of the compilation database (for example through
the
> YCM vim plugin), we noticed that for networked build systems and server
> applications the information we currently expose in the interfaces (path,
> file-name, command line arguments) is not enough - we also need to be
able
> to get all source code, which we then can put into clang's VFS to get a
> fully build-system independent clang run over a translation unit.
>
> We think that getting the source information (perhaps optionally) as
part of
> a getCompileCommands run is a good fit - a build system always must know
how
> to provide the required sources, and as such it seems to be a natural
fit.

I think the premise is sound; it's information that's useful, and
something the compilation database can definitely provide.

> I'd propose to change the CompilationDatabase interface
> (tools/clang/include/clang/Tooling/CompilationDatabase.h) to that end,
and
> see two possible solutions, which both have different pros and cons:
> 1. Add a map from std::string (file-name) -> std::string (source
content) to
> the CompileCommand class (in
> tools/clang/include/clang/Tooling/CompilationDatabase.h); let specific
> CompilationDatabases optionally fill in that information
> 2. Do not modify CompileCommand - instead, add a
> getCompileCommandsAndSources(StringRef FilePath) method that returns a
> vector<pair<CompileCommand, map<string, string>>> which also includes the
> sources; I'm reluctant to split the call into two, as the compile command
> and the sources are tightly coupled (if a user syncs in the background,
both
> tend to change at the same time)

Can we have a bit more information about these two options? I think I
may be a bit confused. Option #1 sounds like it maps a file name to
the file contents. But I can't seem to make heads or tails of what
the input and outputs are for Option #2.

Ah, both basically would additionally provide a map<string, string> that
maps from filenames to file contents, so a client can overlay those file
contents to get a fully hermetic "replay" of the compilation.

The difference is that option #1 would basically keep the old interface,
and optionally the compilation database could provide the file contents (if
it wants), while option #2 would introduce a new interface for clients to
ask for file contents explicitly if they need it.

Hi,

as we've seen more use of the compilation database (for example through
the YCM vim plugin), we noticed that for networked build systems and server
applications the information we currently expose in the interfaces (path,
file-name, command line arguments) is not enough - we also need to be able
to get all source code, which we then can put into clang's VFS to get a
fully build-system independent clang run over a translation unit.

This is a good idea, but do keep in mind (you probably do already, but
I'll mention it just in case) that for autocompletion and similar
code-comprehension features in a source code editor the canonical state of
a source file is not what the filesystem sees, it's what the user has in
his editor buffer. The state in the buffer may be unsaved but the user
still wants it used for code-completion.

So it's perfectly fine if libclang can say "here's what I think is the
state of the relevant source files" as long as the caller can then override
this with their own data.

Yes, of course - the libclang change would be that you have some additional
functions (much like for how to iterate over command line arguments) that
would allow you to fetch all file names and contents for those files - in
the end, the client can of course exchange any file with its own content
before handing it off the parser.

We think that getting the source information (perhaps optionally) as part
of a getCompileCommands run is a good fit - a build system always must know
how to provide the required sources, and as such it seems to be a natural
fit.

I'd propose to change the CompilationDatabase interface
(tools/clang/include/clang/Tooling/CompilationDatabase.h) to that end, and
see two possible solutions, which both have different pros and cons:
1. Add a map from std::string (file-name) -> std::string (source content)
to the CompileCommand class (in
tools/clang/include/clang/Tooling/CompilationDatabase.h); let specific
CompilationDatabases optionally fill in that information
2. Do not modify CompileCommand - instead, add a
getCompileCommandsAndSources(StringRef FilePath) method that returns a
vector<pair<CompileCommand, map<string, string>>> which also includes the
sources; I'm reluctant to split the call into two, as the compile command
and the sources are tightly coupled (if a user syncs in the background,
both tend to change at the same time)

I'm curious how either of the two APIs would be exposed through the
libclang C API.

See above...

I lean towards solution (1), mainly because it seems less invasive. On
the other hand, it might make it less obvious to users of libclang what's
going on (for both solutions I'd propose extending the libclang interface
basically in a very similar way to how the C++ code changes).

One open question with #1 is: what do we do for the JSONCompilationDatabase
/ FixedCompilationDatabase? Do we always read all files from disk? On the
one hand, it seems like often unnecessary overhead, on the other hand, it
might take away problematic race conditions with files changing underneath
the client...

Option #2 was accepting a file path input, but returning a vector of
information -- was the file path the path to the database itself, or
to a specific source file?

~Aaron

>>
>> > Hi,
>> >
>> > as we've seen more use of the compilation database (for example
through
>> > the
>> > YCM vim plugin), we noticed that for networked build systems and
server
>> > applications the information we currently expose in the interfaces
>> > (path,
>> > file-name, command line arguments) is not enough - we also need to be
>> > able
>> > to get all source code, which we then can put into clang's VFS to get
a
>> > fully build-system independent clang run over a translation unit.
>> >
>> > We think that getting the source information (perhaps optionally) as
>> > part of
>> > a getCompileCommands run is a good fit - a build system always must
know
>> > how
>> > to provide the required sources, and as such it seems to be a natural
>> > fit.
>>
>> I think the premise is sound; it's information that's useful, and
>> something the compilation database can definitely provide.
>>
>> > I'd propose to change the CompilationDatabase interface
>> > (tools/clang/include/clang/Tooling/CompilationDatabase.h) to that end,
>> > and
>> > see two possible solutions, which both have different pros and cons:
>> > 1. Add a map from std::string (file-name) -> std::string (source
>> > content) to
>> > the CompileCommand class (in
>> > tools/clang/include/clang/Tooling/CompilationDatabase.h); let specific
>> > CompilationDatabases optionally fill in that information
>> > 2. Do not modify CompileCommand - instead, add a
>> > getCompileCommandsAndSources(StringRef FilePath) method that returns a
>> > vector<pair<CompileCommand, map<string, string>>> which also includes
>> > the
>> > sources; I'm reluctant to split the call into two, as the compile
>> > command
>> > and the sources are tightly coupled (if a user syncs in the
background,
>> > both
>> > tend to change at the same time)
>>
>> Can we have a bit more information about these two options? I think I
>> may be a bit confused. Option #1 sounds like it maps a file name to
>> the file contents. But I can't seem to make heads or tails of what
>> the input and outputs are for Option #2.
>
>
> Ah, both basically would additionally provide a map<string, string> that
> maps from filenames to file contents, so a client can overlay those file
> contents to get a fully hermetic "replay" of the compilation.
>
> The difference is that option #1 would basically keep the old interface,
and
> optionally the compilation database could provide the file contents (if
it
> wants), while option #2 would introduce a new interface for clients to
ask
> for file contents explicitly if they need it.

Option #2 was accepting a file path input, but returning a vector of
information -- was the file path the path to the database itself, or
to a specific source file?

The source file - we return a vector because we can have multiple ways to
compile the same file (for example in host and target configuration in
cross-compilations)

For anybody interested, I sent out:
http://llvm-reviews.chandlerc.com/D2121

I decided to go for the “optional provided mapping” solution. Thx for all the feedback / questions / ideas. Feel free to chime in on the review thread…