CompilationDatabase support added to libclang and python binding

Hi Manuel,

I am done adding the CompilationDatabase functionality to libclang
(first patch)
and the python binding (second patch).

With respect to our last email exchange :
- comments have been modified to be futureproof with respect to
database actual format.
- the python binding is using exceptions for operations which have failed
- tests have been added for the libclang part as well as the python binding

The tests are making an assumption I am not totally happy with : in case
a file has several CompileCommands, the tests expect to get the
CompileCommands in the same order they appear in the database.

In c-index-test, I have added a 'dirname' function for use on Windows
platforms, assuming that if 'basename' is not available, than 'dirname'
would not. I do not know if this is necessary, and I have not tested it
as I have no windows platform available.

Cheers,

0001-cindex-cdb.patch (12.3 KB)

0002-cindex-cdb-python.patch (7.88 KB)

Hi Manuel,

I am done adding the CompilationDatabase functionality to libclang
(first patch)
and the python binding (second patch).

With respect to our last email exchange :
- comments have been modified to be futureproof with respect to
database actual format.
- the python binding is using exceptions for operations which have failed
- tests have been added for the libclang part as well as the python binding

The tests are making an assumption I am not totally happy with : in case
a file has several CompileCommands, the tests expect to get the
CompileCommands in the same order they appear in the database.

In c-index-test, I have added a 'dirname' function for use on Windows
platforms, assuming that if 'basename' is not available, than 'dirname'
would not. I do not know if this is necessary, and I have not tested it
as I have no windows platform available.

Yay - new functionality! Overall a great patch. I can't wait to see
this integrated.

I do have a number of comments though.

I am lacking clout on the C front, so take that feedback with a grain
of salt. I'm a maintainer of the Python bindings though, so that must
mean something.

I've tried to fight this battle before and I'll say it again: I don't
like the existing naming convention for the libclang APIs.
Specifically, there isn't a consistent one. And, this patch continues
that trend (not your fault - you were basing it on existing code,
which is the right thing to do).

Let's look at the new API (in source order):

CXString clang_tooling_CompileCommand_getDirectory(CXCompileCommand);
unsigned clang_tooling_CompileCommand_getNumArgs(CXCompileCommand);
CXString clang_tooling_CompileCommand_getArg(CXCompileCommand, unsigned i);
void clang_tooling_disposeCompilationDb(CXCompilationDb);
CXCompilationDb clang_tooling_getCompilationDb_fromDirectory(const
char *buildDir);
void clang_tooling_disposeCompileCommands(CXCompileCommands);
unsigned clang_tooling_CompileCommands_getSize(CXCompileCommands);
CXCompileCommand
clang_tooling_CompileCommands_getCommand(CXCompileCommands, unsigned
i);
CXCompileCommands clang_tooling_getCompileCommands(CXCompilationDb
CDb, const char *completeFileName);

First, some nits. I would change the source order of the declarations
so related APIs are grouped together then sorted from highest to
lowest level. Start with all the functions operating on
CXCompilationDb. Then define everything operating on
CXCompileCommands. Finally CXCompileCommand. This will increase
readability, IMO. Also, always place the "constructors" and
"destructors" next to each other and reference the other in the
doxygen blocks. This increases the chances that the destructors are
called appropriately.

The use of "Db" rather than "DB" looks unpleasant to me. I think you
should capitalize both letters.

Everything is prefixed with "clang_tooling_." The "tooling" bit is new
to libclang. I personally like prefixing this way as it draws a clear
line between underlying modules. But, I know others feel differently
and may argue for its elimination. If we decide to keep it, perhaps a
new header, Tooling.h, would be the appropriate place to declare these
APIs since they are clearly separate from "index." This would extend
to the Python API as well.

I really like your general naming convention of
clang_tooling_<thing>_<action>. IMO, this is a proper naming
convention. Related functions are naturally grouped together and the
name communicates what they operate on and the action being performed.
Unfortunately, the rest of the libclang API isn't like this.
Furthermore, your patch isn't consistent in this style. e.g.
"clang_tooling_disposeCompilationDb" should be
"clang_tooling_CompilationDb_dispose" and
"clang_tooling_getCompilationDb_fromDirectory" should be
"clang_tooling_CompilationDb_getFromDirectory" or even
"clang_tooling_CompilationDb_fromDirectory," etc. Since the existing
API doesn't use my preferred naming convention, I guess this means you
don't have to adopt it. Whatever you end up doing, please make it
consistent because an inconsistent convention is worse than a
consistently "bad" one.

I wonder if a clang_tooling_CompileCommand_getArgs(CXCompileCommand,
CXString **, size_t *size) (or similar) would be helpful. This API
would allocate a new array of CXString or CXString* in one go instead
of requiring a function call for each argument. Or, you could have the
caller pre-allocate the array and the function would fill. Semantics.
I think the majority of users are interested in most or all arguments
and this API would save some function call overhead (which could be
noticeable in places like the Python bindings). I can make the same
argument for having a
clang_tooling_CompileCommands_getCommands(CXCompileCommands,
CXCompileCommand **, size_t *).

I really don't like the magic of compilation databases being
constructed from a directory with the auto-discovery of
"compile_commands.json." I know this is how the underlying API works,
so this isn't the fault of this patch. I think the underlying API and
libclang should be supplemented with the ability to construct the
database from an arbitrary file or possibly even a buffer with an
associated compilation database type (CXCompilationDBType_JSON).

While we're on the subject of limitations in the underlying API, the
lack of the ability to extract *all* compilation commands is a serious
downer. I can think of a number of use cases where one would crawl all
compilation databases and operate on all commands. Are those users
expected to implement their own parsers for the compilation database
and not go through the tooling/libclang API?

Whew, that was a lot. Again, I don't have clout in C land, so don't go
integrating my feedback before others have time to rebuke.

On to the Python!

Nit: Include space after comma between function arguments. This is
missing throughout.

Nit: Again, I like "CompilationDB" not "CompilationDb." I may like
"CompilationDatabase" even more.

Nit: I like wrapping the method summary on the same line as """. i.e.
don't waste lines.

@property
def getDirectory(self):

This produces code that looks like:

dir = command.getDirectory

This looks weird. To most Python programmers, it looks like you are
retrieving the underlying method. Rename your properties so they are
nouns, not verbs. e.g. "directory," "commandLine."

For the CompileCommand arguments, I would make it a "arguments"
property or a "getArguments" method. To me, "command line" implies the
full string, not a list of strings.

On a related note, there is no need to create inner iterator classes.
Instead, just use yield, which does all the magic for you. e.g.

def getArguments(self):
    length = CompileCommand_getNumArgs(self.cl)
    for i in xrange(length):
        yield CompileCommand_getArg(self.cl, i)

Of course, you lose the ability to do efficient len() and subscript
access. Callers can gain this back by doing list(foo.getArguments()),
etc. But, that buffers everything, which could have performance
side-effects. That could be a problem for cursor iteration. I don't
think it is a problem for compilation databases. Although, it /might/
be if we ever expose an API to retrieve all CompileCommands. The good
news is you can swap in a custom iterator later and the API is
backwards-compatible (yield returns an iterator magically under the
hood). My vote is to optimize for fewer lines of code now. Complexity
can always be added later if needed for performance.

The current Python API is currently inconsistent with its use of
iterators in terms of using properties or methods. I /think/ I like
methods more. But, I'm fine either way as long as things are
documented.

In CompilationDb.fromDirectory, the ptr validation can be done in a
from_result function registered on the ctypes library. I would prefer
this, as that ensures all call sites have the same behavior (nevermind
there will only likely be 1 call site).

+ try:
+ cmds = cdb.getCompileCommands('file_do_not_exist.cpp')
+ except KeyError:
+ assert True
+ else:
+ assert False

You can do this?! I guess you do learn something new every day!

+ cmdline = ' '.join(cmds[0].getCommandLine)
+ assert cmdline == 'clang++ -o project.o -c
/home/john.doe/MyProject/project.cpp'

Please compare lists instead. Every time you compare derived entities,
you introduce a point for subtle failures.

OK. That's a lot of feedback. I think I'm done :slight_smile:

Again, great patch and thank you for putting the time into it. Despite
the length of this reply, things are looking good. Most of my comments
are cosmetic in nature, so they should be trivial to address. Again,
expect conflicting feedback on my C comments.

Gregory

Hi Manuel,

I am done adding the CompilationDatabase functionality to libclang
(first patch)
and the python binding (second patch).

With respect to our last email exchange :

  • comments have been modified to be futureproof with respect to
    database actual format.
  • the python binding is using exceptions for operations which have failed
  • tests have been added for the libclang part as well as the python binding

The tests are making an assumption I am not totally happy with : in case
a file has several CompileCommands, the tests expect to get the
CompileCommands in the same order they appear in the database.

In c-index-test, I have added a ‘dirname’ function for use on Windows
platforms, assuming that if ‘basename’ is not available, than ‘dirname’
would not. I do not know if this is necessary, and I have not tested it
as I have no windows platform available.

Yay - new functionality! Overall a great patch. I can’t wait to see
this integrated.

I do have a number of comments though.

I am lacking clout on the C front, so take that feedback with a grain
of salt. I’m a maintainer of the Python bindings though, so that must
mean something.

I’ve tried to fight this battle before and I’ll say it again: I don’t
like the existing naming convention for the libclang APIs.
Specifically, there isn’t a consistent one. And, this patch continues
that trend (not your fault - you were basing it on existing code,
which is the right thing to do).

Let’s look at the new API (in source order):

CXString clang_tooling_CompileCommand_getDirectory(CXCompileCommand);
unsigned clang_tooling_CompileCommand_getNumArgs(CXCompileCommand);
CXString clang_tooling_CompileCommand_getArg(CXCompileCommand, unsigned i);
void clang_tooling_disposeCompilationDb(CXCompilationDb);
CXCompilationDb clang_tooling_getCompilationDb_fromDirectory(const
char *buildDir);
void clang_tooling_disposeCompileCommands(CXCompileCommands);
unsigned clang_tooling_CompileCommands_getSize(CXCompileCommands);
CXCompileCommand
clang_tooling_CompileCommands_getCommand(CXCompileCommands, unsigned
i);
CXCompileCommands clang_tooling_getCompileCommands(CXCompilationDb
CDb, const char *completeFileName);

First, some nits. I would change the source order of the declarations
so related APIs are grouped together then sorted from highest to
lowest level. Start with all the functions operating on
CXCompilationDb. Then define everything operating on
CXCompileCommands. Finally CXCompileCommand. This will increase
readability, IMO. Also, always place the “constructors” and
“destructors” next to each other and reference the other in the
doxygen blocks. This increases the chances that the destructors are
called appropriately.

The use of “Db” rather than “DB” looks unpleasant to me. I think you
should capitalize both letters.

Everything is prefixed with “clang_tooling_.” The “tooling” bit is new
to libclang. I personally like prefixing this way as it draws a clear
line between underlying modules. But, I know others feel differently
and may argue for its elimination. If we decide to keep it, perhaps a
new header, Tooling.h, would be the appropriate place to declare these
APIs since they are clearly separate from “index.” This would extend
to the Python API as well.

I really like your general naming convention of
clang_tooling__. IMO, this is a proper naming
convention. Related functions are naturally grouped together and the
name communicates what they operate on and the action being performed.
Unfortunately, the rest of the libclang API isn’t like this.
Furthermore, your patch isn’t consistent in this style. e.g.
“clang_tooling_disposeCompilationDb” should be
“clang_tooling_CompilationDb_dispose” and
“clang_tooling_getCompilationDb_fromDirectory” should be
“clang_tooling_CompilationDb_getFromDirectory” or even
“clang_tooling_CompilationDb_fromDirectory,” etc. Since the existing
API doesn’t use my preferred naming convention, I guess this means you
don’t have to adopt it. Whatever you end up doing, please make it
consistent because an inconsistent convention is worse than a
consistently “bad” one.

I wonder if a clang_tooling_CompileCommand_getArgs(CXCompileCommand,
CXString **, size_t size) (or similar) would be helpful. This API
would allocate a new array of CXString or CXString
in one go instead
of requiring a function call for each argument. Or, you could have the
caller pre-allocate the array and the function would fill. Semantics.
I think the majority of users are interested in most or all arguments
and this API would save some function call overhead (which could be
noticeable in places like the Python bindings). I can make the same
argument for having a
clang_tooling_CompileCommands_getCommands(CXCompileCommands,
CXCompileCommand **, size_t *).

I really don’t like the magic of compilation databases being
constructed from a directory with the auto-discovery of
“compile_commands.json.” I know this is how the underlying API works,
so this isn’t the fault of this patch. I think the underlying API and

What don’t you like about this? As a tool author, I don’t want to care about what underlying system somebody uses, I want my tool to just work. I wonder what’s different in your use case.

libclang should be supplemented with the ability to construct the
database from an arbitrary file or possibly even a buffer with an
associated compilation database type (CXCompilationDBType_JSON).

This is already possible with the underlying API:
JSONCompilationDatabase::loadFromFile
JSONCompilationDatabase::loadFromBuffer
FixedCompilationDatabase::loadFromCommandLine

While we’re on the subject of limitations in the underlying API, the
lack of the ability to extract all compilation commands is a serious
downer. I can think of a number of use cases where one would crawl all
compilation databases and operate on all commands. Are those users
expected to implement their own parsers for the compilation database
and not go through the tooling/libclang API?

Great idea. Even better, combine that with a ClangTool::runOnAll(ASTFrontendAction*) which we can parallelize once clang is sufficiently thread safe for this. Patch welcome :slight_smile: (or, open a bug)

Thanks for taking time for this lengthy review :slight_smile:

See my comments inline.

Hi Manuel,

I am done adding the CompilationDatabase functionality to libclang
(first patch)
and the python binding (second patch).

With respect to our last email exchange :
- comments have been modified to be futureproof with respect to
database actual format.
- the python binding is using exceptions for operations which have failed
- tests have been added for the libclang part as well as the python binding

The tests are making an assumption I am not totally happy with : in case
a file has several CompileCommands, the tests expect to get the
CompileCommands in the same order they appear in the database.

In c-index-test, I have added a 'dirname' function for use on Windows
platforms, assuming that if 'basename' is not available, than 'dirname'
would not. I do not know if this is necessary, and I have not tested it
as I have no windows platform available.

Yay - new functionality! Overall a great patch. I can't wait to see
this integrated.

I do have a number of comments though.

I am lacking clout on the C front, so take that feedback with a grain
of salt. I'm a maintainer of the Python bindings though, so that must
mean something.

I've tried to fight this battle before and I'll say it again: I don't
like the existing naming convention for the libclang APIs.
Specifically, there isn't a consistent one. And, this patch continues
that trend (not your fault - you were basing it on existing code,
which is the right thing to do).

Let's look at the new API (in source order):

CXString clang_tooling_CompileCommand_getDirectory(CXCompileCommand);
unsigned clang_tooling_CompileCommand_getNumArgs(CXCompileCommand);
CXString clang_tooling_CompileCommand_getArg(CXCompileCommand, unsigned i);
void clang_tooling_disposeCompilationDb(CXCompilationDb);
CXCompilationDb clang_tooling_getCompilationDb_fromDirectory(const
char *buildDir);
void clang_tooling_disposeCompileCommands(CXCompileCommands);
unsigned clang_tooling_CompileCommands_getSize(CXCompileCommands);
CXCompileCommand
clang_tooling_CompileCommands_getCommand(CXCompileCommands, unsigned
i);
CXCompileCommands clang_tooling_getCompileCommands(CXCompilationDb
CDb, const char *completeFileName);

First, some nits. I would change the source order of the declarations
so related APIs are grouped together then sorted from highest to
lowest level. Start with all the functions operating on
CXCompilationDb. Then define everything operating on
CXCompileCommands. Finally CXCompileCommand. This will increase
readability, IMO. Also, always place the "constructors" and
"destructors" next to each other and reference the other in the
doxygen blocks. This increases the chances that the destructors are
called appropriately.

Yes. Things are in a bottom up order because I wanted each 'type' and
its related functions to be grouped together. Doing things top down will
force to split the type definition (usually void * for now) from its
related functions, which I, as a developer, do not really like.

The use of "Db" rather than "DB" looks unpleasant to me. I think you
should capitalize both letters.

I did hesitate between both. Let's see how others react.

Everything is prefixed with "clang_tooling_." The "tooling" bit is new
to libclang. I personally like prefixing this way as it draws a clear
line between underlying modules. But, I know others feel differently
and may argue for its elimination. If we decide to keep it, perhaps a
new header, Tooling.h, would be the appropriate place to declare these
APIs since they are clearly separate from "index." This would extend
to the Python API as well.

Also though about that. Although the tooling part is clearly a separate
module, it will most probably be used together with the "index". So
having a separate CTooling.h makes sense to me, but it should be made
available to the users thru the Index.h. The same holds for the python
binding. The Index acts more as an umbrella from which you can draw
different submodules. Beside, some functionality from Index is reused :
CXString for example.

I really like your general naming convention of
clang_tooling_<thing>_<action>. IMO, this is a proper naming
convention. Related functions are naturally grouped together and the
name communicates what they operate on and the action being performed.
Unfortunately, the rest of the libclang API isn't like this.
Furthermore, your patch isn't consistent in this style. e.g.
"clang_tooling_disposeCompilationDb" should be
"clang_tooling_CompilationDb_dispose" and
"clang_tooling_getCompilationDb_fromDirectory" should be
"clang_tooling_CompilationDb_getFromDirectory" or even
"clang_tooling_CompilationDb_fromDirectory," etc. Since the existing
API doesn't use my preferred naming convention, I guess this means you
don't have to adopt it. Whatever you end up doing, please make it
consistent because an inconsistent convention is worse than a
consistently "bad" one.

I agree. I did not adhere to the clang_tooling_<thing>_<action>
convention for the dispose functions to stick with the actual convention
used in the rest of index.h for those functions.

I wonder if a clang_tooling_CompileCommand_getArgs(CXCompileCommand,
CXString **, size_t *size) (or similar) would be helpful. This API
would allocate a new array of CXString or CXString* in one go instead
of requiring a function call for each argument. Or, you could have the
caller pre-allocate the array and the function would fill. Semantics.
I think the majority of users are interested in most or all arguments
and this API would save some function call overhead (which could be
noticeable in places like the Python bindings). I can make the same
argument for having a
clang_tooling_CompileCommands_getCommands(CXCompileCommands,
CXCompileCommand **, size_t *).

I also though about this. My belief is that we should provide a "low
level access" for the few people needing this power --- this is the
current function, plus a "high level" function which will be used in 99%
of the case and has no performance penalty.

Taking a CompileCommand as an example, I am thinking of :
- CXString clang_tooling_CompileCommand_getArgs(CXCompileCommand);
- CXString clang_tooling_CompileCommand_getExe(CXCompileCommand);
to get all arguments concatenated (without the executable part) and the
executable separately. This beside expresses an invariant from
CompileCommand.

Or even :
- CXString clang_tooling_CompileCommand_getCmdLine(CXCompileCommand);
to get the full command line, ready for execution.

I really don't like the magic of compilation databases being
constructed from a directory with the auto-discovery of
"compile_commands.json." I know this is how the underlying API works,
so this isn't the fault of this patch. I think the underlying API and
libclang should be supplemented with the ability to construct the
database from an arbitrary file or possibly even a buffer with an
associated compilation database type (CXCompilationDBType_JSON).

Manuel already answered this one.
We can make the other possibilities available in Index.h and the python
biding later on.

While we're on the subject of limitations in the underlying API, the
lack of the ability to extract *all* compilation commands is a serious
downer. I can think of a number of use cases where one would crawl all
compilation databases and operate on all commands. Are those users
expected to implement their own parsers for the compilation database
and not go through the tooling/libclang API?

Whew, that was a lot. Again, I don't have clout in C land, so don't go
integrating my feedback before others have time to rebuke.

On to the Python!

Nit: Include space after comma between function arguments. This is
missing throughout.

I will fix that.

Nit: Again, I like "CompilationDB" not "CompilationDb." I may like
"CompilationDatabase" even more.

Again, I have no preference there. I like the principle that the same
concept spells the same way thru Index.h, the python binding and the C++
original library. However, CompilationDatabase created really long names
(e.g. clang_tooling_getCompilationDatabase_fromDirectory).

Nit: I like wrapping the method summary on the same line as """. i.e.
don't waste lines.

I will fix that.

@property
def getDirectory(self):

This produces code that looks like:

dir = command.getDirectory

This looks weird. To most Python programmers, it looks like you are
retrieving the underlying method. Rename your properties so they are
nouns, not verbs. e.g. "directory," "commandLine."

I will fix that.

For the CompileCommand arguments, I would make it a "arguments"
property or a "getArguments" method. To me, "command line" implies the
full string, not a list of strings.

OK

On a related note, there is no need to create inner iterator classes.
Instead, just use yield, which does all the magic for you. e.g.

def getArguments(self):
    length = CompileCommand_getNumArgs(self.cl)
    for i in xrange(length):
        yield CompileCommand_getArg(self.cl, i)

Will fix that too

Of course, you lose the ability to do efficient len() and subscript
access. Callers can gain this back by doing list(foo.getArguments()),
etc. But, that buffers everything, which could have performance
side-effects. That could be a problem for cursor iteration. I don't
think it is a problem for compilation databases. Although, it /might/
be if we ever expose an API to retrieve all CompileCommands. The good
news is you can swap in a custom iterator later and the API is
backwards-compatible (yield returns an iterator magically under the
hood). My vote is to optimize for fewer lines of code now. Complexity
can always be added later if needed for performance.

The current Python API is currently inconsistent with its use of
iterators in terms of using properties or methods. I /think/ I like
methods more. But, I'm fine either way as long as things are
documented.

In CompilationDb.fromDirectory, the ptr validation can be done in a
from_result function registered on the ctypes library. I would prefer
this, as that ensures all call sites have the same behavior (nevermind
there will only likely be 1 call site).

I will add the from_result check.

+ try:
+ cmds = cdb.getCompileCommands('file_do_not_exist.cpp')
+ except KeyError:
+ assert True
+ else:
+ assert False

You can do this?! I guess you do learn something new every day!

What is surprising you there ? I am definitely not a python expert :slight_smile:

+ cmdline = ' '.join(cmds[0].getCommandLine)
+ assert cmdline == 'clang++ -o project.o -c
/home/john.doe/MyProject/project.cpp'

Please compare lists instead. Every time you compare derived entities,
you introduce a point for subtle failures.

Will fix that too.

OK. That's a lot of feedback. I think I'm done :slight_smile:

Again, great patch and thank you for putting the time into it. Despite
the length of this reply, things are looking good. Most of my comments
are cosmetic in nature, so they should be trivial to address. Again,
expect conflicting feedback on my C comments.

Gregory

Thanks again for your review !

Bikeshedding! /me jumps right in

Thanks for taking time for this lengthy review :slight_smile:

See my comments inline.

Hi Manuel,

I am done adding the CompilationDatabase functionality to libclang
(first patch)
and the python binding (second patch).

With respect to our last email exchange :

  • comments have been modified to be futureproof with respect to
    database actual format.
  • the python binding is using exceptions for operations which have failed
  • tests have been added for the libclang part as well as the python binding

The tests are making an assumption I am not totally happy with : in case
a file has several CompileCommands, the tests expect to get the
CompileCommands in the same order they appear in the database.

In c-index-test, I have added a ‘dirname’ function for use on Windows
platforms, assuming that if ‘basename’ is not available, than ‘dirname’
would not. I do not know if this is necessary, and I have not tested it
as I have no windows platform available.
Yay - new functionality! Overall a great patch. I can’t wait to see
this integrated.

I do have a number of comments though.

I am lacking clout on the C front, so take that feedback with a grain
of salt. I’m a maintainer of the Python bindings though, so that must
mean something.

I’ve tried to fight this battle before and I’ll say it again: I don’t
like the existing naming convention for the libclang APIs.
Specifically, there isn’t a consistent one. And, this patch continues
that trend (not your fault - you were basing it on existing code,
which is the right thing to do).

Let’s look at the new API (in source order):

CXString clang_tooling_CompileCommand_getDirectory(CXCompileCommand);
unsigned clang_tooling_CompileCommand_getNumArgs(CXCompileCommand);
CXString clang_tooling_CompileCommand_getArg(CXCompileCommand, unsigned i);
void clang_tooling_disposeCompilationDb(CXCompilationDb);
CXCompilationDb clang_tooling_getCompilationDb_fromDirectory(const
char *buildDir);
void clang_tooling_disposeCompileCommands(CXCompileCommands);
unsigned clang_tooling_CompileCommands_getSize(CXCompileCommands);
CXCompileCommand
clang_tooling_CompileCommands_getCommand(CXCompileCommands, unsigned
i);
CXCompileCommands clang_tooling_getCompileCommands(CXCompilationDb
CDb, const char *completeFileName);

First, some nits. I would change the source order of the declarations
so related APIs are grouped together then sorted from highest to
lowest level. Start with all the functions operating on
CXCompilationDb. Then define everything operating on
CXCompileCommands. Finally CXCompileCommand. This will increase
readability, IMO. Also, always place the “constructors” and
“destructors” next to each other and reference the other in the
doxygen blocks. This increases the chances that the destructors are
called appropriately.

Yes. Things are in a bottom up order because I wanted each ‘type’ and
its related functions to be grouped together. Doing things top down will
force to split the type definition (usually void * for now) from its
related functions, which I, as a developer, do not really like.

The use of “Db” rather than “DB” looks unpleasant to me. I think you
should capitalize both letters.

I did hesitate between both. Let’s see how others react.

Can we spell out Database? /me ducks. Otherwise I’d vote for DB, too. Db looks strange.

First, some nits. I would change the source order of the declarations
so related APIs are grouped together then sorted from highest to
lowest level. Start with all the functions operating on
CXCompilationDb. Then define everything operating on
CXCompileCommands. Finally CXCompileCommand. This will increase
readability, IMO. Also, always place the "constructors" and
"destructors" next to each other and reference the other in the
doxygen blocks. This increases the chances that the destructors are
called appropriately.

Yes. Things are in a bottom up order because I wanted each 'type' and
its related functions to be grouped together. Doing things top down will
force to split the type definition (usually void * for now) from its
related functions, which I, as a developer, do not really like.

If you put all the typedefs before/away from the functions that
operate on them, I don't think that decreases readability. In fact,
I'd argue it increases it because you are describing all the "things"
(the data model) before you describe the actions that can be performed
on them. That's like reading a glossary before you dive into a
technical manual.

Everything is prefixed with "clang_tooling_." The "tooling" bit is new
to libclang. I personally like prefixing this way as it draws a clear
line between underlying modules. But, I know others feel differently
and may argue for its elimination. If we decide to keep it, perhaps a
new header, Tooling.h, would be the appropriate place to declare these
APIs since they are clearly separate from "index." This would extend
to the Python API as well.

Also though about that. Although the tooling part is clearly a separate
module, it will most probably be used together with the "index". So
having a separate CTooling.h makes sense to me, but it should be made
available to the users thru the Index.h. The same holds for the python
binding. The Index acts more as an umbrella from which you can draw
different submodules. Beside, some functionality from Index is reused :
CXString for example.

I'm not sure it needs to be made available through Index.h. Sure, that
can be done easily enough. We could just as easily have Tooling.h
include Index.h (to get CXString, etc). Oh, bikeshedding.

+ try:
+ cmds = cdb.getCompileCommands('file_do_not_exist.cpp')
+ except KeyError:
+ assert True
+ else:
+ assert False

You can do this?! I guess you do learn something new every day!

What is surprising you there ? I am definitely not a python expert :slight_smile:

I didn't know you could use "else" along with "except!" Sure enough,
it is documented at
http://docs.python.org/reference/compound_stmts.html#the-try-statement.
I'll have to start using that!

Hi Manuel and Gregory,

Here are the patches for a second round of review. I took into account
most of
your suggestions, if not all.

Here are the changes with respect to the previous set of patches.

Regarding cindex:
- Index.h : declaration order changed to be more reader friendly
- CompilationDb moved to CompilationDatabase
- the clang_tooling_ prefix is kept for now, expressing this
functionality comes
   from a different module. We may want to move it into another header
then Index.h
   to even better express this. Which name should it have ? Tooling.h is
already
   in use. Beside the logical isolation, do we want a physical one, i.e.
have a
   separate library for tooling C interface ?
- the naming convention is now using consistently
clang_tooling_<thing>_<action>
- i have not yet added the other available ways of loading a database.
This can
   probably be done later on, on demand
- i have not yet added higher level helpers like
   clang_tooling_CompileCommand_getCmdLine. This should come latter,
depending
   on actual needs and uses.

Regarding the python binding:
- space added after comma for the arguments of function calls
- method summary now on one line, when it is small enough to fit in 80
columns.
- properties name have been changed to noun
- yield used instead of iterator for CompileCommand
- CompileCommands is no longer a hidden object.
- CompilationDb moved to CompilationDatabase
- from_result method used for checking CompilationDatabase &
CompileCommands
- test_cdb.py : comparisons are now done using lists

Cheers,

0001-cindex-cdb.patch (12.7 KB)

0002-cindex-cdb-python.patch (8.27 KB)

First of all, thanks Arnaud for this work, I'm going to be making use of it soon!

Second, and this isn't directly related to your patch, but: Could we please have some documentation on the format of the compile_commands.json? (Manuel?). I couldn't find anything in the cmake docs. In particular, what is the "directory" supposed to be -- the root of your source directory? If you have a "recursive make" build system, would "directory" be the directory of the innermost invocation of make? Or simply the $PWD for the invocation of "command"? (this can differ from make's $PWD if the recipe explicitly contained a "cd"). Can "file" be a relative path from that directory?

Finally, what is the meaning of having multiple compilation commands for a single file?

Cheers
Dave.

Hi David,

From what I can see in the cmake sources, directory is the working

directory, i.e. the place where the compiler must be invoked from. For
cmake, directory and file are absolute pathes. But I do not know how
much of a contract this is.

Multiple compilation commands can happen in some cases : for example
varying some defines to enable different part of the file (think
tablegen'erated files), or different other compile options in case
multiple builds are done at the same time (debug, release, ...).

Cheers,

Just wanted to second this. I actually wanted to cite just such documentation in a new design proposal I’m about to mail out… =D

Could we please have some documentation on the format of the compile_commands.json? (Manuel?).

Just wanted to second this. I actually wanted to cite just such documentation in a new design proposal I’m about to mail out… =D

Committed as r158365.
http://clang.llvm.org/docs/JSONCompilationDatabase.html

Feeback or patches welcome :slight_smile:

Cheers,
/Manuel

Thanks Manuel -- that was quick!

SUPPORTED SYSTEMS

* Since which version of CMake?

* Give an example of how to add the option to a CMakeLists.txt, and how
  to specify the option on the command-line when not present in the
  CMakeLists.txt.

* Out of curiosity, how hard would it be to implement for other CMake
  targets (like Visual Studio projects) -- presumably the compilation
  command would be different, but everything else can be re-used?

* Perhaps give a link to the specific CMake implementation for the sake
  of those considering adding support for other CMake targets:
  http://cmake.org/gitweb?p=cmake.git;a=commitdiff;h=fe07b055
  http://cmake.org/gitweb?p=cmake.git;a=commitdiff;h=5674844d

FORMAT

Pick a name: "compile object" or "command object", and use it consistently:

--- JSON Compilation Database Format Specification.html.orig
+++ JSON Compilation Database Format Specification.html
@@ -111 +111 @@
-objects, where each object specifies one way a translation unit
+"command objects", where each command object specifies one way a translation unit
@@ -113 +113 @@
-<p>Each object contains the translation unit's main file, the working
+<p>Each command object contains the translation unit's main file, the working
@@ -131 +131 @@
-compile objects for the same file, for example if the same translation unit is
+command objects for the same file, for example if the same translation unit is

If you have 2 different compilation commands for the same source file, do you
have:
a. 2 separate "command objects", or
b. a single command object with two "command" entries?

With my above wording changes, the answer is unambiguously "b"; but I
want to confirm that's actually what you intended. Actually with json
it's probably impossible to have the same key twice so I think I've just
answered my own question.

MINOR FIXUPS

--- JSON Compilation Database Format Specification.html.orig
+++ JSON Compilation Database Format Specification.html
@@ -78 +78 @@
-<p>This document describes a format to specify how to replay
+<p>This document describes a format for specifying how to replay
@@ -82 +82 @@
-<p>Tools based on the C++ AST need full information how to
+<p>Tools based on the C++ Abstract Syntax Tree need full information how to
@@ -102 +102 @@
-<p>Currently <a href="http://cmake.org">CMake</a> support generation of compilation
+<p>Currently <a href="http://cmake.org">CMake</a> supports generation of compilation
@@ -131 +131 @@
-compile objects for the same file, for example if the same translation unit is
+compile objects for the same file, for example if the same source file is

Justification for the 2nd change: Documentation aimed at users of clang-based
tools shouldn't contain acronyms relating to compiler internals.

Justification for the 4th change: If the build configuration has changed
(different -D or -I command-line options) then it isn't really the same
translation unit.

Cheers
Dave.

David Röthlisberger wrote:

http://clang.llvm.org/docs/JSONCompilationDatabase.html

Feeback or patches welcome :slight_smile:

Thanks Manuel -- that was quick!

SUPPORTED SYSTEMS

* Since which version of CMake?

The CMake makefile generators since 2.8.5 I believe. For the Ninja
generator, it has just been merged, and will be part of CMAke 2.8.9 released
in a few weeks.

* Give an example of how to add the option to a CMakeLists.txt, and how
  to specify the option on the command-line when not present in the
  CMakeLists.txt.

Typically it won't be in the CMakeLists.txt, but set on the command line:

cmake $srcdir -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

or

cmake $srcdir -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -G Ninja

to use the Ninja generator

* Out of curiosity, how hard would it be to implement for other CMake
  targets (like Visual Studio projects) -- presumably the compilation
  command would be different, but everything else can be re-used?

I don't know this one, but indeed it probably just depends on someone caring
enough to implement and debug it and make sure the unit test passes (the
unit test for the generation of the file does not currently work on Windows
with the Ninja generator - that will also take someone debugging it on a
Windows machine to find out why)

Thanks,

Steve.

Answering what Stephen didn’t reply to yet :slight_smile:

http://clang.llvm.org/docs/JSONCompilationDatabase.html

Feeback or patches welcome :slight_smile:

Thanks Manuel – that was quick!

SUPPORTED SYSTEMS

  • Since which version of CMake?

Added to the doc.

  • Give an example of how to add the option to a CMakeLists.txt, and how
    to specify the option on the command-line when not present in the
    CMakeLists.txt.

As Stephen said, this is a core cmake option if the generator supports it, so I’m not sure putting more cmake docs in here would help.

  • Out of curiosity, how hard would it be to implement for other CMake
    targets (like Visual Studio projects) – presumably the compilation
    command would be different, but everything else can be re-used?

I don’t know how much sense this makes before we have more windows stuff working in clang.
It most certainly is possible, and I’m happy to advise should somebody be interested in tackling it…

I don’t think that would help that much. The CMake folks usually cc’ me into discussions, which I think is much more helpful. In general, the concepts are really simple here.

FORMAT

Pick a name: “compile object” or “command object”, and use it consistently:

— JSON Compilation Database Format Specification.html.orig
+++ JSON Compilation Database Format Specification.html
@@ -111 +111 @@
-objects, where each object specifies one way a translation unit
+“command objects”, where each command object specifies one way a translation unit
@@ -113 +113 @@
-

Each object contains the translation unit’s main file, the working
+

Each command object contains the translation unit’s main file, the working
@@ -131 +131 @@
-compile objects for the same file, for example if the same translation unit is
+command objects for the same file, for example if the same translation unit is

If you have 2 different compilation commands for the same source file, do you
have:
a. 2 separate “command objects”, or
b. a single command object with two “command” entries?

With my above wording changes, the answer is unambiguously “b”; but I
want to confirm that’s actually what you intended. Actually with json
it’s probably impossible to have the same key twice so I think I’ve just
answered my own question.

MINOR FIXUPS

— JSON Compilation Database Format Specification.html.orig
+++ JSON Compilation Database Format Specification.html
@@ -78 +78 @@
-

This document describes a format to specify how to replay
+

This document describes a format for specifying how to replay
@@ -82 +82 @@
-

Tools based on the C++ AST need full information how to
+

Tools based on the C++ Abstract Syntax Tree need full information how to
@@ -102 +102 @@
-

Currently CMake support generation of compilation
+

Currently CMake supports generation of compilation
@@ -131 +131 @@
-compile objects for the same file, for example if the same translation unit is
+compile objects for the same file, for example if the same source file is

Justification for the 2nd change: Documentation aimed at users of clang-based
tools shouldn’t contain acronyms relating to compiler internals.

Justification for the 4th change: If the build configuration has changed
(different -D or -I command-line options) then it isn’t really the same
translation unit.

Applied all your patches and submitted.

Thanks a lot for your feedback :slight_smile:

Cheers,
/Manuel

what is the meaning of having multiple compilation commands for a single file?

Multiple compilation commands can happen in some cases : for example
varying some defines to enable different part of the file (think
tablegen'erated files),

Do you have any insights as to what user-level tools (such as
clang_complete) can do in the presence of multiple such commands?
Perhaps process (gather completions from) *all* of the file's different
compilation commands?

or different other compile options in case
multiple builds are done at the same time (debug, release, ...).

I don't think the tools need to worry about this, as different builds
will be done in different build directories, each with its own
compilation database.

Thanks!
Dave.

what is the meaning of having multiple compilation commands for a single file?

Multiple compilation commands can happen in some cases : for example
varying some defines to enable different part of the file (think
tablegen’erated files),

Do you have any insights as to what user-level tools (such as
clang_complete) can do in the presence of multiple such commands?
Perhaps process (gather completions from) all of the file’s different
compilation commands?

For example, yep. I think for code_complete that’s probably not what you want, but for other tools (rename) that’s a completely different story.

or different other compile options in case
multiple builds are done at the same time (debug, release, …).

I don’t think the tools need to worry about this, as different builds
will be done in different build directories, each with its own
compilation database.

This is a use case we have, so I’ll just go ahead and respectfully disagree…

Cheers,
/Manuel

Hi Manuel & Gregory,

If there are no more feedback, can someone commit those patches for me ?
I do not have commit access.

Cheers,

Well, this is true for CMake but maybe different for other build system

Sorry about the delay.

I'm not qualified to review the C++. But, the style changes look very good! I'm more than satisfied with its state.

The Python is *extremely* close. The only major item left (and I didn't mention this in the initial feedback, sorry) is to store a reference to the associated CompilationDatabase in CompileCommands instances. Now, this may not be required (I haven't looked at the C++). If CompileCommands don't internally use a reference to their associated DB, we're fine. But if they do, then it is possible that a garbage collection could collect the DB instance and CompileCommands could blow up on next access. By stuffing a reference to the parent in the child, you prevent GC from prematurely collecting the parent. This manual reference tracking can be done in the from_result(). Just iterate over args until you find an instance of the appropriate type.

Even if there is no reference now, it may not always work. You should add a test that forces a GC to ensure things don't blow up in the future. Search for "gc" in the test code and you'll find an example.

Why does CompilationCommands.from_result raise a KeyError? KeyError would be appropriate if you were trying to access an attribute, etc. Its use on a function is IMO inappropriate. If you must raise, I think it should be a regular Exception. But, I think in this case the "Pythonic" thing to do is return None.

Other nits:

* I like CompilateDatabase.fromDirectory, not createFromDirectory.
* Don't import * in the test. Instead, import specific symbols.

It's also possible to have one file built several times with different options in one build,
e.g. if you build static and shared versions of library.