JSONCompilationDatabase additional keys

Hello cfe-dev,

At present unknown keys in the compile_commands.json file will cause the JSONCompilationDatabase parser to reject the file.

  1. Would a patch to relax this constraint be accepted?
  2. Would other consumers of the compile_commands.json be negatively impacted by unknown keys?

The motivating example (somewhat contrived) involves a single TU built with a different environment in two different targets

int main()
{
TYPE i = -1;
return i < 0;
}

cc -DTYPE=int source.c -o bin1
cc -DTYPE=uint16_t source.c -o bin2

So for my use case I’d like to add a (purely optional) target name to the compilation database (generated via cmake) - in order to be able to provide more accurate diagnostics to the user.

Cheers,

-nick

Hello cfe-dev,

At present unknown keys in the compile_commands.json file will cause the
JSONCompilationDatabase parser to reject the file.

1. Would a patch to relax this constraint be accepted?
2. Would other consumers of the compile_commands.json be negatively
impacted by unknown keys?

Yes. This can be rather big, and being able to quickly parse it is
important for interactive use cases.

The motivating example (somewhat contrived) involves a single TU built
with a different environment in two different targets

int main()
{
    TYPE i = -1;
    return i < 0;
}

cc -DTYPE=int source.c -o bin1
cc -DTYPE=uint16_t source.c -o bin2

So for my use case I'd like to add a (purely optional) target name to the
compilation database (generated via cmake) - in order to be able to provide
more accurate diagnostics to the user.

I'd say that if we really want something more elaborate, we should try to
design it first instead of randomly allowing to add fields in ways that
might later break.

I'd be curious about other opinions though. (+Edwin, Daniel)

I am personally very much in favor of removing this restriction and in fact I proposed it before. This does not mean I propose any extensions today. I agree with Manuel that we should carefully design what we add. Nevertheless, the more released clang versions we have that just abort
on additional keywords, the harder it will be to add backward compatible extensions later on.

Cheers,
Tobias

Yes. This can be rather big, and being able to quickly parse it is

important for interactive use cases.

That is fair enough. And I agree that it would be wise to limit the
proliferation of fields added to this structure to maintain the ability to
parse it quickly - I do think however that this very strict parsing could
cause problems in the future wrt. extensions both for clang and periphery
projects. Perhaps a version field in the next revision?

I'd say that if we really want something more elaborate, we should try to

design it first instead of randomly allowing to add fields in ways that
might later break.

I'd be curious about other opinions though. (+Edwin, Daniel)

I am personally very much in favor of removing this restriction and in
fact I proposed it before. This does not mean I propose any extensions
today. I agree with Manuel that we should carefully design what we add.
Nevertheless, the more released clang versions we have that just abort
on additional keywords, the harder it will be to add backward compatible
extensions later on.

Ah, slight misunderstanding, I wasn't actually proposing to add the
'target' parameter, rather to simply relax the restriction that clang would
reject the file if it encountered unknown fields.

However on reflection given the current behaviour (and as Tobias indicates)
it would be very difficult to introduce any new behaviour here without
breakage. I doubt the target of my second patch (cmake) would be willing to
add breakage here either.

My thoughts here are that a compilation database, essential for static
analysis purposes, could be very useful in other use cases as well - albeit
with some minor additional information.

The build tool that encodes much wisdom in this regard is in a prime
position to generate this extra information.

While it is trivial to then generate this extra information, clang will
currently reject the file if I attempt to reuse the JSONCompilationDatabase
class to parse the file - and if I parse the file manually, I still lose
the ability to reuse other static analysers based on clang.

A future llvm/clang specific use case I can imagine is some application
doing whole program analysis. It requires information not only on the
individual TUs but also which of those are combined to be analysed as a
unit. This sort of information is not currently encoded in the compilation
database, and with the current behaviour updates to the build tool
generating the file would need to be in lock-step with updates to
llvm/clang based analysers consuming it.

Cheers,

-nick

I don't think that this is a very good reason. Parsing the compilation
database in the JSON format is going to take O(project size) work, and
anything O(project size) is not going to be adequate for interactive use
cases anyway.

-- Sean Silva

Hello cfe-dev,

At present unknown keys in the compile_commands.json file will cause the
JSONCompilationDatabase parser to reject the file.

1. Would a patch to relax this constraint be accepted?
2. Would other consumers of the compile_commands.json be negatively
impacted by unknown keys?

Yes. This can be rather big, and being able to quickly parse it is
important for interactive use cases.

I don't think that this is a very good reason. Parsing the compilation
database in the JSON format is going to take O(project size) work, and
anything O(project size) is not going to be adequate for interactive use
cases anyway.

How do you come to that conclusion? I've run benchmark on chromium-sized
projects, and the interactive use-case worked just fine.

Hello cfe-dev,

At present unknown keys in the compile_commands.json file will cause
the JSONCompilationDatabase parser to reject the file.

1. Would a patch to relax this constraint be accepted?
2. Would other consumers of the compile_commands.json be negatively
impacted by unknown keys?

Yes. This can be rather big, and being able to quickly parse it is
important for interactive use cases.

I don't think that this is a very good reason. Parsing the compilation
database in the JSON format is going to take O(project size) work, and
anything O(project size) is not going to be adequate for interactive use
cases anyway.

How do you come to that conclusion? I've run benchmark on chromium-sized
projects, and the interactive use-case worked just fine.

On my machine it takes 30ms to parse the compile_commands.json for
clang/llvm (measured with perf(1)). Typically interactive response time
expectation is 100ms, so 30ms is about 1/3 of the total time available,
which IMO is unacceptable. Conversely, a project >3x larger would be
noninteractive.

-- Sean Silva

Have you made sure it scales linearly?
If your analysis is true, it has regressed, and we should fix the YAML
parser.

(resending without some of the attachments, since the list doesn’t like big messages; ask if you want me to send the full Mathematica notebook that I used for the analysis)

CompilationDatabaseParsePerformanceAnalysis.png

CompilationDatabaseParsePerformance.json (2.31 KB)

(resending without some of the attachments, since the list doesn't like
big messages; ask if you want me to send the full Mathematica notebook that
I used for the analysis)

Hello cfe-dev,

At present unknown keys in the compile_commands.json file will cause
the JSONCompilationDatabase parser to reject the file.

1. Would a patch to relax this constraint be accepted?
2. Would other consumers of the compile_commands.json be negatively
impacted by unknown keys?

Yes. This can be rather big, and being able to quickly parse it is
important for interactive use cases.

I don't think that this is a very good reason. Parsing the compilation
database in the JSON format is going to take O(project size) work, and
anything O(project size) is not going to be adequate for interactive use
cases anyway.

How do you come to that conclusion? I've run benchmark on
chromium-sized projects, and the interactive use-case worked just fine.

On my machine it takes 30ms to parse the compile_commands.json for
clang/llvm (measured with perf(1)). Typically interactive response time
expectation is 100ms, so 30ms is about 1/3 of the total time available,
which IMO is unacceptable. Conversely, a project >3x larger would be
noninteractive.

Have you made sure it scales linearly?
If your analysis is true, it has regressed, and we should fix the YAML
parser.

Yes, I'm almost 100% sure that it scales linearly;

Wow, you did that thoroughly - 4 data points of different enough order of
magnitude would have fully convinced me :wink:

+Michael: has that regressed? Can we make it faster?

see the attached plot and raw data. The numbers are given in seconds. The