[RFC] Embedding compilation database info in object files.

tl;dr: compiler embeds compilation db info in object files; you can then collect it afterwards with simple tools. Good idea?

It seems like for a while now, we have been looking for a way that clang can assist users in creating JSON compilation databases, but solutions seem limited to specific build systems or platforms. I came up with a neat little hack that may be a viable way for clang to help create compilation databases “everywhere clang runs”, with fairly good user experience.

I believe the following user experience is achievable “everywhere clang runs”:

  1. Add some option to the compiler command line.
  2. Rebuild.
  3. Feed all of your built object files/executables to a small tool we ship and out comes a compilation database.

The basic idea is that instead of generating the compilation database “before compilation, by something that knows about the build a priori”, have the compilation database info come out “the back of the compiler a posteriori” and follow the natural flow of information through the build pipeline, and eventually be recovered from build products.

From an operational standpoint, it just involves adding a small amount of extra logic to the compiler and doing some simple postprocessing on build products. Hence I think this may be a good fit for our situation as clang developers: we want to provide a feature to users, but 1) don’t control users’ build systems/platforms but 2) do control the compiler and can ship small utilities alongside it.

I hacked up a minimal demo at <https://github.com/chisophugis/clang-compdb-in-object-file> which currently uses a compiler-wrapper to add the extra logic to the compiler. The high-level summary of how it works is that in each TU it embeds a string literal containing {"directory":...,"command":...,"file":...} in a special section .clang.compdb, and then these are aggregated by the linker; afterwards, these compilation database entries can be extracted and put together into the final JSON compilation database. For full details of how it does that consult the README (and/or the source); it’s pretty hackish.

As a test of the approach, I used this same essential technique to successfully produce a compilation database from a large game (>1M lines) without having to worry about the build system in any way (it’s some sort of Visual Studio project with a custom toolchain; I don’t really understand it very far beyond the GUI options it presents and which compiler binary it invokes).

What I have now is just a couple hackish scripts mostly; I have no idea what final form would be most appropriate as a user-facing feature inside of clang. Does this seem like a good direction for helping users create compilation databases?

– Sean Silva

We have done similar things before internally, but considered it more to be
a hack :wink:

I think the direction that we want to go to is to have an option in clang
to append to a compilation database while running - that way, no
post-processing step is required, which again needs to be somehow put into
the build flow. The only part missing is somebody with enough time on their
hands for whom this is high enough priority.

Cheers,
/Manuel

2013/7/18 Manuel Klimek <klimek@google.com>

In this context Compilation Database is (roughly) a description of the
compilation commands that are used to build a project.

One supported format for this is JSON, see
http://clang.llvm.org/docs/JSONCompilationDatabase.html.

-- James

Ah, then I was completely misunderstanding :wink:

Wouldn't this approach (appending to a compilation database) have issues
with filesystem contention and/or write atomicity in multicore/distributed
builds (without involving a "real database" for the database storage)?
Also, wouldn't a post-processing step be needed in order to remove outdated
entries appended from a previous incremental build (consider: `make;
<rename some file in the project>; make`)?

The approach I proposed has two extremely desirable properties that I think
would be hard to achieve with an approach that carries the information in
an external "side channel", as in the approach you suggested:
1. The compilation database info is always up to date as long as the build
products are up to date, since the information follows the "causal chain"
leading to the final programs/libraries.
2. It works "everywhere clang does" since it makes no assumptions about
build systems, filesystems, or anything else; the data is carried along a
datapath that already works (namely, that information emitted by the
compiler will end up in build products).

Also, the format of the embedded entry could be streamlined to make it
utterly trivial to extract, e.g. a simple string
"@ClangCompilationDatabaseEntryMD5JSON<hex md5sum of $JSON>$JSON", and then
you could reliably extract the compdb entries with a single linear scan of
arbitrary binary files; with that it seems like it would be feasible for
most use cases (possibly adding an optional caching step) to have clang
tools directly accept binaries containing such data as the compilation
database itself!

-- Sean Silva

We have done similar things before internally, but considered it more to
be a hack :wink:

I think the direction that we want to go to is to have an option in clang
to append to a compilation database while running - that way, no
post-processing step is required, which again needs to be somehow put into
the build flow. The only part missing is somebody with enough time on their
hands for whom this is high enough priority.

Wouldn't this approach (appending to a compilation database) have issues
with filesystem contention and/or write atomicity in multicore/distributed
builds (without involving a "real database" for the database storage)?

On Unix systems we can handle that via file locks. On windows we'd need a
windows expert :stuck_out_tongue:

Also, wouldn't a post-processing step be needed in order to remove
outdated entries appended from a previous incremental build (consider:
`make; <rename some file in the project>; make`)?

Well, we could require a rebuild to update the database (basically rm the
compilation database, make clean && rebuild).

The approach I proposed has two extremely desirable properties that I
think would be hard to achieve with an approach that carries the
information in an external "side channel", as in the approach you suggested:
1. The compilation database info is always up to date as long as the build
products are up to date, since the information follows the "causal chain"
leading to the final programs/libraries.

Wouldn't it have exactly the same "delete" problem? When I rename a .cc
file, won't most build systems leave the .o just lying around?

2. It works "everywhere clang does" since it makes no assumptions about
build systems, filesystems, or anything else; the data is carried along a
datapath that already works (namely, that information emitted by the
compiler will end up in build products).

Putting it into a special section in the object file is definitely better
than what we did (just appending it to the object file, as no tool we know
fails with trailing bytes on a .o file).

So I'm not completely opposed to the idea. I'd be curious what Chandler
thinks, he usually happens to have strong opinions about things like this :slight_smile:

Cheers,
/Manuel

3. Feed all of your built object files/executables to a small tool we ship
and out comes a compilation database.

I like this. We could even get lld to produce the final compilation
database for each executable and library :slight_smile:

Cheers,
Rafael

We have done similar things before internally, but considered it more to
be a hack :wink:

I think the direction that we want to go to is to have an option in
clang to append to a compilation database while running - that way, no
post-processing step is required, which again needs to be somehow put into
the build flow. The only part missing is somebody with enough time on their
hands for whom this is high enough priority.

Wouldn't this approach (appending to a compilation database) have issues
with filesystem contention and/or write atomicity in multicore/distributed
builds (without involving a "real database" for the database storage)?

On Unix systems we can handle that via file locks. On windows we'd need a
windows expert :stuck_out_tongue:

Also, wouldn't a post-processing step be needed in order to remove
outdated entries appended from a previous incremental build (consider:
`make; <rename some file in the project>; make`)?

Well, we could require a rebuild to update the database (basically rm the
compilation database, make clean && rebuild).

The approach I proposed has two extremely desirable properties that I
think would be hard to achieve with an approach that carries the
information in an external "side channel", as in the approach you suggested:
1. The compilation database info is always up to date as long as the
build products are up to date, since the information follows the "causal
chain" leading to the final programs/libraries.

Wouldn't it have exactly the same "delete" problem? When I rename a .cc
file, won't most build systems leave the .o just lying around?

The use case I primarily envision is sourcing the compdb info in the usual
case from "final" build products, like executables and libraries. In that
case, the old .o would not be linked into the final build product and hence
its compilation database info would not be included; there would be issues
if one of the final build products is renamed though, but I think that is
relatively rare, and we can document this particular caveat. In other cases
(even when sourcing .o's), I think a useful, actionable diagnostic can be
emitted ("compilation database entry found in file foo.o doesn't seem to
correspond to any source file; skip it? delete it?").

2. It works "everywhere clang does" since it makes no assumptions about

build systems, filesystems, or anything else; the data is carried along a
datapath that already works (namely, that information emitted by the
compiler will end up in build products).

Putting it into a special section in the object file is definitely better
than what we did (just appending it to the object file, as no tool we know
fails with trailing bytes on a .o file).

So I'm not completely opposed to the idea. I'd be curious what Chandler
thinks, he usually happens to have strong opinions about things like this :slight_smile:

Yeah, I'd love to hear any ideas he has about this.

Cheers,
/Manuel

Also, the format of the embedded entry could be streamlined to make it
utterly trivial to extract, e.g. a simple string
"@ClangCompilationDatabaseEntryMD5JSON<hex md5sum of $JSON>$JSON", and then
you could reliably extract the compdb entries with a single linear scan of
arbitrary binary files; with that it seems like it would be feasible for
most use cases (possibly adding an optional caching step) to have clang
tools directly accept binaries containing such data as the compilation
database itself!

-- Sean Silva

-- Sean Silva

We have done similar things before internally, but considered it more
to be a hack :wink:

I think the direction that we want to go to is to have an option in
clang to append to a compilation database while running - that way, no
post-processing step is required, which again needs to be somehow put into
the build flow. The only part missing is somebody with enough time on their
hands for whom this is high enough priority.

Wouldn't this approach (appending to a compilation database) have issues
with filesystem contention and/or write atomicity in multicore/distributed
builds (without involving a "real database" for the database storage)?

On Unix systems we can handle that via file locks. On windows we'd need a
windows expert :stuck_out_tongue:

Also, wouldn't a post-processing step be needed in order to remove
outdated entries appended from a previous incremental build (consider:
`make; <rename some file in the project>; make`)?

Well, we could require a rebuild to update the database (basically rm the
compilation database, make clean && rebuild).

The approach I proposed has two extremely desirable properties that I
think would be hard to achieve with an approach that carries the
information in an external "side channel", as in the approach you suggested:
1. The compilation database info is always up to date as long as the
build products are up to date, since the information follows the "causal
chain" leading to the final programs/libraries.

Wouldn't it have exactly the same "delete" problem? When I rename a .cc
file, won't most build systems leave the .o just lying around?

The use case I primarily envision is sourcing the compdb info in the usual
case from "final" build products, like executables and libraries. In that
case, the old .o would not be linked into the final build product and hence
its compilation database info would not be included; there would be issues
if one of the final build products is renamed though, but I think that is
relatively rare, and we can document this particular caveat. In other cases
(even when sourcing .o's), I think a useful, actionable diagnostic can be
emitted ("compilation database entry found in file foo.o doesn't seem to
correspond to any source file; skip it? delete it?").

Normally a project has multiple "final build products". The reason we have
the compilation database is that given a source file, you want to be able
to parse it. If I give you a source file, how do you know which of the
final build products you look into to get the information? All of them?
Have yet another database?

I'm summoned. =D

So, I'm moderately opposed to the idea. The reason is that we've tried this
(as Manuel mentions) and it creates a really huge new problem: where do you
look for the object file. Worse, the *right* object file.

The primary benefit of writing out to a single compilation database is
*precisely* that: it is a *single* compilation database. You can place it
in a common, predictable location and have clang-based tools look there. We
had huge, never-ending problems with this in practice. We would spend more
time looking for the .o file than we would running the clang tool, or we
would find the wrong .o file and end up not reproducing the compile
developers actually cared about.

Even if you build aggregate databases as you say for "final build
products", I agree with Manuel: this just moves the problem. Now you need
to know which build product to look into.

I genuinely think that having a common database is far and away the best
strategy for integrating tools into a development process. We should focus
on updating build systems to directly write out these databases. That tends
to be the most effective way to get the compilation database.

Adding a mutex does not solve the problem of contention, it just eliminates races. It still adds a serialising step to an inherently parallel process, and you should ask Amdahl why this is a bad idea.

David

Well, for example, we can spawn a process that writes to the compilation db
in the background from the driver, and if we assume that C++ compilations
are long compared to writing a line of text to a file, we get that:
a) finishing off the build will be independent of finishing writing to the
compilation db, thus, the impact on the build is negligible
b) the compilation db will probably be finished writing before the build
finishes (as each step probably finishes writing to the compilation db
before the step finishes)

Well, for example, we can spawn a process that writes to the compilation db in the background from the driver, and if we assume that C++ compilations are long compared to writing a line of text to a file, we get that:

So now we get the extra overhead of another fork / exec. Lots more TLB churn, which is often a limiting factor on scalability on modern CPUs.

a) finishing off the build will be independent of finishing writing to the compilation db, thus, the impact on the build is negligible

Other than having twice as many processes running...

b) the compilation db will probably be finished writing before the build finishes (as each step probably finishes writing to the compilation db before the step finishes)

Yes, the 'probably' bit is fun here. If your build system is doing the 'run the tool' step as a dependency of the build step, you end up with this sequence (for -j n):

n compilation tasks finish.
Build system starts tool.
Tool acquires lock on compilation database and runs.
n child processes of the compile sit waiting for the lock.
Tool finishes.
n child processes sequentially update compilation database

Even better, some of them will probably succeed, so not only do you have the wrong data in the compilation database, you have inconsistent and wrong data in the compilation database.

David

> Well, for example, we can spawn a process that writes to the compilation
db in the background from the driver, and if we assume that C++
compilations are long compared to writing a line of text to a file, we get
that:

So now we get the extra overhead of another fork / exec. Lots more TLB
churn, which is often a limiting factor on scalability on modern CPUs.

I think when we're talking C++ compilations everything else is dwarfed by
the CPU use of re-parsing transitive include closures.

> a) finishing off the build will be independent of finishing writing to
the compilation db, thus, the impact on the build is negligible

Other than having twice as many processes running...

clang already spawns a subprocess per clang invocation by the driver. If
that significantly impacted build performance, I'm sure that somebody would
have changed that.

> b) the compilation db will probably be finished writing before the build
finishes (as each step probably finishes writing to the compilation db
before the step finishes)

Yes, the 'probably' bit is fun here. If your build system is doing the
'run the tool' step as a dependency of the build step, you end up with this
sequence (for -j n):

n compilation tasks finish.
Build system starts tool.
Tool acquires lock on compilation database and runs.
n child processes of the compile sit waiting for the lock.
Tool finishes.
n child processes sequentially update compilation database

Even better, some of them will probably succeed, so not only do you have
the wrong data in the compilation database, you have inconsistent and wrong
data in the compilation database.

I'm not sure we're talking about the same implementation idea. The whole
idea would be to not have anything special in the build system, but to
specify an --update-compilation-db=/my/compildation/db/path flag. Then the
driver would take that flag, and launch a background task to update that
file.

Thus, it would be:
build system launches clang
clang launches:
-> launches the compilation db update process
-> launches the clang -cc1 process
with high probability the tool finishes before clang -cc1 process finishes,
as C++ compilation takes a lot longer than writing to a single file (for
example build disk I/O is also competing for a common resource, parsing CPU
is a common resourc, and there are dependencies inherent in the build,
which make it not fully parallelizable)

Cheers,
/Manuel

There is no need to have it be a separate process at all. Just start a
background thread and join it before terminating.

Have you benchmarked the difference between clang with and without pthreads linked in? A lot of libc and STL things become more expensive (including malloc(), although not by much) when pthreads are linked, even if they're not used. These may be dwarfed by the CPU time of the compilation, but it isn't free.

David

> There is no need to have it be a separate process at all. Just start a
background thread and join it before terminating.

Have you benchmarked the difference between clang with and without
pthreads linked in? A lot of libc and STL things become more expensive
(including malloc(), although not by much) when pthreads are linked, even
if they're not used.

Yes, I have, and on my systems none of these things are true. I don't know
whether or why they are true for you, but I don't think it should guide the
decision of how to architect the Clang driver.

In the not too distant future we will almost certainly want to build Clang
as a multithreaded binary (if possible to do so) so that we can take
advantage of threads internally.

These may be dwarfed by the CPU time of the compilation, but it isn't free.

Sure, it isn't free. I'm not suggesting that it is. But I am suggesting
that it is a totally viable fallback strategy for this use case, and that
the cost is negligible while still being non-zero.

+1. I hope I didn't sound like I was suggesting it was free :slight_smile: Adding code
is never free.
The important question is whether it's actually affecting the bottom line
in a measurable way (which I think it won't).

Also, I think it's important to point out that this will be an optional
feature.

I genuinely think that having a common database is far and away the best
strategy for integrating tools into a development process. We should focus
on updating build systems to directly write out these databases. That tends
to be the most effective way to get the compilation database.

Changing the build system used by projects is hard, and changing some
build systems is not possible.

Would you be ok with a -fcompilation-db option that makes clang record
the necessary information to that db? By the way, any ideas how to
handle files that are compiled more than once? For example, when cross
compiling every file in lib/Support is compiled for the build and host
targets.

Cheers,
Rafael