Getting involved with Clang refactoring

I’m following this discussion closely, as I’m working on a very
similar constellation
of problems. I’m interested in clang_complete and some form of clang indexing,
but for emacs, not for vim. I’m thinking that the indexing tool
should be something
like ctags or gid, but smarter. I’m working on a python tool to
extract build information
from the output of gmake. (FWIW, I think I have management interest
in contributing
my code to clang, or emacs, or whatever my code ends up being useful for.)

Does anybody know if the uses of the compilation database cares about the
first parameter in the command? What I’m asking is, if I interpose my command,
say mk_compile_db, for a call to g++, and then do gmake --makeall
CC=mk_compile_db…
the compilation command will say that mk_compile_db is the command, not g++.
It turns out that the path to g++ is explicit in our build system, so
I can’t just change
the first word to “g++” and give an accurate answer. On the other
hand, maybe it
doesn’t matter what the command is. Maybe the libtooling facilities ignore the
first word, or maybe they expect the first word to be elided altogether.

Yes, they ignore argv[0].

Cheers,
/Manuel

I would personally favor the use of the Clang flag or an interposition script to write into the compilation database.

I don’t buy the “building with another compiler” issue as even for such codebases, one can easily run clang with -fsyntax-only and discard the output or a specific -fdo-nothing could be added and ignored by the tool.

I see only two issues:

  • locking, which would reduce performance
  • distributed builds: when the compiler is invoked on various machines, how do you centralize the generated information ?

And I don’t think that it is worth worrying too much about them. The creation of the compilation database is a one-off operation so performance is not too much of an issue, and people having the resources to set up such distributed builds should have the resources to configure their build to run on a single machine and/or write scripts to create the database by themselves.

– Matthieu

As time goes on, I find myself caring less and less about code that isn't compiling with Clang. Abstractly, I care about people being able to use Clang-based tools even if they don't compile with Clang. However, I would much prefer us to spend our time making the Clang ecosystem *awesome*, so that once your code parses/builds with Clang, you get access to a variety of useful tools. The compilation-flag approach to building a compilation database is by far the easiest solution for people to use if they're already building with Clang, and therefore that's the one I'm most interested in.

  - Doug

3. In an order determined by the dependency graph. :slight_smile:

Thank you for the explanation. At least now I partly understand what
you're talking about.

But I'm still puzzled. Why would one need to recompile all files that
use a specific symbol? Changes to a symbol normally occur in a header
file, affecting the dependency graph in just the way that makes make
useful. To solve the problem you describe -- to have Clang recompile
"all files that use a specific symbol" -- one could touch(1) all such
files and then run make.

The "figure out the command line" part is also intrinsic to make, as
you know, based on the filename suffix. That's 30 years of prior art
that the compilation database seems to be reinventing.

I understand the project has an existing build infrastructure, and that
the compilation database is already embedded in it. It's not yet clear
to me why it was invented, and whether it's a solution or a problem.

--jkl

Bill White <bill.white@griggsinst.com> writes:

I'm interested in clang_complete and some form of clang indexing, but for
emacs, not for vim. I'm thinking that the indexing tool should be something
like ctags or gid, but smarter.

Bill,

This is exactly the tool I'm working on right now, using Clang's ASTVisitor.
I started the project a few days ago, born from a need for a better tagging
system usable from Emacs (GNU global is my 2nd best so far, but it just isn't
good enough):

    https://github.com/jwiegley/clang-tags

README soon to come.

Gentlemen, could you tell me please what you mean by a “compilation
database”? It sounds like a list of files to compile, something I’d
normally look to make(1) to provide. Apparently that’s not what you
mean, or there’s some reason make won’t do the job.

If I understood the problem better perhaps there’s something I
could do about it.

The problem with make is that it executes the compiler based on
changes in the dependency graph.

Imagine you want to run a tool on all files that use a specific
symbol. To do this you need to:

  1. figure out the compile command line for the file into which you’re
    pointing to figure out the symbol under the cursor
  2. run the tool over all files that reference that symbol
  1. In an order determined by the dependency graph. :slight_smile:

No, in one process at the same time.

Thank you for the explanation. At least now I partly understand what
you’re talking about.

But I’m still puzzled. Why would one need to recompile all files that
use a specific symbol? Changes to a symbol normally occur in a header
file, affecting the dependency graph in just the way that makes make
useful. To solve the problem you describe – to have Clang recompile
“all files that use a specific symbol” – one could touch(1) all such
files and then run make.

Not all build systems work on system time. Some work on content digests. Even if this worked, the amount of stats that build systems do in ReallyLargeCodeBases to figure out what to build can be a huge cost, and can actually dwarf the time it takes to reparse 3 or 4 files you care about.

And there’s one more reason: for example, I want to rename some symbol. I need all files to parse before the refactoring and after the refactoring, as broken files lead to problems in the AST.

The “figure out the command line” part is also intrinsic to make, as
you know, based on the filename suffix. That’s 30 years of prior art
that the compilation database seems to be reinventing.

I think this argument is way to generic. It seems to me like most innovations look like this at first. I might be wrong, but at least let me try :wink:

I understand the project has an existing build infrastructure, and that
the compilation database is already embedded in it. It’s not yet clear
to me why it was invented, and whether it’s a solution or a problem.

It was definitely invented by multiple people independently in different shades:
For example, look at how clang-complete works - it requires you to have a simplified compilation database (one compile command that works for all files you work with) in your project. This does not work for a large set of projects, though.

Cheers,
/Manuel

Gentlemen, could you tell me please what you mean by a “compilation
database”? It sounds like a list of files to compile, something I’d
normally look to make(1) to provide. Apparently that’s not what you
mean, or there’s some reason make won’t do the job.

If I understood the problem better perhaps there’s something I
could do about it.

The problem with make is that it executes the compiler based on
changes in the dependency graph.

Imagine you want to run a tool on all files that use a specific
symbol. To do this you need to:

  1. figure out the compile command line for the file into which you’re
    pointing to figure out the symbol under the cursor
  2. run the tool over all files that reference that symbol
  1. In an order determined by the dependency graph. :slight_smile:

One more reason: dependency order introduces artificial serialization. When you want to analyze code (e.g. run a tool over all code in a project to collect data), you can do so fully parallel with the compilation database and a simple call to xargs. This might not be a huge issue for small code bases (for llvm I get ~2x speedup vs. the cmake generated Makefiles, which are again ~2x vs. the autoconf Makefiles), but for larger projects serialization becomes a major issue.

Cheers,
/Manuel

Changes to a symbol normally occur in a header
> file, affecting the dependency graph in just the way that makes make
> useful. To solve the problem you describe -- to have Clang
> recompile "all files that use a specific symbol" -- one could touch
> (1) all such files and then run make.

Not all build systems work on system time. Some work on content
digests.

I don't know what you mean by "content digests", unless you're talking
about some kind of fingerprint (e.g. MD5) to determine if the content
has changed. That's clearly insufficient if a header file changes.

Even if this worked, the amount of stats that build systems
do in ReallyLargeCodeBases to figure out what to build can be a huge
cost, and can actually dwarf the time it takes to reparse 3 or 4
files you care about.

I doubt that. Could you point me to research supporting that claim?
Experience in pkgsrc shows that recursive make is very slow over
large trees, and that the problem is the process overhead, not solving
the graph.

If you haven't read it, I recommend Peter MIller's "Recursive Make
Considered Harmful" (http://aegis.sourceforge.net/auug97.pdf). It's a
trove of information about misinformation around make.

One more reason: dependency order introduces artificial
serialization.

Are you saying the dependencies of static analysis are lesser than for
compilation? I would think that's ultimately untrue, given that
eventually static analysis would have to be applied to the whole
program, even if that's not the case today.

If you mean make doesn't compile in parallel when it could, I would be
interested to see an example, because afaik that's a solved problem
theoretically.

It seems to me like most innovations look like this at first. I might
be wrong, but at least let me try :wink:

Nothing against you trying; I'm hoping at least one of us will learn
something. :slight_smile:

You may have heard, "Scientists stand on one another's shoulders;
computer scientists stand on one another's toes." A lot of innovations
aren't, and it's not hard to find projects (and products) re-solving
things using new terminology, seemingly unaware how old and well trod
is the ground they're covering. It pays to be skeptical.

Regards,

--jkl

Changes to a symbol normally occur in a header

file, affecting the dependency graph in just the way that makes make
useful. To solve the problem you describe – to have Clang
recompile “all files that use a specific symbol” – one could touch
(1) all such files and then run make.

Not all build systems work on system time. Some work on content
digests.

I don’t know what you mean by “content digests”, unless you’re talking
about some kind of fingerprint (e.g. MD5) to determine if the content
has changed. That’s clearly insufficient if a header file changes.

Yes, I mean a fingerprint like md5.
I don’t understand how this is different to timestamps when it comes to header files - the headers need to be in the dependency graph anyways.

Even if this worked, the amount of stats that build systems
do in ReallyLargeCodeBases to figure out what to build can be a huge
cost, and can actually dwarf the time it takes to reparse 3 or 4
files you care about.

I doubt that. Could you point me to research supporting that claim?

No. I can point you to my anecdotal experience with networked file systems and code bases that are so large that you don’t want to check out everything at once.

Experience in pkgsrc shows that recursive make is very slow over
large trees, and that the problem is the process overhead, not solving
the graph.

If you haven’t read it, I recommend Peter MIller’s “Recursive Make
Considered Harmful” (http://aegis.sourceforge.net/auug97.pdf). It’s a
trove of information about misinformation around make.

One more reason: dependency order introduces artificial
serialization.

Are you saying the dependencies of static analysis are lesser than for
compilation? I would think that’s ultimately untrue, given that
eventually static analysis would have to be applied to the whole
program, even if that’s not the case today.

But that doesn’t mean you can’t apply it in parallel. We’ve had great success using a MapReduce over ~100MLOC code to do very fast parallel static analysis.

If you mean make doesn’t compile in parallel when it could, I would be
interested to see an example, because afaik that’s a solved problem
theoretically.

It seems to me like most innovations look like this at first. I might
be wrong, but at least let me try :wink:

Nothing against you trying; I’m hoping at least one of us will learn
something. :slight_smile:

You may have heard, “Scientists stand on one another’s shoulders;
computer scientists stand on one another’s toes.” A lot of innovations
aren’t, and it’s not hard to find projects (and products) re-solving
things using new terminology, seemingly unaware how old and well trod
is the ground they’re covering. It pays to be skeptical.

Well, obviously I think I am skeptical myself, but I’m aware that’s confirmation bias :wink:

Cheers,
/Manuel

Hi Manuel,

For your information, I started the work on point #2 : integrating compilation database in libclang + the corresponding python binding.

Cheers,
Arnaude de Grandmaison

Hi Manuel,

For your information, I started the work on point #2 : integrating compilation database in libclang + the corresponding python binding.

Awesome - feel free to cc me directly to any patches you send out for review regarding this.

Cheers,
/Manuel

>
> > > To solve the problem you describe -- to have Clang
> > > recompile "all files that use a specific symbol" -- one could
> > > touch(1) all such files and then run make.
> >
> > Not all build systems work on system time. Some work on content
> > digests.
>
> That's clearly insufficient if a header file changes.

I don't understand how this is different to timestamps when it comes
to header files - the headers need to be in the dependency graph
anyways.

I see. There are two orthogonal issues: the dependency graph, and
determining which files in that graph have changed. No matter how
"change" is measured/defined, anything affected by it -- anything
further down the graph -- needs recompilation or relinking.

With make, we express the dependency graph directly in make's
target:source syntax, and "change" is defined in terms of the files'
mtimes. make is quite good at converting its source language into a
DAG, but to determine what's changed it must rely on the underlying
system.

> > Even if this worked, the amount of stats that build systems
> > do in ReallyLargeCodeBases to figure out what to build can be a
> > huge cost, and can actually dwarf the time it takes to reparse 3
> > or 4 files you care about.
>
> I doubt that. Could you point me to research supporting that claim?

No. I can point you to my anecdotal experience with networked file
systems and code bases that are so large that you don't want to check
out everything at once.

I think I see where you're going.

If the source control system (or, say, compilation database) maintains
fingerprints, it's possible to determine what's changed a priori,
without checking out the files to a local working directory. If it
furthermore represents the dependency graph, it's also possible to
check out only "what's changed" prior to rebuilding, and to federate
that rebuilding. In effect, the fingerprints represent change directly,
whereas inode mtime is only a proxy, and not one that scales well along
a network.

Each time make runs, it must stat the whole tree, meaning that its
evaluation time is O(n) with the number of files. A database must also
make as many comparisons, but those comparisons orders of magnitude
faster than a system call. If N is 100 or even 1000, none of that
matters very much, but if N is 1,000,000 it will be measurable.

It's interesting to think of make divorced from the filesystem, to
imagine a make implementation that used a database of fingerprints to
answer the "is changed" question. I'm guessing that is, in a way, what
you've done or are doing, except that it isn't as general as make.

> > One more reason: dependency order introduces artificial
> > serialization.
>
> Are you saying the dependencies of static analysis are lesser than
> for compilation? I would think that's ultimately untrue, given that
> eventually static analysis would have to be applied to the whole
> program, even if that's not the case today.

But that doesn't mean you can't apply it in parallel. We've had great
success using a MapReduce over ~100MLOC code to do very fast parallel
static analysis.

Yes, but part of that "win" is due to the fact that you're not
linking. Static analysis today is comparable to compilation, the
evaluation of a single translation unit.

> Nothing against you trying; I'm hoping at least one of us will learn
> something. :slight_smile:

If I'm on target above, then our experiment yielded unexpected but
useful results!

--jkl

To solve the problem you describe – to have Clang
recompile “all files that use a specific symbol” – one could

touch(1) all such files and then run make.

Not all build systems work on system time. Some work on content
digests.

That’s clearly insufficient if a header file changes.

I don’t understand how this is different to timestamps when it comes
to header files - the headers need to be in the dependency graph
anyways.

I see. There are two orthogonal issues: the dependency graph, and
determining which files in that graph have changed. No matter how
“change” is measured/defined, anything affected by it – anything
further down the graph – needs recompilation or relinking.

With make, we express the dependency graph directly in make’s
target:source syntax, and “change” is defined in terms of the files’
mtimes. make is quite good at converting its source language into a
DAG, but to determine what’s changed it must rely on the underlying
system.

Even if this worked, the amount of stats that build systems
do in ReallyLargeCodeBases to figure out what to build can be a
huge cost, and can actually dwarf the time it takes to reparse 3
or 4 files you care about.

I doubt that. Could you point me to research supporting that claim?

No. I can point you to my anecdotal experience with networked file
systems and code bases that are so large that you don’t want to check
out everything at once.

I think I see where you’re going.

If the source control system (or, say, compilation database) maintains
fingerprints, it’s possible to determine what’s changed a priori,
without checking out the files to a local working directory. If it
furthermore represents the dependency graph, it’s also possible to
check out only “what’s changed” prior to rebuilding, and to federate
that rebuilding. In effect, the fingerprints represent change directly,
whereas inode mtime is only a proxy, and not one that scales well along
a network.

Each time make runs, it must stat the whole tree, meaning that its
evaluation time is O(n) with the number of files. A database must also
make as many comparisons, but those comparisons orders of magnitude
faster than a system call. If N is 100 or even 1000, none of that
matters very much, but if N is 1,000,000 it will be measurable.

It’s interesting to think of make divorced from the filesystem, to
imagine a make implementation that used a database of fingerprints to
answer the “is changed” question. I’m guessing that is, in a way, what
you’ve done or are doing, except that it isn’t as general as make.

One more reason: dependency order introduces artificial
serialization.

Are you saying the dependencies of static analysis are lesser than
for compilation? I would think that’s ultimately untrue, given that
eventually static analysis would have to be applied to the whole
program, even if that’s not the case today.

But that doesn’t mean you can’t apply it in parallel. We’ve had great
success using a MapReduce over ~100MLOC code to do very fast parallel
static analysis.

Yes, but part of that “win” is due to the fact that you’re not
linking. Static analysis today is comparable to compilation, the
evaluation of a single translation unit.

Yes. But if you want to statically analyze a whole code base for global information, you need to analyze every TU (as you would need to compile every TU). Having a dependency graph restriction in here would lead to sequential bottlenecks (which make -j shows). On the other hand, having global analysis of course still relies on the build, as generated sources need to be available. But given that we often want to run multiple iterations of static analysis at a single point of the code, this pays of big time.

Nothing against you trying; I’m hoping at least one of us will learn
something. :slight_smile:

If I’m on target above, then our experiment yielded unexpected but
useful results!

More details about what I’m talking about can be found here:
http://google-engtools.blogspot.de/
(use the left side navigation bar to everything that has “Build” in the title)

Cheers,
/Manuel

This part I don't understand. What kind of dependency graph restrictions
are you thinking about? If you have functional dependencies of one TU on
another TU, you can't just drop those. Consider build tools like yacc for
an integrated OS build as example.

Joerg

This is exactly what I meant when I wrote: “On the other hand, having global analysis of course still relies on the build, as generated sources need to be available. But given that we often want to run multiple iterations of static analysis at a single point of the code, this pays of big time.”

Note that this is not about running something when you change code (that’s what build tools are for), but to run a tool when you want to change something, or to run multiple static analysis runs on the same code point.

Cheers,
/Manuel

Manuel,

Here is a preliminary patch to let you know where I stand right now as I will be away for a few days. It has the libclang c functions + python bindings to export the CompilationDatabase functionality. It is already usable with the python script attached.

Feel free to make any comments before I engage further down this way.

A few remarks / questions :

  • no tests yet. For libclang, I intend to do it thru c-index-test and provide a ‘compile_commands.json’ in test/Index. The JSON syntax does not allow comments to be embedded inside, so I will probably have to add a another file to manage the c-index-test call with lit. For the python binding, I will enhance the attached python script.
  • What is the invariant for class CompileCommand ? I understood from the mailing list that CommandLine[0] is the compiler executable. Are there any other ?
  • Error management : simplistic for now. Either aborts with a hopefully useful message, or return a value which the caller is expected to interpret correctly. Any thoughts on this ?
  • (not related to this patch) Would it make sense for CompileCommand to provide a higher level interface for the CommandLine (at the C++ level). As a user of the CommandLine for a CompileCommand, I would have expected an easy way to access easily (and separately) some important parts of it : executable name (ok, that’s the first argument), input file and output file. This capture the fact that a CommandLine is $(compiler) $(bunch_of_args) -o $(output) $(input) and enable user to adapt to $(compiler) or mangle $(output) / $(input) depending on the application’s needs.

Cheers,

cindex-cdb.patch (11.8 KB)

test-cdb.py (568 Bytes)

Hi Manuel,

For your information, I started the work on point #2 : integrating compilation database in libclang + the corresponding python binding.

Awesome - feel free to cc me directly to any patches you send out for review regarding this.

Cheers,
/Manuel

Manuel,

Here is a preliminary patch to let you know where I stand right now as I will be away for a few days. It has the libclang c functions + python bindings to export the CompilationDatabase functionality. It is already usable with the python script attached.

Cool!

Feel free to make any comments before I engage further down this way.

A few remarks / questions :

  • no tests yet. For libclang, I intend to do it thru c-index-test and provide a ‘compile_commands.json’ in test/Index. The JSON syntax does not allow comments to be embedded inside, so I will probably have to add a another file to manage the c-index-test call with lit. For the python binding, I will enhance the attached python script.

We actually throw the json file at a yaml parser. I don’t think we should put that into the contract for the format, but for testing our implementation we can make use of the fact.

  • What is the invariant for class CompileCommand ? I understood from the mailing list that CommandLine[0] is the compiler executable. Are there any other ?

There are none. The contract is that it contains the argv of whatever tool was executed to run the compilation. We just added support to add conversion functions. For example, we’ll at least need to convert gcc command lines to clang command lines.

  • Error management : simplistic for now. Either aborts with a hopefully useful message, or return a value which the caller is expected to interpret correctly. Any thoughts on this ?

I think an exception is more pythonic.

  • (not related to this patch) Would it make sense for CompileCommand to provide a higher level interface for the CommandLine (at the C++ level). As a user of the CommandLine for a CompileCommand, I would have expected an easy way to access easily (and separately) some important parts of it : executable name (ok, that’s the first argument), input file and output file. This capture the fact that a CommandLine is $(compiler) $(bunch_of_args) -o $(output) $(input) and enable user to adapt to $(compiler) or mangle $(output) / $(input) depending on the application’s needs.

I don’t think it makes sense for CompileCommand. Getting out information means command line parsing. Now, if you ask whether we should have a standalone command line parsing class in clang, I’d say “yay!”, but currently the only entity that can parse clang command lines is the driver when building up a compiler object.

Apart from that:

  • CompilationDatabase::loadFromDirectory(…) will support other ways than using compile_commands.json in the future - it is the major integration points for integrating with build systems; I’d rather have the comments say “for example” so we the comments don’t get auto-out-of-date.

In general, the direction looks good.

Thanks!
/Manuel

Having gone down that road myself, you may be interested in PyMake: http://benjamin.smedbergs.us/pymake/

It is make implemented in Python. It isn't feature complete, but most of it is there. Mozilla uses it to build Firefox, specifically on Windows, where the new process overhead of regular make adds significant time to builds.

The built-in make functions are implemented as Python methods. And, there is even some logic to turn basic rules into in-line Python. Also, if your rule begins with a '%' it is converted into a Python function call instead of a shell invocation. If you change e.g. CC to '% mybuildsystem.compile' you can simply implement the function 'compile' in a 'mybuildsystem' module (on sys.path) which can capture the arguments, record executing times, etc. It's a pretty nifty backdoor.

If you have any questions, it is arguably off-topic for this list, so just ping me on IRC - I'm IndyGreg. Or, join #pymake on irc.mozilla.org (I'm gps there).

Gregory