[GSoC] Remote-compiling programs with Clang

Hi,

For many years, Google has organized its famous Google Summer of Code program, sponsoring students to work on their favorite Free Software project during the summer. This year is not an exception and Google has announced its GSoC program a few weeks ago.

In 2011, I had the chance to work on an OpenCL implementation used in Mesa3D and now known as Clover (my work concentrated on a software implementation and has now been largely rewritten to be able to use hardware acceleration when possible). This experience allowed me to use Clang as a library, and it did exactly what I wanted. Even though Clang was not the main part of my work, I read some documentation and learned how it is architectured at a high level.

Since then, my Computer Science studies took all my time and I wasn’t able to work on any public Free Software. My spare time was dedicated to pet projects and experimentation. C++ being my favorite programming language, I used it extensively for the past two years. My computer having a very slow 2x1.6 Ghz E-350 processor (and I have no other computer), I was able to see how C++ can be slow to compile.

This made me dreaming of a fast way to compile C++. I did a bit of research and have found interesting things, mainly the well-known distcc and a LLVM presentation talking about using “modules” instead of includes to fasten C++ compilation. This exposes two means to fasten big C++ compilation :

  • Delegating the compilation to a more powerful machine (this delegation must be fast to be worth using)
  • Avoiding to re-parse and re-compile files that don’t change

The first point is already well covered by distcc, except that distcc has some limitations that limit its use (for instance, it preprocesses the code on the local computer and then sends a huge amount of data on the network, thereby limiting the use of more than 2 or 3 remote computers). You can ask distcc not to preprocess the code on the local machine, but the locale machine and the remote one then need to run the exact same operating system and have all the include files at the same location.

The second point was partly addressed by the presentation about modules : store information about the public interface of modules in a compiler-friendly form and avoid parsing thousands of header files.

I don’t know if Clang will apply for the GSoC this year, and if this application is separate from the one of LLVM, but if it is possible, I would like to work on implementing a client/server architecture for clang, like the one exposed by distcc (but focused on Clang this time).

The first point, delegation, will be handled by a simple protocol (maybe based on something already existing, like HTTP) : the local machine puts source files and compilation options on the remote ones, and then the remote machines start parsing and compiling them. A remote machine never uses its own headers or touches its local file system : every header referenced directly or indirectly by the files it compiles is either read from an in-memory cache, or queried to the machine for which it compiles the file.

If I want to compile “foo.c” that includes “foo.h” and “bar.h”, and “foo.h” also includes “bar.h”, then the local machine L will send foo.c to the remote R. The remote starts analyzing the file and sees that it doesn’t know about foo.h. R asks L to give it foo.h and gets it. Then, foo.h includes bar.h, that is queried in the same way. When foo.h is parsed, we come back to foo.c, that also includes bar.h. This time, bar.h is present in the cache of R, and nothing is sent to the network.

Every file stored in the cache is accompanied by an environment description : as LLVM can easily generate code for plenty of processor architectures, the R machine can be anything Clang and LLVM can run on : the client machines will give them everything they need : architecture-specific include files, compilation options, target triple, etc.

For this cache mechanism to be efficient, the cache needs to be kept between compilation units (the client machines could inform the server machines that a file may change and that it must be checked once for each compilation unit, to avoid using out of date source files, headers or generated files).

This remark brings me to the second point : avoiding re-parsing files. I don’t know sufficiently well the Clang’s internals, but is it possible for Clang to parse a header and to keep in memory a pre-parsed AST for it ? If this AST is a complete one (with every included files expanded inline), updating an included file will force Clang to rebuild the AST, but it is already the case for precompiled headers. Having these ASTs in memory and using them every time a file is included could give a fair speedup to the compilation process.

So, what do you think about all this ? Is this an idea that could be implemented during the summer, and are you interested by it ?

Best regards,
Denis Steckelmacher.

Note: I’ve not subscribed to cfe-dev, so I added myself to the CC list of this mail. I hope this will enable me to see any response.

Hi,

Just responding to a small part of your mail. It is easily possible for Clang to keep a pre-parsed AST for a header in memory, but that’s not quite enough. There is some internal state in other parts (particularly Sema) that also needs to reflect the code parsed so far.
Our precompiled header machinery can actually fully restore the compiler state, of course.
The other issue is the same one that affects PCH. Header files are order-dependent. If the header state isn’t an exact prefix of the compilation unit, its incorrect to use it. Some coding conventions thus make PCH very hard to use - LLVM’s own convention of putting the primary header as the first thing in the file, for example, makes PCH pretty much useless.

Sebastian

I just copied a part of your message to reply to.

Are you aware of ice cream? It is inspired by distcc, and often is better than distcc for large build farms. Clang is supported in the latest versions. It doesn’t solve the network bandwidth issue, but I find it is better for large compile farms. (Even supporting cross compiling on machines different from your own architecture!)

That isn’t to say your idea is bad, there are things that a clang based server could do well. However there are also limits to such a system. (What if machines have different versions of clang?) I encourage someone to try it to out and see if the advantages are worth the limits (or perhaps the limits I can think of are solveable). For the near future current tools are good enough.

Hi Denis,

This is a good idea in what you are trying to achieve. There's
another similar idea floating in the air for the ninja build system.
A distributed compilation mode for ninja would be useful. It will not
send files explicitly over the network, but it would rely on a shared
filesystem. Of course, task scheduling will be done using sockets as
usual. The advantage of this approach is that it is not limited to
using Clang as the compiler. And since we are relying on a shared
filesystem, implementing this in ninja could be easier than in Clang.

If I want to compile "foo.c" that includes "foo.h" and "bar.h", and "foo.h"
also includes "bar.h", then the local machine L will send foo.c to the
remote R. The remote starts analyzing the file and sees that it doesn't know
about foo.h. R asks L to give it foo.h and gets it. Then, foo.h includes
bar.h, that is queried in the same way. When foo.h is parsed, we come back
to foo.c, that also includes bar.h. This time, bar.h is present in the cache
of R, and nothing is sent to the network.

This is the most important difference between what you propose and
what we already have in distcc (and ice cream?). In distcc the server
does preprocessing. This is a disadvantage, because it limits the
parallelism by server's computational power. In your proposal, the
preprocessor is being run on the client. It would request files from
the server on the fly. Am I correct in my understanding of your
proposal?

If so, this requires abstracting away all the file I/O so that it can
be intercepted. It might be hard, depending on how the code is
layered currently. We do have libSupport, and all file I/O should be
done through that. So it can actually be easy to intercept I/O, but
it might be hard to make it run fast over the network. As a remedy,
we could essentially invent a filesystem caching layer, but I doubt we
do want that.

Dmitri

Hi,

For many years, Google has organized its famous Google Summer of Code program, sponsoring students to work on their favorite Free Software project during the summer. This year is not an exception and Google has announced its GSoC program a few weeks ago.

In 2011, I had the chance to work on an OpenCL implementation used in Mesa3D and now known as Clover (my work concentrated on a software implementation and has now been largely rewritten to be able to use hardware acceleration when possible). This experience allowed me to use Clang as a library, and it did exactly what I wanted. Even though Clang was not the main part of my work, I read some documentation and learned how it is architectured at a high level.

Since then, my Computer Science studies took all my time and I wasn't able to work on any public Free Software. My spare time was dedicated to pet projects and experimentation. C++ being my favorite programming language, I used it extensively for the past two years. My computer having a very slow 2x1.6 Ghz E-350 processor (and I have no other computer), I was able to see how C++ can be slow to compile.

This made me dreaming of a fast way to compile C++. I did a bit of research and have found interesting things, mainly the well-known distcc and a LLVM presentation talking about using "modules" instead of includes to fasten C++ compilation. This exposes two means to fasten big C++ compilation :

* Delegating the compilation to a more powerful machine (this delegation must be fast to be worth using)
* Avoiding to re-parse and re-compile files that don't change

The first point is already well covered by distcc, except that distcc has some limitations that limit its use (for instance, it preprocesses the code on the local computer and then sends a huge amount of data on the network, thereby limiting the use of more than 2 or 3 remote computers). You can ask distcc not to preprocess the code on the local machine, but the locale machine and the remote one then need to run the exact same operating system and have all the include files at the same location.

The second point was partly addressed by the presentation about modules : store information about the public interface of modules in a compiler-friendly form and avoid parsing thousands of header files.

There is more about modules that keeping the public interface in a compiler-friendly form. The thing is, a module is (normally) completely independent; contrary to a header. If you include two headers A and B, then depending on which header goes first, the second is processed differently (because you already got some of the common dependencies, because A defines a macro that influences B parsing, because A defines a template class that may influence B parsing, ...); this does not happen with modules (at least, not with other languages), and therefore since each module is "stand-alone", instead of compiling a module each time it's included somewhere you only recompile it if itself or one of its dependencies changed.

The implementation of the modules is already ongoing (I've seen Douglas Gregor and Argyrios K[...] working on this), but it's definitely not ready for prime time. I don't know of any deadline, either.

Since modules are the way forward, and since they will radically change the behavior of compilation, do you really want to invest on a remote-compiler scheme now ?

-- Matthieu

Hi,

First of all, I would like to thank everyone who have responded to my message for their good remarks, and particularly for their pointers to distcc-like tools. I will take a look at them.

The difference between modules and simple order-dependent include files interests me greatly. Where can I find more information about the development status of modules ? Is there already any repository where I can find some code to play with ?

Denis Steckelmacher.

It is in trunk. Look for commits from Douglas Gregor that have
"modules" in the commit log. He also gave an awesome talk during the
dev meeting [1].

[1] http://llvm.org/devmtg/2012-11/