For many years, Google has organized its famous Google Summer of Code program, sponsoring students to work on their favorite Free Software project during the summer. This year is not an exception and Google has announced its GSoC program a few weeks ago.
In 2011, I had the chance to work on an OpenCL implementation used in Mesa3D and now known as Clover (my work concentrated on a software implementation and has now been largely rewritten to be able to use hardware acceleration when possible). This experience allowed me to use Clang as a library, and it did exactly what I wanted. Even though Clang was not the main part of my work, I read some documentation and learned how it is architectured at a high level.
Since then, my Computer Science studies took all my time and I wasn’t able to work on any public Free Software. My spare time was dedicated to pet projects and experimentation. C++ being my favorite programming language, I used it extensively for the past two years. My computer having a very slow 2x1.6 Ghz E-350 processor (and I have no other computer), I was able to see how C++ can be slow to compile.
This made me dreaming of a fast way to compile C++. I did a bit of research and have found interesting things, mainly the well-known distcc and a LLVM presentation talking about using “modules” instead of includes to fasten C++ compilation. This exposes two means to fasten big C++ compilation :
- Delegating the compilation to a more powerful machine (this delegation must be fast to be worth using)
- Avoiding to re-parse and re-compile files that don’t change
The first point is already well covered by distcc, except that distcc has some limitations that limit its use (for instance, it preprocesses the code on the local computer and then sends a huge amount of data on the network, thereby limiting the use of more than 2 or 3 remote computers). You can ask distcc not to preprocess the code on the local machine, but the locale machine and the remote one then need to run the exact same operating system and have all the include files at the same location.
The second point was partly addressed by the presentation about modules : store information about the public interface of modules in a compiler-friendly form and avoid parsing thousands of header files.
I don’t know if Clang will apply for the GSoC this year, and if this application is separate from the one of LLVM, but if it is possible, I would like to work on implementing a client/server architecture for clang, like the one exposed by distcc (but focused on Clang this time).
The first point, delegation, will be handled by a simple protocol (maybe based on something already existing, like HTTP) : the local machine puts source files and compilation options on the remote ones, and then the remote machines start parsing and compiling them. A remote machine never uses its own headers or touches its local file system : every header referenced directly or indirectly by the files it compiles is either read from an in-memory cache, or queried to the machine for which it compiles the file.
If I want to compile “foo.c” that includes “foo.h” and “bar.h”, and “foo.h” also includes “bar.h”, then the local machine L will send foo.c to the remote R. The remote starts analyzing the file and sees that it doesn’t know about foo.h. R asks L to give it foo.h and gets it. Then, foo.h includes bar.h, that is queried in the same way. When foo.h is parsed, we come back to foo.c, that also includes bar.h. This time, bar.h is present in the cache of R, and nothing is sent to the network.
Every file stored in the cache is accompanied by an environment description : as LLVM can easily generate code for plenty of processor architectures, the R machine can be anything Clang and LLVM can run on : the client machines will give them everything they need : architecture-specific include files, compilation options, target triple, etc.
For this cache mechanism to be efficient, the cache needs to be kept between compilation units (the client machines could inform the server machines that a file may change and that it must be checked once for each compilation unit, to avoid using out of date source files, headers or generated files).
This remark brings me to the second point : avoiding re-parsing files. I don’t know sufficiently well the Clang’s internals, but is it possible for Clang to parse a header and to keep in memory a pre-parsed AST for it ? If this AST is a complete one (with every included files expanded inline), updating an included file will force Clang to rebuild the AST, but it is already the case for precompiled headers. Having these ASTs in memory and using them every time a file is included could give a fair speedup to the compilation process.
So, what do you think about all this ? Is this an idea that could be implemented during the summer, and are you interested by it ?
Note: I’ve not subscribed to cfe-dev, so I added myself to the CC list of this mail. I hope this will enable me to see any response.