Now, I'm investigating what it would take to build such an implementation. I'm curious about the following:
1. How hard will it be to navigate the LLVM/clang codebase having very little compiler domain knowledge?
Clang is a very easy codebase to get to grips with. I think it took me about a week from starting to read the code to getting my first patch accepted, and that's not particularly unusual.
2. What stages of the compilation are worth parallelizing(at least for a first step)?
As I recall, distcc runs the preprocessor on one machine then ships the preprocessed code to the others. That means that Amdahl starts to attack you at around 8 nodes (I think - see Chris's slides where he talked about preprocessing time for the real numbers).
There was some talk about parallelising the preprocessing too. This would require farming out all of the headers files to each of the nodes. This isn't quite as bad as it seems. You can cache the files at the distribution end and send a set of timestamps for the system headers with the new source files so that they don't need to be re-requested if they are not modified.
After preprocessing, the building of the AST, IR generation, and optimisation are all trivial to parallelise, as they are largely independent. Just ship the preprocessed code to another node and have it run all of these steps. Until LLVM does native machine code emission, you probably don't want to be generating the binary on the remote node even without LTO, so just ship the (optimised) IR back for linking.
Link-time optimisations probably can't easily be parallelised, because they need to run once all of the other compilation steps have run, and the same is true of linking. Of course, if you're using a parallel make, you can maybe ship some of these off to different machines (e.g., when building clang, link each of the modules on a separate machine and then only do the final link of the clang tool on one machine.) This is a bit beyond the scope of distcc, however.
On some systems, particularly compiling Objective-C on OS X, I've noticed that the process creation and tear-down time for the compiler is actually the bottleneck for performance in a lot of cases. With clang's architecture, you could probably get some performance improvements by keeping the dist-clang processes around and just passing them new data to compile, rather than keeping on spawning new ones.
3. Will it be feasible to implement a basic distcc implementation in 1-2 months? There should be 4 or so people working on the project, but none of us have significant compiler domain knowledge. If not, is there a subset of the problem that's worth working on?
It's a pretty open-ended problem. I think it's probably possible to achieve something useful in 1-2 months, with scope for future improvements.
4. Are there any examples of code(preferably in real-world projects) which would lend themselves to parallel compilation which come to mind? At the end of the project, we'll need to document the performance of our work, so I'd like to be thinking about how we'd create (good) presentable results along the way.
Clang and LLVM come to mind. Big codebase, lots of separate files.
5. Where should I start? :). Obviously this is a pretty large undertaking, but is there any documentation that I should look at? Any particular source files that would be relevant?
I'd start by taking a look in the clang driver. For this project, you can treat most of the compiler as a black box. The important thing is getting the source code in and the compiled code out. You might consider extending the SourceManager stuff in Basic to allow fetching headers over the network (and implement a cache coherency protocol to let these be invalidated if they are modified on the controlling node) if you want to distribute the preprocessing.
Beyond that, you want to make sure that you are constructing an instance of the compiler classes on the remote machine that has the same options (LangOpts mainly) set as the local one would. If you are doing something more clever than just invoking clang over ssh, you might also want to provide a new DiagnosticClient that reports errors and warnings on the control node.
-- Sent from my Cray X1