my gsoc proposal: llvm/clang distcc

Hi,
here is my proposal:

Synopsys
The main purpose of this project is implement a clang/llvm based distcc implementation.
Distcc means distributed compiler. It can be used as a replacement of gcc.
Clang distcc will support ditributed (over network) compilation to any architecture
supported by llvm. The llvm/clang distcc main advantage over original gcc based distcc is
performace (less memory, less compile time) and customization.
All these benefit comes from llvm and clang.
The new discc will be a compiler driver (or frontend) build from clang and llvm libraries.
The driver will have two usage mode: gcc option mode and clang option mode.
So it can be used as a drop in replacement of gcc, and it will handle all distribution and cache task.
There will be also an admin daemon, what will support configuration and distributes incoming
requests to nodes. In each node will run a distcc deamon, and will handle incoming tasks.
This will do the compilation work.
The clang distcc will support languages via clang. So currently C and Objective C will be supported.

Functionality details
Usage:

  • setup network:
  • start the admin deamon, in a node
  • start distcc deamons in each node
  • register nodes in admin deamon
  • setup local:
  • configure distcc, (setup admin node address)
    This will generate a config file.
  • use: ex: make CC=distcc

Implemenation details
The development will be done incrementally, from the simplest solution to more complex.
The simplest solution is when all source parsing task is done locally and the built AST
is distributed to Node for optimization and code generation and it sends the result back when its done.
An advanced solution is when a file sharing protocol is used to share local source files (for including)
and then parsing is done in Node side and file including is done via the file sharing protocol.
A more advanced solution is when we caching built ASTs in a central database to prevent
parsing and building each time. This is useful in header files case.
So there will be these standalone programs:

  • distcc, supports gcc options
  • distcc, supports clang options
  • distcc daemon for Nodes (network is composed from Nodes, what will do the compilation work)
  • distcc admin daemon (stores information from the distcc Node network)

All necessary software components are available in llvm/clang sources, but network handling.
So there will be a thin network layer implemented for unix and windows platforms.
The new distcc driver will be placed in clang/Driver directory.

In caching the cached AST identification can be done with a MD3 sum of the source file including the included
files MD3 sum and the options used in parsing (defines).

Development methodology
The work will be done via svn. I’ll need a clang branch for my work. But it is not required a standalone svn
repository will work too.
I use ubuntu linux (gutsy gibbon). I’ll send a weekly report of project.
I’ll write user and developer documentation (html or pdf).

Project Schedule
Before the mid time gsoc evaluation the simplest method will be implemented. The file sharing protocol and
caching will be done in second part of soc. But it can be figured out in depth during the first part, when
the simples solution will ready.

Bio
I’m a 23 years old student, studying at the Budapest University of Technology and Economics. I’ve started programming 7 years
ago, and I’ve been using the C language for 6 years, and the C++ language for 5 years. I’ve been using opensource software for 7 years.
Compiler programs are one of my passions. I like efficient and clean solutions. I like nice and clean and well documeted API’s,
like Qt, Ogre3D, llvm, clang. I have stable knowledge of OOP and software engineering.
I like much reusable and clean, easy to understand code.
I’m familiar with the following programming languages:

  • C (6 years)
  • C++ (5 years)
  • python (3 years)
  • java (4 years)
  • SML (1 year)
  • Prolog (1 year)
  • lua, squirrel
  • haskell (actual passion)
    I’m tracking llvm and clang development since last gsoc, beacuse I’ve recognised llvm in gsoc projects list.
    I have an svn copy of llvm and clang since 2007 october. I always compile it. I’ve readen all docs avalable from llvm and clang.
    I also know the source code structure and its functionality.

Hi Peter,

I've taken a look at this proposal. Can you say how this is different than either a) modifying dcc_find_compiler to use clang as an option or b) saying distcc clang for compilation?

Otherwise it looks like a simple change to distcc and not clang at all. Unless you were planning on reimplementing distcc for clang?

-eric

Hi,
As far as I know, distcc doesn’t support binary AST caching and distribution.
And the proposal is about complete reimpementation in C++ with static linked llvm/clang libs.
The main benefit is in handling (storing, caching) AST-s in binary form. And also to have a central process, which
tracks the node network load and coordinates incoming tasks according it.

Hi,
As far as I know, distcc doesn't support binary AST caching and distribution.

No, it works of of preprocessed sources. But you could easily adapt it to AST.

And the proposal is about complete reimpementation in C++ with static linked llvm/clang libs.
The main benefit is in handling (storing, caching) AST-s in binary form. And also to have a central process, which
tracks the node network load and coordinates incoming tasks according it.

I see. Personally I think that's a rather large project for SoC and am not sure why a reimplementation specific to llvm would be useful as opposed to modifying the existing distcc - other than that would be a distcc SoC project and not an llvm one :slight_smile:

-eric

Peter,

Developing an AST-centric caching mechanism is very interesting.

Last time I looked at distcc, it basically preprocessed the file locally and sends the result to distributed nodes. The benefit of this approach is it’s very simple (no dependencies on headers, macros, etc.).

That said, caching AST’s will require more “smarts” than distcc (in many ways). Knowing if an AST can actually be reused across module boundaries is a little tricky (given the flexibility of the preprocessor and mutable headers). I co-authored NeXT’s first pre-compiled headers scheme which actually implemented sharing/checking.

Another point…I think the notion of caching AST’s locally (on a single machine) is also interesting (and possibly more useful than traditional pre-compiled header schemes, where the developer needs to fiddle with defining a huge header to make the compiler fast).

snaroff

Hi,
here is my proposal:

Have you submitted an application to Google? The deadline is today, so
I'd suggest doing it sooner rather than later. (AFAIK, you can tweak
it later.)

Implemenation details
    The development will be done incrementally, from the simplest solution
to more complex.
     The simplest solution is when all source parsing task is done locally
and the built AST
    is distributed to Node for optimization and code generation and it sends
the result back when its done.

So you're planning to use the regular clang codepath through Sema,
then use the AST serialization to send it across the network to
another host? That's not a bad approach, since by that point all the
dependencies on installed headers are gone, and you don't have to
worry about errors (besides bugs in clang/LLVM).

    An advanced solution is when a file sharing protocol is used to share
local source files (for including)
     and then parsing is done in Node side and file including is done via
the file sharing protocol.

Right... that reduces the load on the host, but it might increase the
compile time due to the round-trip time for the requests. It also
significantly complicates the protocol. It'll be interesting to see
which approach performs better.

    A more advanced solution is when we caching built ASTs in a central
database to prevent
    parsing and building each time. This is useful in header files case.

I'm not sure it's practical to cache headers in that way. The exact
way a header parses depends on the code before it, and solving
dependencies on other headers seems like more trouble than it's worth.
If you can come up with something here, that would be cool, though.

Caching whole files would be possible, but not too important, since
someone could just run "ccache distcc".

     So there will be these standalone programs:
        + distcc, supports gcc options
        + distcc, supports clang options
        + distcc daemon for Nodes (network is composed from Nodes, what will
do the compilation work)
         + distcc admin daemon (stores information from the distcc Node
network)

So the way that the client discovers distcc nodes is through the admin
daemon? I'm not too familiar with distcc's architecture.

Peter,

    The development will be done incrementally, from the simplest solution to more complex.
    The simplest solution is when all source parsing task is done locally and the built AST
    is distributed to Node for optimization and code generation and it sends the result back when its done.

This is indeed a good first step.

However, couple of points to note while designing new distributed build system from scratch.

In general, preprocessing source file consumes significant portion of total compile time. If local host is tasked to preprocess all source files then it becomes bottleneck.

GCC uses PCH mechanism to reduce compile time locally. If distributed build system distributes the GCC PCHes then it is likely to flood the network (because GCC PCH size is significantly larger then source files) which may have an impact on scalability.

The ideal solution 1) does not impose significant compilation related duties on the local host, 2) does not incur huge network traffic during job distribution and 3) let local host focus on efficient distribution of tasks and collection of results.

    An advanced solution is when a file sharing protocol is used to share local source files (for including)
    and then parsing is done in Node side and file including is done via the file sharing protocol.

IMO, such setup would work well in an environment where available of Nodes is stable.

One variation of this advanced solution would be to just distribute build instructions (command line flags etc...), source file names and project source repository revision number (e.g. svn rev. number) to the Nodes and let the Nodes get project source files from the repository directly. This would free local host from duty of distributing files and take advantage of existing bandwidth provided by source code repository server.

Yet another advanced step would be to build a pyramid scheme to distribute incremental link time optimization work.

    A more advanced solution is when we caching built ASTs in a central database to prevent
    parsing and building each time. This is useful in header files case.

The trick here is to validate already cached centralized ASTs very cheaply. This is not cheap in current GCC implementation. Steve Naroff did such implementation in a standalone preprocessor to boost local compilation time in early 1990s.

I won't be surprised if distributed AST catches do well compared to centralized AST catch.

Hi,
here is my proposal:

Hi Peter,

Sorry for the delay, I know time is running short.

Synopsys
    The main purpose of this project is implement a clang/llvm based distcc implementation.
    Distcc means distributed compiler. It can be used as a replacement of gcc.

It would be useful to say that 'distcc' here means a general distributed compiler tool, not an extension to the existing distcc tool.

    Clang distcc will support ditributed (over network) compilation to any architecture
    supported by llvm. The llvm/clang distcc main advantage over original gcc based distcc is
    performace (less memory, less compile time) and customization.
    All these benefit comes from llvm and clang.

    The new discc will be a compiler driver (or frontend) build from clang and llvm libraries.
    The driver will have two usage mode: gcc option mode and clang option mode.
    So it can be used as a drop in replacement of gcc, and it will handle all distribution and cache task.
    There will be also an admin daemon, what will support configuration and distributes incoming
    requests to nodes. In each node will run a distcc deamon, and will handle incoming tasks.
    This will do the compilation work.
    The clang distcc will support languages via clang. So currently C and Objective C will be supported.

One nice thing about the existing distcc is that it can work with existing other random compilers: it can work with GCC as well as (say) ICC or llvm-gcc. It would be nice to have the option to support these, by using preprocessed .i files as the common medium.

Functionality details
    Usage:
        + setup network:
            - start the admin deamon, in a node
            - start distcc deamons in each node
            - register nodes in admin deamon

        + setup local:
            - configure distcc, (setup admin node address)
                This will generate a config file.

        + use: ex: make CC=distcc

Ok

Implemenation details
    The development will be done incrementally, from the simplest solution to more complex.

yay! :slight_smile:

    The simplest solution is when all source parsing task is done locally and the built AST
    is distributed to Node for optimization and code generation and it sends the result back when its done.
    An advanced solution is when a file sharing protocol is used to share local source files (for including)
    and then parsing is done in Node side and file including is done via the file sharing protocol.
    A more advanced solution is when we caching built ASTs in a central database to prevent
    parsing and building each time. This is useful in header files case.
    So there will be these standalone programs:
        + distcc, supports gcc options
        + distcc, supports clang options
        + distcc daemon for Nodes (network is composed from Nodes, what will do the compilation work)
        + distcc admin daemon (stores information from the distcc Node network)

I think that this is too much to be realistically accomplished in a summer. I think it would be reasonable to incrementally develop this with the following milestones:

1. The first major useful milestone is a "new distcc". Implement exactly what distcc does, but better. Building this involves the main driver, and the 'node daemon'. The intermediate files passed over the network would be .i files.
2. Once #1 basically working, add an 'admin daemon' that is a centralized process on the machine running 'make' which handles communication with the remote nodes. This allows intelligent load balancing, and allows preprocessor caching as well.
3. Once #2 is working well, there are a variety of things in clang that could be done to make the preprocessing faster and more efficient. Everything from using PCH effectively, to dynamically detecting PCH, to other intelligent caching of token strings, to implementing -fdirectives-only [ala gcc] can be considered.

I think that the first two and part of #3 is a full summer worth of work. Maybe next summer (when clang is farther along) we can talk about using serialized ast's for the distribution medium, and/or use a network file system to distribute files etc. It isn't clear whether these are a significant win though.

    All necessary software components are available in llvm/clang sources, but network handling.
    So there will be a thin network layer implemented for unix and windows platforms.
    The new distcc driver will be placed in clang/Driver directory.

Sounds good.

    In caching the cached AST identification can be done with a MD3 sum of the source file including the included
    files MD3 sum and the options used in parsing (defines).

Ok, this provides something like 'ccache'?

Development methodology
    The work will be done via svn. I'll need a clang branch for my work. But it is not required a standalone svn
    repository will work too.
    I use ubuntu linux (gutsy gibbon). I'll send a weekly report of project.
    I'll write user and developer documentation (html or pdf).

Ok

This is very exciting: there is a huge community of people who could benefit from a better 'distcc' tool. I'm looking forward to seeing this make progress!

-Chris

Peter,

     The development will be done incrementally, from the simplest solution to more complex.
     The simplest solution is when all source parsing task is done locally and the built AST
     is distributed to Node for optimization and code generation and it sends the result back when its done.

This is indeed a good first step.

However, couple of points to note while designing new distributed build system from scratch.

In general, preprocessing source file consumes significant portion of total compile time. If local host is tasked to preprocess all source files then it becomes bottleneck.

In practice, that appears not to be the case - at least not for number of hosts < 10.

GCC uses PCH mechanism to reduce compile time locally. If distributed build system distributes the GCC PCHes then it is likely to flood the network (because GCC PCH size is significantly larger then source files) which may have an impact on scalability.

The ideal solution 1) does not impose significant compilation related duties on the local host, 2) does not incur huge network traffic during job distribution and 3) let local host focus on efficient distribution of tasks and collection of results.

Yep. I don't know of such a solution, though.

> An advanced solution is when a file sharing protocol is used to

share local source files (for including)
     and then parsing is done in Node side and file including is done via the file sharing protocol.

IMO, such setup would work well in an environment where available of Nodes is stable.

One variation of this advanced solution would be to just distribute build instructions (command line flags etc...), source file names and project source repository revision number (e.g. svn rev. number) to the Nodes and let the Nodes get project source files from the repository directly. This would free local host from duty of distributing files and take advantage of existing bandwidth provided
by source code repository server.

The problem with this idea is that somehow the distributing machine would have to figure out exactly which files were needed to compile the file. Doing that would require effectively preprocessing the file to generate such a list. _And_ then the target machine would have to get all those files from the repository.

Also, this would require programmers to check source into source control in order to compile it.