zapcc compiler

Does anyone know anything about this?

-Chris

Hi Chris,

I am the prinicipal developer of zapcc and can add some more tech details. zapcc is heavily-modified clang (the diff is about 200K) with additional code outside the llvm/clang codebase. zapcc operates in a client-compilation server mode. The compilation server (think of it as clang -cc1) stays in memory and accepts compilation commands from the driver. The client runs until cc1_main() which communicates with the server rather than rerunning another clang as usual.

zapcc makes distinction between two classes of source files, the “system” ones of which all compilation state is kept in memory and the “user” ones whose compilation state is removed once compiled. The programmer can select which are the “user” files by wildcards set in a configuration files. The default of user is .c .cpp .cxx .CC files but it could easily be all files in /home/user/yaron or whatever. It is expected that the system files are non-changing (such a change will not be recognized anyhow until server restart) while the user files are the ones to be modified. As an example, you could have llvm/lib/MC/MachObjectWriter.cpp as the “user” file so every other file compilation result would be kept in memory.

Not only a header file is parsed once but all its templates instantiations and generated code are kept memory ready for the next compilation. zapcc is very carefull to undo anything releated to the ‘user’ files in clang/LLVM data structures,This is very very complex, which is why zapcc is not yet ready for public beta. We prefer to release a more reliable product rather than waste your time.

There are limitations with this approach, as previously declared entities are still visible in subsequent compilations, a limitation we hope to address someday, not in the near future. With good quality modern codebase such clashes are rare. In the LLVM/clang codebased there are just a few clashes which can be easily fixed by renaming one of the clashing entities. Some of the renaming would be according to the new codestyle anyhow… In such cases zapcc automatically resets the compilation cache and retries compilation before giving up. It also resets if compilation flags change or in some situations it finds out it can’t undo the compilation.

Having everything ready in-memory saves time, especially where the headers are much more complex than the source code. With a short C++ program using boost::numeric, boost::graph etc or Eigen, we see a 10-50x speedup. We had some code examples on the web site which I asked to be currently removed now until we can provide you with a beta release so that the results could be independently replicated. These may be considered best-case examples but are actally useful for programmers modifying and rebuilding a smaller program based on heavy templated C++ infrastructure.

For full-build LLVM compilation we don’t yet have full results as not all zapcc bugs are solved, but we do see about 1.5x speedups building until 55% build or so. This timing includes some linking and tablegenning which just the same using zapcc, so compilation speedup is actully somewhat better.

We haven’t compared with precompiled headers as they are really not equivalent. Using precomp headers is non-trivial change to a project build and will not always help build time ,depending on include patterns. I’m not sure precomp headers would benefit LLVM build time. OTOH, zapcc builds the project as-is without redesign, with the exception of renaming name clashes, a trivial refactoring.

Hoping to release a beta version soon,

Yaron

This sounds like a very interesting approach, but also (as you say) very complex :slight_smile:

Have you looked at the modules work in clang? It seems that building on that infrastructure could help simplify things. In principle you could load “all the modules” and then treat any specific translation unit as a filter over the available decls. This is also, uhm, nontrivial, but building on properly modular headers could simplify things a lot.

-Chris

zapcc maintains as much as possible from previous compilations: AST, IR, MC and DebugInfo. I’m not sure that module support goes that far. This indeed would be easier to implement if we know that the C++ code is properly modularized.

One example, if a compile unit instantiates StringMap and the next compile unit also requires it, StringMap should not to be reinstantiated, codegenned and optimized. This could mostly achieved using extern + explicit template instantiations however this approach is quite rare. Maybe because extern template wasn’t supported before C++11, programmers unfamiliar with the technique or because it’s cleaner and easier to #include the template header and let the compiler handle the housekeeping. Whatever reason, zapcc handles this automatically.

Oh, neat. This reminds me of the incremental compiler server efforts with GCC (IncrementalCompiler - GCC Wiki). We also briefly played with this notion at Google a few years back.

The big blockers at the time were the tricky implementation details. GCC's code base is extremely toxic to multi-threading and server approaches.

Diego.

zapcc maintains as much as possible from previous compilations: AST, IR,
MC and DebugInfo. I'm not sure that module support goes that far.

ASTs are preserved in modules, that's all they're for (parsing time tends
to dominate, at least in our world/experiments/data as I understand it, so
that's the first thing to fix). Duplicate IR/MC/DebugInfo is still present
though it'd be the next thing to solve - we're talking about deduplicating
some of the debug info and Adrian Prantl is working on that at the moment -
putting debug info for types into the module files themselves and
referencing it directly as a split DWARF file.

Duplicate IR/MC comes from comdat/linkonce_odr functions - and at some
point it'd be nice to put those in a module too, if there's a clear single
ownership (oh, you have an inline function in your modular header - OK,
we'll IRGen it, make an available_externally copy of it in the module to be
linked into any users of the module, and a standard external definition
will be codegen'd down to object code and put in the module to be passed to
the linker). This wouldn't solve the problems with templates that have no
'home' to put their definition.

- David

This indeed would be easier to implement if we know that the C++ code is
properly modularized.

One example, if a compile unit instantiates StringMap<bool> and the next
compile unit also requires it, StringMap<bool> should not to be
reinstantiated, codegenned and optimized. This could mostly achieved using
extern + explicit template instantiations however this approach is quite
rare. Maybe because extern template wasn't supported before C++11,

Actually it was available in '98, so far as I know.

programmers unfamiliar with the technique or because it's cleaner and
easier to #include the template header and let the compiler handle the
housekeeping.

Yeah, the usual problem is that it's a maintenance burden to couple
template definitions to the types they're instantiated with (& often
impossible - because the template is in a library that doesn't know about
the instantiated types at all (like std::vector - it can't know all the
types in the world that it might be instantiated with)).

> zapcc makes distinction between two classes of source files, the
"system" ones of which all compilation state is kept in memory and the
"user" ones whose compilation state is removed once compiled. The
programmer can select which are the "user" files by wildcards set in a
configuration files. The default of user is .c .cpp .cxx .CC files but it
could easily be all files in /home/user/yaron or whatever. It is expected
that the system files are non-changing (such a change will not be
recognized anyhow until server restart) while the user files are the ones
to be modified. As an example, you could have
llvm/lib/MC/MachObjectWriter.cpp as the "user" file so every other file
compilation result would be kept in memory.

This sounds like a very interesting approach, but also (as you say) very
complex :slight_smile:

Have you looked at the modules work in clang? It seems that building on
that infrastructure could help simplify things. In principle you could
load “all the modules” and then treat any specific translation unit as a
filter over the available decls.

This is actually exactly what clang's current modules infrastructure
already does. Submodules are simply a visibility filter on top of the
loaded AST. This is e.g. what the `Hidden` bit on Decl is for:
http://clang.llvm.org/doxygen/classclang_1_1Decl.html#ad58279c91e474c764e418e5a09d32073
(among other places inside clang touched by implementing it this way).

-- Sean Silva

I guess it depends on the build setup: if you spread the build across
multiple machines then... never mind.

But if the whole build is on one machine and it has enough memory, and
as long as something like zapcc is retaining the whole program's AST
anyway, it could be a win for it to complete that whole-program AST
before any IR is generated. Presumably, the compiler could then
invent the 'home' and do each instantiation exactly once in the entire
build.

Or... it might still help the multi-machine setup. In the worst case,
an instantiated function would get instantiated once per machine.

But in that case it might be nice to get a fix-it hint from the linker
to automatically extern-templateize all such instantiations. (:

--James

That reminds me: is there any public data that shows the percentage of
build time spent doing IRGen/opt/CodeGen for duplicates that end up
getting discarded?

--James

>>>
>>> zapcc maintains as much as possible from previous compilations: AST,
IR,
>>> MC and DebugInfo. I'm not sure that module support goes that far.
>>
>>
>> ASTs are preserved in modules, that's all they're for (parsing time
tends to
>> dominate, at least in our world/experiments/data as I understand it, so
>> that's the first thing to fix). Duplicate IR/MC/DebugInfo is still
present
>> though it'd be the next thing to solve - we're talking about
deduplicating
>> some of the debug info and Adrian Prantl is working on that at the
moment -
>> putting debug info for types into the module files themselves and
>> referencing it directly as a split DWARF file.
>>
>> Duplicate IR/MC comes from comdat/linkonce_odr functions - and at some
point
>> it'd be nice to put those in a module too, if there's a clear single
>> ownership (oh, you have an inline function in your modular header - OK,
>> we'll IRGen it, make an available_externally copy of it in the module
to be
>> linked into any users of the module, and a standard external definition
will
>> be codegen'd down to object code and put in the module to be passed to
the
>> linker). This wouldn't solve the problems with templates that have no
'home'
>> to put their definition.
>
> I guess it depends on the build setup: if you spread the build across
> multiple machines then... never mind.
>
> But if the whole build is on one machine and it has enough memory, and
> as long as something like zapcc is retaining the whole program's AST
> anyway, it could be a win for it to complete that whole-program AST
> before any IR is generated. Presumably, the compiler could then
> invent the 'home' and do each instantiation exactly once in the entire
> build.
>
> Or... it might still help the multi-machine setup. In the worst case,
> an instantiated function would get instantiated once per machine.
>
> But in that case it might be nice to get a fix-it hint from the linker
> to automatically extern-templateize all such instantiations. (:

That reminds me: is there any public data that shows the percentage of
build time spent doing IRGen/opt/CodeGen for duplicates that end up
getting discarded?

I have information on a couple large (1-10MLOC) codebases indicating that
time spent outside of parsing is typically ~20% of total CPU time at
-O2/-O3. IIRC, with lower optimization levels, I saw 10-15%.

So that ~20% number is a rough upper bound for the time spent in the LLVM
optimizers and code generation, and hence an upper bound on the time for
duplicates.

The fact that clang does IRGen as it parses (hence it fell under "parsing
time" in my mesurements) makes it somewhat difficult to pinpoint how much
time is spent on duplicates during IRGenj. If you want to measure this, you
could do it similarly to how I describe measuring per-file time in
http://permalink.gmane.org/gmane.comp.compilers.clang.devel/42127 but with
extra probes tracking calls into IRGen. Also adding probes inside of the
middle end and back end to track per-function time.

By combining this information with information from the linker about which
functions end up becoming "duplicates", you should have a decent empirical
estimate for the data that you want. You might do this by placing probes in
the linker so that you can easily measure any project by just building it
with the instrumented toolchain and using DTrace to funnel out all the
data, which can then be fed into a script.

-- Sean Silva