RFC: Add an LLVM CAS library and experiment with fine-grained caching for builds
Goal
Long-term: Speed up builds by enabling fine-grained caching in the compilers and linkers.
Short-term (this RFC): Add a builtin content-adddressable storage (CAS) to LLVM, designed to integrate with external CAS instances, to facilitate experiments with using content-based caching in LLVM tools.
TL;DR: Let’s give LLVM-based tools access to a CAS for storing structured data, as foundation for experimenting with caching.
Motivation
Opportunity: Builds do redundant work
The computations and data in a build tend to be redundant, both within an individual build and across subsequent builds. Within a build, multiple compiler and linker invocations parse and process different subsets of the same inputs, repeating many computations (such as type-checking, generating code, and optimizing the same C++ template functions). Across consecutive builds, usually at least one input will change, but often many of the computations match between builds.
Problem #1: Caching is hard
It’s hard to cache computations internal to the compiler and linker. One challenge is to define a cache key that is sound to use across different contexts: filesystem inputs are specified by path, which change independently of the input content; and some inputs are found implicitly, discovered only after the computation begins. Another challenge is to avoid cache misses when there’s an input changes in a way that’s irrelevant for a particular computation.
Problem #2: Caching is expensive
It’s expensive to store compiler and linker outputs, which limits the lifetime of caches (thereby limiting their effectiveness). The main challenge is that current output artifacts are monolithic, such that a small semantic change in the input has a ripple effect across the binary representation of the output artifact. Even when two artifacts are semantically similar, where most of the represented data is fundamentally equivalent, it’s hard to avoid storing that data redundantly.
Observation: Distributed build systems operate in the same space, but at a wider scope
Distributed build systems, such as Bazel and Buck, have a wider lens but target the same opportunity. In particular, a single build system invocation includes many compiler and linker invocations, which are largely redundant across builds when many users ask for builds of the same software stack in parallel.
These build systems use content-addressable storage (CAS) for storing inputs and outputs of compiler and linker invocations. Using a CAS simplifies Problem #1 (caching is hard) by factoring out cache invalidation (quick explainer a bit later). Storage cost of inputs and outputs is amortized by scale, relying on objects in the CAS being referenced many times to mitigate Problem #2 (caching is expensive).
A few leading questions…
- Can compiler and linker invocations make internal use of a CAS to simplify Problem #1 (caching is hard), and speed up builds by caching internal computations?
- Where would minor refactoring enable existing algorithms to be cached?
- Where would architectural changes allow more effective or finer-grained caching?
- Can we improve the schemas for compiler outputs / linker inputs to address Problem #2 (caching is expensive) for the compiler, linker, and build system?
- If the compiler, linker, and build system all use the same CAS, are there other opportunities that emerge?
Proposal: Let’s add an LLVMCAS library as a foundation for content-based caching
Let’s add an LLVMCAS library, to establish a foundation for experimenting with persistent, content-based caching in LLVM tools.
-
Content-addressable storage (CAS) uses strong hashing to provide content-based UUIDs for arbitrary data. A CAS object can be structured, referencing sub-objects by UUID as part of their content, forming a DAG. If two objects (or sub-objects) have the same content, then they have the same UUID and are implicitly deduplicated. A prominent example of using a CAS for storage is Git, where each data structure is a CAS object.
- Persistent caching across contexts is often easy when inputs and outputs are stored in a CAS. The action cache can be a key-value map from an action (which includes UUIDs of inputs) to a result (UUID of the output). For an action cache, there is no need to check validity of the cached result, since there is only a cache hit when it’s correct.
- This library should provide abstractions for object storage (the CAS itself) and action caching, with a builtin implementation optimized for local use. Later, we can design a plugin system to hook up external CAS instances (such as those used by distributed build systems).
- With a foundation for persistent, content-based caching in place, we can collaborate on ideas for rearchitecting the tools and artifacts to take advantage of it.
There is already a prototype for such a library (not ready for review! caveats and design problems to fix!). This prototype currently lives in a branch called experimental/cas/main (in Apple’s fork), alongside a few experiments (also not ready for review!).
Here’s a high-level view of what’s in the prototype LLVMCAS library:
- A few thread-safe data structures
- Abstractions to model a CAS and an action cache
- Three object types: “Blob” (data), “Node” (data + references), “Tree” (map: name → reference + kind)
- Builtin CAS implementations for in-memory and on-disk
- A thread-safe utility for caching lazily-discovered directories/files, supporting concurrent views that have different working directory state, used by:
- Filesystem that reads from a CAS, and treats a CAS tree as the filesystem root
- Filesystem that provides an immutable view of the system filesystem (
sys::fs
), tracking accesses to enable creation of CAS trees
Assuming consensus on this proposal, the next steps are to prepare patches for the pieces of LLVMCAS, discuss any important design points, review them, and land them at github.com/llvm/llvm-project.
Feedback needed! (Questions for the reader!)
Firstly, I’m seeking high-level guidance on preparing LLVMCAS for review, mostly around what to “fix” before landing vs. evolving in tree. (Most design questions can wait for code review, I think.)
- Should LLVMCAS exist at all, or should its pieces just land in LLVMSupport? (Proposal: Preference for LLVMSupport, allowing LLVMSupport to use it)
- Does LLVMCAS need to support Windows immediately? (Proposal: No; Windows can be added later)
- On design of the CAS object storage and action cache abstractions
- Should the abstractions be stable to help downstream code? (Proposal: No; they evolve incrementally, as usual)
- Should the abstractions support plugins? (Proposal: Eventually; and the plugin interface will need to be stable)
- Should plugins be “figured out” or examples implemented before landing? (Proposal: No)
- On the implementation of the builtin CAS
- Is the serialization of CAS objects stable? (Proposal: Eventually—to allow tools with different revisions of LLVMCAS to talk to the same builtin CAS—but not yet)
- Is the CAS hash function stable? (Proposal: Eventually—same goal—but not yet)
- Is the persistent on-disk format stable? (Proposal: Eventually—same goal—but not yet)
- Should clients be able to configure which stable serialization/hash/etc. to use? (Proposal: Yes! (eventually))
- Do we need users of the library in-tree? (Proposal: Not immediately)
- Does accepting this proposal commit us to the long-term vision outlined below? (Proposal: No)
Secondly, where should we be working on follow-up experiments? Two obvious options:
- Iterate in
main
, off-by-default (with usual code quality and review expectations). This is best for collaboration, IMO. - Iterate in experimental/cas/main (in Apple’s fork), only sending experiments for review on
main
once they’re “ready”.
If there’s consensus for main
, we’ll prioritize cleaning up the experiments for review and can iterate in tree from there (abandon the existing branch). Else, what should the bar be for “ready”? (Is there a third option to consider?)
Finally, please ask (and/or answer) important questions that I missed!
Experiments
Below are short descriptions of a few experiments living at experimental/cas/main (in Apple’s fork) that depend on the LLVMCAS prototype.
- None is being proposed here, just examples to show some potential.
- Most of them are documented in DEMO.md on that branch.
- I’m “living on” the Clang and TableGen caching experiments (18-core iMac Pro). My experience:
- Compile-time of clean builds of Clang with an unprimed cache are roughly on par with builds that don’t use the experimental features (slightly faster for me, but <1%).
- Clean builds of “Release” Clang with a primed cache are ~10s (after CMake). “RelWithDebInfo” and “Debug” are slower than that since I’m not living on the linker experiments (and they have more I/O and more work for the linker).
- Rebuilds after switching branches or reordering history are often fully cached (~10s).
- Rebuilds after rebasing from
main
often faster (depends on what changed upstream).
Clang experiment: Cache full compilation across worktrees
This experiment adds -fdepscan
and -fdepscan-prefix-map
options to clang.
-
-fdepscan
causes the driver to launch / connect to a deamon that discovers and ingests-cc1
inputs into the CAS. It adds the ID of the resulting CAS tree to the-cc1
command-line, which will read from aCASFileSystem
instead of the disk. As a result, the-cc1
command-line lists all its inputs explicitly and can be used as a cache key for the outputs. -
-fdepscan-prefix-map
applies a prefix map to the discovered CAS tree and the-cc1
command-line. This allows cache hits across different worktrees (or build directories).
On macOS, clean builds with an empty/never-used CAS and action cache slow down 1-2% (regression recovered by caching raw tokenization, described later). Clean builds with a primed CAS and action cache are “fast” (hundreds of compiles per second), with the build dominated by linking and running tablegen and process launch.
One interesting side effect of -fdepscan-prefix-map
is that the output is fully canonical. Even diagnostics are relative to the remapped files (.d
dependency files are the exception: they are emitted by the dependency scanner in the daemon, keeping incremental builds working). This canonicalization is mostly great (reproducible builds!), but it can be annoying for active development. Maybe some things (like diagnostics) should be de-canonicalized, as a layer on top of caching?
Clang experiment: Cache raw tokenization
This experiment adds a -cc1
option that turns on “raw token caching”. Before entering a file, the lexer first tokenizes it in “raw” mode (without the preprocessor) to create and cache a binary token stream and raw identifier table. This result is cached across compilations.
On macOS, this speeds up clean builds that have an empty/never-used CAS and action cache by 1-2%— enough to recover the overhead from -fdepscan
.
TableGen experiment: Cache llvm-tblgen and clang-tblgen
This experiment adds -depscan
and -depscan-prefix-map
to the TableGen executable, which operate like the clang options above without the use of any daemons. They turn on dependency scanning and caching for running TableGen.
On macOS, this eliminates tablegen-related overhead in a (cached) clean build (as long as tablegen was built with optimizations; when built without optimizations, CAS ingestion of the tablegen executable itself (to use in the cache key) shows up, but we could use a daemon to eliminate avoid repeating this in each invocation).
TableGen reverses the -depscan-prefix-map
when emitting diagnostics and .d
files: inputs are canonicalized with the prefix map, actions are cached using canonical results, and then the cached output is de-canonicalized for tool output. Unlike full compilation caching in clang (above), the prefix map should not affect the output of the tool; it only improves cache hits. Maybe this model would be better for Clang as well!
Clang / llvm-libtool-darwin / ld64.lld experiment: Skip writing object file content, writing out CAS UUIDs instead
This experiment aims to reduce I/O during a build by writing out the contents of object files in a CAS and then only writing the associated CAS ID references as the .o
files on disk. This is particularly beneficial in the context of a compilation caching build, where the object file contents are already stored in a CAS as part of caching compilation outputs. It also makes ld64.lld experiment: Cache linking (see later) more efficient by having readily available CAS IDs as the inputs of the linker invocation.
- Clang:
-fcasid-output
flag makes clang write an embedded CAS ID as.o
file output. There is specialfile_magic
added as prefix so this kind of files can be recognized by other tools. - llvm-libtool-darwin: Gets additional options for connecting with a builtin LLVMCAS and if the
.o
files contain embedded CAS IDs, it writes out just the CAS ID in the static archive output instead of the full object file contents. - ld64.lld: Modified to be able to read
.o
files with embedded CAS IDs.
ld64.lld experiment: Cache linking
This experiment adds --fcas-cache-results
to lld
’s MachO linker. When enabled, linking is performed in 2 stages:
- Do a pass and record all the inputs that are relevant for the linker invocation into a CAS tree. This will form the cache key for the invocation.
- If the associated linker output for the given cache key was recorded in the action cache then write out the cached output. Otherwise run the normal linker invocation and record its output in the action cache.
While doing the linker work in #2 lld
only accesses data from the CAS tree derived from #1, it doesn’t read from the filesystem again. This ensures that no input file gets used that was not accounted for as part of the caching key. This caching, in combination with Clang / llvm-libtool-darwin / ld64.lld experiment: Skip writing object file content, writing out CAS UUIDs instead showed ~23% reduction in build time for a clean build of the clang executable with a primed CAS and action cache.
CAS-optimized object file experiments: Split object files into a DAG of CAS objects
This experiment splits “object files” into a DAG of CAS objects. It’s disjoint from the caching experiments above.
The idea is to exploit the redundant semantic information in collections of object files reduce aggregate storage costs (in a CAS). It’s also an opportunity to design compiler outputs as a static linking format (what is convenient for emitters and linkers?), rather as an executable format (what do runtime loaders need?).
It adds a tool, llvm-cas-object-format
, that ingests object files into the CAS in various ways and computes stats on the aggregate storage cost.
There are two experimental schemas so far, both focused on the first hypothetical question above (storage costs). They show 2-4x smaller growth rate than native Mach-O, when storing a series of build directories for consecutive commits (with or without optimizations; but with no debug info).
The plan is to make a third prototype (soon) after thinking through builder and reader APIs.
Ideas for other experiments and investigations
Here are a few scattered ideas for other experiments and investigations in this direction:
- Can we make Clang “lazy” about parsing function bodies, but still type-check correctly (as-if parsed inline) if/when it’s triggered later? That’s an initial step in converting Clang to a demand-driven model (similar to Rust and Swift); what would the next step be?
- Where could Clang use relative source locations instead of absolute? Where could it isolate computations from absolute source locations (e.g., by virtualizing them) so that computations could be cached even when code moves?
- How could we change Clang module (
.pcm
) “files” if we were building against a CAS? What validation information could we strip? Can we split up the content (e.g., outline the AST block) to get redundancy between two compilations of a module with different command-line options but the same semantics in practice (and can we get cache hits between them)? Can we reference input files by content instead of full filesystem path… or even better, drop input files from the artifact entirely and rely on the main compilation to provide them when needed?- If the consumer of the modules were expected to share an action cache, could we cache code generation on-demand? When would it be profitable to compute less at module generation time to be computed/cached on-demand later?
- Bitcode’s structure is hostile for fine-grained caching. For example, the instructions in a function reference global values based on their enumeration order; this order is deterministic, but the changes are independent of function bodies. Can we change how globals are referenced so that a function’s instruction stream is bit-for-bit identical when the function body has not changed?
- One idea is to enumerate for each function the globals it references (edge-list between globals), and then the instructions reference indexes into that function-local array; this isolates the content of the instruction stream for the function from the order that globals are enumerated in the module.
- The function-local array of global references would add a small storage cost overhead for an isolated function. But if the instruction stream is outlined and stored as a sub-object in a CAS, then the aggregate storage costs for related bitcode files would see big savings overall, since function bodies tend to repeat (e.g., consider a function that doesn’t change between two compilations, or a
linkonce_odr
function that shows up in multiple LLVM modules). - Relatedly, using a per-function constant pool (already an optional bitcode encoding!) solves an analogous problem with instructions that reference pure constants.
- Relatedly, in-memory IR has an analogous structure, which prevents running function passes in parallel since two threads can race when modifying the use-list of a global. An analogous refactoring could allow function passes to run concurrently. First, imagine opaque pointers are “done” and type names moved to the module, and types and “pure” constants are made immutable (no names, no use-lists) and stored in a thread-safe/lock-free data structure in the LLVMContext; thus, “pure” constants can be created/referenced concurrently. Second, the analogous refactoring: make instructions point at function-local global-imports that have local use-lists instead of pointing directly at global values; thus, a function pass would only modify function-local use-lists.
- One reason debug info IR is hostile to CAS-optimized bitcode is that function-specific metadata isn’t known a priori to be function-local and is serialized in the global pool (e.g., a DISubprogram is pinned to a single function declaration). Can we move ownership of such metadata to the function (and use the global-values idea above to importing global metadata they reference)?
Sketch of a long-term vision
These experiments are working in the direction of a long-term vision to dramatically speed up builds. Here is a sketch of the main pieces:
-
Caching.
- Speed up the compiler by isolating functional computations from filesystem and execution environment and modelling input discovery explicitly, caching functional computations based on explicit inputs in a CAS. Increase cache hits between related compiler invocations by caching fine-grained actions/requests that prune and canonicalize their inputs. Further increase cache hits by refactoring data models for inputs, intermediates, and outputs to isolate data that change independently.
- Reduce cost of long-lived caches by using CAS-optimized bitcode and object file formats designed to share sub-objects. Bridge with other tools through conversions to/from existing object file formats, or by updating them to read directly from the CAS.
- Speed up the linker by refactoring along similar principles, using fine-grained computations with pruned and canonicalized inputs to cache between related linker invocations. CAS-optimized linker inputs should be designed to facilitate this.
-
Scheduling.
- Improve scheduling of high-level sub-tasks by sending “discovered work” graphs to a toolchain daemon that’s integrated with the CAS (suitable for distributed ThinLTO, implicitly-discovered explicitly-built Clang modules, or the new Swift driver).
- Reduce process overhead by designing a daemonization protocol for tools to adopt, allowing the toolchain daemon to send tasks via IPC rather than spawning a process.
- Empower the build system to schedule and distribute work by designing a toolchain API for build systems to adopt, which sends tasks to the toolchain daemon over IPC and can handle receiving graphs of “discovered work”.
-
Interactive workflows (vs. artifact workflows).
- Speed up tooling that doesn’t need artifacts by transitioning workloads from waterfall to demand-driven, requesting only the observed results.
- Speed up interactive workflows (such as live editor tooling in clangd, edit-and-continue in JITLink, or ultra-fast local incremental builds) by adding fine-grained dependency tracking for cacheable actions, enabling sound reuse of still-valid computations from a previous run (e.g., see Rust’s incremental compilation and Swift’s request evaluator).