RFC: Add an LLVM CAS library and experiment with fine-grained caching for builds

dexonsmith · February 8, 2022, 5:32am

RFC: Add an LLVM CAS library and experiment with fine-grained caching for builds

Goal

Long-term: Speed up builds by enabling fine-grained caching in the compilers and linkers.

Short-term (this RFC): Add a builtin content-adddressable storage (CAS) to LLVM, designed to integrate with external CAS instances, to facilitate experiments with using content-based caching in LLVM tools.

TL;DR: Let’s give LLVM-based tools access to a CAS for storing structured data, as foundation for experimenting with caching.

Motivation

Opportunity: Builds do redundant work

The computations and data in a build tend to be redundant, both within an individual build and across subsequent builds. Within a build, multiple compiler and linker invocations parse and process different subsets of the same inputs, repeating many computations (such as type-checking, generating code, and optimizing the same C++ template functions). Across consecutive builds, usually at least one input will change, but often many of the computations match between builds.

Problem #1: Caching is hard

It’s hard to cache computations internal to the compiler and linker. One challenge is to define a cache key that is sound to use across different contexts: filesystem inputs are specified by path, which change independently of the input content; and some inputs are found implicitly, discovered only after the computation begins. Another challenge is to avoid cache misses when there’s an input changes in a way that’s irrelevant for a particular computation.

Problem #2: Caching is expensive

It’s expensive to store compiler and linker outputs, which limits the lifetime of caches (thereby limiting their effectiveness). The main challenge is that current output artifacts are monolithic, such that a small semantic change in the input has a ripple effect across the binary representation of the output artifact. Even when two artifacts are semantically similar, where most of the represented data is fundamentally equivalent, it’s hard to avoid storing that data redundantly.

Observation: Distributed build systems operate in the same space, but at a wider scope

Distributed build systems, such as Bazel and Buck, have a wider lens but target the same opportunity. In particular, a single build system invocation includes many compiler and linker invocations, which are largely redundant across builds when many users ask for builds of the same software stack in parallel.

These build systems use content-addressable storage (CAS) for storing inputs and outputs of compiler and linker invocations. Using a CAS simplifies Problem #1 (caching is hard) by factoring out cache invalidation (quick explainer a bit later). Storage cost of inputs and outputs is amortized by scale, relying on objects in the CAS being referenced many times to mitigate Problem #2 (caching is expensive).

A few leading questions…

Can compiler and linker invocations make internal use of a CAS to simplify Problem #1 (caching is hard), and speed up builds by caching internal computations?
- Where would minor refactoring enable existing algorithms to be cached?
- Where would architectural changes allow more effective or finer-grained caching?
Can we improve the schemas for compiler outputs / linker inputs to address Problem #2 (caching is expensive) for the compiler, linker, and build system?
If the compiler, linker, and build system all use the same CAS, are there other opportunities that emerge?

Proposal: Let’s add an LLVMCAS library as a foundation for content-based caching

Let’s add an LLVMCAS library, to establish a foundation for experimenting with persistent, content-based caching in LLVM tools.

Content-addressable storage (CAS) uses strong hashing to provide content-based UUIDs for arbitrary data. A CAS object can be structured, referencing sub-objects by UUID as part of their content, forming a DAG. If two objects (or sub-objects) have the same content, then they have the same UUID and are implicitly deduplicated. A prominent example of using a CAS for storage is Git, where each data structure is a CAS object.
- Persistent caching across contexts is often easy when inputs and outputs are stored in a CAS. The action cache can be a key-value map from an action (which includes UUIDs of inputs) to a result (UUID of the output). For an action cache, there is no need to check validity of the cached result, since there is only a cache hit when it’s correct.
This library should provide abstractions for object storage (the CAS itself) and action caching, with a builtin implementation optimized for local use. Later, we can design a plugin system to hook up external CAS instances (such as those used by distributed build systems).
With a foundation for persistent, content-based caching in place, we can collaborate on ideas for rearchitecting the tools and artifacts to take advantage of it.

There is already a prototype for such a library (not ready for review! caveats and design problems to fix!). This prototype currently lives in a branch called experimental/cas/main (in Apple’s fork), alongside a few experiments (also not ready for review!).

Here’s a high-level view of what’s in the prototype LLVMCAS library:

A few thread-safe data structures
Abstractions to model a CAS and an action cache
- Three object types: “Blob” (data), “Node” (data + references), “Tree” (map: name → reference + kind)
- Builtin CAS implementations for in-memory and on-disk
A thread-safe utility for caching lazily-discovered directories/files, supporting concurrent views that have different working directory state, used by:
- Filesystem that reads from a CAS, and treats a CAS tree as the filesystem root
- Filesystem that provides an immutable view of the system filesystem (sys::fs), tracking accesses to enable creation of CAS trees

Assuming consensus on this proposal, the next steps are to prepare patches for the pieces of LLVMCAS, discuss any important design points, review them, and land them at github.com/llvm/llvm-project.

Feedback needed! (Questions for the reader!)

Firstly, I’m seeking high-level guidance on preparing LLVMCAS for review, mostly around what to “fix” before landing vs. evolving in tree. (Most design questions can wait for code review, I think.)

Should LLVMCAS exist at all, or should its pieces just land in LLVMSupport? (Proposal: Preference for LLVMSupport, allowing LLVMSupport to use it)
Does LLVMCAS need to support Windows immediately? (Proposal: No; Windows can be added later)
On design of the CAS object storage and action cache abstractions
- Should the abstractions be stable to help downstream code? (Proposal: No; they evolve incrementally, as usual)
- Should the abstractions support plugins? (Proposal: Eventually; and the plugin interface will need to be stable)
- Should plugins be “figured out” or examples implemented before landing? (Proposal: No)
On the implementation of the builtin CAS
- Is the serialization of CAS objects stable? (Proposal: Eventually—to allow tools with different revisions of LLVMCAS to talk to the same builtin CAS—but not yet)
- Is the CAS hash function stable? (Proposal: Eventually—same goal—but not yet)
- Is the persistent on-disk format stable? (Proposal: Eventually—same goal—but not yet)
- Should clients be able to configure which stable serialization/hash/etc. to use? (Proposal: Yes! (eventually))
Do we need users of the library in-tree? (Proposal: Not immediately)
Does accepting this proposal commit us to the long-term vision outlined below? (Proposal: No)

Secondly, where should we be working on follow-up experiments? Two obvious options:

Iterate in main, off-by-default (with usual code quality and review expectations). This is best for collaboration, IMO.
Iterate in experimental/cas/main (in Apple’s fork), only sending experiments for review on main once they’re “ready”.

If there’s consensus for main, we’ll prioritize cleaning up the experiments for review and can iterate in tree from there (abandon the existing branch). Else, what should the bar be for “ready”? (Is there a third option to consider?)

Finally, please ask (and/or answer) important questions that I missed!

Experiments

Below are short descriptions of a few experiments living at experimental/cas/main (in Apple’s fork) that depend on the LLVMCAS prototype.

None is being proposed here, just examples to show some potential.
Most of them are documented in DEMO.md on that branch.
I’m “living on” the Clang and TableGen caching experiments (18-core iMac Pro). My experience:
- Compile-time of clean builds of Clang with an unprimed cache are roughly on par with builds that don’t use the experimental features (slightly faster for me, but <1%).
- Clean builds of “Release” Clang with a primed cache are ~10s (after CMake). “RelWithDebInfo” and “Debug” are slower than that since I’m not living on the linker experiments (and they have more I/O and more work for the linker).
- Rebuilds after switching branches or reordering history are often fully cached (~10s).
- Rebuilds after rebasing from main often faster (depends on what changed upstream).

Clang experiment: Cache full compilation across worktrees

This experiment adds -fdepscan and -fdepscan-prefix-map options to clang.

-fdepscan causes the driver to launch / connect to a deamon that discovers and ingests -cc1 inputs into the CAS. It adds the ID of the resulting CAS tree to the -cc1 command-line, which will read from a CASFileSystem instead of the disk. As a result, the -cc1 command-line lists all its inputs explicitly and can be used as a cache key for the outputs.
-fdepscan-prefix-map applies a prefix map to the discovered CAS tree and the -cc1 command-line. This allows cache hits across different worktrees (or build directories).

On macOS, clean builds with an empty/never-used CAS and action cache slow down 1-2% (regression recovered by caching raw tokenization, described later). Clean builds with a primed CAS and action cache are “fast” (hundreds of compiles per second), with the build dominated by linking and running tablegen and process launch.

One interesting side effect of -fdepscan-prefix-map is that the output is fully canonical. Even diagnostics are relative to the remapped files (.d dependency files are the exception: they are emitted by the dependency scanner in the daemon, keeping incremental builds working). This canonicalization is mostly great (reproducible builds!), but it can be annoying for active development. Maybe some things (like diagnostics) should be de-canonicalized, as a layer on top of caching?

Clang experiment: Cache raw tokenization

This experiment adds a -cc1 option that turns on “raw token caching”. Before entering a file, the lexer first tokenizes it in “raw” mode (without the preprocessor) to create and cache a binary token stream and raw identifier table. This result is cached across compilations.

On macOS, this speeds up clean builds that have an empty/never-used CAS and action cache by 1-2%— enough to recover the overhead from -fdepscan.

TableGen experiment: Cache llvm-tblgen and clang-tblgen

This experiment adds -depscan and -depscan-prefix-map to the TableGen executable, which operate like the clang options above without the use of any daemons. They turn on dependency scanning and caching for running TableGen.

On macOS, this eliminates tablegen-related overhead in a (cached) clean build (as long as tablegen was built with optimizations; when built without optimizations, CAS ingestion of the tablegen executable itself (to use in the cache key) shows up, but we could use a daemon to eliminate avoid repeating this in each invocation).

TableGen reverses the -depscan-prefix-map when emitting diagnostics and .d files: inputs are canonicalized with the prefix map, actions are cached using canonical results, and then the cached output is de-canonicalized for tool output. Unlike full compilation caching in clang (above), the prefix map should not affect the output of the tool; it only improves cache hits. Maybe this model would be better for Clang as well!

Clang / llvm-libtool-darwin / ld64.lld experiment: Skip writing object file content, writing out CAS UUIDs instead

This experiment aims to reduce I/O during a build by writing out the contents of object files in a CAS and then only writing the associated CAS ID references as the .o files on disk. This is particularly beneficial in the context of a compilation caching build, where the object file contents are already stored in a CAS as part of caching compilation outputs. It also makes ld64.lld experiment: Cache linking (see later) more efficient by having readily available CAS IDs as the inputs of the linker invocation.

Clang: -fcasid-output flag makes clang write an embedded CAS ID as .o file output. There is special file_magic added as prefix so this kind of files can be recognized by other tools.
llvm-libtool-darwin: Gets additional options for connecting with a builtin LLVMCAS and if the .o files contain embedded CAS IDs, it writes out just the CAS ID in the static archive output instead of the full object file contents.
ld64.lld: Modified to be able to read .o files with embedded CAS IDs.

ld64.lld experiment: Cache linking

This experiment adds --fcas-cache-results to lld’s MachO linker. When enabled, linking is performed in 2 stages:

Do a pass and record all the inputs that are relevant for the linker invocation into a CAS tree. This will form the cache key for the invocation.
If the associated linker output for the given cache key was recorded in the action cache then write out the cached output. Otherwise run the normal linker invocation and record its output in the action cache.

While doing the linker work in #2 lld only accesses data from the CAS tree derived from #1, it doesn’t read from the filesystem again. This ensures that no input file gets used that was not accounted for as part of the caching key. This caching, in combination with Clang / llvm-libtool-darwin / ld64.lld experiment: Skip writing object file content, writing out CAS UUIDs instead showed ~23% reduction in build time for a clean build of the clang executable with a primed CAS and action cache.

CAS-optimized object file experiments: Split object files into a DAG of CAS objects

This experiment splits “object files” into a DAG of CAS objects. It’s disjoint from the caching experiments above.

The idea is to exploit the redundant semantic information in collections of object files reduce aggregate storage costs (in a CAS). It’s also an opportunity to design compiler outputs as a static linking format (what is convenient for emitters and linkers?), rather as an executable format (what do runtime loaders need?).

It adds a tool, llvm-cas-object-format, that ingests object files into the CAS in various ways and computes stats on the aggregate storage cost.

There are two experimental schemas so far, both focused on the first hypothetical question above (storage costs). They show 2-4x smaller growth rate than native Mach-O, when storing a series of build directories for consecutive commits (with or without optimizations; but with no debug info).

The plan is to make a third prototype (soon) after thinking through builder and reader APIs.

Ideas for other experiments and investigations

Here are a few scattered ideas for other experiments and investigations in this direction:

Can we make Clang “lazy” about parsing function bodies, but still type-check correctly (as-if parsed inline) if/when it’s triggered later? That’s an initial step in converting Clang to a demand-driven model (similar to Rust and Swift); what would the next step be?
Where could Clang use relative source locations instead of absolute? Where could it isolate computations from absolute source locations (e.g., by virtualizing them) so that computations could be cached even when code moves?
How could we change Clang module (.pcm) “files” if we were building against a CAS? What validation information could we strip? Can we split up the content (e.g., outline the AST block) to get redundancy between two compilations of a module with different command-line options but the same semantics in practice (and can we get cache hits between them)? Can we reference input files by content instead of full filesystem path… or even better, drop input files from the artifact entirely and rely on the main compilation to provide them when needed?
- If the consumer of the modules were expected to share an action cache, could we cache code generation on-demand? When would it be profitable to compute less at module generation time to be computed/cached on-demand later?
Bitcode’s structure is hostile for fine-grained caching. For example, the instructions in a function reference global values based on their enumeration order; this order is deterministic, but the changes are independent of function bodies. Can we change how globals are referenced so that a function’s instruction stream is bit-for-bit identical when the function body has not changed?
- One idea is to enumerate for each function the globals it references (edge-list between globals), and then the instructions reference indexes into that function-local array; this isolates the content of the instruction stream for the function from the order that globals are enumerated in the module.
- The function-local array of global references would add a small storage cost overhead for an isolated function. But if the instruction stream is outlined and stored as a sub-object in a CAS, then the aggregate storage costs for related bitcode files would see big savings overall, since function bodies tend to repeat (e.g., consider a function that doesn’t change between two compilations, or a linkonce_odr function that shows up in multiple LLVM modules).
- Relatedly, using a per-function constant pool (already an optional bitcode encoding!) solves an analogous problem with instructions that reference pure constants.
- Relatedly, in-memory IR has an analogous structure, which prevents running function passes in parallel since two threads can race when modifying the use-list of a global. An analogous refactoring could allow function passes to run concurrently. First, imagine opaque pointers are “done” and type names moved to the module, and types and “pure” constants are made immutable (no names, no use-lists) and stored in a thread-safe/lock-free data structure in the LLVMContext; thus, “pure” constants can be created/referenced concurrently. Second, the analogous refactoring: make instructions point at function-local global-imports that have local use-lists instead of pointing directly at global values; thus, a function pass would only modify function-local use-lists.
One reason debug info IR is hostile to CAS-optimized bitcode is that function-specific metadata isn’t known a priori to be function-local and is serialized in the global pool (e.g., a DISubprogram is pinned to a single function declaration). Can we move ownership of such metadata to the function (and use the global-values idea above to importing global metadata they reference)?

Sketch of a long-term vision

These experiments are working in the direction of a long-term vision to dramatically speed up builds. Here is a sketch of the main pieces:

Caching.
- Speed up the compiler by isolating functional computations from filesystem and execution environment and modelling input discovery explicitly, caching functional computations based on explicit inputs in a CAS. Increase cache hits between related compiler invocations by caching fine-grained actions/requests that prune and canonicalize their inputs. Further increase cache hits by refactoring data models for inputs, intermediates, and outputs to isolate data that change independently.
- Reduce cost of long-lived caches by using CAS-optimized bitcode and object file formats designed to share sub-objects. Bridge with other tools through conversions to/from existing object file formats, or by updating them to read directly from the CAS.
- Speed up the linker by refactoring along similar principles, using fine-grained computations with pruned and canonicalized inputs to cache between related linker invocations. CAS-optimized linker inputs should be designed to facilitate this.
Scheduling.
- Improve scheduling of high-level sub-tasks by sending “discovered work” graphs to a toolchain daemon that’s integrated with the CAS (suitable for distributed ThinLTO, implicitly-discovered explicitly-built Clang modules, or the new Swift driver).
- Reduce process overhead by designing a daemonization protocol for tools to adopt, allowing the toolchain daemon to send tasks via IPC rather than spawning a process.
- Empower the build system to schedule and distribute work by designing a toolchain API for build systems to adopt, which sends tasks to the toolchain daemon over IPC and can handle receiving graphs of “discovered work”.
Interactive workflows (vs. artifact workflows).
- Speed up tooling that doesn’t need artifacts by transitioning workloads from waterfall to demand-driven, requesting only the observed results.
- Speed up interactive workflows (such as live editor tooling in clangd, edit-and-continue in JITLink, or ultra-fast local incremental builds) by adding fine-grained dependency tracking for cacheable actions, enabling sound reuse of still-valid computations from a previous run (e.g., see Rust’s incremental compilation and Swift’s request evaluator).

beanz · February 8, 2022, 6:40pm

Hey @dexonsmith, this is super cool!

I definitely think integrating this directly into LLVMSupport would be a big win, and I largely agree with all your proposed points.

One question though, what pieces are missing for Windows support? I assume the issue is Windows lacking POSIX, are there specific APIs needed?

rnk · February 8, 2022, 7:41pm

Very interesting! I’m excited about the build optimizations this could unlock.

Feedback needed! (Questions for the reader!)

Firstly, I’m seeking high-level guidance on preparing LLVMCAS for review, mostly around what to “fix” before landing vs. evolving in tree. (Most design questions can wait for code review, I think.)

Should LLVMCAS exist at all, or should its pieces just land in LLVMSupport? (Proposal: Preference for LLVMSupport, allowing LLVMSupport to use it)

Later you discuss using this library for tablegen, and my immediate thought was, “make sure tablegen doesn’t depend on it, to ensure that tablegen has as few deps as possible and doesn’t have to re-run as often.” Perhaps you are right and tablegen should depend on it, but I still think it’s worth factoring this thing as it’s own library. Support is just too large already. For example, FileCheck and not should not depend on LLVMCAS.

Does LLVMCAS need to support Windows immediately? (Proposal: No; Windows can be added later)

I think the answer can be no, but please find a reviewer with Windows expertise for the design so that we have some confidence that we can port the system to Windows at a later date. It’s a lot harder to address platform-specific assumptions after the fact.

On design of the CAS object storage and action cache abstractions

Should the abstractions be stable to help downstream code? (Proposal: No; they evolve incrementally, as usual)

Should the abstractions support plugins? (Proposal: Eventually; and the plugin interface will need to be stable)

Should plugins be “figured out” or examples implemented before landing? (Proposal: No)

Agree

On the implementation of the builtin CAS

Is the serialization of CAS objects stable? (Proposal: Eventually—to allow tools with different revisions of LLVMCAS to talk to the same builtin CAS—but not yet)

Is the CAS hash function stable? (Proposal: Eventually—same goal—but not yet)

Is the persistent on-disk format stable? (Proposal: Eventually—same goal—but not yet)

Should clients be able to configure which stable serialization/hash/etc. to use? (Proposal: Yes! (eventually))

Sounds good. Regarding the choice of hash, LLD currently uses SHA1 as a content hash for PDB type information (blog post). I wanted to migrate this code to a faster content hash, such as BLAKE3. The state of the art is constantly evolving, but we should pick a fast content hash and use it for both use cases.

mehdi_amini · February 8, 2022, 8:01pm

This is a very nice proposition, looking forward to this

You haven’t mentioned some past related work, in particular from @paulhuggett (and others) on the Program Repository, thoughts about this all fit together?

tschuett · February 8, 2022, 8:05pm

You hit some of my favourite projects: clang-scan-deps, swift-driver, and maybe Bazel. Ala given more information to the build system and (remote) caching and remote execution.

ddunbar · February 8, 2022, 10:08pm

Agreed, we’ve had good experience with BLAKEx for similar stuff.

dexonsmith · February 9, 2022, 12:04am

Right. It’s mainly convenience of using POSIX APIs in a few places where llvm::fs didn’t have the thing I wanted for prototyping.

Two bigger things to solve (both solvable I think?):

There’s a filesystem that uses a CAS tree as the “root”. How should that work on Windows? (e.g., handling drive name and root name)
The on-disk CAS is using some memory mapping tricks that I don’t think will work the same way on Windows. At some point I did some research (see comment next to LazyMappedFileRegion::create() implementation in https://github.com/apple/llvm-project/blob/experimental/cas/main/llvm/lib/CAS/LazyMappedFileRegion.cpp#L21) and I think it’s implementable, but not 100% confident. (Either way, memory mapping isn’t required for using an on-disk CAS, it just happens to be how I implemented it…)

Yes, makes sense; I’ll be sure to pull people in for the design/review. I’m particularly wary about the behaviour of VFS implementations (not so worried about on-disk CAS since there’s less chance of abstraction leaks there).

dexonsmith · February 9, 2022, 12:12am

Yup, BLAKE3 makes to me; seems pretty ideal for now. And we should make it possible to configure another way I think so that a vendor (or future LLVM) can make a different choice.

TBH, I’m not sure how to add BLAKE to LLVM. My non-legal-trained self looks at their README and LICENSE and thinks we could import the code directly… but I actually don’t know. Is this a question for the LLVM foundation, or is there an obvious answer?

dblaikie · February 9, 2022, 12:42am

Very vague and hand-wavy, but I worry this starts to feel a bit like “implicit modules” with some of the drawbacks of operations not being visible/orchestrated by the build system? Though I realize at least in some of the scenarios it’s just not possible for the build system to know ahead of time - which types are being described, which inline functions are being instantiated, etc. Some of this can be addressed by ThinLTO (deduplicating after the fact), but some losses are made before that point.

Windows/CodeView PDB creation does do something like this, as I understand it - via cross-process rather than CAS, communicating about “hey, do we already have a definition of this type, great, otherwise I can provide one but I didn’t want to do all the work first if it’s already done” sort of thing. So some prior art that’s been around a while.

I don’t think the model would fit well into Bazel, which likes the build actions to be hermetic - but maybe that’ll have to be revisited at some point for these sort of wins. Maybe Bazel could live with a situation where the CAS is fixed on execution (versioned in some way, so the build could be reproduced) and any updates are serialized as output, then reflected to other nodes only on completion.

The lazy work part (especially when it’s moving work around (pushing more codegen to the linker, for instance - could imagine a non-LTO build that still lazily does codegen so it doesn’t have to codegen code that the linker could dead-strip anyway (see similar sort of thnigs that ORCJIT can do)), rather than just allowing work that would be done at one step to be skipped because it’s already been done) seems a bit more worrying to me with regard to how it interacts with the build system/enables or harms distributing work, etc. Feels like maybe just a filesystem doesn’t capture the needs - and you’d want something more like the way @urnathan’s GitHub - urnathan/libcody: Compiler/BuildSystem Interface Library that is meant to communicate back to the build system when it discovers new work to do. (though that certainly complicates the situation)

How do the self-host numbers (due to branches/rebasing/etc) compare to other existing tools like ccache that, I think, have a lower complexity/setup cost/etc? (though don’t enable as many features, for sure - would have to do some work for canonicalizing to make that work across different build trees, for instance - though we have the flags for at least some of that (Google/Chrome use them for instance - so they’re pretty well tested at least in some specific scenarios))

dexonsmith · February 9, 2022, 1:04am

I think TableGen is an interesting case. I initially thought the same — separate library — because Support seems problematically huge already. Then when I got the “cached compilation” working and starting playing with some demos, TableGen was annoyingly expensive…

I decided to fix it, and in the end it was clean and simple. But there’s an odd result. If the CAS code changes, then TableGen gets cache misses during rebuilds (since my approach was to include TableGen’s own binary executable as part of the action cache key). Compilation caching doesn’t have this (unless you’re doing stage2 of a bootstrap build) since the host clang doesn’t change every time you rebuild the CAS.

(If the builtin CAS were in a separate daemon, it’d work a bit differently. Hypothetically, the just-built TableGen wouldn’t link against the CAS, just against the client side of an IPC protocol (and some glue code to provide a simple API; and let’s assume we’ve made the IPC super efficient with memory mapping tricks). Then the just-built TableGen wouldn’t include the code for the CAS implementation. It could be configured to use the same CAS and action cache as the host Clang.)

Anyway, I’m fine keeping the CAS implementation out of Support; if that’s the wrong decision, we can always merge it later; a bit hard to go the other way around…

(On that topic, thoughts on the right direction for Support? Are there other things we should remove from it / split out? If so, how should that be organized?)

pogo59 · February 9, 2022, 3:09am

I think that discussion deserves its own topic. For example there are bits of Support that are really OS-portability, and bits that are compiler-foundational (e.g., Triple), and probably one or two more categories I’m not thinking of offhand. @clattner had some preliminary thoughts in a different topic somewhere as well. Something like CAS is really more of an operational-functionality library that doesn’t feel quite the same as OS-portability.

dexonsmith · February 9, 2022, 3:29am

@dblaikie, still thinking about your response, but a partial response below.

For reproducibility, I think your suggestion is effectively snapshotting the CAS and the action cache on entry. That makes sense and probably isn’t prohibitively expensive.

A related idea is to add a layer of indirection and do a lazy/partial snapshot, where Bazel can record anything loaded from the CAS and lookups in the action cache (but it can store in the global CAS immediately). This would be sufficient for reproducing.

Bazel could also decline to provide access to the action cache entirely, and clang could make its own in-memory action cache, self-contained for a single compilation. Even there I think there are some interesting possibilities.

With a Bazel plugin, clang/llvm/etc. could read directly from Bazel’s CAS, rather than Bazel having to realize the files on disk, and then store the output directly in the CAS as well.
One of the (very early stages) experiments is on CAS-optimized object file formats. Early results show a 2-4x better growth rate (what do 100 consecutive builds cost to store vs. just 1?) thanks to exploiting the natural redundancy in the object files.
Bitcode has a similar opportunity available (and it’s easier to change). I.e., if a bitcode “file” were just an ID pointing at the root of a CAS object graph, and the next compilation of the same file shared a significant portion of that CAS object graph, it would be much cheaper for Bazel to send the bitcode around for the ThinLTO phase.
Bazel could get some of the caching benefits by breaking compilation into smaller pieces.

(Just mentioning “implicit modules” makes me want to respond to lots of questions you didn’t actually ask…)

Regarding build system visibility:

Firstly, I think it’s interesting to cache finer-grained work items than the build system currently knows about, and maybe more fine-grained than it would make sense to distribute to separate nodes or compilation contexts. For example, in one experiment I added caching for “raw tokenization of a file”, as a pure computation based only on file content and LangOptions; maybe too small to package up as a task for the build system to orchestrate though.

Another (bigger) example might be caching a bottom-up CGSCC pass pipeline (each SCC would be its own action, depending on the input and SCCs underneath it). In the long term, it’d be nice to cache this, or even cache individual function passes, but it’s hard to imagine planning this up front in the build system, and it’s also really hard to think about how to know when things have changed.

Adding CAS integration will give the compiler handles to describe the inputs and outputs of these finer-grained actions in a clear and consistent way, so that details from the environment (or changes on the filesystem) don’t accidentally leak in.

Put another way, I think right now Clang and LLVM don’t have a good way to separate “inputs” from “environment” when doing computations. I think we can fix that by using a CAS.

Secondly, I think we should design/build a generic toolchain protocol as a new interface for the build system to use, and give it MORE visibility, not less. A build system that knows about it (and detects that the tool supports it) spins up a daemon and talks to the tool via IPC, and sends over its action/request. (If the build system is comfortable, it can share a daemon for multiple requests). The tool’s response would be one of:

“Finished” (handing back the results, pass/fail/etc; maybe the build system looks up artifacts on the FS)
“Here’s how to do it”: partial results and a graph of self-contained actions for the build system to schedule, coordinate, cache, and/or distribute.

E.g., steering clear of modules for a moment, consider the following potential workflow for clang 1.cpp 2.cpp 3.cpp, where clang could high-level tasks back to the build system:

1.o: clang -cc1 ... -o 1.o 1.cpp
2.o: clang -cc1 ... -o 2.o 2.cpp
3.o: clang -cc1 ... -o 3.o 3.cpp
a.out: ld -o a.out 1.o 2.o 3.o

Imagine something similar for clang -save-temps, which would allow the build system to cache steps individually.

For modules, the compiler can request a scan:

Build system invokes clang 1.cpp
Driver sends back
- 1.scan: clang -cc1scan -o 1.scan 1.cpp
- 1.o: clang -cc1 -o 1.o 1.cpp <1.scan>
Build system invokes clang -cc1scan -o 1.scan 1.cpp
Scanner sends back
- M1.pcm: clang -cc1 ....
- M2.pcm: clang -cc1 ....
- M3.pcm: clang -cc1 ....
- 1.scan: M1.pcm M2.pcm M3.pcm
etc.

A generic protocol allows other tools to join, too. E.g., this could be used for implementing ThinLTO’s interaction with the build system. I.e., the linker could send back clang jobs for the thinlto steps.

dblaikie:

The lazy work part (especially when it’s moving work around (pushing more codegen to the linker, for instance - could imagine a non-LTO build that still lazily does codegen so it doesn’t have to codegen code that the linker could dead-strip anyway (see similar sort of thnigs that ORCJIT can do)), rather than just allowing work that would be done at one step to be skipped because it’s already been done) seems a bit more worrying to me with regard to how it interacts with the build system/enables or harms distributing work, etc. Feels like maybe just a filesystem doesn’t capture the needs - and you’d want something more like the way @urnathan’s GitHub - urnathan/libcody: Compiler/BuildSystem Interface Library that is meant to communicate back to the build system when it discovers new work to do. (though that certainly complicates the situation)

libcody looks neat!

I think having callbacks for discovered work and/or continuation APIs would go a long way on the build system front. Where the CAS comes in is the “filesystem doesn’t capture the needs” part, I think. It gives us handles for describing inputs/outputs by content, without needing to find where to store intermediates on the filesystem such that they’ll deduplicate or having to worry about filesystems changing over time.

I haven’t run ccache so I’m not sure; if they’ve integrated clang-scan-deps then it’s likely pretty similar in speed; if they haven’t, then it probably adds significant overhead for running the full preprocessor in uncached builds. The setup cost of the compiler caching experiment is low, BTW; just install the clang and add a couple of command-line flags. Probably ccache is also pretty easy to set up though.

IMO, there’s a lot of cross-over between what “build systems” and “compilers” can do when caching full compilation. I think the opportunities for fine-grained caching inside the compiler will ultimately be pretty interesting and it’ll be hard for build systems to go there on their own.

tobiashieta · February 9, 2022, 8:14am

I am interested in getting this to work on Windows as well. I’ll try to keep an eye on this when there are things to review.

clattner · February 9, 2022, 8:42pm

This is really cool Duncan (and team), I’m thrilled to see this coming to light!

I have a few high level thoughts on this, but the biggest one is “this is a big proposal touching on a bunch of different things” and “gosh I wish we had in person meetings to discuss these things”. Maybe it would help to have a zoom meeting next week to discuss?

Here are a few high level comments/ideas on an initial read, but I haven’t dug into the code, so may be quite off base here:

you appear to be proposing a bunch of stuff all in one monolithic change, which you probably don’t intend and isn’t necessary. In order to get this into LLVM, we should look at decoupling some of the changes, e.g. the thread safe data structures are probably no brainer and can be individually reviewed.
+1 on the comment above about Support need to be fixed for other reasons. MLIR folks briefly discussed this on the forum and we had an offline call about it, but I think that Support/ADT should be broken out to their own top level thing that is a peer to llvm-project/llvm, maybe name it “foundation” or “bedrock” or “basement” or something like that, move FileCheck and related things up there, etc. This would make it easier to enforce and reason about layering.
Adding the CAS itself to a Support like thing sounds like it could be tricky: you’re going to want to continue adding features and capabilities to it over time and that could take on dependencies that we don’t want for Support (e.g. maybe someone will want protobuf support or something). Doing it as a separate thing that is higher level could make sense if so. Alternate idea: is there any room to make an abstract interface with a “minimal CAS impl” that always misses in the cache? That would allow you to put fancy distributed logic into a more-derived project and keep things like tblgen compatible with the CAS while not depending on the complicated logic.
This whole proposal strikes me as being very “incubatory” in terms of being full of potential, experimental, and unproven. I’d like to get this into the LLVM project sooner-rather-than-later so experimentation can go fast, but I’m concerned about having things like clang take a hard dependency on this for a variety of reasons. Is there a way to phase this in like we do for incubator projects or experimental projects? This could also help make the Windows thing more of a no brainer.

Overall, I’m super excited about this. I think it could have a profound effect on the LLVM project, and I also have use-cases in my brain that would benefit from it. That said, this is a very very large scope effort with a lot of ambiguity and alternate design points to consider. I think it would be good to find a way to get it “in tree but off by default” somehow to help work both ends of that.

-Chris

dexonsmith · February 9, 2022, 9:50pm

Heh, sorry, not at all the intent!

The intent is to decouple things and land them incrementally
The code on the branch isn’t review quality (proof of concept / demo stuff) and there are design issues important to fix before (probably some during) code review

SGTM! Thanks the extra context.

Yeah, I think something like this could work. Similar idea: an in-memory CAS (and/or action cache) could live in Support.

Some of this stuff totally fits with “incubator” at a high-level, and I thought a bit about that before sending out the proposal. Since this involves modifying the existing tools it’s not obvious to me how to do that.

The types of changes (either in the branch, or for later) fall into a few categories; some are pretty experimental and others less so. Here’s one way of breaking down the work:

Basic data structures and filesystem support.
Refactoring existing code (in the experiments, this is parts of tablegen, ld.lld, clang, and clang-scan-deps) to perform computations in a way that models the inputs separately from the environment.
CAS and action cache interface / abstraction.
Builtin CAS implementations (in-memory / on-disk) — note that the in-memory one doesn’t need to be stable at all really
VFS implementations that ingest into the CAS or sit on top of it.
Changing tools to use a CAS and action cache when available (depending on command-line options and/or build-time configuration), in the places where (2) has been done.
Bigger changes changes to tools that go beyond refactoring (such as “auto-daemonize clang-scan-deps in the clang driver”).
“New object file format” experiment (or other future things like it). Definitely super experimental / incubation stage
(No prototypes/experiments yet) CAS Plugins; IPC protocols for work discovery; etc.

IMO:

(1) squarely belongs in ADT/Support
(2) is probably uncontroversial whenever it comes up; mostly “pure goodness”
(3) and (4) are the CAS / action cache itself. IMO (3) needs to be in llvm/, for now let’s say LLVMCAS (if/when we find a strong justifcation to move the abstraction to Support we can do it). IMO, (4) should be in llvm/ too, at least an in-memory implementation.
(5) fits pretty cleanly alongside other llvm::vfs stuff, which is one possible justification for putting (3) in Support, but they could also just go in LLVMCAS.
(6) and (7) need to live in the tools they’re changing, I think? This is why (3) and (5) need to be in llvm/, so the tools can access them. (I imagine “off by default”; or even “not compiled by default”)
(8) Incubator-like thing would be ideal, if it were possible… but we’ll want to experiment with (e.g.) a custom MCAssembler to write directly to the CAS. Is there a way to do that from an incubator?
(9) Not sure where this should go when we get there; probably an incubator would work, although we’d want to experiment with (e.g.) clang adoption.

Totally open to having an incubator where it makes sense; but it feels like most of these things (besides CAS plugins, and potentially the on-disk CAS) need to be dependencies of the existing tools. Maybe I’m missing some obvious solution though…

WDYT?

ddunbar · February 9, 2022, 10:22pm

Can I +1 this? Not at all an LLVM stakeholder anymore, but in llbuild we ended up hard cloning (and then filtering to simplify) LLVM’s support because I wanted some of its C++ support code (and couldn’t find anyone else’s I liked better) but couldn’t justify taking on LLVM’s build complexity. https://github.com/apple/swift-llbuild/tree/main/utils/import-llvm

This is obviously a huge hack although one I still justify, but I would love to get rid of it and directly depend on some reasonable factored LLVM Support library.

rjmccall · February 9, 2022, 11:07pm

We should pull the discussion of refactoring Support into a separate project into a dedicated thread. I don’t know who has moderator privileges here, but it’s easy to pick posts in Discourse and move them into a new or different thread.

dexonsmith · February 9, 2022, 11:19pm

The program repository looks pretty interesting! From what a quick ready, it seems like the idea is to manage shared state for a build-in-progress, similar to how Windows sends debug info into a shared PDB. I imagine functions/etc. are carved out into separate llvm::Modules (or similar) to enable reasoning about them in some way as separate units. IIUC, lots of similarities from the goals perspective.

One difference is that the program repository has context for what “the current build is” and what is being provided by other compilations in the same build. The llvm/clang changes I’m experimenting with don’t have that context (the CAS can only lookup object by content/content-hash (it’s a “set”, not a “map”), and the action cache is intended to be used across unrelated contexts, including the hashes of the all the inputs in the cache key). I do think such a system would be easy enough to build on top of the CAS-based stuff.

Laziness is another path to get this “current build” context. With linker and build system integration; you can have the initial compilation skip parsing function bodies of linkonce symbols (unless odr-used) and skip type checking where possible, focused on spitting out a list of the symbols exported from a TU and a “continuation” for how to reinvoke to get actual code (you need fine-grained computations with pruned inputs to enable this kind of laziness). Then the linker does a symbol resolution and spits out which symbols it wants from each TU; and then the build system runs the continuations with what that data. (@dblaikie, BTW, I think this sort of thing would work well in Bazel; basically the same model as ThinLTO, just stopping way earlier in the compilation.)

One of the challenges is that when you reenter the compiler, you need to give it the same inputs it saw originally. How do you do that? The CAS provides an easy answer. The compiler’s “continuation” would reference the CAS tree that it wants to see as its input when it’s re-invoked.

Great; when the compiler runs the “continuation”, it’ll have the same inputs, because it’s not looking anything up on the filesystem (it’s reading the same objects from the CAS). Now it has to produce “real” outputs. It could redo all the work it did the first time and continue from there with the instructions from the linker. But with a CAS available, another option is for the initial compilation to serialize its state in some CAS-friendly way (optimizing the data model for deduplication at scale) and reference that state in the “continuation” action, as an extra input. This avoids doing anything redundantly.

(That’s all without caching; just using the CAS as a way to get a handle on particular data.)

IMO, the “data model” is the real heart of the matter. You want to know precisely what the inputs are; you want to isolate properties and content that change independently so that you can afford to store this stuff at scale; and you want state/outputs to be cheap to write/store/read.

Incidentally, anything structured like the above is now easy to add caching for. To maximize cache hits (without sacrificing correctness), you want to prune/canonicalize inputs to computations and make computations as fine-grained as possible.

Maybe I lost track of the question . Perhaps the two points I’m trying to make are that (1) a CAS provides a good foundation for doing this kind of work and (2) there are lots of opportunities available if we change the granularity of the data and computations in the compilers and a CAS can help with that (whether it’s in-memory, for use within a single compilation, or something persistent and on-disk).

dexonsmith · February 9, 2022, 11:37pm

I forgot to respond to this; a call sounds great!; Monday is best for me if that works for others.

pogo59 · February 10, 2022, 2:32pm

Yes, the program repository operates at a function level. I don’t remember what the keying scheme is offhand, or exactly what gets stored (IR vs object). The main design center there is to reduce build time by not (a) having the compiler generate duplicated functions e.g. template instantiations into lots of .o files, and (b) making the linker deduplicate them afterward. In this case the .o-surrogate has a pile of keys into the database, which obviously are smaller than COMDAT sections and less work (certainly less I/O) to deduplicate by the repository-aware linker.

Topic		Replies	Views
LLVMCAS Upstreaming LLVM Project clang , llvm	11	2096	October 7, 2023
Round Table about CAS and Compiler Caching in 2022 LLVM Dev Mtg LLVM Project clang , llvm , usdevmtg	0	859	November 18, 2022
my gsoc proposal: llvm/clang distcc Clang Frontend	8	121	April 7, 2008
RFC: LLVM Build System Future Direction LLVM Dev List Archives	20	344	December 4, 2019
RFC: Upcoming Build System Changes LLVM Dev List Archives	113	421	November 3, 2011

RFC: Add an LLVM CAS library and experiment with fine-grained caching for builds

RFC: Add an LLVM CAS library and experiment with fine-grained caching for builds

Goal

Motivation

Opportunity: Builds do redundant work

Problem #1: Caching is hard

Problem #2: Caching is expensive

Observation: Distributed build systems operate in the same space, but at a wider scope

A few leading questions…

Proposal: Let’s add an LLVMCAS library as a foundation for content-based caching

Feedback needed! (Questions for the reader!)

Experiments

Clang experiment: Cache full compilation across worktrees

Clang experiment: Cache raw tokenization

TableGen experiment: Cache llvm-tblgen and clang-tblgen

Clang / llvm-libtool-darwin / ld64.lld experiment: Skip writing object file content, writing out CAS UUIDs instead

ld64.lld experiment: Cache linking

CAS-optimized object file experiments: Split object files into a DAG of CAS objects

Ideas for other experiments and investigations

Sketch of a long-term vision

Feedback needed! (Questions for the reader!)

Related topics