LLVMCAS Upstreaming
From what feels like ages ago, we posted our RFC about integrating Content Addressable Storage into LLVM to enable compiler caching (RFC: Add an LLVM CAS library and experiment with fine-grained caching for builds). If you don’t remember the details, you can also watch our LLVM dev meeting talk (https://youtu.be/E9GdNKjGZ7Y) and read about our round-table summary from the meeting (Round Table about CAS and Compiler Caching in 2022 LLVM Dev Mtg).
While we got initial good feedback from the community, we struggled to find reviewers for our patches when we started the upstreaming process. Thanks to @dblakie and others put in time and effort to provide valuable feedback, but we didn’t quite get enough feedbacks to comfortably land our changes for such a major new component. Even though there isn’t much action happening here, we definitely didn’t give up on the CAS or compiler caching. We continue to work on downstream (GitHub - apple/llvm-project: The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. This fork is used to manage Apple’s stable releases of Clang as well as support the Swift project.) and improve on what we proposed. Since then, we have clang modules working with CAS and we prototyped CAS support into swift compiler as well.
Now with llvm-17 branched, it is a good time to revisit CAS upstreaming and we need your help. If you are interested in CAS, caching and build performance, please help us upstream our implementation. The overall changes are big and I will try to break down into different parts so reviewers can have easier time. Let me know if you can help with any of them so I can add you as reviewer (as I create even more patches).
CAS Implementation
We have both in-memory and on-disk implementations with basic functions. Even though we haven’t done much performance tuning yet, our implementation is very efficient (e.g. small on disk size without compression) and fast (loading/traversing CAS objects). It can be broken down into 3 different categories:
- Underlying data-structures, mostly contributing to ADT. Both in-memory and on disk lock-free TRIE data structures and other concurrent data structures that are used to implement Content Addressable Storage.
- CAS implementation on top of those data structures.
- On top of that, we also need feedback for CAS APIs that can be used to integrate CAS into different tools and provide higher level functions. Example from downstream is: https://github.com/apple/llvm-project/blob/experimental/cas/main/llvm/include/llvm/CAS/UnifiedOnDiskCache.h
Clang Integration
- llvm VirtualOutputBackends: utilities to virtualize compiler outputs so outputs can be re-direct/mirror into different
OutputBackend
. It is a useful tool to integrate CAS into various tools (clang, llvm-tblgen, swift, etc). - clang dependency scanning daemon: an out-of-process dependency scanner daemon, which an ongoing GSoC project is also actively exploring:
clang -cc1depscand
which starts a daemon from clang binary- clang-driver can coordinate caching build without build system support with daemon
- Currently, it has a very simple protocol and communicate with clang processes via Unix socket (needs to support non-Unix platforms)
- Clang cache integration: with CAS and dependency scanner, implement clang cache that is sound by natural.
Other
- MCCAS ObjectFormat: using CAS to efficiently store object files, which is very well received during dev-meeting. It is lower on priority list because its dependency on rest of the work.
Current patches need review:
- https://reviews.llvm.org/D133503: Support: Add proxies for raw_ostream and raw_pwrite_stream
- https://reviews.llvm.org/D133504: Support: Add vfs::OutputBackend and OutputFile to virtualize compiler outputs
- https://reviews.llvm.org/D133509: Frontend: Adopt llvm::vfs::OutputBackend in CompilerInstance
- https://reviews.llvm.org/D133715: [ADT] Add TrieRawHashMap
- https://reviews.llvm.org/D133716: [CAS] Add LLVMCAS library with InMemoryCAS implementation
There are still lots of patches need to be created. If you want to see some areas mentioned above being prioritied, please let me know so I can adjust my work.