LLVM bitstream integration with CAS (content-addressable storage)

Description of the project: The LLVM bitstream file format is used for serialization of intermediate compiler artifacts, such as LLVM IR or Clang modules. There are situations where multiple bitstream files store identical information, and this duplication leads to increased storage requirements.

This project aims to integrate the LLVM CAS library into the LLVM bitstream file format. If we factor out the frequently duplicated part of a bitstream file into a separate CAS object, we can replace all copies with a small reference to the canonical CAS object, saving storage.

The primary motivating use-case for this project is the dependency scanner that’s powering “implicitly-discovered, explicitly-built” Clang modules. There are real-world situations where even coarse de-duplication on the block level could halve the size of the scanning module cache.

Expected result: There’s a way to configure the LLVM bitstream writer/reader to use CAS as the backing storage.

Skills: Intermediate knowledge of C++, some familiarity with data serialization, self-motivation.

Project size: Medium or large

Confirmed Mentors: Jan Svoboda, Steven Wu


Resources:

Note: The LLVM CAS library is currently in the process of being upstreamed from the Apple LLVM fork.

Dear @jansvoboda11,

My name is Sahil Patidar, and I am excited about the opportunity to contribute to the LLVM project. I have a strong background in C++ and am familiar with the LLVM IR API. After reviewing the information you supplied, I am particularly interested in contributing to the Bitstream-related aspects of the project.

Once I have thoroughly reviewed all the details, I will get back to you with any questions or clarifications. If there are specific challenges or areas of focus you recommend, please let me know; I want to ensure I align my efforts with the project’s goals.

Thank you for your time and consideration. I look forward to the possibility of working together.

Best regards,
Sahil Patidar

One quick question, when you mention coarse deduplication, would that mean treating each bitstream file as a single data blob given to CAS? If so, would there be any situation where multiple bitstream files share some common subsections, meaning splitting bitstream files up would improve performance even more?
Thanks!

Hi,

  • When you meant “single data blob”, I think you were referring to CASObject right? As per my understanding, having a CASObject per bitstream file alone won’t cause the deduplication that Jan is talking about above.
  • Since there are situations where multiple bitstream files store similar information, having CASObject per some common section of the bitstream file will help us to replace ( by reference ) all the files that use this section with the relevant CASObject, resulting in deduplication.

PS: Please correct me if my understanding is wrong. I’m also a beginner contributor to LLVM, looking to work on this project for GSOC 2024.

Thanks!

In the context of a bitstream, defined as a sequential arrangement of Blocks and Data Records, a key optimization involves the implementation of coarse de-duplication at the block level. This optimization aims to factor out duplicate Blocks or Data Records within the bitstream. The proposed approach suggests identifying redundant blocks and creating a CASObject for blocks. Subsequently, the bitstream references these CASObjects, minimizing storage redundancy and enhancing overall efficiency.

I would appreciate your insights on this interpretation. Do you agree with my understanding, or would you offer any clarifications or adjustments?

Hi all, it’s good to see the discussion going!

The coarse de-duplication I mentioned was on the block and record level. In that experiment, we simulated de-duplication of parts of Clang’s module files written by these functions:

  • ASTWriter::WriteSourceManagerBlock()
  • ASTWriter::WritePreprocessor()
  • ASTWriter::WriteHeaderSearch()
  • ASTWriter::WriteIdentifierTable()
  • ASTWriter::WriteSubmodules()

As it stands now, trying to de-duplicate entire files probably won’t bring much benefit, since lots of these files will have some small differences.

I think it would be good to explore what other parts of bitstream files could benefit from de-duplication. I did not investigate the LLVM IR format at all, for example.

Maybe we can even tweak our serialization formats to be more amenable to de-duplication as well.

2 Likes

I think your interpretation is correct, but I should clarify that the redundancy I had in mind was not within a single file, but across multiple files.

Take “implicitly-discovered, explicitly-built” Clang modules as an example (talk). For caching purposes, the current implementation of the dependency scanner creates lightweight “scanning module files” for discovered modular dependencies. When it’s given two command-lines, one with -DMACRO=1 and the other with -DMACRO=2, it needs to create two module files just in case the different macro values somehow affect the semantics of the module. More often that not, such command-line options are benign, and the scanning module files end up being almost identical. This is where de-duplicating across files would prove a big win.

1 Like

I’ve been thoroughly exploring LLVM documentation and related resources, particularly focusing on bitstream encoding.

In my research, I’ve encountered various key points regarding block handling in bitstream encoding, particularly those related to deduplication:

  • MODULE_BLOCK (Code: 8): Contains overarching module-level information.
  • PARAMATTR_BLOCK (Code: 9): Enumerates parameter attributes.
  • PARAMATTR_GROUP_BLOCK (Code: 10): Describes the attribute group table.
  • CONSTANTS_BLOCK (Code: 11): Describes constants for a module or function.
  • FUNCTION_BLOCK (Code: 12): Describes a function body.
  • VALUE_SYMTAB_BLOCK (Code: 14): Describes a value symbol table.
  • METADATA_BLOCK (Code: 15): Describes metadata items.
  • METADATA_ATTACHMENT (Code: 16): Contains records associating metadata with function instruction values.
  • TYPE_BLOCK (Code: 17): Describes all types within the module.
  • STRTAB_BLOCK (Code: 23): Represents the bitcode file’s string table.

I have a specific query regarding the deduplication of TYPE_BLOCK within the bitstream. Suppose we encounter two bitcodes with identical TYPE_BLOCK content. Now, if one of these TYPE_BLOCKs is already present in the CAS file system (FS), how does the system recognize that the identical block exists, same as for records? Is there an internal logic within the CAS that facilitates this recognition, or would such logic need to be explicitly implemented?

Additionally, I’ve been investigating methods to handle identical block content within the Bitstream. I’ve considered a method where each bitstream block is created as a CAS object, with each record inside the block also being a CAS object, allowing for the deduplication of records across multiple bitstream files through references within the Block CASObject. Could you confirm if my understanding aligns with best practices, or if there are alternative methods I should explore further?

Your insights and guidance on these matters would be greatly appreciated.

The problem of deduplicating the blocks efficiently is handled by CAS thus it is not part of this project. The downstream branch has a very efficient CAS implementation that works on most file systems that provide a good baseline for this project.

In general, you don’t need to worry about if the block exists or not before storing. The CAS guarantees there is only one copy exists and storing an existing block will result in no extra storage and get back the same ID as the previous insert.

3 Likes

Adding more details to the project to help break down the task.

One of the main goal of the project to create a close-to-drop-in-replacement CAS based bitstream reader/writer so current users of the LLVM bitstream can get the benefit of CAS storage. The main candidate for such user is ASTWriter/ASTReader in clang.

Some aspects of the LLVM bitcode can be easily mapped into CAS world (e.g. Block) while some properties will be much harder (e.g. Cursor). The biggest discrepancy is that bitstream is a continuous flat layout while CAS objects are tree-like so something like global offset is hard to map directly. With the current abstraction of BitstreamWriter and BitstreamReader, Reader will take more effect to model and implement.

The serialization format of the bitstream is mostly defined by the client (e.g. ASTWriter) but the CAS streamer has some freedom to layout blocks/records. Passing the control of CAS layout to the client is also an option. We will explore the benefit of the CAS streaming by tweaking clang ASTWriter and its CAS layout to maximize the space saving/serialization overhead.

2 Likes

Hi Steven, Can you please clarify if my understanding is correct?

Since different clients have their own serialisation format of the bitstream ( the content and structure of record/blocks ), we want to configure the CAS layout from the client side itself in this new BitStreamCASWriter by passing some extra params. It helps us deduplicate better as the CAS layout can be specialised to the client calling it.

Thanks!

Correct.

Let me clarify the goal a bit more. From my point of view, if we can achieve good result with no/minimal API changes to the Writer/Reader, it would be a big plus. That means we can adopt CAS without little effort to existing codebase. However, it is also an option to update or even re-design the current APIs to make it work. That will likely further increase the project size.

1 Like

If possible, could you please share the code pointer to this experiment? ( if its open source. ) It might help while making the proposal. I looked at the llvm/llvm-project repo and apple/llvm-project repo and could not find it.

Thanks!

I managed to find the patch on my other computer. It’s basically just dumping the sizes and hashes of various parts of PCM file to stdout.

ast-duplication.patch (2.1 KB)

1 Like