Experiments with a replacement for the bitcode format

Hello,

Disclaimer: I have no plan to break 3.x bitcode compatibility in any way.

I’m experimenting with ideas to improve the current bitcode serialization and deserialization in general.
I’m interested in improving the speed and the flexibility of the reader/writer, as well as making the code easier to deal with.

My current griefs against our bitcode format are mainly the lack of built-in random access (which prevents efficient lazy loading), and the implementation which does not help layering the high-level structure of the serialization from what is emitted (and the compression part). On the speed side, the bit level management of record has also non-negligible overhead right now.
The two key issues with the current format that prevents random access are:

- Abbrev are lazily emitted in the stream and can be interleaved with record. You have to scan the whole block to get the context for a record.
- Because records have variable size, you need to parse then in sequence to find the “interesting” one.

I just finished a quick prototype using Google Flatbuffers ( http://google.github.io/flatbuffers/ ). This is a solution that provide “zero-copy” serialization/deserialization, by avoiding the use of variable length integer and aligning appropriately every field. The structure of the serialization is described using a DSL, and the serialization/deserialization code is generated from this description.
My prototype is a naive description of the IR, but complete enough to handle all the .ll files present in the LLVM validation, and rountrip a 800MB file from the LTO link of Clang itself.

The results is that the extra flexibility given by the random access is helping beyond the lazy loading: for instance loading metadata is an issue because of forward reference usually, while the random access allows to “recurse” on the operand when loading a metadata. This simplifies the algorithms both on the reader and the writer.

On the drawback, Flatbuffers seems to target small use cases (not a lot of different type of records or large buffers). The implementation is not totally optimized, for example 40% of the writer time is spend on “vtables” (equivalent of our abbrev) deduplication on some of my test cases.
Also multiple possible obvious size optimizations are not possible with the current Flatbuffers, currently my prototype emit files that are ~2 times larger that our bitcode format. By working on the schema I probably can get this to down 50%, but that may still be too high for us (especially considering that some of these limits are inherent to Flatbuffer implementation itself).

So my current thoughts on the topic is that we could still learn from other recently developed format and bring part of the ideas into our format where it makes sense. For example we don’t need random access in the instruction array, but it’d be appreciable for the metadata though. We'd want zero-copy for a symbol table (see https://llvm.org/bugs/show_bug.cgi?id=27551 ) but not for the instruction stream.

I’m also interested in what are other people’s stakes on the bitcode format (other than backward compatibility), what tradeoff are you ready to make on the size vs speed/flexibility for example?

Hello,

Disclaimer: I have no plan to break 3.x bitcode compatibility in any way.

I’m experimenting with ideas to improve the current bitcode serialization
and deserialization in general.
I’m interested in improving the speed and the flexibility of the
reader/writer, as well as making the code easier to deal with.

My current griefs against our bitcode format are mainly the lack of
built-in random access (which prevents efficient lazy loading), and the
implementation which does not help layering the high-level structure of the
serialization from what is emitted (and the compression part). On the speed
side, the bit level management of record has also non-negligible overhead
right now.
The two key issues with the current format that prevents random access are:

What are most common IR entities that need fast random accesses?

- Abbrev are lazily emitted in the stream and can be interleaved with
record. You have to scan the whole block to get the context for a record.
- Because records have variable size, you need to parse then in sequence
to find the “interesting” one.

I just finished a quick prototype using Google Flatbuffers (
http://google.github.io/flatbuffers/ ). This is a solution that provide
“zero-copy” serialization/deserialization, by avoiding the use of variable
length integer and aligning appropriately every field. The structure of the
serialization is described using a DSL, and the
serialization/deserialization code is generated from this description.
My prototype is a naive description of the IR, but complete enough to
handle all the .ll files present in the LLVM validation, and rountrip a
800MB file from the LTO link of Clang itself.

The results is that the extra flexibility given by the random access is
helping beyond the lazy loading: for instance loading metadata is an issue
because of forward reference usually, while the random access allows to
“recurse” on the operand when loading a metadata. This simplifies the
algorithms both on the reader and the writer.

On the drawback, Flatbuffers seems to target small use cases (not a lot of
different type of records or large buffers). The implementation is not
totally optimized, for example 40% of the writer time is spend on “vtables”
(equivalent of our abbrev) deduplication on some of my test cases.
Also multiple possible obvious size optimizations are not possible with
the current Flatbuffers, currently my prototype emit files that are ~2
times larger that our bitcode format. By working on the schema I probably
can get this to down 50%, but that may still be too high for us (especially
considering that some of these limits are inherent to Flatbuffer
implementation itself).

Do you have read/write time comparison data (for sequential and random
access)?

So my current thoughts on the topic is that we could still learn from
other recently developed format and bring part of the ideas into our format
where it makes sense. For example we don’t need random access in the
instruction array, but it’d be appreciable for the metadata though. We'd
want zero-copy for a symbol table (see https://llvm.org/bugs/show_
bug.cgi?id=27551 ) but not for the instruction stream.

I’m also interested in what are other people’s stakes on the bitcode
format (other than backward compatibility), what tradeoff are you ready to
make on the size vs speed/flexibility for example?

A hybrid format (using fixed length representation or indirection for parts
that need fast random access, the rest uses compression) seems promising.
Why can't backward compatibility be kept?

David

It really depends on the use case: if you want to load a full module at once, we can get away with all but Metadatas.
The only things that have cycles are named types and instructions, but these have fast RAUW support anyway and the writer can lay them in order so that most of the time operands are parsed before the user.

For lazy-load or selective loading, random access helps a lot for everything but instructions (if you materialize a function then you read all of the instructions anyway).

Keeping in mind that I have a naive implementation (both size and speed): the writer is 10% slower (but spend 40% of its time in the not-so-smart-implementation of flatbuffers vtable deduplication).
The reader can be ~40% faster for loading a full module (LTO merged module of clang with debug info), but even a lot faster when you want to load only one function from the module.
The current lazy-loading scheme still forces us to create a declaration for every global/alias/function and load all the global constants expressions. (Duncan did a bunch of work this year to help ThinLTO on this issue by storing more record nested inside the function blocks and not at the global level).

I don’t plan to lose backward compatibility in any way. Was you question intended to be “why can’t backward compatibility be dropped?” or did you misread my disclaimer at the top of the email?

Hello,

Disclaimer: I have no plan to break 3.x bitcode compatibility in any way.

I’m experimenting with ideas to improve the current bitcode serialization
and deserialization in general.
I’m interested in improving the speed and the flexibility of the
reader/writer, as well as making the code easier to deal with.

My current griefs against our bitcode format are mainly the lack of
built-in random access (which prevents efficient lazy loading), and the
implementation which does not help layering the high-level structure of the
serialization from what is emitted (and the compression part). On the speed
side, the bit level management of record has also non-negligible overhead
right now.
The two key issues with the current format that prevents random access
are:

What are most common IR entities that need fast random accesses?

It really depends on the use case: if you want to load a full module at
once, we can get away with all but Metadatas.
The only things that have cycles are named types and instructions, but
these have fast RAUW support anyway and the writer can lay them in order so
that most of the time operands are parsed before the user.

For lazy-load or selective loading, random access helps a lot for
everything but instructions (if you materialize a function then you read
all of the instructions anyway).

Ok. The reason I asked the question is that by understanding the most
important use cases, it is possible to come up with a design that maximize
the benefit with very low overhead (downside e.g, size increase).

- Abbrev are lazily emitted in the stream and can be interleaved with
record. You have to scan the whole block to get the context for a record.
- Because records have variable size, you need to parse then in sequence
to find the “interesting” one.

I just finished a quick prototype using Google Flatbuffers (
http://google.github.io/flatbuffers/ ). This is a solution that provide
“zero-copy” serialization/deserialization, by avoiding the use of variable
length integer and aligning appropriately every field. The structure of the
serialization is described using a DSL, and the
serialization/deserialization code is generated from this description.
My prototype is a naive description of the IR, but complete enough to
handle all the .ll files present in the LLVM validation, and rountrip a
800MB file from the LTO link of Clang itself.

The results is that the extra flexibility given by the random access is
helping beyond the lazy loading: for instance loading metadata is an issue
because of forward reference usually, while the random access allows to
“recurse” on the operand when loading a metadata. This simplifies the
algorithms both on the reader and the writer.

On the drawback, Flatbuffers seems to target small use cases (not a lot
of different type of records or large buffers). The implementation is not
totally optimized, for example 40% of the writer time is spend on “vtables”
(equivalent of our abbrev) deduplication on some of my test cases.
Also multiple possible obvious size optimizations are not possible with
the current Flatbuffers, currently my prototype emit files that are ~2
times larger that our bitcode format. By working on the schema I probably
can get this to down 50%, but that may still be too high for us (especially
considering that some of these limits are inherent to Flatbuffer
implementation itself).

Do you have read/write time comparison data (for sequential and random
access)?

Keeping in mind that I have a naive implementation (both size and speed):
the writer is 10% slower (but spend 40% of its time in the
not-so-smart-implementation of flatbuffers vtable deduplication).
The reader can be ~40% faster for loading a full module (LTO merged module
of clang with debug info), but even a lot faster when you want to load only
one function from the module.
The current lazy-loading scheme still forces us to create a declaration
for every global/alias/function and load all the global constants
expressions. (Duncan did a bunch of work this year to help ThinLTO on this
issue by storing more record nested inside the function blocks and not at
the global level).

With a un-tuned protoype, the result sounds pretty good to me. Besides,
reader performance is presumably more important than writer performance.

So my current thoughts on the topic is that we could still learn from
other recently developed format and bring part of the ideas into our format
where it makes sense. For example we don’t need random access in the
instruction array, but it’d be appreciable for the metadata though. We'd
want zero-copy for a symbol table (see https://llvm.org/bugs/show_bug
.cgi?id=27551 ) but not for the instruction stream.

I’m also interested in what are other people’s stakes on the bitcode
format (other than backward compatibility), what tradeoff are you ready to
make on the size vs speed/flexibility for example?

A hybrid format (using fixed length representation or indirection for
parts that need fast random access, the rest uses compression) seems
promising. Why can't backward compatibility be kept?

I don’t plan to lose backward compatibility in any way. Was you question
intended to be "why can’t backward compatibility be dropped?" or did you
misread my disclaimer at the top of the email?

no, not proposing drop backward compatibility --- I simply misread your
last sentence.

David

It really depends on the use case: if you want to load a full module at once, we can get away with all but Metadatas.
The only things that have cycles are named types and instructions, but these have fast RAUW support anyway and the writer can lay them in order so that most of the time operands are parsed before the user.

For lazy-load or selective loading, random access helps a lot for everything but instructions (if you materialize a function then you read all of the instructions anyway).

Ok. The reason I asked the question is that by understanding the most important use cases, it is possible to come up with a design that maximize the benefit with very low overhead (downside e.g, size increase).

So, talking about important use cases, ultimately my Grand Plan is to “kill" the IR Linker in favor of a “Bitcode Linker”: being able to link / materialize all or part of a module in bitcode directly into an existing module (i.e. without first creating a separate Module and then moving IR from a Module to another). Of course this implies to rework the Reader from the ground-up and it’ll take a few months :slight_smile:

I believe we should be able to achieve this without regressing the existing write/load time for the “non-lazy” case, and for “not so much” increase in size.

Keeping in mind that I have a naive implementation (both size and speed): the writer is 10% slower (but spend 40% of its time in the not-so-smart-implementation of flatbuffers vtable deduplication).
The reader can be ~40% faster for loading a full module (LTO merged module of clang with debug info), but even a lot faster when you want to load only one function from the module.
The current lazy-loading scheme still forces us to create a declaration for every global/alias/function and load all the global constants expressions. (Duncan did a bunch of work this year to help ThinLTO on this issue by storing more record nested inside the function blocks and not at the global level).

With a un-tuned protoype, the result sounds pretty good to me. Besides, reader performance is presumably more important than writer performance.

In general yes, I care more about the reader than the writer, but keep in mind that in a LTO build, every module read has to be written first :slight_smile: