Bitcode format

Greetings,

I am working on a project (unrelated to LLVM) that needed a
bytecode-like format. I found Bitcode and it seems to fit the bill
really nicely.

I am writing an independent implementation of Bitcode in C (I really
want to keep my runtime pure C). In the process of doing this, I've
discovered a few things I want to ask about.

I notice the Bitcode format documentation [0] is somewhat incomplete --
there have been a few questions I had to resolve by looking at LLVM
source code. This is totally understandable (I'm actually impressed
that it's documented so well for something so new), but I wonder if you
would accept a patch that clarified some things. For example, the
document doesn't mention endianness at all, but from the source code I
discover that bytes are read in order and bits are read
least-significant first.

I also have a few questions about the format:

- it appears that the only magic number in the file is
  application-specific. This seems unfortunate, because it means that
  application-neutral tools cannot be built that process bitcode files,
  since they could not reliably detect that the file is a bitcode file.
  It might seem like there is little room for application- neutral tools
  since almost all the data in the file is application-specific, but off
  the top of my head I can think of a few, like a tool to suggest
  abbreviations that would give a file better compression.

- the LLVM code assumes that several VBR fields can be at most 32 bits
  (block ids, number of elements in an array, etc). These assumptions
  seem quite reasonable: can they be considered part of the format and
  added to the document?

Cheers,
Josh

[0] LLVM Bitcode File Format

Hi Joshua,

Greetings,

I am working on a project (unrelated to LLVM) that needed a
bytecode-like format. I found Bitcode and it seems to fit the bill
really nicely.

I am writing an independent implementation of Bitcode in C (I really
want to keep my runtime pure C). In the process of doing this, I've
discovered a few things I want to ask about.

Okay, cool :slight_smile:

I notice the Bitcode format documentation [0] is somewhat incomplete --
there have been a few questions I had to resolve by looking at LLVM
source code. This is totally understandable (I'm actually impressed
that it's documented so well for something so new), but I wonder if you
would accept a patch that clarified some things. For example, the
document doesn't mention endianness at all, but from the source code I
discover that bytes are read in order and bits are read
least-significant first.

Absolutely, patches are always welcome.

I also have a few questions about the format:

- it appears that the only magic number in the file is
  application-specific. This seems unfortunate, because it means that
  application-neutral tools cannot be built that process bitcode files,
  since they could not reliably detect that the file is a bitcode file.
  It might seem like there is little room for application- neutral tools
  since almost all the data in the file is application-specific, but off
  the top of my head I can think of a few, like a tool to suggest
  abbreviations that would give a file better compression.

Wouldn't those abbreviations imply some application-specific knowledge
of the format? I can't think of any situation where a generic tool could
do anything other than outline the contents of the file by looking at
the blocks, e.g. llvm-bcanalyzer.

- the LLVM code assumes that several VBR fields can be at most 32 bits
  (block ids, number of elements in an array, etc). These assumptions
  seem quite reasonable: can they be considered part of the format and
  added to the document?

I don't see why not. The llvm bitcode documentation is specific to llvm
anyway so the limits should be defined.

Thanks for your interest, Joshua.

Reid.

Reid Spencer <rspencer <at> reidspencer.com> writes:

Hi Joshua,

Hi, thanks for the reply.

> I also have a few questions about the format:
>
> - it appears that the only magic number in the file is
> application-specific. This seems unfortunate, because it means that
> application-neutral tools cannot be built that process bitcode files,
> since they could not reliably detect that the file is a bitcode file.
> It might seem like there is little room for application- neutral tools
> since almost all the data in the file is application-specific, but off
> the top of my head I can think of a few, like a tool to suggest
> abbreviations that would give a file better compression.

Wouldn't those abbreviations imply some application-specific knowledge
of the format?

No -- unless I am mistaken about bitcode, abbreviations have no semantic
meaning. They are nothing but a space optimization, and needn't even be
exposed to the application (at least for reading).

I can't think of any situation where a generic tool could
do anything other than outline the contents of the file by looking at
the blocks, e.g. llvm-bcanalyzer.

I can think of several tools that could be application-neutral and very
useful:

- a tool that prints the hierarchy of blocks and their sizes in the file

- a tool that does more sophisticated space usage analysis. For
  example, how much space are the abbreviations saving? Could the file
  be more efficiently packed with different VBR choices? Or with
  different abbreviations? It could even spit out a file that is
  semantically identical but smaller.

- a tool that dumps selected blocks from the file (possibly giving you
  fine-grained control over what blocks, how many records from each
  block, etc).

- a tool for creating arbitrary bitcode files by reading textual input
  on stdin written in some language that is isomorphic to bitcode. This
  could be useful for things like constructing bitcode files that don't
  follow the expectations of your application, so you can test that your
  application handles such corruption gracefully.

I know there must already be bitcode files flying around, and so I can
see the difficulty of making any changes to the format. What if we just
said that BC (2 bytes) is the magic number for the Bitcode format, and
that 0xC0DE (2 bytes) is the application-specific magic number for LLVM?

> - the LLVM code assumes that several VBR fields can be at most 32 bits
> (block ids, number of elements in an array, etc). These assumptions
> seem quite reasonable: can they be considered part of the format and
> added to the document?

I don't see why not. The llvm bitcode documentation is specific to llvm
anyway so the limits should be defined.

Hmm, the Bitcode documentation [0] seems to contain both application-neutral
and LLVM-specific parts. Specifically, only section 4 of that document
seems specific to LLVM.

In any case, when I submit my patch to that document, I'll put any
of these assumptions that seem supremely reasonable in there and see
what the feedback is. I'm especially interested to hear from Chris,
since he wrote that document (and I assume designed the format).

Thanks for your interest, Joshua.

Thanks for encouraging my participation. And you can call me Josh. :slight_smile:

Josh

[0] LLVM Bitcode File Format

  the top of my head I can think of a few, like a tool to suggest
  abbreviations that would give a file better compression.

Wouldn't those abbreviations imply some application-specific knowledge
of the format?

No -- unless I am mistaken about bitcode, abbreviations have no semantic
meaning. They are nothing but a space optimization, and needn't even be
exposed to the application (at least for reading).

You're absolutely right, that is by design. :slight_smile:

I can't think of any situation where a generic tool could
do anything other than outline the contents of the file by looking at
the blocks, e.g. llvm-bcanalyzer.

I can think of several tools that could be application-neutral and very
useful:

Absolutely. The best way to think about the bitcode format is as an application-independent container format... just like XML. Unlike XML, it is designed to be a dense binary format, but it should be trivial to transform bitcode to XML and back (for example). There are lots of tools that work on arbitrary XML files so I don't see why there wouldn't be application independent tools for bitcode.

- a tool for creating arbitrary bitcode files by reading textual input
on stdin written in some language that is isomorphic to bitcode. This
could be useful for things like constructing bitcode files that don't
follow the expectations of your application, so you can test that your
application handles such corruption gracefully.

Yep. If this is your goal, I think it would be useful to add some new standardized records to blockinfo, which allows the file to give textual names to blocks. For example, a bitcode file could say that block id #1234 is the "foo" block. If converting to/from XML, you'd then print it as <foo>...</foo>.

I know there must already be bitcode files flying around, and so I can
see the difficulty of making any changes to the format. What if we just
said that BC (2 bytes) is the magic number for the Bitcode format, and
that 0xC0DE (2 bytes) is the application-specific magic number for LLVM?

Sure, this makes sense to me.

In any case, when I submit my patch to that document, I'll put any
of these assumptions that seem supremely reasonable in there and see
what the feedback is. I'm especially interested to hear from Chris,
since he wrote that document (and I assume designed the format).

It sounds like you're right on in all counts. Thanks for helping improving the documentation, I'm glad you find the format interesting and useful. In the future, I'll probably also extend it to encode tree structures more efficiently (important for ASTs in the new C frontend).

-Chris