[RFC] A binary serialization format for MLIR

Hi all, I’d like to propose the addition of a binary format for MLIR (i.e. an MLIR equivalent of LLVM bitcode). This is something that we have discussed many times in the past, but there was never a strong enough justification/need to have one (plus the general volatility of MLIR). That being said, I think the time is right for us to start building this out. We are starting to hit areas where we really want the benefits that a binary format brings to the table; namely serialization speed and size, mmap capabilities, more easily enabled versioning, etc. From my perspective (both from my work at Modular, and maintainer of other various things), I have at least two recent-ish things that have sparked sending this RFC now (could dig up more on request):

At Modular we have large swathes of data (e.g. ML weights) that want/need to live alongside IR (e.g. see previous discussion on adding “resources” to mlir), which is extremely difficult/expensive to do with a textual format: firstly the cost of transcoding from hex/base64 is significant (this data can be several mbs/gbs/etc), and secondly given that we can’t mmap from the text file we have to allocate space for this data (which as before, is huge).

For PDLL, we generate PDL (an MLIR dialect) that gets parsed/compiled/etc. at runtime. We currently embed this in .cpp source files using the textual mlir format, but this ends up being slower/creates larger source files/etc.

Structure (likely the least controversial portion of the RFC?)

Given the very generic structure of MLIR, the binary format (from a high level) is actually fairly simple. The format really boils down to a few sections (don’t @ me too much here, there will of course likely be some additional things during the actual implementation, but these are the bigger ones):

  • Dialect name section

    • Containing referenced dialect/operation names
  • Attribute+Type section

    • A table containing each of the referenced attributes/types within the input IR. By default attributes/types can be encoded using their string form (which enables all user attributes/types to be supported out-of-the-box), but dialects can define explicit/optimal binary encodings if desired.
  • IR section

    • This section (which doesn’t necessarily need to be one section in the actual implementation) contains the actual IR, operations/blocks/regions/etc. The IR encodings are extremely simple given that all operations have the same underlying generic structure, think of the “generic” operation form in the textual format as an indicator of how simple this can be.
  • Resource section

    • This section holds any dialect/external resources.

That’s… kind of it (from a high level of course). There may be some other things like a string section to collate common strings, etc., but the general structure is just as above. Compared to other IR representations, the generality of MLIR makes the format on its own fairly simple to conceptualize.

Encoding

This aspect for me is the most interesting and I think this likely merits the most discussion and scrutiny. From my perspective, our main options for encoding effectively boil down to either: bitcode (using LLVM’s bitstream), or a custom bytecode. I’ve got working prototypes of both that I’ve used to develop some preliminary opinions, but they aren’t to a quality to share at this point (and I also want to make sure we agree on a direction before placing expounding a bunch more effort cleaning up a specific path).

Bitcode (Using LLVM’s Bitstream)

Bitcode is likely the thing most people would expect from an MLIR binary format, given its use for LLVM’s bitcode. The benefit of using a Bitcode/Bitstream encoding is that many encoding complexities are already taken care of (e.g endian conversions, VBR values, records/blocks, etc.), the underlying infrastructure is well tuned and tested, and there is precedent within LLVM for using it. Given the precedent within LLVM, Bitstream seems like a fairly solid choice for an encoding.

Custom Bytecode

While it’s always good practice to reuse (or at least try to reuse) known-good formats when possible, given all of the complexities/subtleties that inevitably arise, we should also consider what would be best for our particular use case. In my experience so far, Bitcode/Bitstream isn’t as good of a fit for MLIR as it is for LLVM, which creates some interesting complexities when trying to map MLIR to it. For example, the concept of a “Block” in bitstream seems appealing for modeling things like regions, but “Blocks” are heavy enough that IR with lots of regions quickly bloats in both write/read speed and size. Another semi-problematic area (at least in my testing for MLIR) was the use of “bits” as the boundary for the encoding. Every value emit/read requires mangling around with bit boundaries. From prototyping VBR bytecode encodings, I wasn’t actually able to get the bitstream encoding to be smaller or faster than the bytecode encoding (even with trying various tricks/tracking sizes of indices/using VBR/etc). Finally, and anecdotally (i.e. not something I would use strongly to lean either way), it takes a bit of time to understand bitstream “abbreviations”. It may be just a matter of me not having looked at them in forever, but if we use bitstream we will be requiring users to understand abbreviations as well (if they want more optimal attribute/type encodings that is). This could create friction with users, but we could combat this with effective examples and documentation.

Which to use?

There are pros and cons to both, but my slight leaning would be to use a bytecode encoding instead of reusing bitstream. The code for the prototype there ended up being faster, with a smaller resultant encoding (though there is likely a way bitcode could be better with more magic?), and easier to understand/extend both from the internal implementation and dialect interface point-of-view. This is my anecdotal experience leading up to this RFC, so it would be nice for others to provide their perspective (I’m sure there are plenty with preference to either side).

A binary encoding means MLIR is stable now, right?

Well, no. This RFC is explicitly not about establishing/enforcing stability guarantees within MLIR. A binary format definitely makes some aspects easier, and whatever we land on will be
setup to work with some stability guarantees, but this is an explicit non-goal from the start. Stability in MLIR requires a much broader discussion, mostly because the more important thing there is not really the format encoding, but the agreement from dialect owners that they want to support stability and all that entails (it doesn’t matter how stable the encoding itself is if it goes out of date with no upgrade path). Stabilizing MLIR would also be a good time to think of “breaking” changes to representational constructs that we want to push through before committing to anything. Either way, I don’t foresee it being difficult to build stability guarantees (e.g. upgrade on import) on top of the format, it’s an inevitable goal just not an initial one. Not stabilizing initially also gives the binary format room to stabilize itself, as we evolve it and figure out more optimal encodings/tune it to be useful in the general case.

Would love everyone’s thoughts here. This is a significant addition to the infra and will have long term wide reaching impacts.

– River

14 Likes

I think this is a good idea. I’m not sure how much time we usually spend parsing and producing text, but your motivation behind a binary format is quite solid to me.

A few questions/comments:

One of the major differences between LLVM IR and MLIR (in this discussion) is that an MLIR module can have a wide range of dialects, including downstream ones. Dialects can also change instructions (number, semantics, shape) much easier than the overall MLIR, which is already unstable on its own.

But this is true for the text format as well, so, my assumption here is that a binary form will be as stable as the environment that reads it or writes to it.

This may look limiting, but it also frees you to simplify dialect/instruction representation from a unique ID per dialect/instruction to an auto-increment numeric ID on a first-come-first-serve bases and simplifying the list of IDs to an array (where index is instruction ID).

Is that a fair assessment?

Honestly, I think this can use the same mechanism as above. And we can even encode this similar to Unicode or ULEB, where the size is variable depending on the number of instruction/attributes and, say, the first byte encodes what it is (inst/attr/type), another byte has the dialect ID and the remaining bytes are their order in the table above.

You only store the string representation once, in a compact serialised form, in a header, and that’s it.

I worry about the special parsing rules from the instruction ODR definition. In textual form, you propagate the actual representation that will be parsed by the special rules, but in binary form, where everything is function-based, how do you reconstruct the textual form afterwards?

I’m not sure this is valid, but if we had a “generic” binary reader for all instructions (without user intervention?), then creating text from binary would be just a matter of reading the binary and dumping the text.

In your experiments, did you conclude that no matter how complex the parsing is, we can always convert it to a “simple binary form” in a way that no user will ever have to care about it?

And this is the meat of your proposal, but as being dialect/tool specific, it can literally be a memory dump and the “IR format” will not care at all.

If we restrict these cases only to the run-time dialects, then it should be fine. Or at least, not worse than keeping everything else as text.

And unless we come up with a stable binary representation for PDL, being at odds with all other dialects, I can’t see how we could also have binary run-time representation.

Thanks for making this an RFC!

Is this something independent of (for lack of better phrase :slight_smile: ) container (a la Comparison of data-serialization formats - Wikipedia) and would be stored in one of those or is this meant as the format? E.g., here it seems positioned use LLVMs or create our own, but seems there is 3rd option to to be discussed (e.g., why not flatbuffer? :slight_smile: ).

I think you’ve highlighted one of the aspects you want here (direct memory mapping) and relatively compact (sounds like there is a preference to giving up some size for performance), are there other considerations by which you (or.folks in community) are evaluating the trade offs?

Could we slap in a version number close’ish the start, even if not stable to at least be able to fail with good error message? (speaking as someone who keep having to answer “why did this fail to parse/verify?” :slight_smile: [the verify failure is also a parse failure really])

It sounds like you’ve prototyped each instance far enough to be able to evaluate. Is there perhaps a benchmark we could extract some stats from? (I was thinking something like an ML model in TFG format with region based control flow could give a data point as to space used and by what to quantify the bloat you mention).

This would just be a print: the textual form is still captured in the printers associated with the ops, AFAICS this format does not remove the need to have ops registered.

1 Like

I don’t think ops are the issue here: the binary serialization should be like the generic textual format from this point of view.
The more interesting part is rather types and attributes which don’t have a generic form.

Didn’t we have some issues with bitstream in the context of ThinLTO? I forget the details, but the bit-level-ness proved at odds with desires for mmapability or something?

When comparing size on disk, it is probably interesting to look at how compression plays into everything. I guess for some use cases the on disk / over-the-network size is the most important, while for others fast encoding/decoding is most important. I somehow recall bytecodes compressing better?

Overall my sense is that rolling our own bytecode, or using an existing thing like flatbuffer (mind the 4GB limit) or capnproto would be best and simplest.

Right, to some extent regardless of how stable the format is, we still need each of the dialects to commit to stability. This is part of what I was trying to convey in the “stability” section. We can setup the interfaces/format/framework for supporting stability, but at the end of the day this is largely up to the individual dialects.

Yes. In the serialization itself, we can assign a unique ID for every encountered dialect and operation name. The ID assignment is purely for that specific serialization though, meaning that it bears no correlation to the in-memory form. We do want the format to be “stable” at some point, but just like the textual form we don’t want to be unable to parse if, e.g., a new operation was added to a dialect (that shouldn’t break anything serialized previously). Same for if an operation was deleted that wasn’t within the serialized IR. We don’t make assumptions about the order of operations within a dialect, and try to keep things more self contained/local to the serialized IR.

Yep. The unique’d aspect makes this extremely simple, encoded operations simply reference the IDs for the attributes/types that they use.

Operations in MLIR have a unified generic form, meaning that we can serialize every user operation with zero work from the user themselves. This means that we only really need to define a single encoding for operations, and it works everywhere. As @mehdi_amini mentioned, the interesting aspect here is really the attributes+types, which have no generic encoding so we either use the string form (i.e. the current textual form), or an explicitly providing encoding by that dialect. For the fallback of string form, we are literally calling back into the MLIR parser, so regardless of the complexity of the format we can still parse it. Given that attributes/types are “uniqued”, we also don’t need to worry about any of the ODR related complexities of operations.

Yeah, one of the things about resources is that the serialized forms they can take are limited. This means that in the encoding we can define the most optimal ways to store the resources, with no additional work for users there. The interface for providing resources is the same regardless of whether we are using text or binary.

Right. One thing about this use of PDL (w.r.t. PDLL) is that it is effectively similar to a tablegen invocation (PDLL doesn’t use tablegen, it’s a custom language, but the way it works here is similar). The source containing the PDL is generated at build time, meaning that stability isn’t as much of a problem there given that we are using the locally built tooling to generate the IR in the first place.

– River

It’s true, there is definitely a third possibility for using a different pre-existing format. If people have strong possible contenders, please push them forward. I didn’t consider many too strongly, given that having our own brings various benefits of: simplicity, both in code complexity/maintainability and not needing to bring in a new external project dep; supporting arbitrary sized files, i.e. not limiting to 4GB; support with the compilers/platforms we target, e.g. capnproto mentioned above requires a minimum clang of 6.0, which is higher than the new target for LLVM when we hit 16.0 (We could try to bump the upcoming clang to 6.0, but that would require an RFC discussion with the broader community, or hope that capnproto compiles with a lower version), flexibility to evolve/better suit our needs, etc.

Yeah, the prototypes I’ve had so far all have a version field. We could bump versions with all big changes, but only claim “stability”/“upgrade” after a certain version number (when we get to that point).

The tests I’ve conducted so far have IR pulled from tests from each of the dialects upstream (in a larger 100mb file), plus some larger tests from circt (there are a few “large” tests there used for perf tracking), and whatever else has been lying around on my machine. Happy to have other test cases to guide different edge cases (regardless of what we end up with).

– River

Would this just be special handling for StringAttr or do you have something else in mind?

Awesome, I’m thrilled you’re pushing on this. This will also be helpful for folks in the CIRCT community, some of which have very large designs. The .mlir parser being single threaded (and slow in general) is a significant performance problem that prevents using .mlir files in any significant workflows. This is an issue in systems that have .fir files and verilog files as the alternatives, both are pretty problematic. FWIW, CIRCT has a parallelized .fir file parser, and it would be good to consider that as well for this.

I’d advocate strongly for /not/ using LLVM bitstream for MLIR. I’ve shared the origin story of that with many folks, but it was a different time and world than what MLIR needs. LLVM doesn’t have a simple and uniform representation like MLIR does, so it needed a bunch of clever stuff to define a logical AST and be able to decode and encode it in a relatively stable way. LLVM .bc files were also designed to be “small” rather than fast to read and write, which the bitstream format is really really not. The bitcode format also have various fragile context sensitive encoding things going on, and (as you say) it’s block abstractions are not what MLIR needs/wants.

I’d recommend going for a very simple byte-centric encoding, and using a bespoke zero-dependency one rather than adopting a new dependency. There are many simple VBR encodings that are well known (I’ve probably written a dozen of them for various projects over the years, it isn’t difficult) and this give us a lot of control.

One thing you don’t mention that is also useful: it is helpful for some clients to be able to lazily deserialize “function bodies” on demand, e.g. for LTO like use cases. Have you considered how to support this? Beyond the encoding question (which I think is the simple part) there is a question of how the API for the loader works.

-Chris

Does this imply the need for some sort of index within the MLIR binary file format?

I could be on board with this and consider the choice to be between bitcode or bespoke. Imo, the main thing in favor of bitcode is that we have it but I only have experience with it as a black box user of the artifacts vs in a position to critique the implementation.

I would not use flatbuffers/capnproto/etc for something like this. I could come up with a few reasons but most are subjective. Solidly, I just don’t want to be fighting with 64bit support again and just discard anything from that heritage (we just finished adapting our fb based setup to 64bit because of size challenges – no thank you). Also, MLIR is a relatively uniform “meta schema” already. My experience describing such “meta schemas” in those tools is not good (as opposed to concrete entities): it just adds layers of indirection that are neither size, speed, or cache efficient (to say nothing of obtuse code).

I think we need to prioritize speed to read, mapped access efficiency, speed to emit, and size in that order. Keeping the door open to lazy loading of nested IsolatedFromAboves is really interesting and I would mainly be looking now at whether the binary format boxes us out of that.

One clever hack that we’ve exploited with similar things is to append a zipfile trailer to a bespoke binary format and expose independent sections and resources as “files”. In MLIR, maybe those are each IsolatedFromAbove and resource (or some other indication of “top-level-ness”). This is purely a debugging aid (the tools just emit it/don’t rely on it): it gives a nice way for users to break the thing apart for inspection. We’ve found this helpful with everything from large resources (ie. Constants) to embedded binaries… There is something compelling about answering support questions for debugging help with “can you unzip it and send me file X” vs bespoke extraction tooling (which tends to be the limiting factor with custom formats). Note that this is purely an add-on from a file format perspective, although it does encourage a “Russian doll like” decomposability of the format that is often the right thing for the main priorities as well.

Perhaps but it depends on how you implement it. Given a region tree like structure, the encoding of a region would just need to have a “this is how big I am” field to allow the reader to skip over the body. The reader could then produce its own index on the fly.

Totally agreed. Such a path isn’t likely to fly with the wider LLVM community as a dependency for a number of reasons.

I think we need to prioritize speed to read, mapped access efficiency, speed to emit, and size in that order.

+1. As @_sean_silva mentions above, folks who care a LOT about file size at the expense of access speed, can always do an additional compression phase (gzip or whatever).

Right. I think that River’s orthogonal work to allow large BLOB data to be attached to MLIR files will help out with this. The two efforts compose nicely of course.

-Chris

What is the thinking on debug info here? Is the plan to treat it as a regular attribute and load it into memory/etc, or will there be some special handling to keep it nicely isolated in a side table / allow lazy loading?

I would hope that the table of global attribute is also lazy loaded so that a partial load of IR only pulls the minimum amount of attributes.
Same should naturally apply to debug info (you mean locations?)

I mean, if we have an op like this which has been the result of a bunch of inlining:

my.addi %0, %1 : ... loc(callsite(callsite(....("/path/to/foo")))

My current understanding is that we will have N malloced objects, one for each nested “callsite”. So I’m wondering if we can achieve deserializing this operation with zero computation spent deserializing the loc. Maybe there is enough sharing (flyweight pattern) of the debug info locations across ops that this is not an issue in practice though.

I seem to recall in LLVM that we had to be super careful about debug info and want to be sure that we avoid the same issue here.

I’m not really concerned about the serialization format itself, it is possible to lazy load attributes (e.g. based on the IR actually being loaded). The interesting thing here would be what actually happens when we load the operations, which is something that could be built later (based on need/a real use case) without affecting how we actually store the IR.

I’m just reiterating that anything that would help with improved serialization / deserialization of MLIR would be useful on the CIRCT side. The circt/perf “large” regression tests are several orders of magnitude smaller than what we are dealing with in practice today. A “large” internal design is > 2GiB of FIRRTL text (which has no type annotations and is much smaller than the equivalent, verbose FIRRTL Dialect / MLIR representation). This is reasonable to parse with the CIRCT parallel FIRRTL parser, but MLIR serialization or parsing isn’t tractable which has knock-on effects of making debugging tedious (something like -mlir-print-ir-after-all is a complete non-starter :joy:).

3 Likes

This all sounds very exciting and sounds like everyone is on board with custom bytecode (also +1 for priority order for format).

Finally catching up with this RFC . Reading it brought up several experiences we’ve had while working on TOSA serialization with particular emphasis on ML accelerator targeting. Sharing some of those insights and thoughts:

  • Choice of formats

TOSA serialization does flatbuffers and JSON. The former made sense because of the fundamental stability intent of this dialect that could be expressed by aligning dialect and serialization schema, but also the ability to quickly make it work. But we’ve hit some pain with fb 1.x vs 2.0 lately, and also troubles with compiling flatc generated content on clang vs gcc. JSON is really just a more human readable re-expression into a more ‘platform independent form’ for the reference model, not fundamentally a serialization.

The choice of any performant serialization format that satisfies the priorities in the order listed (which sounds appropriate) plus explicit/optional attribute encoding support (more further down) would be ideal.

  • Weight vs op stream encoding

We went back and forth as to whether weights should be embedded within the serialized form, stored as links to external files (Numpy in our case) or even a suffixed trailer as @stellaraccident mentioned.

Our experience with backend compiler development is that op compilation and weight optimization are parallel efforts. For custom ML accelerators there’s a substantial amount of iterative effort into bespoke encoding tied to the underlying hardware design, and thus the effort splits off into work targeting the compute engine and work targeting the weight codecs. Subsequently there are the low level passes orchestrating the two.

So this means we’ve had reasons to like having the weight content sitting adjacent the serialized IR as opposed to embedded within - subsequent lowering stages enable us to quickly run metrics on compression ratios, code vs weight footprints and more. Since the weight organization may evolve during dialect lowering, the proposed explciit/optional encodings for them are useful. For example, passes may implement a combination of pruning, clustering and either block based or VBR encodings for weights.

  • Target ElementsAttr ?

Some internal experiments on end to end MLIR based compilation of full models (ResNet etc) targeting Arm Ethos accelerators indicated 70-75% of the runtime being spent on ElementsAttr transcoding to raw binary. That’s the struct that large weight attributes tend to be initially in, and it somehow needs to be converted to a binary form. This one time cost alone constituted the significant share of compile time as we found. Presumably this proposal intends to define an mmapable form of these basic MLIR types too ? Just a focus on rearchitecting this construct for speed and efficiency would be a substantial early win.