[RFC] Bytecode: op fallback path

GitHub: [mlir][RFC] Bytecode: op fallback path by nikalra · Pull Request #129784 · llvm/llvm-project · GitHub

Bytecode dialect versioning today is built around the premise that MLIR bytecode is used as a long-term storage format: specifically, that the producing process has knowledge about the version of dialect to target, and is able to downgrade the dialect prior to serialization. If it cannot, this is an error at serialization time.

This poses a problem if bytecode is used as an exchange format between distributed compilers with different versions of the dialect. If a given module references an operation that the receiving process doesn’t know about, or utilizes a newer version of the op that the receiving process doesn’t have support for, the receiving process will fail to parse the bytecode in its entirety, regardless of whether or not the failing operation is relevant to the receiving process.

This proposal adds a fallback mechanism for dialects to construct an operation that maintains the semantics of the unknown operation, while supporting roundtrip bytecode serialization in a bitwise exact manner. Specifically, the flow changes to the following:

  1. Attempt to deserialize bytecode using the standard dialect/op interface mechanisms.
  2. If an op cannot be parsed, attempt to parse it again using the dialect fallback mechanism. This uses a sideband path to register the original operation’s name with the fallback op, and requires that properties are encoded using a preset dialect-wide properties encoding scheme. The fallback op has flexible semantics to allow it to represent any op in the dialect.
  3. When round-tripping back to bytecode, the numbering scheme uses the sideband path to reference the original operation’s name, and use that in-place of the fallback op. Properties are serialized according to the dialect’s standard encoding scheme.

Existing options considered:

  • Unregistered operations: this does not work if the operation is registered on the sending side, but not on the receiving side. Unregistered operations require that properties are serialized as attributes, which breaks down if the sending side uses custom properties encoding for versioning. The registration state is also serialized into the bytecode, so it will be incorrect on the receiving side.
  • Passing through the properties section as-is: the original version of this patch envisioned a fallback interface where the properties blob was read directly from the bytecode and then serialized back as-is during the roundtrip, eliminating the need for a standard property encoding scheme between dialect ops. Unfortunately, the numbering scheme in the round tripped bytecode doesn’t match up with the numbers encoded in the attribute and type sections, because we’re not able to number the attributes and types stored in the opaque properties blob.
  • Downgrade on serialization: this adds significant overhead. It requires negotiating the minimum supported dialect version between the receiving processes, which may result in the semantics of the program changing significantly to accommodate. Furthermore, it does not solve the problem if the module cannot be downgraded. Alternatively, the serialization process could convert newer ops to opaque fallback ops at serialization time, but that would still require determining the dialect version of the receiving process.

Seems like this design could expand the scope of bytecode and IR; also not to mention carrying deadcode or library implementation is a security risk.

Why not store metadata of dialect and declare serialization unusable – like current protobuf environments ?

Can you elaborate on the security risk? This shouldn’t introduce any new data into the bytecode, and doesn’t hold onto any binary blobs: everything is fully parsed during deserialization, and emitted back into the bytecode using the standard infrastructure. If the dialect uses a standardized encoding scheme, the emitted bytecode can be bitwise exact.

Good to see consideration for other flows! :slight_smile:

I’d like to understand a bit more about your setup (this is independent of the proposal). In distributed compute, one often does have a window of compatibility and changing file formats & distributing them without updating readers aren’t done in general. One often has first a window where readers are updated and then the writer. In your setup, is this something that can’t happen in a coordinated manner?

Top level question from me: does this change the bytecode serialization format? E.g., could this be a lazily loaded metadata op that doesn’t affect any encoding, or is changes to format needed? (We don’t require processing all ops fully, so can mix and match unsupported).

Can you expand on this? Why would an irrelevant op even be sent?

How? It seems difficult to be able to retain semantics of an operation the receiver isn’t even aware of.(Else downgrade need to be possible, but I’ll ask below)

So this encoding is fixed at dialect ~v1 and never new attribute or property encodings used there? We have had a partial discussion on making the built-in dialect properties stable, in which case these could perhaps be a usable subset (I have no idea of range of attribute kinds you are proposing).

Not sure I’m following. Is this saying when serializing back a received module one can retain the original encoding of ops one didn’t know about? So a receiver that doesn’t understand the op, would parse, mutate module/ops and then reserialize with the original un-parsed op emitted again in place of “opaque” one?

Would this then work with flag “always serialize as attributes”?

Above it is assumed it can in a semantic preserving way at the receiver IMHO. Else if it can’t, it seems the receiver can’t handle the semantics. But maybe related to the discardable part.

We unfortunately don’t have control over the receiving side and need to support them being on any previous version of the dialect. In most distributed setups, there’s an assumption that the receiving side can handle whatever the sending side produces; instead, we use bytecode as a transfer mechanism to query the receiving side for its capabilities. On a substantially older version of the dialect, the inability to deserialize a Op is equivalent to not having that capability.

It shouldn’t – rather, it’s an opt-in mechanism for a dialect to allow for processing a bytecode buffer where the encoding can be understood, but the Op/Attribute/Type definitions don’t exist. That being said, the onus is still on the dialect to maintain a compatible encoding such that the missing elements can be deserialized and serialized faithfully. If the dialect doesn’t support that, parsing should still fail gracefully as it did before.

In our case, the receiving process is expected to mutate the Module based on supported capabilities, returning a bytecode buffer back to the sending process once work has been completed. If the unknown op were to be lazily loaded, would we still be able to encode it back in its original form to the sending process? I’d expect it to be missing from the Module altogether.

I added some details in response to one of the other questions, but the workflow essentially uses bytecode as a medium to determine (a) what capabilities are available and (b) what is needed to leverage those capabilities. The host compiler uses that information to drive downstream work.

Using bytecode as that medium removes the need for us to define a separate sideband encoding scheme that would effectively duplicate the data already stored in the Module when communicating with the downstream process.

It’s true that it isn’t possible to retain all semantics, but it’s possible to retain relevant semantics: number of operands, number of results, number of regions, attributes, etc. The semantics and structure of the module are preserved, as are the semantics of how the recognized operations interact with the unknown operation.

If the receiver needs to know the specifics of the operation in order to operate on it, it’s a fatal failure just as failing deserialization would be. But, if the op describes work that could alternatively happen downstream of the receiving process (albeit maybe less efficiently), as long as it’s able to produce something that can be consumed by the downstream workload, the specifics of the operation aren’t relevant to the receiver.

Effectively, yes! Attributes or types can be added as long as their members can be serialized using the preset encoding scheme. If something is added that can’t use the fixed encoding scheme, the dialect would need to rev to ~v2, and that’d be considered a dialect-breaking change. In that case, we wouldn’t expect receiving processes to parse the bytecode without upgrading to the newest version of the dialect.

Would you mind linking to the discussion on making the built-in properties stable? I wasn’t aware that they weren’t considered stable, and I think that’d be something we’d want for our use-case.

Yes, exactly! Since all of the dialect ops would use the same encoding scheme, the only difference between serializing the “opaque” one and the real one is the name field written into the Dialect section. The proposal includes a change that adds a sideband path to use the name of the original operation when generating that section instead of using the name of the “opaque” op.

If I’m understanding right, I think that would still have the standard op versioning problem: if an attribute is serialized that doesn’t exist on the dialect version of the op, there’d still need to be a custom hook that is called to perform the upgrade.

Ah, yeah, I think that was in reference to the module containing ops that cannot be decomposed or represented in another way. Handling it at the receiver allows the op to be treated as “opaque”, whereas a proper downgrade at the sender wouldn’t be able to.

There’s an alternative where we query the receiver and transform the unsupported ops into an opaque representation at the sender, and then legalize them back to dialect ops on the roundtrip. It still has the overhead of negotiating the dialect version and performing the transforms on both sides, and also requires us to maintain a strict table containing which ops need to be replaced for each version of the dialect, but is probably workable if there isn’t interest to move forward with this proposal.