[RFC] Adding a generalized “config” section to MLIR files

TLDR;

We have various use cases that want to attach external non-IR data alongside IR when serializing, and we’d like to standardize a mechanism to do it.

Background

The MLIR textual format currently only contains operations (technically attribute/type aliases too, but those are immaterial), meaning that any external configurations used to process the IR must either be attached directly to the IR, or serialized in some other often hacky format. This generally establishes a good practice of encouraging users to encode the information they need directly in the IR, but in certain cases this just isn’t practical or desirable. For example, MLIR’s pass crash reproducer currently encodes the reproducer configuration as a comment in the output IR file, of which mlir-opt magically tries to detect and parse. Putting aside how extremely hacky this is(I’ll take fault here), this solution obviously won’t work when mlir has a bitcode format (stay tuned in the next week or so :wink: ).

Another important use case for us (at Modular AI) is the representation of constants/weights in ML programs. Right now we (and I suspect many others in the ecosystem) generally encode these as DenseElementsAttr, which is extremely undesirable for a large variety of reasons. This data is often many mb/gb/etc., and we don’t want MLIR to own/unique/allocate/copy this data. We’d ideally like to keep the data adjacent to the IR, and have very controlled access to it. The problem with moving data out of the IR is that once you want to serialize, whether that be for generating reproducers/writing tests/<insert-flow-specific-thing>/etc, you’ll have to get creative with how you interact with the rest of the ecosystem.

There are various other potential use cases that could be enumerated here, but IMO these all boil down to a general desire to encode external “configurations” alongside the IR.

Pitch: A generalized “config” section in MLIR files

I’d like to propose extending the MLIR serialization format with a “config” section to capture the use cases described in the background above.

This section is designed to be a mechanism by which dialects, and external clients, can attach additional information when parsing/printing IR without that information being encoded in the IR itself. Configurations are not unique’d within the MLIR context, are not attached directly to any operation, and are solely intended to live and be processed outside of the immediate IR. Configurations are encoded using a key-value pair nested within dictionaries anchored either on a dialect, or an externally registered entity. Dictionaries anchored on dialects use the dialect namespace directly. Dictionaries anchored on external entities use a provided identifier wrapped within <>(to differentiate them from dialects). The configuration key is an identifier used to uniquely identify and disambiguate the data. The configuration value can be encoded in a few limited forms (for now either a bool, string (human readable), or blob format (binary)). We can expand these as necessary, but the intention isn’t to have this be super open, we want to be able to optimally encode these as we see fit (in both the textual and our inevitable bitcode format). Within the textual format, an example may be of the form:

{-#
  config: {
    // Here is a dictionary anchored on "mlir_reproducer", which is an
    // external entity representing MLIR's crash reproducer functionality.
    // External entity anchors are wrapped in `<>` to differentiate them
    // from dialect names.
    <mlir_reproducer>: {
      // `pipeline` is an entry that holds a crash reproducer pipeline
      // configuration.
      pipeline: "func.func(canonicalize,cse)"
    },
    // Here is a dictionary anchored on "foo_dialect", which is a dialect
    // namespace.
    foo_dialect: {
      // `some_dialect_config` is a key to be interpreted by the dialect,
      // and used to initialize/configure/etc.
      some_dialect_config: "Some important config value"
    }
  }
#-}

The wrapping {-# /#-} is intended to represent a new top-level “file metadata dictionary” section within the MLIR file that holds all of the non-IR extensions that we may want to add in the future (and easily differentiate them from operations/other IR constructs). The design of this was influenced by some of the discussion on the semi-recent proposal for IR versioning.

I’ve uploaded a few commits that illustrate what the design could look like:

  • D126446
    • Adds support for the config section described above, with an example “external constants” attribute added to the test dialect.
  • D126447
    • Updates the Pass Crash Reproducer to use a config section instead of the hacky magic comment encoding.

Note: I’m not attached to any of the naming or formatting of the “config” section, happy to take any better suggestions here.

7 Likes

Is the wrapping required as the location in the file/string being parsed where the metadata dictionary can be parsed is not fixed? E.g., if it always had to be at end of file why do we need both delimiters?

Conceptually this feels equivalent to having a dedicated top level “op” with a creative parser. Previously week had module as implicit top-level but we removed that requirement so that any op could be top level, now we have a different top level that (conceptually) has a region with a (today) “top level” op and another with metadata. So it feels like we are making a dedicated top-level op again with difference that this isn’t actually an op in the IR/not the parent of the top-level op it contains. So perhaps, could you clarify why isn’t this just a dedicated top-level op that we introduce? What is the tradeoff with doing this vs dedicated op?

(And you weren’t only one responsible for reproducer “magic comment” usage, but also it has been very useful, just time to make it less magic :slight_smile: )

Right. The delimiters are about encapsulating a section of the file to use for the non-IR stuff. We could always have it at the end of the file, but I didn’t want to necessarily preclude that all of the non-IR stuff we want to add in the future needs to be at the end.

Of course. I would say that it’s only similar to a top-level op in the sense that it’s a top-level construct. Given that we don’t really ever want to treat this like IR, using an operation would be counter to many of the goals. We would either be using an operation just for serialization, with magic handling for this op in the parser/printer so that this never gets presented to users as an op, or we don’t actually satisfy the goals we have. I didn’t really consider using a dedicated operation given the complexity, and I didn’t want to pretend like this was IR to users when it’s never intended to be. On that, some specific points:

  • We don’t want anything to be allocated via MLIR context/attributes/etc., none of this is intended to be inherently attached to IR lifetime (especially when the size of the data gets large).
  • Given the previous point, we need to introspect/utilize aspects of the parser that shouldn’t be exposed to operation parsers (for good reason).
  • In bitcode form we want an optimal encoding for configs, enabling mmaping data whenever possible, which would immediately be at odds with the use of an operation (given that operation encodings are optimal in ways specific to them).

(I didn’t want to @ you alongside me with the use of “hacky” :slight_smile:. It’s served its purpose well until now, but I’d like to kill it.)

– River

ODM this week? :slight_smile:

I remember in our discussion about introducing external constants that without introducing a new concept to MLIR (i.e. modifying the parser), something would have to give.

I’m going to nit on the name here. I personally prefer “data” section since “config” seems a little bit specific. A large ML weight doesn’t feel like it fits the label “external config”.

Just to clarify for other readers, the parser will dispatch to dialects to handle the memory of the blobs right?

Happy to have a discussion during the ODM this week if people want to discuss there.

I actually started with “data” as a name, but ultimately it doesn’t really capture what the intent of this section is supposed to be. For pass crash reproducers, it really is a config. For ML weights, I would argue that the data is actually part of configuring the surrounding system (at least for the use cases I’m interested in). “data” inevitably just felt too opaque to me, especially given that this section isn’t just about weights. Happy to be convinced otherwise if enough folks think that all of the use cases we intend to capture are really just “data”.

Right. The parser doesn’t try to interpret any of the data (just enough to lex/parse), and always dispatches to dialects (or external clients) to handle allocation/processing/etc.

– River

This sounds like very promising and relevant. Some background here - we’ve previously sought to embed low level information with IR for some specific codegen requirements around custom hardware. These are finer grained than loop/tiling and could be broadly viewed in two categories - low level hardware specific configuration parameters (e.g. one of multiple weight compression modes) better suited as an embedded dict than a long set of cmdline parameters, and scheduling heuristics generated in prior steps.

Particularly because some of this was generated during conversion process and are not just predefined cfg parameters, a carrythrough dict-like construct was meaningful. So this RFC resonates well with such a requirement.

In terms of feedback, there’s a potential use case for also tying config values to one or more ops in the IR but I can see how this might start to look messy - probably worth a separate RFC with a concrete use case built upon this one.

<ducks>
Maybe “metadata”?
</ducks>

3 Likes

How about something like “resource” (akin to Windows executable resources used for icons, string data, etc.)?

1 Like

I’m strongly supportive of proceeding with this proposal as presented (give or take specific naming of config/data/metadata/resource). I’m also very supportive of getting this part landed so that we can keep moving and get bitcode integrated properly.

River was nice enough to spend some time talking me through personally how ml_program could be a good first user of these facilities to provide real ML external data management APIs that we can all use. If we land this patch/approach, it will unlock us doing some really nice things there that can unify how the various ML frontends interact.

So, +1 from me on the approach in general, and +1 on the first step patches specifically.

2 Likes

+1 on moving forward too.

It is really exciting to see this feature being added. It will be nice if we don’t need to use very boilerplate serialization/deserialization API to update/persist the config. I’m worried that overusing config could lead to using MLIR as protobuf.
What I want to propose is a mechanism to hide the serialization/deserialization of config behind common attributes and JIT.

  • In the beginning, the IR is built with an extra config.attribute in the module:

    config.attribute @config_1(attribute_type: tensor<6xf64>) {
        %path_to_state_dict = config.string_to_cstring {
            attr = "path_to_state_dict"
        } : () -> !llvm.ptr<i8>
        %var_name = config.string_to_cstring {
            attr = "var_name"
        } : () -> !llvm.ptr<i8>
        %2 = call @load_tensor_from_pytorch_state_dict(%path_to_state_dict: !llvm.ptr<i8>, %var_name: !llvm.ptr<i8>) : tensor<6xf64>
        %3 = return %2 : tensor<6xf64>
    }
    
    func.func @some_func() ->() {
      %1 = some_dialect.constant dense<#config_1> : tensor<6xf64>
      ...
      return
    }
    
  • after running persist-config pass, @config_1 get compiled and run in JIT. The final config is generated and persisted.

    {-#
      config: {
        // actual config generated, weight hex, etc
      }
    #-}
    func.func @some_func() ->() {
      %1 = some_dialect.constant dense<#config_1> : tensor<6xf64>
      ...
      return
    }
    

I’d love to see something like this, but I’m deeply skeptical of things that exist only with file scope. Could this instead be attached to a module, or, for that matter, any op? Or perhaps exist alongside attributes as an explicitly non-uniqued object? Brainstorming slightly: could symbol tables be build on the same (scoped) machinery as a core concept?

In practice, I suspect that the ‘non-uniqued’ aspect is just going to push the complexity of managing key collisions or other aspects onto users of the API. I’m not sure that it’s a net-win, but I agree that there are certain cases for which the existing Attribute concept is not a great fit, so something is necessary.

Sure, you can already do that but:

  1. What needs to change in MLIR? Seems like the existing structure is enough for anyone to setup this kind of techniques.
  2. This does not make MLIR file “standalone” and breaks reproducibility (at least you need to handle reproducibility with your own mechanism).
1 Like

One of the main goals is to avoid copying and even uniquing data into the context. Unowned attributes can be achieved right now with a few hacks to the storage uniquer infra, but there are a lot of things that would not be as efficient as it should be and that would break reproducibility. The solution many have used already is to store the data (“configs”) in separate files, but managing multiple files works against all the infrastructure and utilities provided for standalone .mlir files.

Jacques’ question about whether this could’ve been a top-level op with a “creative” parser/printer is along a similar vein. The only decent standing solution is to introduce a new concept to MLIR files entirely.

One thing that came out when I was discussing this with River wrt using it for ml_program is that the expected users of these APIs are primarily dialect implementations, and we do expect it to be an advanced feature that many of them don’t touch. Since the ones that do need it for out of band storage will already need to expose new user level APIs, I’m ok biasing this towards a simple flat namespace of metadata and pushing the responsibility for proper usage (ie. Scoping, etc) to the direct users.

I agree that is this were a user facing feature intended for wide use, we may want something else. In my mind, though, this a) largely a side storage for out of band data for things that really need it – of which, I don’t think there will be very many, and b) a place to stash actual file scoped metadata.

I don’t quite understand the details, but I’ll take your word that adapting Attributes wouldn’t be an easy solution.

I’m still concerned that file-scope config data is going to be problematic, given that MLIR is fundamentally hierarchically scoped. It seems like a solution that doesn’t take this into account is going to be challenging to leverage.

FWIW, I’m also +1 on this proposal, I think it is a sorely missing piece of functionality and will make many MLIR based tools way better. I’m also not super convinced by the “config” name, it comes loaded with a bunch of associations that are not ideal. I’d throw out “resource” and “asset” for consideration.

Stephen, the major issue with extending attributes and extending operations is that nothing in the MLIR IR constructs currently has custom destructor support, so we can’t do proper management of these things. Attributes (and custom attributes) are a handy way to form references to these data blobs from the IR though!

-Chris

1 Like

Right, there are also a lot of complications that arise when factoring in: cloning/deletion, nested operation insertion/removal, efficiency(both in memory and runtime), etc. that can greatly complicate aspects of ownership w.r.t non-unique’d things. We’ve considered such things at various points in the past, but there is also some general amount of complexity surrounding these that complicates things.

We support destructor for attributes now, they are executed during Context tear down (which is limiting of course).
(I may have misunderstood your point though)

I’m +1 on resources/resource for naming (mentioned a couple of time up-thread).

3 Likes