RFC: Introduce ml_program dialect and top-level ops (proposal v2)

High level summary

We propose beginning in-tree development of an experimental ml_program dialect which can help provide some standard containers and structural ops to unify the outer level of ML computations. This is a completely reworked v2 proposal based on the discussions that came out of RFC: Introduce ml_program dialect (deprecated v1 proposal).

We expect development to proceed incrementally with the following initial design scope:

  • An outer ml_program.subgraph container representing a graph of ML operations.
  • An ml_program.global operation for declaring the storage of globals.
  • Accessors for loading/storing values from/to an ml_program.global.

Completing this work will be sufficient to enable us to upstream IREE’s frontend interfaces for JAX and TFLite to their respective projects without having a dependency on IREE’s internals. It would also enable a similar integration for torch-mlir with respect to features needed that go beyond simple inference graphs. Removing the IREE dependency corrects the layering and will enable broader access by other community projects to programs extracted from these frameworks.

While discussing this topic with the community, a lot of interest has been expressed in potentially adjacent topics, many of which we also agree are important to work out (and would benefit from having an ml_program foothold upstream in MLIR). However, in the interests of having a concrete, minimal starting point, we are recommending this initial scope and have aligned it with the minimum set of features needed to enable community benefit via greater interop and access to programs extracted from ML frontends. We consider everything outside of this initial scope to be interesting next steps but out of scope of this RFC.

This RFC is separable: convergence on any of the points (dialect name/subgraph op, global or global accessors) is sufficient to start work while the rest is resolved.

Further discussion on initial design scope

ml_program dialect name

We should define a new ml_program dialect in tree for the development of high level, structural operations/types/utilities for the cross-framework definition of “ML Programs”. As a point of principle, we seek to provide direct implementations of concepts that are either universal or provide a high plateau for interop, while also allowing toolchain specific MLIR dialects to compose with the ml_program dialects to support non-standardized features.

Why / What’s in a name?

The ultimate goal of claiming such a name is to create a well-lit point of top-level interop between toolchains and to provide the scaffolding needed to enable them to differentiate features in a reasonable way. We propose starting with a more experimental, qualified name like ml_program with an eye towards evolving a suite of inter-operating dialects rooted on the ml name. We believe that there is sufficient technical and product consolidation in this area to warrant such a callout. Doing so will have a strong, in-tree unifying effect for the community of contributors who are currently fragmented across downstreams and incubators.

Ultimately, even if significant parts of what makes up an “ML Program” live outside of the ml_program dialect, a place to define the norms (or even leave design comments) is an important next step for this space.

We are taking significant inspiration from the approach and success of the CIRCT project, which took a similar development approach with respect to providing an MLIR-based unification point for a family of toolchains and topics in the hardware design space.

ml_program.subgraph container op

As noted in the discussion about moving builtin.func to the FuncDialect, the widespread use of func.func is considered an anti-pattern that we want to move on from. This is the perfect time to create a subgraph op which implements FunctionLike and is meant for interop of graph-based-callables between ML tools. The name subgraph is proposed as it is thought to better connote a graph based callable with explicit argument and result values vs some of the original “graph” constructs in this space which were more sea-of-nodes based. Note that “graph based” here refers to the typical nomenclature used by ML frameworks and does not necessarily imply an “MLIR Graph Region”, although whether it should also imply that is a point of technical discussion that can be debated.

The proposed ml_program.subgraph op is defined as:

  • Implementing FunctionLike.
  • Having a single region and single block within the region.
  • Having no restrictions on argument and result types.
  • Allowing ops from any dialect (naturally including existing dialects like TOSA, MHLO, and ONNX).
  • Terminated by ml_program.return.
  • Enforcing Dominance.

(This last point is likely the most controversial part of this op and it is included as a forcing function to debate)

ml_program.global op

The ml_program.global op is inspired by the LLVM GlobalVariable and associates a symbol name with storage for a value (required for “large” tensor values but not restricted as such). A global is defined as:

  • Implementing Symbol.
  • Being mutable or immutable.
  • Having a bounding Type.
  • Having an initialization state that is one of:
    • Uninitalized (not legal for immutable globals to be uninitialized)
    • Inline initialized with a value based on an Attribute appropriate for the bounding Type
    • Externally initialized

Here are some plausible spellings for this op:

  • Private, immutable global that is inline initialized, with its bounding type equal to the type of its defining attribute:
    • ml_program.global @foobar = dense<4> : tensor<4xi32>
  • Global with a wider bounding type than its initial, inline value:
    • ml_program.global @foobar : tensor<?xi32> = dense<4> : tensor<4xi32>
  • Public, mutable global that is externally initialized where its bounding type matches the data externally referenced:
    • ml_program.global public mutable @foobar : tensor<4xi32> extern
  • Private, immutable global with a bounding type different from its externally initialized storage type:
    • ml_program.global @foobar : tensor<?xi32> extern : tensor<4xi32>

Design note: We could unify these forms further by defining an ExternalAttribute, which would enable spellings like:

  • ml_program.global public mutable @foobar = extern : tensor<4xi32>
  • ml_program.global @foobar : tensor<?xi32> = extern : tensor<4xi32>

It is intended that ML frontends and downstream tools will make the effort to use external linkage for sufficiently large initializers, which will be an improvement over the current status quo where most tools represent those inline, uniqued within the context. To better support this case, we expect that some simple utilities will be created upstream for creating file-based bundles of such initializer values, supporting the use case where tools will “link” them by name. This can be something not unlike Glow’s weight files but perhaps with a trailing table of contents and wrapped by an API that supports some interop with existing attributes and interfaces.

The global op itself only defines storage and initialization. It is plausible that toolchains can also extend it with notions such as placement via attributes, and it is out of scope for this RFC to take an opinion on that. In addition, the memory model for accessing globals is defined by the accessor ops.

Simple global accessors

As stated in the previous section, the memory model and load/store behavior for accessing globals is defined by the accessor ops. It is expected that downstream, framework-specific dialects will define accessors specific to any number of exotic cases, and we can evaluate inclusion of those in ml_program itself over time as the space matures. To get the ball rolling, we propose defining two such accessors with the simplest possible memory model and semantics:

  • %0 = ml_program.global_load @foobar -> tensor<?xi32>
  • ml_program.global_store @foobar %0 : tensor<4xi32>

These ops perform whole-global loads and stores with implementation defined synchronization and visibility constraints with respect to concurrent and distributed access. They have minimal invariants that apply only within the same top-level invocation of a public subgraph entry point:

  • A global_load following a global_store will access a consistent view of the entire data-type (i.e. it cannot observe a partial write).
  • Built-in canonicalizations will not re-order global_load and global_store operations accessing the same global.

It is expected that these load and store ops will only be sufficient for a limited number of cases constrained to a single execution node. A full, concurrency aware, distributed memory model is out of scope of this RFC. The primary purpose of having these operations at all is:

  • For use of globals for external linkage of large, immutable constants, the provided ml_program.global_load op should be sufficient for all downstream cases.
  • “Starter ops” for the wide set of cases where a frontend just needs to emit a simple, whole value load or store that can then be further specialized for advanced cases later in the toolchain.
  • Even though simple, they are sufficient for a large body of single-node inference and training cases.
  • They provide a place for us to build in-tree interfaces that all load/store type ops should implement in downstream dialects, allowing for more common transformations to be written.
2 Likes

Thanks for the updated proposal!
A few comments below, but otherwise I’m very supportive of course :slight_smile:

While it makes sense for people more familiar with ML framework than MLIR to take a “subgraph” terminology independently of the “MLIR Graph Region”, we’re talking about a MLIR dialect here and I’d be concerned about confusing MLIR concepts here. If this ml_program.subgraph isn’t defined as holding a “graph region” then I would rather see it take another name aligned with MLIR terminology.

A graph representation is appealing, but impacts the design of the operations: we need something like a control token (or a “chain”) to carry scheduling of things like load/store to globals or any other operation with side-effects.

I don’t quite understand what this means here? Why do we have this notion of “bounding type”?

The “implementation defined” aspect isn’t satisfactory to me here: it prevents reasoning about transformation upstream. I am not really concerned about the runtime implementation of this kind of things, but the IR semantics should be well defined.
For example when we have a global_load immediately following a global_store, we can’t conclude anything if we don’t defined the possibility of a concurrent invocation of this subgraph (which would expose other global_store that could happen between the store and the load). An easy way out I think is to define these as non-atomic/non-volatile/not-synchronized, this would force some other mechanism for user who need stronger guarantees, and allows us to perform dead-store elimination and other similar simplifications.

I ran into this with the TF SavedModel importer. There is a distinction between the type that the global is initialized to hold, and the set of allowable values that can be dynamically stored into it. The “bounding type” bounds the types of all the values dynamically stored into it. The bounding type is the type that global_load returns, for example, or the type that the verifier enforces to be stored by global_store.

Makes sense, thanks!
But also that’s assuming that the global is mutable right? I think that the example I quoted is an immutable global with an initialized value.

I agree, that bound could be tightened up for an immutable value.

Indeed: this is a bug. I shouldn’t have included that permutation.

(a note on process: I’m giving some time to collect a batch of feedback and then will try to “rev” the RFC in a bigger chunk vs point by point)

Since we are in a new “v2” proposal thread, I’d like to quote some of the feedback/questions from the other thread for reconsideration:

As for naming the dialect ml_program, I’m old enough to remember when “ML” in MLIR stood for “multi-level”, and not “machine learning.” :wink:

The first two concepts being discussed (global constants and graph-based callables) don’t seem entirely specific to machine learning. (Maybe there are some machine learning “wrinkles”…)

I’m also supportive - I think that this is an ambitious undertaking. But as someone who would like to use MLIR for applications both machine learning and otherwise, I also hope that there is interest in decomposing what the frameworks need, and curating dialects so that a wider range of applications can benefit.

Would a top-down design approach be possible? Instead of creating and populating an ml_program dialect just to get the ball rolling, try to map out an initial dialect map? This would have advantages from a division of labor perspective (if you are looking for help implementing), and might allow for more graceful design iteration than the alternative of moving things out of ml_program after frameworks have already started using them.

It seems like defining norms and leaving design comments can be done even if things are distributed among multiple dialects.

One advantage to the ml_program dialect approach being propseed is that it avoids extended discussions on “what should this dialect be called?” or “what should be in this dialect?” Does that outweigh the effort that would be required to extract things from ml_program into a different dialect down the road? I will defer to your expertise - just wanted to re-raise these issues one more time.

Haha well the benefit of this is that one can still read it such :wink: (although this would probably be “top level program” rather than multi-level, although an argument could be made that most of these are appropriate for the “mid level” and then the acronym could be expanded thus).

That is a good point. Was there a reason to refer to this as subgraph rather than func?

Are these globals effectively unordered and no expectation on relative ordering? E.g., could these sorted/shuffled or does input order need to retained? And can they be DCE’d?

Indeed, it makes the rubber hit the road quicker and allows for trying out ideas/evolving it rather than having purely conceptual discussions (while starting minimal and with what is known useful rather). Having 3 frontends that need to be supported (and they don’t have to be ML, but geared to at least be useful in unifying there) is good to weed out some ideas. Having seen a couple of these go through many months of designs to only find the world has moved on and then get abandoned, makes me biased to code :slight_smile: (of course without being tied to the code/unwilling to change it if we find there is a need).

So I think you all misunderstood my assertion/note: ‘Note that “graph based” here refers to the typical nomenclature used by ML frameworks and does not necessarily imply an “MLIR Graph Region”, although whether it should also imply that is a point of technical discussion that can be debated.’

I am open to it being argued that this be a graph region/not enforce dominance. There are pros and cons to such a decision (and choosing it will change the formulation of the load/store ops as well as future work), and I staked out the position and naming like this to force the decision be made.

Separably, there was an opinion in a pre-review of this draft that when modeling a domain, it is important to use names from that domain and consider infra-nomenclature alignment secondary. There was also the belief that func carried a lot of imperative baggage that does not carry forward here. I was convinced by these arguments and spelled it as presented. If we ultimately decide to make this a Graph Region, then we have no dissonance. If we don’t, I have been convinced that it is still fine.

Where graph is used for classical dataflow computing execution model, and so doesnt correspond to dominance of llvm func AFAIK.

I believe TF, Jax and PyTorch both use functions as part of their nomenclature. While subgraph was used in TF1 and associated with specifying feeds and fetches within a graph (at least for me).

(I can see benefit of removing connotations, but these aren’t connotations I have, but I could be exception ;-))

Ok, since no one is taking me up on the debate, I will self contradict myself in an attempt to arrive at the right answer here. (I will still answer the other points brought up in the thread, but this is likely the most important technical discussion point and I want to make sure we have it)

Alternative proposal

ml_program.subgraph op

The subgraph op contains a single MLIR Graph Region, meaning that its execution schedule is not based on dominance but must be explicitly encoded in the IR using SSA edges. We disallow use-def cycles in the graph. For convenience, a pass will be provided to topologically sort the subgraph to aid downstream usage which will be implementing it with a dominance based schedule.

We propose this form because it is higher-level, allowing for the preservation of more information from frontends which have more information about side effect modeling and benefit from a mechanism for preserving the relationships within the IR (versus just statically in terms of effects on operations).

For NoSideEffect operations, this does not change anything. For side-effecting operations, it forces additional constraints on the op design:

  • Operations that may need to be constrained to run after another operation must accept an access token.
  • Operations that may need to precede other operations need to produce an access token.

Note that, in general, stateful operations defined in terms of access-tokens for a Graph-based container can remain well-defined in a dominance based container but the inverse is not true.

In order to facilitate interop, we define access tokens as:

  • A new !ml_program.AccessToken type.
  • A new %token = ml_program.token operation.

The existence of this type and op does not constrain downstream op-sets from providing their own tokens (i.e. MHLO defines an !mhlo.TokenType and mhlo.token op). To enable compatibility, we may want to define this in terms of interfaces and/or recommend that downstreams have a token_cast as appropriate.

Implications for co-operating dialects

Generically, if existing op-sets have stateful operations that are not modeled in terms of access tokens, we could introduce an %token, ... = ml_program.execute(%token) {} operation which encapsulates them. Doing so is out of scope of this RFC.

CF dialect

People use cf.assert %pred in various ways. This is a side-effecting operation that does not return a value. Usage within a graph region would push us to define and use a %value = cf.assert_with %pred, %value op which returns an explicit guarded value (or values). This has been discussed in various forums but never implemented.

MHLO

As mentioned, MHLO already models all of its stateful operations with tokens, so the only accommodations needed are with respect to interop of token types.

TOSA

TOSA does not define any stateful operations and is therefore not impacted in terms of its current design. Should it ever add stateful operations, we would recommend that it follow the conventions defined here.

ONNX

ONNX-MLIR does not appear to define stateful operations and should not be impacted.

ml_program

The dominance-based ml_program proposal would be amended such that:

  • global_load accepts an access token.
  • global_store accepts and produces an access token.

Further, while out of scope for this RFC, future definitions of callable semantics would likely require more discussion about side-effecting calls.

Discussion Points

  • Yes/no: Should we define the subgraph op to contain a Graph Region?
    • Yes/no: Should we prohibit cycles in the graph (i.e. effectively just removing the dominance requirement)?
  • Yes/no: Regardless of the above point, should we design stateful ops with explicit access tokens and strongly prefer that co-operating op-sets do likewise (potentially adding compatibility shims like ml_program.execute {} to bridge)?

Please discuss. In informal discussions, I have generally found weak preferences pro and con and am therefore forcing the issue.

2 Likes

There’s a proposal in Statefulness Support for TOSA - TOSA - Discourse . It’s been implemented and recently successfully prototyped in both the TOSA MLIR and the TOSA reference model. It doesn’t appear that proposal is at odds with what’s listed here but we’re happy to take feedback on it - ideally in that thread so as to not derail this one.

Maybe I’ve had my head in a hole, but personally I don’t feel like the “actually treat the graph as a graph region with no sequential execution semantics” design space has really been explored enough in the MLIR ecosystem to warrant attempting to build for that first. Or in other words, it’s hard to justify proposing something with the goal of ecosystem interop when nothing (AFAIK) actually interoperates IR with the proposed semantics between components.

Happy to be proven wrong here – I think that moving to actual graphs with no sequential ordering actually has a lot of benefits for modeling ops with shape preconditions (otherwise the implicit UB of wrong shapes makes reasoning about the relationship between ops harder and conservatively restrictive in sequentially ordered programs).

Speaking with my Torch-MLIR hat on, such a graph-based op doesn’t really solve any problems for me really. And if backends started requiring IR in that form, it would actually be a non-trivial amount of work for me to support that well.

1 Like

Thanks, Sean - indeed, in my survey, I found no modern IRs being fed to a compiler backend where the serialized form was not at least required to be topologically sorted and free of cycles. Those that I found from more of the “inference side” typically encode their schedule sequentially and are naturally represented by a dominance based container. Further, I found no modern frontends that feed to a compiler which would be disadvantaged by emitting “function/subgraph units” to a dominance based container. In fact, the most popular (TFLite and PyTorch) encode their execution schedule for such units in precisely that way. This is why I wrote the RFC the way I did – but am in the same state as you are: open to an alternative viewpoint.

In my mind, us making this decision now is largely about future-proofing: true graph-based ML constructs do exist and it is conceivable that they will want to interoperate with what we define here. It would be best if such use later could be incremental and not find representational issues that cannot be easily bridged. The place where this tends to crop up is around parallelism and distribution. I would argue that unlike these “ML subgraph programs”, that space is far from converged, and frameworks seeking representations will absolutely be bringing their own IR constructs to model it to their wishes for some time.

Practically, the conflict only arises when looking to interoperate with stateful operations. I think it is defensible if we define these subgraphs assuming a dominance based schedule which respects effects. This does not preclude future explicitly-dataflow-oriented representations, either by way of a new top-level container type (i.e. dataflow_graph) and/or ops that encase arbitrary stateful ops by either tieing or carrying dataflow edges explicitly (i.e. %token, ... = execute %token { ... }). If we took this stance, it would allow us to bias towards the simplest, common representation (dominance-based) while leaving the door open for interop with more of pure dataflow based systems in the future.

I agree with you: if we made these literal graph regions, every toolchain currently in existence would need to do extra work to get into that form, and most of the backend compilers would run some passes to establish a valid topological order. I remain theoretically interested in something beyond this, but there isn’t a lot in the commons on which to anchor it.

I don’t think there is really much cost, if we can agree on “ml_program.func” (single block, dominance-based region), to also define “ml_program.subgraph” (single block, graph region) to be a default graph container for this space (perhaps part of the reason we don’t have more graph-based compilers is that we don’t have a “starter graph” container op).

I think the two can actually interoperate pretty easily, and I think there is a pretty strong technical argument that there is a minimal, technically-“opinionless” way to define it. We would adopt the “old TF” behavior of calls to subgraphs not being a “wait for all” kind of thing – despite the somewhat unintuitive behavior coming from the imperative side, those semantics preserve the invariant that inlining is a no-op transformation (modulo wrapping in a ml_program.execute_inline_graph kind of op to protect the graph region from an outer dominance region, if applicable) and outlining can be done composably – everything else can be open-coded around that if necessary, and anything else would prevent open-coding certain pattern. Eventually we could incrementally introduce new call-like/subgraph-like ops that encode common patterns of token threading that emerge (which in the interim, would be open-coded with the regular “starter” ml_program.call_subgraph/ml_program.subgraph), and that would also be the natural time to introduce a starter token type and some starter token ops.

But the opposite seems quite true as well: there shouldn’t be any disadvantage for any “sequential input” to be fed to a graph-based IR. It is even trivial to write an API that take a sequence of operation and implicitly thread the “chain” / “token” through them when building the IR under the hood.

What is more important to me is the level of information that we carry: I believe that the graph representation embeds strictly more information than the sequential representation. This is because the scheduling information is made explicit instead of implicit: the question if two stores are sequenced with respect to each other is actually explicitly present in the dataflow, instead of implicitly carried through the combination of the control sequence (implicit ordering in a block) and the memory effects (and aliasing) that have to be reconstructed by some analysis. In practice the implicit sequential representation makes necessarily the compiler more conservative in presence of side-effect. Getting to a “graph” representation is the kind of “raising” done by MemorySSA in LLVM for example.
In a “pure” dataflow, there isn’t much difference as far as I can tell.

(I’m excluding cycles from consideration here).

I don’t think this is accurate: XLA internal representation is actually a graph today.

There is some level of “cost” or at least technical immediate impact: operations with side-effects should be defined to all return a “token” (and accept a variadic list of tokens) otherwise you cannot use them in a graph. The tokens could be ignored in the “func” based representation.

I think we all agree that the graph representation encodes more precise information. From the perspective of this RFC though, which is about interop, we are looking at the real world of the semantics of the IR that is going between components in the MLIR ecosystem presently, and there is a real need for the single-block, sequential execution semantics func-like op.

Also, I personally think we should have two different ops – one for graph regions (which carries a token), and one for sequential execution regions (which does not carry a token). Besides just being inconvenient to have the token all the time in sequential regions, there is a real possibility that somebody incorrectly thinks that the tokens are the entire dependency graph and does a transformation that is incorrect according to sequential execution semantics.

I think there could be a lot of ecosystem value to having a single-block, graph-region, “func-like” container op upstream (possibly in ml_program), but I think that merits its own discussion outside this RFC.

I would challenge this: can you elaborate why is there such a “need”? What advantage is there for anyone to not use a graph? (other than superficial aspects of convenience).
I would claim the contrary: because this is at the level of inter-op, this seems like exactly the kind place to avoid dropping information on the floor that will have to be “raised” conservatively / incompletely later. If you have a graph with the token edges, going to a sequential representation is fairly trivial (any topological sort is valid).

I don’t quite get this: since this RFC is about starting the very core building block of interior, that seems very fundamental to me on the contrary.

Well, we have frontends that are inherently sequential (like Torch-MLIR). Going into the graph representation in the presence of control flow is pretty non-trivial (I think we would need to build graph-compatible versions of scf.if/for/while upstream?). In the simpler cases we can just add tokens between each two ops, but I’m worried about the general case.

Also, backends that don’t natively support graph regions (which is all MLIR-based compiler backends I’m aware of) are going to need to sequentialize the graph, which is pretty hard and subtle (I know you know this better than anyone from working on tf_executor et al :slight_smile: ). Even handling special cases (like graph with no cycles) requires knowing information about the data types involved (conservative behavior is to treat every datatype as if it was a tf.FIFOQueue and potentially have weird un-sequentializable semantics – we don’t even have a TypeInterface or trait for such a property). It would be weird to make graph regions the default interchange format, but then not provide an upstream pass “that we are happy about” to sequentialize it for all the backends that would need that component.

I think the high-order bit here is that we don’t really know for sure that a graph-region-based MLIR ecosystem is the right thing, while we know that we have today a working ecosystem in the ML space based on a single-block sequential “func”. So I’m pretty comfortable leaving the discussion of a graph-based “starter func” and ancillary operations to a future RFC.