High level summary
We propose beginning in-tree development of an experimental ml_program
dialect which can help provide some standard containers and structural ops to unify the outer level of ML computations. This is a completely reworked v2 proposal based on the discussions that came out of RFC: Introduce ml_program dialect (deprecated v1 proposal).
We expect development to proceed incrementally with the following initial design scope:
- An outer
ml_program.subgraph
container representing a graph of ML operations. - An
ml_program.global
operation for declaring the storage of globals. - Accessors for loading/storing values from/to an
ml_program.global
.
Completing this work will be sufficient to enable us to upstream IREEâs frontend interfaces for JAX and TFLite to their respective projects without having a dependency on IREEâs internals. It would also enable a similar integration for torch-mlir with respect to features needed that go beyond simple inference graphs. Removing the IREE dependency corrects the layering and will enable broader access by other community projects to programs extracted from these frameworks.
While discussing this topic with the community, a lot of interest has been expressed in potentially adjacent topics, many of which we also agree are important to work out (and would benefit from having an ml_program
foothold upstream in MLIR). However, in the interests of having a concrete, minimal starting point, we are recommending this initial scope and have aligned it with the minimum set of features needed to enable community benefit via greater interop and access to programs extracted from ML frontends. We consider everything outside of this initial scope to be interesting next steps but out of scope of this RFC.
This RFC is separable: convergence on any of the points (dialect name/subgraph
op, global
or global
accessors) is sufficient to start work while the rest is resolved.
Further discussion on initial design scope
ml_program
dialect name
We should define a new ml_program
dialect in tree for the development of high level, structural operations/types/utilities for the cross-framework definition of âML Programsâ. As a point of principle, we seek to provide direct implementations of concepts that are either universal or provide a high plateau for interop, while also allowing toolchain specific MLIR dialects to compose with the ml_program
dialects to support non-standardized features.
Why / Whatâs in a name?
The ultimate goal of claiming such a name is to create a well-lit point of top-level interop between toolchains and to provide the scaffolding needed to enable them to differentiate features in a reasonable way. We propose starting with a more experimental, qualified name like ml_program
with an eye towards evolving a suite of inter-operating dialects rooted on the ml
name. We believe that there is sufficient technical and product consolidation in this area to warrant such a callout. Doing so will have a strong, in-tree unifying effect for the community of contributors who are currently fragmented across downstreams and incubators.
Ultimately, even if significant parts of what makes up an âML Programâ live outside of the ml_program
dialect, a place to define the norms (or even leave design comments) is an important next step for this space.
We are taking significant inspiration from the approach and success of the CIRCT project, which took a similar development approach with respect to providing an MLIR-based unification point for a family of toolchains and topics in the hardware design space.
ml_program.subgraph
container op
As noted in the discussion about moving builtin.func
to the FuncDialect
, the widespread use of func.func
is considered an anti-pattern that we want to move on from. This is the perfect time to create a subgraph
op which implements FunctionLike
and is meant for interop of graph-based-callables between ML tools. The name subgraph
is proposed as it is thought to better connote a graph based callable with explicit argument and result values vs some of the original âgraphâ constructs in this space which were more sea-of-nodes based. Note that âgraph basedâ here refers to the typical nomenclature used by ML frameworks and does not necessarily imply an âMLIR Graph Regionâ, although whether it should also imply that is a point of technical discussion that can be debated.
The proposed ml_program.subgraph
op is defined as:
- Implementing
FunctionLike
. - Having a single region and single block within the region.
- Having no restrictions on argument and result types.
- Allowing ops from any dialect (naturally including existing dialects like TOSA, MHLO, and ONNX).
- Terminated by
ml_program.return
. - Enforcing Dominance.
(This last point is likely the most controversial part of this op and it is included as a forcing function to debate)
ml_program.global
op
The ml_program.global
op is inspired by the LLVM GlobalVariable and associates a symbol name with storage for a value (required for âlargeâ tensor values but not restricted as such). A global is defined as:
- Implementing
Symbol
. - Being
mutable
orimmutable
. - Having a bounding
Type
. - Having an initialization state that is one of:
- Uninitalized (not legal for
immutable
globals to be uninitialized) - Inline initialized with a value based on an
Attribute
appropriate for the boundingType
- Externally initialized
- Uninitalized (not legal for
Here are some plausible spellings for this op:
- Private, immutable global that is inline initialized, with its bounding type equal to the type of its defining attribute:
ml_program.global @foobar = dense<4> : tensor<4xi32>
- Global with a wider bounding type than its initial, inline value:
ml_program.global @foobar : tensor<?xi32> = dense<4> : tensor<4xi32>
- Public, mutable global that is externally initialized where its bounding type matches the data externally referenced:
ml_program.global public mutable @foobar : tensor<4xi32> extern
- Private, immutable global with a bounding type different from its externally initialized storage type:
ml_program.global @foobar : tensor<?xi32> extern : tensor<4xi32>
Design note: We could unify these forms further by defining an ExternalAttribute
, which would enable spellings like:
ml_program.global public mutable @foobar = extern : tensor<4xi32>
ml_program.global @foobar : tensor<?xi32> = extern : tensor<4xi32>
It is intended that ML frontends and downstream tools will make the effort to use external linkage for sufficiently large initializers, which will be an improvement over the current status quo where most tools represent those inline, uniqued within the context. To better support this case, we expect that some simple utilities will be created upstream for creating file-based bundles of such initializer values, supporting the use case where tools will âlinkâ them by name. This can be something not unlike Glowâs weight files but perhaps with a trailing table of contents and wrapped by an API that supports some interop with existing attributes and interfaces.
The global
op itself only defines storage and initialization. It is plausible that toolchains can also extend it with notions such as placement via attributes, and it is out of scope for this RFC to take an opinion on that. In addition, the memory model for accessing globals is defined by the accessor ops.
Simple global accessors
As stated in the previous section, the memory model and load/store behavior for accessing globals is defined by the accessor ops. It is expected that downstream, framework-specific dialects will define accessors specific to any number of exotic cases, and we can evaluate inclusion of those in ml_program
itself over time as the space matures. To get the ball rolling, we propose defining two such accessors with the simplest possible memory model and semantics:
%0 = ml_program.global_load @foobar -> tensor<?xi32>
ml_program.global_store @foobar %0 : tensor<4xi32>
These ops perform whole-global loads and stores with implementation defined synchronization and visibility constraints with respect to concurrent and distributed access. They have minimal invariants that apply only within the same top-level invocation of a public subgraph
entry point:
- A
global_load
following aglobal_store
will access a consistent view of the entire data-type (i.e. it cannot observe a partial write). - Built-in canonicalizations will not re-order
global_load
andglobal_store
operations accessing the same global.
It is expected that these load and store ops will only be sufficient for a limited number of cases constrained to a single execution node. A full, concurrency aware, distributed memory model is out of scope of this RFC. The primary purpose of having these operations at all is:
- For use of globals for external linkage of large, immutable constants, the provided
ml_program.global_load
op should be sufficient for all downstream cases. - âStarter opsâ for the wide set of cases where a frontend just needs to emit a simple, whole value load or store that can then be further specialized for advanced cases later in the toolchain.
- Even though simple, they are sufficient for a large body of single-node inference and training cases.
- They provide a place for us to build in-tree interfaces that all load/store type ops should implement in downstream dialects, allowing for more common transformations to be written.