RFC: Linalg OpDSL

Hi folks - it was brought up last week that the work that we’ve been doing on Linalg Op Generation could benefit from a refresher/public RFC. A lot of these ideas are not new, having been prototyped long ago, but we are just now getting to the point where enough utility has been obtained from the core abstractions that we are implementing the right mechanisms to scale things.

Refresher: Linalg Structured Ops

The Linalg dialect aims to provide:

  • An orthogonal set of op forms, capable of representing the broad set of linear algebra and related tensor manipulation operations that show up in typical tensor compute programs.
  • An extensive, open set of ops for the space of computations that can be derived from one of the supported op forms:
    • Each op form provides a named generic variant which can fully represent the computations it supports in generality, through an explicit, verbose IR structure.
    • Compared to “high level” (i.e. user-level ML library) operations, named operations are polymorphic to the degree useful within an op form, and as a result are usually more restrictive: such high level operations may map to multiple Linalg ops, either via multi-versioning or composition.
    • When an operation maps to a specific library call and/or hardware feature or useful form, it must be instantiated as a named operation.
  • Transformations for tiling, fusing and otherwise analyzing graphs of Linalg ops.
  • Exits to lower level dialects (i.e. affine, vector, et al).

Up to this point in the project, we have been relatively constrained, working primarily on simple variants of Linalg Structured Ops, sometimes just called “Linalg ops”, as represented by linalg.generic.

As this layer has become more mature, we have hit the tooling limits with respect to defining the full library of named operations that represent instances of the Structured Op form. This RFC concerns further realizing the tooling flow for scaling the flow of defining, tuning, experimenting and implementing such named structured ops.

Also see this thread from one year ago which prototyped things to the point they are prior to this RFC.

Requirements of an op definition tooling flow

Unlike lower level parts of the MLIR compiler stack, definition of Linalg ops is an end-user centric activity, with a result, for ops that prove useful or require more interop with transformations, of instantiating a named op within the compiler stack (either within the linalg dialect or in a derived project, depending on utility). It is the belief of the authors that this places a couple of extra burdens on the implementation (i.e. over the usual approach of open/hand coding an op directly in an ODS file):

  • Syntax matters: Ops must be specified concisely in the syntax of math, and preferably, in an otherwise general purpose language which allows experimentation, inspection, execution and the management of variants.
  • Experimentation and iteration is important: The presence of generic ops means that computations can be created verbosely through explicit IR, allowing further hacking, scaled out tuning/simulation, etc, without including compiler rebuilds in the loop.
  • Publishing should be an extension of the tooling flow: It should be a straightforward extension of the experimentation flow to update the MLIR (or other MLIR-derived) repository to build an op in as a permanent named operation.
  • Scales to O(1000) ops: While not all tensor-compute operations in a modern ML toolkit are directly representable by Linalg, many are (and more will be as additional forms are developed). Further, since Linalg represents a lower level of abstraction, there should be an expected multiplier (i.e. one high level op may map to more than one Linalg op). Spread across all frontends and backends, we could end up with ~thousands of such operations, and we want their definitions to be cheap to manage in order to be a counter-balance to the traditional pressure to overly generalize high level operations such that they lose an obvious connection with the actual structure of the computation being performed (i.e. they lose their transformation power and do not map directly to physical realizations).

Linalg OpDSL Sketch

For structured ops that map to a linalg.generic, we propose to start with a simple DSL, implemented in Python, which is inspired by the syntax proposed in Tensor Comprehensions. An in-tree prorotype of some common ops can be found here with some further description here. For glancability, here are a few matmul variants:

def matmul(A=TensorDef(T1, S.M, S.K),
           B=TensorDef(T2, S.K, S.N),
           C=TensorDef(U, S.M, S.N, output=True)):
  """Performs a matrix multiplacation of two 2D inputs.
  Numeric casting is performed on the operands to the inner multiply, promoting
  them to the same data type as the accumulator/output.
  C[D.m, D.n] += cast(U, A[D.m, D.k]) * cast(U, B[D.k, D.n])
def batch_matmul(A=TensorDef(T1, Batch, S.M, S.K),
                 B=TensorDef(T2, Batch, S.K, S.N),
                 C=TensorDef(U, Batch, S.M, S.N, output=True)):
  """Performs a batched matrix multiplacation of two 3D inputs.
  Numeric casting is performed on the operands to the inner multiply, promoting
  them to the same data type as the accumulator/output.
  C[D.b, D.m, D.n] += cast(U, A[D.b, D.m, D.k]) * cast(U, B[D.b, D.k, D.n])
# Note: not yet expressible as a named op in linalg
def log_matmul_exp(
    A=TensorDef(T, S.M, S.K),
    B=TensorDef(T, S.K, S.N),
    C=TensorDef(T, S.M, S.N, output=True),
    Interim=TensorDef(T, S.K, S.M, S.N),  # TODO: Mark temp
    TmpShift=TensorDef(T, S.M, S.N),  # TODO: Mark temp
    Tmp=TensorDef(T, S.M, S.N)):  # TODO: Mark temp
  Interim[D.k, D.m, D.n] = A[D.m, D.k] + B[D.k, D.n]
  TmpShift[D.m, D.n] = ReduceFn.max(D.k)(Interim[D.k, D.m, D.n])
  Tmp[D.m, D.n] += PrimFn.exp(Interim[D.k, D.m, D.n] - TmpShift[D.m, D.n])
  C[D.m, D.n] = PrimFn.log(Tmp[D.m, D.n]) + TmpShift[D.m, D.n]

Setting syntax aside for a moment (which we intend to iterate on as we go), the above decorates python functions, transforming them into a callable with an attached model data structure, which fully describes the linalg operation. The callable, when invoked, serves as an IR builder, allowing direct construction of the underlying linalg.generic operation(s) that implement the computation (note that the IR generation is not yet in-repo and is only in the original side-repository in a crude form).

In addition, an entry point is provided to, given modules of such op definitions, dump them in a YAML-based, declarative intermediate form, which is used for testing, inspection, and as an input to the static ODS/CC code generator which runs as part of the LLVM build.

On YAML and Splitting OpDSL from ODS Generation

As mentioned in the requirements, it is deemed a strong requirement to have a continuity of tools and representations for these ops, with the majority of the work being done by users of the compiler, not necessarily core compiler developers. Building an op into the compiler is seen as the final step of a user-centric development flow, and as such, we seek to reflect that in the tooling flow. Earlier prototypes (of which there have been 3 – two mentioned here) attempted both a TableGen centric approach (critique: violated API layering and did not satisfy user-development/language needs) and a completely custom language+monolithic code generator (critique: very hard to extend/reason about and did not satisfy user-development/language needs).

Development of the approach proposed in this RFC took place by:

  • Separating the parser/language from mlir-linalg-ods-gen.cpp into a Python DSL and backing object model.
  • Extracting the minimal data structures which declaratively define a Linalg op into both C++ and Python.
  • Re-implementing the code generation aspects of mlir-linalg-ods-gen.cpp into mlir-linalg-ods-yaml-gen.cpp, based on discrete data structures (versus intertwined with a parser).
  • Teaching both the C++ and Python side to interop via a YAML serialization of the op configuration structures, leveraging existing MLIR text forms wherever possible (i.e. Affine maps, types, etc).

The choice of YAML to tie these layers together started pragmatically:

  • When modeled cleanly, it is a reasonably human-readable format that retains its ability for machine manipulation pretty well (esp. compared to other options).
  • LLVM has an extremely nice binding library that ends up just being a few lines of code to map C++ structs to a YAML file.
  • Python easily interops with YAML – again with just a few lines of code and a bit of care in the object model.
  • I didn’t want to spend all of my time inventing something new to transport structs between two languages.

Side note: I’ve done a lot of object mapping over the years and I presently work for Larry and Sergey’s Proto Moving Company. I usually find myself underwhelmed by either the heft, usability or readability of such things, but I found this to be a pretty stark exception in practice – I was pretty skeptical that this would pass my bar for “a good idea”, but it did.

After having used this for about a month (mostly in a side repo, and as discussed on IREE’s #codegen channel on Discord), I found that having put care into making the YAML intermediate form readable helped with more than just interop:

  • Unit Tests are relatively easy to write.
  • The YAML declarative form has good “glancability”: I don’t necessarily want to be writing it, but it is easy to read and determine whether the op is defined as you expect it to be. I found myself dumping to YAML a lot during development (even transitioning from more repr based doctests) to good effect.

At this point, I’m happy with the outcome (and didn’t really think I would be), but I expect skepticism. Here is an example of the definition of a matmul (ignore all of the !Foo tags: python adds those by default and they are just noise – I need to remove them):

--- !LinalgOpConfig
metadata: !LinalgOpMetadata
  name: matmul
  cpp_op_name: MatmulOp
  doc: |-
    Performs a matrix multiplacation of two 2D inputs.
    Numeric casting is performed on the operands to the inner multiply, promoting
    them to the same data type as the accumulator/output.
  - LinalgContractionOpInterface
structured_op: !LinalgStructuredOpConfig
  - !<LinalgTensorDef>
    name: A
    usage: input
    shape: affine_map<()[s0, s1, s2] -> (s0, s2)>
    element_type_var: T1
  - !<LinalgTensorDef>
    name: B
    usage: input
    shape: affine_map<()[s0, s1, s2] -> (s2, s1)>
    element_type_var: T2
  - !<LinalgTensorDef>
    name: C
    usage: output
    shape: affine_map<()[s0, s1, s2] -> (s0, s1)>
    element_type_var: U
  indexing_maps: !LinalgIndexingMapsConfig
    - affine_map<(d0, d1, d2)[s0, s1, s2] -> (d0, d2)>
    - affine_map<(d0, d1, d2)[s0, s1, s2] -> (d2, d1)>
    - affine_map<(d0, d1, d2)[s0, s1, s2] -> (d0, d1)>
  - parallel
  - parallel
  - reduction
  - !ScalarAssign
    arg: C
    value: !ScalarExpression
        fn_name: add
        - !ScalarExpression
          scalar_arg: C
        - !ScalarExpression
            fn_name: mul
            - !ScalarExpression
                type_var: U
                - !ScalarExpression
                  scalar_arg: A
            - !ScalarExpression
                type_var: U
                - !ScalarExpression
                  scalar_arg: B

Alternatives Considered

Have an all in one flow in Python to generate ODS/C++

While nice on the surface from 10,000 feet, the LLVM codebase is in no way ready to include a Python dep in such a central part of its actual build process in this way. In addition to likely having some controversy as to whether we want to do this, the Python API (which the language depends on for core IR structures) is not layered in a way (currently) that is amenable to such a use (i.e. it would introduce circular dependencies). This could all, conceivably, be fixed and leave us purely with a policy decision. I lean towards it not being a great idea for the near future: people working on the compiler should not need to read through/debug/etc a Python code generator that is being used to build the core. If the code generator isn’t doing the right thing, it should at least have the benefit of looking like every other code generator in the project.

Instead, the Python side can be a perfectly serviceable development tool, and there is precedent for this kind of split that I’ve been referred to in other codebases.

Note that we did implement such a Python code generation step for the ATen dialect in the npcomp project. We ended up just checking in the generated .td and .cpp.inc files, for similar layering reasons. Note that the requirements there are slightly different: while in Linalg, we merely strongly prefer that the Python op libraries be the source of truth, in PyTorch, parts of the definitions are only accessible from Python.

Having reviewed both, it is actually meaningful to review the Linalg YAML files for correctness as a standalone entity (since they are declarative and convey the essence of the op), whereas for npcomp, the generated code is not really something you look at closely during review.

Note that stock Python isn’t an awesome code generator on its own: you would want to add third party packages like jinja2, adding yet more friction to the overall LLVM/MLIR development flow, which is currently already struggling to get the Python Bindings to a level of maturity and unobtrusiveness to enable by default.

Make ODS/Tablegen more generic and sufficient to model Linalg ops

This was tried in a previous prototype and was strongly objected to as something we did not want to do. This was re-iterated when we started this exploration.

Completely separate the runtime and compiler side of the toolchain

In this view, we have one toolchain based on a Python DSL for all of the runtime/experimental parts and then some other, yet to be defined, thing that let us declaratively say the same thing in a way more “LLVM-ish” way.

Given that ODS is the most natural thing different from what we’ve done to base this on, that is not desired, and it isn’t clear what the “other thing” would be, I think that the YAML format we have developed is a pragmatic compromise. In the future, we could teach the Python side to emit whatever this other thing is.

In addition, it should be noted that the structs that both sides operate on are pretty simple, and should we ever wish to change the textual representation, it seems like if the tool of choice couldn’t handle them in a similarly concise way to what we have now, it is a pretty good indicator that it is a poor choice. In other words: we have some low risk future optionality with the current proposed factoring.


Please speak up with feedback or concerns. We are a couple of patches deep in working towards the direction outlined above and intend to proceed on it incrementally until we have all of the expressibility and development flow we want. There are a number of expressibility issues with the preceeding approach that are starting to hinder independent progress by multiple individuals who are working on building out coverage for the op library, but we aren’t so deep that we can’t pivot based on feedback.

If there are serious objections to either the principles or the technical approach, we should probably talk about it at an upcoming ODM.

1 Like

Thanks for sending the RFC! This is all very exciting :slight_smile:
I personally have no concern with YAML: it is a step up to the previous custom DSL from the build system point of view, and it makes it for an easy target for the python DSL. The new DSL does not interact with the build system (and so does not cause any maintainability issue), and is both nicer than the previous one and available to python users. Looks like a win on every aspect here!

After the dust settle and we get a bit more end-to-end glue, it’d make it for a good ODM presentation indeed, but we can also have a more casual about it if enough people are interested into this!

Thanks for taking this piece upon your shoulders, rethinking it from first principles and for the great outcome!

Looks good. Glad the YAML worked out well (we added it originally as a nice way to write test cases when we were developing the old LLD – compromise between readability and machine parsability).

What is the definition of the semantics of add and mul (and presumably others like exp and max) in the assignments section? Maybe just say std.addf / math.exp to be precise?

I should be more precise, but can’t just say std.addf / math.exp: The exact instruction sequence is chosen based on types. It’s not a terribly surprising translation but should be called out specifically.