[RFC] Splitting the Standard dialect

Authors: { @ftynse, @mehdi_amini, @_sean_silva, @River707 } (order not important)

Context

MLIR has a Standard dialect which dates back to the project inception. Originally, MLIR had two kinds of functions (first-level IR concepts back then): LLVM-style CFG functions with branching control flow and polyhedral-inspired ML functions with explicit affine loops and conditionals. A set of “core” operations were available in both kinds of functions. After the two kinds of functions were unified, the hitherto “core” operations became what is currently known as the Standard dialect, and the affine constructs became the backbone of the Affine dialect. Since the dialects had been always intended as a modularity mechanism, this reorganization allowed us to push for a leaner “core” IR with fewer built-in concepts [Section 2 paragraph 1 in the MLIR paper].

The Standard dialect persisted in the MLIR code base and grew significantly over time. New Ops are being proposed for inclusion on the regular basis [1, 2, 3, 4].

Yet, there are concerns about maintaining a single, monolithic Standard dialect that reappear in many discussions proposing new ops. This leads to a kafkaesque situation where a contributor, often relatively new to the community, proposing a new simple operation is quasi-systematically requested by commenters/reviewers to consider the entirety of the Standard dialect with the goal of either splitting it up to accommodate the op or prove that the dialect still makes sense as a unit, depending on the reviewer’s own inclinations.

libMLIRStandardOps.a is around 2.8M, which may be unacceptable on, e.g., embedded platforms, especially if only a couple of ops are actually required. A recent RFC asked for moving ReturnOp to the (tinier) built-in dialect for this reason, and PDL/Interp dialect duplicated it to avoid a dependency.

Proposal

We propose to split the Standard dialect into multiple individual components by progressively factoring out well-scoped groups of operations into new dialects. Each new dialect will be a subject of a separate RFC, which will follow the guidelines for new dialects and in particular define the goal of the dialect and criteria for including existing and new operations. Therefore, we are looking for a consensus on the idea and process of the splitting. While we do provide an example of splitting below, this example is not final and only serves as an illustration. We will not discuss the scope of individual dialects in this proposal.

Our goal is to replace the Standard dialect completely. This will help eliminate implicit expectations of better support and privileged status of one dialect, as well as prevent the associated feature creep in the hitherto standard ops. This will reduce the pressure for other dialects to target or otherwise support hitherto standard ops, which creates a risk of having duplicate work on operations and conversions. This risk can be partially mitigated by ensuring new dialects have few overlapping concerns and provide better documentation on the overall upstream dialect ecosystem. We believe that the modularization benefits in terms of code size and general support effort required (today, virtually every contributor is a stakeholder in Standard, but they may not have a stake in all individual components) outweigh the risks.

We propose to identify prospective dialects by finding groups of operations that are frequently used together (for example, “simple” floating point arithmetic operations such as add and sub) as well as abstractions common to a group of operations (for example, the tensor type or CFG-related control flow).

Discussion

Arguments for splitting

  • lib/StandardOps/Ops.cpp is the single largest file in the code base (without the ODS-generated parts!), followed by StandardToLLVM.cpp that needs to handle most of the standard ops.
  • Lack of contextual connection between operations: it has got standard integer/FP arithmetics, complex arithmetics, trigonometric functions, memref/view construction and casts, tensor construction and casts, DMA, etc. It is not clear that somebody needs all of those together. This has actually actively led to the duplication of operations in several downstream projects that have opted to redefine simple operations (e.g. return/cond_br/br) because the cost of including standard is so high.
  • This dialect does not correspond to the guidelines on components (no clear objective), yet it is the most likely source of inspiration for other dialects.
  • Simultaneous privileged status conferred by the notion of “standard”, and experimental-level quality because this dialect ended up to be a default choice for ops that don’t belong anywhere else.
  • The lack of clear scope of the Standard dialect leads to confusion among users and developers. For example, the issues discussed in this thread could have been avoided had the tensor “component” of the Standard dialect been separate.

Arguments against splitting

  • Having many smaller dialects make it hard to navigate the ecosystem.
    • This can be solved by technical means that do not rely on having a huge monolithic library, e.g., search in op documentation.
    • This is something to address and improve regardless, and splitting the standard dialect can be a forcing function here. Otherwise the pain points of manipulating multiple dialects will exist with scf for example. It is a claim of MLIR that dialects can mix and match seamlessly.
    • Ultimately, the size of dialects is a trade-off. Having few huge dialects will lead to the same navigation problem within a dialect as one could have across dialects.
  • The contextual connection between ops is that they all operate on standard types.
    • This does not hold as a general guideline for including an op to the dialect unless dialect == type system. There exist ops that operate on standard types, e.g. in TensorFlow, that don’t belong to the Standard dialect. Some Standard ops can operate on non-standard types, e.g. std.constant with opaque value and tensor-of-custom type. The tuple type is standard, yet we explicitly decided not to have standard ops to operate on values of this type.
  • Concerns related to increasing build system complexity.
    • These are justified and can be partially addressed by maintaining a clean code tree structure and build system compartmentalization.

Miscellaneous concerns

  • The privileged status of the Standard dialect allows one to omit the std. prefix in the custom syntax. This will not be the case if there are multiple dialects.
    • MLIR favors mix-of-dialects [Section 6.2 in the MLIR paper]. In this context, not prefixing some operations with their dialects stands out by breaking the common pattern. In practice, handwritten IR that heavily relies on multiple dialects often uses an explicit std prefix, e.g. [1].
    • The verbosity is not problematic as long as dialect names are short (and pronounceable).
  • Standard dialect serves as the central point of the “hourglass”-shape lowering graph: higher-level dialects funnel into Standard, and lower-level dialects fan out from it. With many dialects, it becomes harder to configure the lowerings.
    • This is partly the legacy of Standard being a generalization of LLVM IR and LLVM IR being the main lowering target. Neither of these is true anymore.
    • In fact, the Standard dialect being lowerable into other representations is a misconception given the growing amount of Standard ops that need to be expanded within Standard before being suitable for further lowering.
    • Better documentation and a cohesive story about how upstream dialects fit together largely mitigates this concern, as is something we should do anyway.

One Possible Splitting

As an example and to support our point about mostly independent groups of operations, we propose one possible splitting of the operations currently in Standard into multiple dialects.

We propose to split the Standard dialect into multiple, smaller dialects according to the groups of semantically-connected operations, listed below, that are likely to get used together.

One of the grouping principles is the common data abstraction (type, or set of types) that the operations operate upon. In particular, we separate out the complex, memref and tensor dialects with the associated operations. Note that the corresponding types remain builtin, i.e. registered in the always-available built-in dialect. This separation practically reinforces two tendencies that have naturally appeared in the ecosystem: (1) the vector dialect contains most of the operations on vectors and was able to grow fast and gather adoption; (2) the naming scheme for ops in the Standard dialect that increasingly tends to include the abstraction name in the op name: memref_reshape, tensor_from_elements, or even the dual ops with type names as disambiguation mechanism: subview/subtensor, load/tensor_load. For the latter, having dedicated dialects would help reduce ambiguity and verbosity, e.g. memref.reshape, tensor.from_elements, memref.subview, tensor.subview. Specifically, the splitting can look as follows.

Control flow - cf dialect - br, cond_br, return, call, call_indirect; also move func in here.

Integer arithmetic - int dialect - addi, cmpi, subi, muli, divi/remi signed and unsigned, and, or, xor, shift_* sexti, zexti, trunci, index_cast; also remove the trailing i.

Basic float arithmetic- float dialect - addf, cmpf, subf, mulf, divf, remf, absf, copysign, negf, ceilf, floorf, fpext, fptrunc, fptos/ui, u/sitofp; also remove the trailing f.

Trigo/math special- math dialect - ceildivi, floordivi; also add the mod equivalent, cos, sin, atan, tanh, exp, exp2, log, log10, log2, rsqrt, sqrt. These can have expansions into basic integer and float arithmetic as a conversion.

Complex numbers - complex dialect - addcf, subcf, create_complex, im, re.

MemRef operations - memref dialect - load, store, prefetch, atomic_rmw, generic_atomic_rmw, atomic_yield, global_memref.

Tensor Operations - tensor dialect - tensor_cast, extract_element, tensor_from_elements, subtensor / subtensor_insert, dynamic_tensor_from_elements, tensor_load, tensor_store.

Needs further discussion - yield, constant, select, dim, rank, splat - we may choose to duplicate some things and add a common interface/trait, have a “utility” dialect, etc.

We are looking for consensus on the idea and process of splitting, NOT on the specific example proposed above.

7 Likes

Thanks for capturing this. In general, I agree that the standard dialect should be refactored, but I think it’s going to be very difficult to take a criterion like “ops that are used together” as guidance without having this reduced to bikeshedding. A few questions/comments, without putting in my personal opinions here:

  1. If these new dialects don’t include their corresponding types (int, float, complex, tensor, memref), and those types are still in the ‘standard’ or builtin dialect, does this ‘solve’ the library size problem? Or put another way: if all ops are removed, how big is libMLIRStandardOps?

  2. If dialects are refactored this way, does this actually simplify conversion to LLVM, or make it more complicated? I think that most of the complexity in StandardToLLVM is because there exist a number of types that have non-trivial lowering to LLVM and that these types are all converted atomically. I suspect that factoring out some of these would simplify the process (e.g. “LegalizeIndexType” that lowers index into a particular int type, LegalizeMathOps that converts rsqrt() -> 1/sqrt(), LegalizeComplex that converts complex to a tuple type. LowerUnrankedMemrefs, etc.

  3. Going back to ‘ops that are used together’: I see 2 ways to interpret this: a) ops appear together in user programs b) ops are manipulated by transformations in a similar way, or must be transformed together. At first I thought you meant a), but maybe you actually mean b), which makes much more sense to me. For instance, all of the ‘control flow’ operations must be considered to build a CFG.

It’s probably also worth thinking about the process by which this happens, in addition to the end goal. It seems to me like an incremental approach can happen here, by splitting off portions of operations, with the quantifiable goals of simplifying StandardOps.cpp. Low hanging fruit would seem to be: DMA ops, Tensor ops and tensor type, complex numbers and complex type. In addition/coincidentally, there seem to be ways of simplifying StandardToLLVM.cpp with more incremental lowering without refactoring the standard dialect that are probably independent.

We are intentionally trying to avoid discussing the final split or criterion for splitting in this thread (We can discuss the “ops that are used together” criterion in separate threads when splitting). This thread is about getting consensus around the desire to split and the process of doing so (via multiple RFC incrementally splitting the dialect, following our ordinary “new dialect” RFC procedure).

Regarding your point 1, if all ops are removed, then libMLIRStandardOps will no longer exist (the std dialect only contains ops). The “builtin” dialect contains all the “standard” types/attributes. This RFC focuses on ops, not on types/attributes.

The resulting size is 0. “Standard” types belong to the BuiltIn dialect and are linked to libMLIRIR. So do “standard” attribute. This is yet another reason to rethink the notion of standard.

There could be some additional complexity in the beginning, but the net result should be simpler. I totally agree that type conversion is what currently makes the conversion monolithic and complex. We don’t have standard ops that accept LLVM types and vice versa, so we convert everything together. (Worse, we also have vector-to-LLVM and Linalg-to-LLVM conversions that subsume standard-to-LLVM and other things). We need to break this anyway and we arguably have the appropriate mechanism - source/target materialization in dialect conversion and cast operations. Having separate dialects will force us to actually use and improve this mechanism.

This is exactly what we want to do.

This is indeed can be an independent refactoring, but as any refactoring it needs a driving force. :slight_smile:

+1 long awaited topic, thanks for pushing on this.

In ramping up on MLIR, one of the things that struck me is the duality of Operations and Types. On one hand Operations have a generic structure that can be reasoned about at a high level, but types are effectively opaque. You can see the practical implication of this in the C Bindings where we provide a generic API for creating an arbitrary operation which I can use to build up a valid operation of a specific type, but for types (and attributes) the generic API is just “parse” and we have manually created a set of C bindings for the standard types.
I don’t really have a concrete suggestion, but I wonder if we can get to a better place by developing a more generic structure for types akin to what we have for operations, and then leverage this structure to modularly translate Standard to LLVM and create a more structured story about defining types using the bindings.

My overall sentiment is that this is a really good direction. I never liked the distinction between “builtin” and “standard” and it never felt like the standard really is “standard”. I think the back-of-the-envelope proposal (int, float, math, complex, etc) makes sense, too, especially the concerns around vector/tensor/memref.

The two main concerns, as already said, are lowering to LLVM and the separation criteria. I believe the former, coupled with implementation patterns, should drive the latter.

Stephen’s point that the “used together” concept is really “must be transformed together” is pertinent. There are patterns of usage like dialects using other dialect’s types/ops, or transformations that look into control flow, and thus need each other to successfully complete.

So far, I have considered “standard” dialect the goal for LLVM lowering, because it’s already implemented. Which means that I may want to convert more than I should to it just to get the free conversion, because it’s simpler to lower to “standard” than LLVM (due to the natural impedance mismatch that the existing lowering already covers).

If the split dialects have their own conversions to LLVM, then from my (biased) point of view, I see no reason to have a monolithic “standard” dialect.

But the maintenance cost is equal or greater. Some have already said there could be duplication of operations, which would translate in a duplication of LLVM lowering. There could also be combinatorial problems between mini-dialects that will start to be developed separately and may diverge without enough test coverage.

I would add perhaps two new ideas to the mix:

  1. We already have dynamic declaration of dialects from passes, we could also have them from dialects themselves, where each dialect explicitly states what other dialects they need to function (represent code and transform code).
  2. Have a more generic LLVM lowering infrastructure that different dialects can use, so that even if we do have duplication of operations across dialects, we don’t need to have duplication of lowering. This is harder, I know, but it would be incredibly powerful (and automatically increase test coverage).

This is a bit orthogonal to the current proposal, but we can think about it. The problem I see that operations do have inherent structure - name/opcode, list of operands, list of result types, list of regions, list of successors, dictionary of attributes. All operations have these although some containers may be empty. Types (and attributes, they are almost the same thing) don’t have such structure. We can say a type always has a list of “parameters”, but it’s unclear if these parameters can be anything else than opaque void *, or char * for byte-level representation where we are back to parsing strings, just not human-readable strings.

The current Standard dialect fails to support this mode… There are many ops in there that cannot be converted into the LLVM dialect. In my opinion, this is an extra argument for splitting. I suppose we can clearly specify which dialects can be lowered to the LLVM dialect and which are not intended for this (e.g., a hypothetical memref dialect is lowerable, and a hypothetical tensor dialect is not).

Even if it’s greater, it is also arguably distributed between more people. There is also a question of scale, navigating huge libraries has its cost.

I am not as concerned about op and lowering duplication. We have mechanisms to share implementations even if we don’t share the ops. And direct lowerings are literally one-liners.

I like this direction, thanks! This could be helpful for cross-dialect canonicalization patterns, which is a partly orthogonal issue we have discussed some time ago.

Are you interested in the lowering to the LLVM dialect or to the LLVM IR? There is slow progress on the former, and we ultimately want to make it more modular than today by having better support on type-conversion edges.

We already have a form of that, in that MyDialect::initialize() is expected to load any dialects that its canonicalization patterns might create. code example.

For dialect interfaces, in IREE we have hit situations where we need to load any dialects that might be created by that interface. code example.

For OpInterfaces/TypeInterfaces/AttributeInterfaces, I don’t think we even have a good concentration point like we do in the constructor of the dialect interface. For those, the dialect that the op belongs to needs to loadDialect any dialects that might be created by any Op/Type/AttributeInterface of any op in the dialect, which is kind of error-prone.

Some previous discussion on this: Structured, custom attributes and types (for non-C++ bindings) (not elaborated on yet)

1 Like

Yes, this would be a much clearer separation. I’d go further and say that any dialect that converts “some” operations to LLVM must convert “all”. This would make it much easier for users of those dialects to trust the dialect will work as expected in all cases wrt. LLVM lowering.

It becomes much simpler then, to only implement LLVM lowering of the things in “my dialect” that I can’t find in any other existing dialect “that lowers to LLVM”.

Of course, this would only make sense if we split the standard dialect in the first place.

As a “user” of those dialects, to me they are the same thing. Ie. the only reason why I would want to use the LLVM dialect is to get LLVM IR at the end. I think it would be a serious break of contract if individual dialects had to lower to LLVM IR directly.

From the tutorials and examples I have seen so far, the further I’d need to go is to lower concepts that don’t exist in any other dialect directly to LLVM dialect, and that would be a “full conversion”. The lowering to LLVM IR would be 1:1 from the LLVM dialect.

This may not be entirely true in the current implementation (I really don’t know, and would hope it was), but it’s something that I think would make it much easier for dialect writers to separate concerns and only do what’s minimally necessary.

Perhaps a list of dialects somewhere in the interfaces that can be used by user dialects to load all necessary ones.

Optimally, no dialect creator would have to load any dialect by hand.

A simple way would list all of the things it needs, which will themselves have lists of all dialects they need, and there would be a simple loop on the intersection to load each dialect.

A more elaborate way would register dialects onto a set argument to each constructor (dialects, interfaces, etc), which would then insert (or ignore) all their needed dialects. Then a wrapper function loadAllDialects would make sure they’re all loaded.

This proposal is fantastic, and I think you’ve really nailed it (including using individual RFCs for the details of the splits). Some random comments:

  1. I completely agree that the privilege of “not having to write std.” is wrong. Perhaps we should phase that in for the standard dialect as well? I think that even requiring that for “func” would help make MLIR more consistent and easier to understand for newcomers.

  2. Among other problems with the standard dialect, it is really old so it is not using best practices. It would be nice to move things to use new ODS features, either as part of this shift or in preparation from it.

  3. I don’t think the goal of having an “hourglass”-shape lowering graph is a particular problem for this. Lowering to two ops in 2 dialects is the same as lowering to two ops in one dialect conceptually.

It looks like there is pretty strong support for this direction. I’ve gone ahead and proposed an RFC for splitting out a tensor dialect.

This is exciting!

Super exciting, I’m thrilled to see this happen!