[RFC] Proposal for a high-level ML dialect in MLIR

And continuing down the path of asking controversial questions, the two approaches to this are what I call a reduction and a union opset. Examples of the former are MHLO/TOSA, and of the latter ONNX. I think we could also include the framework dialects themselves in the latter.

Which way do we want to go? What about beyond ops? With all of the existing dialects, with the possible exception of torch, we are representing a subset from a capability perspective. I’m less concerned about op deltas, which can always be filled with more ops or an extension mechanism. But if we’re heading down this path, we need to know when we’re ok limiting the expressivity of the platform features.

On the torch side, by lowering to a simple ssa-value based form, we are erasing mutability, but more importantly, we are leaving out possibly the most important feature of its runtime representation: the fact that all of its tensors are strided and that, to a first approximation, PyTorch basically is an engine for turning as much as it can into a metadata operation on those data structures. If throwing that out, that has certain implications, and we should be ok with them (chief among them is that it becomes very difficult for the resulting compiler to efficiently use any parts of the torch ecosystem or kernels – thereby needing to be a complete island). Currently if I had to design such a system, I would need to add the “delegate bits” at the torch level and constrain the compiler stack to just certain code generation activities within that framework. That might be ok, but it is a limitation.

Since Jax basically is MHLO, it suffers less from this problem, but you do see some of the gaps emerge around distribution and state management. Ditto TF but also with more “leakiness”.

We’ve got to abstract over something, but we are going to lose things in the process, or drive certain parts of the resulting design to the frameworks. It seems to me that all of the existing proposals are abstracting over a lot of ground on the torch side, when actually the torch execution model is the generalization.

1 Like

That’s a very good point. My personal take on things as vague and high level as these is to be as pragmatic as possible from the beginning without losing sight of the overall goal, which seems to be is exactly what you’re proposing, too.

I do not have a good answer, unfortunately, but if I had to guess, I’d say there are two paths to this:

  1. We pick one side and limit the other side’s ability to perform, which is what you described above and I am interpreting you mean it as “not the best idea”, because it limits what you can do in MLIR, that you could on the original framework.
  2. We compose. Each front-end does what it can on their own dialects and only lower to the upstream dialect when it has run out of things to do. This is the MLIR way but with a twist, that you know you’re losing information, so it will be impossible to do certain things after that.

The second point was our approach with Verona. The highest level representation had information we needed to do Verona-specific stuff that wasn’t possible in any other dialect, and once we lowered to others, we were literally throwing away any hope of performing those same transformations again. It wasn’t just hard, it was probably impossible.

That is a trade off, similar to the first point, but time-wise rather than space-wise, which I think it allows us to milk the framework a bit more. So this more generic high-level dialect could be less expressive than Torch or even MHLO, designed for mid-to-low level optimisation passes, not high-to-mid level ones, that are still better done at their original dialects.

This is one way to do, not the only way and probably not the best way. More likely part in a combination, even. But I think we should explore a more tiered approach than a catch-all approach.

1 Like

IMO the former does not imply the latter.

For instance, an ML compiler could still “fuse” (grouping may be a better term for this) a convolution and a relu and map that to a pre-fused cuDNN kernel. Doing this on an orthogonal op set (TCP / MHLO etc.) means we don’t need a large set of pre-fused convolution ops (that largely mirror cuDNN) at the Torch / TensorFlow level.

cuDNN is one thing. I think what I was referring to was more interop with the native PyTorch kernels (including those that are user defined). And I was using that somewhat as a strawman for other integration laden things in that vein (although there are real use cases for this precise thing that keep coming up).

(And for the record again, I’m not concluding anything, just poking to try to understand beyond what level certain parts of the problem become misaligned)

1 Like

<wild tangent>
Let me expand on this, as an unrelated tangent, outside of the main discussion, just as a data point, not as a discussion point…

Things like alias analysis or vectorisation in LLVM look at bare LLVM IR and try to find patterns, map those patterns in compiler structures and then look at those (TBAA and SCEV annotations, VPlan, Poly). This is hard to do but there is enough info to grasp some things.

If we lower loops in a canonical form, it’s easier to vectorise than if we need to shuffle basic blocks around, hoist declarations, change induction ranges, etc. If we pass restrict annotations, it’s easier to do alias analysis.

So what I mean by time-wise trade-offs is that we do what we can early on, but when we lower, we try to lower in the most canonical form possible, so that, even destroying precise information (dialect ops) we’re keeping imprecise information (canonical forms, annotation), which may make it possible (but not certain, due to cleanups and transforms) to do the same thing again, later on, on a less expressive representation (dialect).

This is necessary because even having a high-level precise representation, we don’t have the right shapes. For example, inlining exposes many opportunities for other optimisations that we just don’t have before inlining. But it also destroys shape and annotations, etc.

The approach I propose above (and similar to what I proposed to Verona) is to do just that:

  • Progressively lower to less expressive dialects only after doing what we can with the information we have, fully knowing we cannot do it all and giving up the idea that we can do it all.
  • When lowering, try hard to keep canonical forms and annotations, so that a late-pass can still do _again what you did originally, but with less information. This can be the same code but after a local raise of the (now low-level) IR, if it works.
  • When doing cleanups, transforms, etc. try hard to keep canonical forms and annotation, just like we do with LLVM IR.

Perhaps this is my brain after 14 years of working with LLVM. Perhaps this is how it can be done with such a diverse set of front-mid-low ends. But I’m hoping someone has a better idea…
</wild tangent>

Another wild tangent, but then I need to sign off for the day and let both the timezones and my day job catch up.

What if we based this new thing on the torch type system and op interfaces but practiced “ethical non monogamy” with respect to ops? Type systems and interfaces are what have the highest mis-abstraction cost. We should try to get the op sets to something approaching canonical, at least for some 80% case, but beyond that and we need freedom more than we need uniformity.

1 Like

I really like the “MLIR way” description, let’s try looking at this from that perspective a bit more. IMO, it is not really a twist that lowering = losing information, this is the main reason why we originally insisted to much on progressive lowering, which is basically a more conscious approach to discarding information. That being said, converting between abstractions (e.g., between a framework-specific and a MLIR-generic opset) does not have to be a lowering. This ties with @stellaraccident 's comment on reduction opset (lowering) vs union opset (horizontal, reversible conversion), and maybe there is some middle ground where only some information is deliberately lost.

An “more MLIR way” could be a mix-of-dialects: there is a “reduction”-style generic opset in MLIR that frameworks use alongside framework-specific ops for the things that cannot be represented otherwise. This can include wrapper ops that bridge abstractions of the kind we have to connect tensors and buffers, but this comes at a cost for the framework.

An “even more MLIR way” is to consider the certain things frameworks are willing to do on their high-level representation and, if they are common, try implementing them on interfaces instead. This could side-step the question of the common dialect, but may end up being harder to design. Specifically, mapping to library calls would need some sort of marker that an op or a combination thereof is equivalent to a library call with potentially as many interface methods as the library has functions.

Both “more MLIR” ways come at a cost for the frameworks compared to rather happily living in their own sandbox, and the time scale at which the benefits from using those will pay off is not very clear.

2 Likes

The other cost is the added complexity (potentially combinatorial) for front-end agnostic optimising pipelines absorbing those modules to know the equivalence ratio of different dialect ops and their effect on other different dialect ops as operands.

We may be able to use traits and interfaces to common those things up, but we may end up with silly traits like ElementWiseMulLikeOp and ElementWiseAddLikeOp so that I can merge into an ElementWiseMLALikeOp, from tosa.add(mhlo.mul). It’s a silly example, but you get the idea.

I can see this. Designing interfaces upfront is hard, I would be inclined to just try and define them to see how soon (if at all) we will actually run into the problem.

For this example specifically, there shouldn’t be as many fused ops that we would need to support, e.g., to target libraries for the exact same reason we would need a lot of patterns: combinatorial cost. We can also think about this problem as generalizing InstCombine to work across dialects. We can try generalizing traits, e.g. ElementWiseOp that returns the core operation as arith.something. We can also decide that this part is not worth generalizing and each framework can just keep doing it separately because it’s easier, but we will know for sure.

3 Likes

I am relatively new to MLIR, but I see modules containing ops from more than one dialect in the same module/function.
So, why a new dialect, why not just pick and choose whichever ops is best suited from the existing array of dialects and mix them as needed? … and only add new ones if none already exist.
Over time and with experience, the best ones to use will be used the most.
I know that there is a lot of focus on inference models at the moment, but there will soon be be ways to train systems with a single picture instead of thousands, and for that, the ops needed are somewhat different than the existing ones.
I guess what I am trying to say, is things keep changing at the moment, so whatever high-level ML dialect we come up with for this, will need updating and changing for some time to come, before it stabilizes, so we need to have considerable flexibility in the ops we use.

1 Like

I think one important point that is worth considering is that it has to be somebody’s job to decide on the hardware efficient lowering of a bunch of different ops. This is something we can definitively pronounce does not belong in the frontends.

I gave the list above: sort, fft, topk, scatter-with-repeated-indices, qr decomposition, cumsum, embedding bag , “things with >4 dimensions”, “things with data-dependent dimension-sizes” (nonzero, unique), quantized-softmax-without-messing-up-the-final-fusion, etc.

Each of these has multiple different ways it could be lowered, and a LOT of ingenuity and domain expertise goes into it. E.g. for FFT: do you do DIT or DIF? do you have a big enough batch dimension you can use for most parallelism? at what size should you fall back on a DFT on this hardware? Even something simple like “exp minus 1” can require precision considerations etc. (it might be okay to unfuse sometimes in return for requiring backends to support fewer ops, or perf, or whatever).

The skillset required for making those design judgments based on target information and domain expertise is a very distinct island that I think is somewhat orthogonal to various other parts of the community. It sits above codegen and backends like IREE (to name one off the top of my head). And it sits below frontends like Torch-MLIR.

We’ve been trying to shove more of this into the codegen space (e.g. linalg_ext.fft) but in its current form it feels too low-level when you consider the vastness and specificity of different decisions that need to be made. Maybe some day these technologies will be generalized enough to subsume all this, but we need a place TODAY for doing those things which we can then gradually subsume. Actually this could accelerate the generalization of linalg/etc. precisely by showing all the transformations that are needed in practice. Writing these transformations to achieve a certain performance/functional goal is a different skillset from staring at multiple transformations and distilling a set of principles that allow unifying them.

I wonder if we can stake out some space in the “ML middle-end” based on that definition: input: broad coverage of ML operators; output: efficient decompositions/lowerings taking some degree of target hardware/low-level backend details into consideration. The input would not be a single orthogonalized form (but would have enough stability guarantees for frontends – probably with a stability-guaranteed “union” dialect that initially is a light layer of indirection to internal dialects, but will gradually get further from them). The output would not be a single canonical form either. But it is the responsibility of this layer to “impedance match” all the frontend operators to the different backend requirements. And we would build composable infra to perform a variety of transformations that incorporate various notions of hardware efficiency which are absent at the frontend level, and non-recoverable at the level of detail of codegen/linalg. As a rough guess here, the input would be non-destination-passing-style and the output would be destination-passing-style.

Another thing that I would want to be handled at this layer is converting complex numbers to real numbers and emulating data types that are not available on a target backend (e.g. a backend doesn’t support f64 – emulate it, or have a great place to make a policy decision about truncating it). This is one of the biggest pain points when lowering from Torch-MLIR to IREE for example, and begs for a layer to handle it.


So, the Torch dialect is not a good place for this in its current form because the work to get a nicely decomposed and “orthogonal” op set hasn’t happened yet. I mean, Torch-MLIR does do decompositions, but it’s a “get things working” sort of thing and we haven’t (and don’t intend to) do the principled work to orthogonalize and layer that decomposed op set. Frontend-toward efforts like PrimTorch or backend-towards efforts like this thread should be responsible for that.

I think that JAX has opened up a bunch of interesting questions regarding the programming model and interaction with the compiler, which traditional compilers (except in lisp/etc.) don’t bring in. Things like grad/vmap/etc. are really user-controlled compiler transformations, and so naturally live closer to the frontend (or require very careful layering). I don’t have full visibility into this, but I have some intuition that distribution (data-parallel and pipeline) are similar things that intersect with the user programming model, and so the layering has to be carefully considered to really understand what precisely is the information that the frontend has to tunnel down to which layer.

Also, practically speaking, not all frontends necessarily have MLIR as a dependency. Maybe they should, but there are valid reasons not to.

We definitely need a reduction somewhere, because many transformations, including grad/vmap but also SPMD/etc., essentially require writing O(#ops) transfer functions. Because grad is more framework/frontend aligned, I think there is some argument to move the reduction op set towards the frontend (and PyTorch is in fact moving this way, JAX is already there). This then leaves a pretty large space in the stack to put all the other transformations. It would be useful if an expert could explain about JAX and how it separate responsibility for this (grad, vmap, spmd, pipeline parallelism distributed, data-parallel distributed) between itself and XLA.

@jekbradbury perhaps?

The direction that Torch-MLIR is collaborating with PyTorch on has the framework give us graphs with value semantics for those islands where the framework wants to commit to a more comprehensive compilation stack (TorchDynamo is an example where such “what goes in the graph the backend compiler sees” decisions happen). I don’t see a super great compilation story possible with strided tensors – it is too constraining for any significant compilation work to happen; at best (and even then I would ignore it and recover later) it is more of an “inline hint”.

3 Likes

That is exactly what we’re trying to do.

Precisely. We don’t want to restrict to any specific front-end, but we still have to apply pattern matches and understand the semantics somehow.

Exactly. Specific back-ends will need some meaningful semantics to lower efficiently. Some passes will want to just convert between dialects and forms, while others will want to lower closer to a specific architecture. It’s the job of those passes to know what to generate, but hopefully not need to know of all variations of input semantics out there.

2 Likes

The thing I’m watching for is when the spark jumps to the next step: ie. Some specific principles get laid down and that excites a small number of people enough to look at priors and write some concrete code or sketches.

In my experience, with a long history of work at this level, in this area more than any, there are no perfect solutions – pragmatism rules the day and the value of having an agreed thing that exists and can be counted on far outweighs a lot of other concerns.

But these things often also turn into kitchen sinks if not well conceived from the get go. Writing down the principles and, most critically, enunciating the layering with examples of the kinds of things that are done at this level above/below/to the side is pretty important to guard against that. Beyond that, I think it just needs (semi) consistent application of the principles and criteria for inclusion and a commitment to some practical iteration before nailing things down for common use. At the end of the day, you do just need to settle on the three ways to spell “add” (and then do that between 50 and 1000 more times) :slight_smile:

I think the other thing that tends to trip these efforts up is trying to aim for being too future proof. Reality today is these numpy derived/inspired tensor programming models as embodied by PyTorch, Jax, and TF. Even if we only succeed in making a common/accepted, compiler aligned mapping for some critical subset of that present state – that is enough. Maybe it generalizes to the future and the next thing, maybe it doesn’t. Either is ok.

My 2 cents.

3 Likes

I suggest we use the 18 August ODM for this. We’re happy to present an initial skeleton design to kick-start the technical discussion.

3 Likes

Yes, that’s also what Alex was pointing out above and me above that. Passes are already written to consider different ops/be robust against unknown ops, one can mix and match (LinAlg and SCF is also really nice in how they enable optimizations without fixing on specific ops - there are some nice SPMD partitioning abstraction ops too in progress …), conversion targets can be combinations, pattern sets are not constrained to specific dialects, interfaces are also a thing, etc. Identifying good set(s) of complementary ops, finding boundaries, identifying gaps, mapping gaps in abstractions, all can be quite well done here as you say. A lot of good data and good discussions can be had that drive evolution along with many contributors IMHO.

But I think folks like naming things. Having a shorthand for a set of ops is convenient - conversion target is a naming mechanism for target, but there isn’t such for source. Pseudo dialect that operates as dialect but consists of ops from many may be useful container to add though and then we can leave sed scripting and duplication/deduplication for later.

I think that will be very exciting and great jump start to incubator project!

Hi @raghavanr, nice to see you around here :slight_smile:

Thanks for reviving the “TCP discussion”, it would be great to see some progress on this front.

Lots of fun stuff to catch up on, after I was out for 2 weeks of vacation :slight_smile:

First, some questions / comments.

From these comments, I seem to infer that you are principally interested in defining an input dialect with an objective of expressiveness-first, is this accurate? FWIW our experience is that there are many tradeoffs involved when additionally considering transformations. Some transformations are more straightforward than others:

  1. fusion-grouping nodes in an SSA-graph, DAG->DAG and DAG->library call and algebraic rewriting can (and should) be done at the TASO-level of abstraction.
  2. transformations related to memory hierarchies and mapping on parallel architectures / accelerators benefit from first-principles co-design of IR with a transformations-first mindset (more below). Preserving the ability to do transformations described in class a. after applying transformations of class b. comes with nice invariants.
  3. other classical transformations can be applied later, once in pure loop form and once we are done exploiting representational IR properties for class a. and b. (which seems to intercept what @rengolin is investigating).

We have found many instances of IR-design / expressiveness / transformation power tradeoffs (e.g. limiting the power/scope of transformations can increase expressiveness and vice versa).

Additionally, is your view that this new dialect should be closed (e.g. does it need to have an tcp.while/tcp.for op, its own types or should one mix it with scf.while/scf.for and existing types) ?
There are interesting tradeoffs in both cases.

This seems to suggest higher interest in algebraic rewrites and mapping to libraries, which I’d intuitively map to class a. of transformations above.

Up to this point of the discussion, I believe @_sean_silva’s breakdown post conveys the frontend-facing tradeoffs best:

Now, I see that a little lower in this long thread, the discussion starts to touch on bufferization and more codegen-like abstractions and transformations. A similar breakdown post of the high-order bits learned while building tensor-based codegen from first principles seems warranted and follows.
Much more details are available in our tech report.

Structured IR Principles

Here are the principles we have iterated to, so far, also prefetching a few extra things we learned since then:

  1. Define away unhappy codegen paths via UB (e.g. static ranks only, no out-of-bounds guaranteed by op semantics, dynamic size-1 broadcasting and shape mismatches are UB). Already discussed by Sean.

  2. Transformation-driven IR design with interfaces rather than concrete ops:
    a. DestinationPassingStyleOpInterface (in flight): ops are in destination-passing style and accept both tensors and buffers. SSA-based analysis allows unsurprising in-place bufferization and avoids spurious copies that are hard to remove post-hoc.
    b. TilingInterface: Ops decompose themselves into smaller ops, either in SSA or side-effecting form, all the way down to loops around scalar data. Bufferization may happen before or after decompositions into loops where temporary storage opportunities appear.
    c. StructuredOpInterface: Key op categories embed their iteration space and indexing pattern (e.g. a matmul is fully defined in the IR with first-class attributes and is not a blackbox op where the magic is carried by an arcane C++ implementation). Useful op categories encompass map, reduce, broadcast, transpose, window ops with arbitrary region and compositions thereof. A limited opset represents a large chunk of ML compute and avoids explosion.
    d. A rich set of transforms on SSA value operands are supported by N-D set / subset abstractions from the tensor and memref dialects. Subject to ongoing generalizations.
    e. Preservation of information: Information is carefully discarded by lowering out progressively (often by decomposing via more loop levels and rank-reducing sizes of 1). Transformations retain information. Raising is unnecessary to lower to library calls or to special ISAs / intrinsics, even in the presence of many transformations.

  3. Concrete ops (e.g. matmul) are declarative and implemented as much as possible as syntactic sugar above DestinationPassingStyleOpInterface + TilingInterface + StructuredOpInterface. This automates the creation of new “named” ops and makes transformations available on new “named” ops by construction. Getting this right is important to get transformation class a. (DAG, fusions and algebraic rewrites) to be idiomatic and delightful to use.

  4. Intentional composition with other dialects (tensor, memref, linalg, vector, sparse_tensor and scf) ops and types in non-surprising ways:

  5. Regions compose with elemental scalar, quantized and vector types and ops

  6. Operands compose well with tensor types (sparse, dense, future extensions) and buffer types (dense, strided, future extensions).

To be clear, linalg by itself is not the one true thing we want, it is not expressive enough and was not meant to be (as was noted, by design). Still, it has been pragmatic in allowing concrete progress in navigating a virtual minefield.

Somewhat of a tangent from this thread, I know there are various critiques of linalg from various parties. As one of the authors and maintainers, I would like to find a way so that we can try to get down to specifics so that we can learn and evolve – I think we’d all like this part of the ecosystem to be better and have a more engaged development process. I sometimes get vague or third-hand commentary that is hard to parse, and I feel that if we could get to the concrete issues, we could probably align on them.

To put things in perspective, note that linalg has been the only upstream e2e path to make ops with tensors executable and transform them to high-performance binaries. It provided a concrete anchor point for the following abstractions that did not exist before or were just broken:

  1. Interfaces using concept-based polymorphism proposed by @mehdi_amini IIRC.
  2. Regions added by @ftynse IIRC.
  3. Progressive lowering principles and practice.
  4. Turning the tensor dialect from a pumpkin into something usable with high performance.
  5. Redoing bufferization after it was deemed unsuitable.
  6. Various abstractions extracted into the tensor and memref, the scf dialect.
  7. The entry point to the sparse compiler work.
  8. The principles and interfaces described above.

There is still a lot of ground to cover to make good use of core infra and continue refactoring interfaces and APIs that sometimes predate more modern core ways. It’s clear linalg contributors have sometimes found it hard to stay completely aligned as the core abstractions have moved forward, a lot, up until very recently as also pointed out by @mehdi_amini. Words do matter indeed, and actions speak a thousand words: I think there are opportunities to engage and spell all this out in the right way, together, and that the time is now ripe for it.

4 Likes

Some additional questions and comments.

Are you referring to linalg.generic as a way to write TASO-like and algebraic rewrites? I think this is an important area that has been underserved so far (beyond showing that automation of a custom-defined opset is possible, following a design from first-principles). The harder opinionated decisions and spellings are not there yet and are important to get right (more on this later).

IMO this is more a symptom of passes: phase ordering is a well-documented problem as you know. I’d characterize it further as passes + analyses + low-level IR create ordering interferences that makes controlling the compiler very hard.

High-level op design and targeted transformations following the principles I outlined above seem to help reduce the issue, sometimes significantly.

This sounds very much in line with principles of the structured approach described above. Could we have an ODM to go through the principles you came away with and what drove you to those? I suspect the topic of concurrent ownership may not be the easiest to grasp for all but focusing on principles would be super interesting IMO.

@rengolin I would also turn the question around: given a skim of the tech report, and the principles I extracted above, do you see principles / examples you disagree with, beyond more opinionated spellings of how a particular op should be written / what dialect it should live in ?

Strong +1: the real elephant in the room is the type system. Interfaces are often easier as they percolate up after enough massaging of the code and refactorings.

This sounds great!

Our experience is that the spelling of middle-end ML ops is an area of highly opinionated discussions for which community consensus is required (e.g. how does one spell a convolution and what are the tradeoffs?), beyond the principles and mechanisms described above. Because of the potential contentious nature, we have punted on taking hard decisions but I believe it is now time to work on this, together.

We prefer an approach by which classes of ops that transform well, e2e, to high-performance implementation are gradually evolved and don’t suddenly turn into pumpkins. But we are also looking for collective wisdom on the topic and hopefully contributing to that wisdom too.

Looking forward to it !

2 Likes

Yes. Each individual lowering looks ok, but when trying to look across ops for common patterns and there are other ops in between, it gets ugly. Especially when the original sequence was good enough, for example, relu(linear(splat())).

Perhaps. But it would probably be a lot less interesting than you may be hoping.

Many of those decisions were short lived, because the region semantics changed a lot over the years. The only high-level design decision was that we didn’t want to carry annotations/metadata throughout lower IRs because that is a known time sink for trying to raise the level again. Standard MLIR stuff. But that generated the conundrum of “there’s no good place to do X because it’s either too high-level or too stiff IR without enough information at the low level” (a bit like XLA/HLO passes).

I’ll study the report and get back to you.

+2

This is good for e2e frameworks, but it’s really bad for anyone else trying to use dialects to start a new path. If we have one (or more) dialect(s) per framework, not interchangeable, then hardware vendors will have to either join every framework’s effort directly or develop their own e2e framework, neither of them scale.

Avoiding the pumpkin problem is hard, I know, but to me it already looks like a zucchini, so potaytoes potahtoes. :smiley:

Trying to do something with heterogeneous hardware (on the same program) becomes virtually impossible.

Thanks for those detailed comments @nicolasvasilache.

That is mostly correct, only difference is that we don’t see it as an input dialect, rather a dialect that all frontend dialects can translate to.

This has been brought up a few times in the thread by @jpienaar and @jcdutton. We don’t see a strong reason for why this has to be closed. We should be able to reuse existing ops from other dialects, as long the needs don’t diverge in future.

One factor to consider here is that this would constrain the kind of types the dialect uses. For example, scf uses builtin types. So, once we decide to use scf.while/scf.for, we can’t be using ops from some other dialect that has its own types.

Thanks for listing these principles. That is very useful. While this structured approach is interesting, I’m not sure if loop-level transformations as you propose here should be the target of this dialect. The idea we had was to keep this dialect at a higher-level and defer the loop-level transformations / optimizations to lower-level dialects like Linalg.

This is a dependency nightmare for frontends. Frontends need something with a consistent project integration story. I think the value here is split at least along 3 main axes:

  1. to be the disciplined layer that has a consistent and convenient “ingestion” story from the frontends (stable format, “shallow/interface dialect”, etc.).
  2. shielding frontends from having to prematurely bake in hardware-specific details when lowering certain ops (by having a solid “union” dialect that can grow and doesn’t have to be perfect/orthogonalized).
  3. the internal design of the project for how it internally orthogonalizes/lowers from what frontends produce to what backends want (and how it defines what backends want)

I see a lot of work in this space around trying to solve a “what is a nice orthogonal set of ops” problem. But I think initially most of the effort is about carving out project structure and “ingestion strategy” that will get wide adoption, and inside of which a more nuanced internal layering/lowering can be designed iteratively.

I think Stella is right that pragmatism is the name of the game, and the more that this project can shield frontends from a lot of stuff they are doing now, the more successful it will be. For example, if you think about it, it’s kind of silly that Torch-MLIR has Linalg, TOSA, and MHLO lowerings. I don’t see a technical reason that this new middle end couldn’t itself have Linalg, TOSA, and MHLO lowerings, and then Torch-MLIR would incrementally delete its lowerings to those various dialects as the “Common MLIR ML Middle End” adds support for them (please - I WANT THIS). and then nested within that project we could home a new dialect that we think provides some advantages over the others already in the ecosystem, but coming from a point of already having “tribal knowledge” of the other ones to ground design decisions.

I actually am seeing a very long tail of effort in Torch-MLIR to get the more weird ops (say, non-linalg-structured-ops) working . We have found these to be extremely time consuming to even implement correctly, let alone in a way that seems to avoid baking in target details or is efficient. I see a significant value-add to having a good story for those in this middle layer too, with all the corresponding difficulties in lowering them. Even if we initially start with the “easy” ops ingestion stategy, I think it would be useful to deliberately design for shielding a potentially large amount of “mini-compilers” for various non-structured ops.