[RFC] Proposal for a high-level ML dialect in MLIR

As I replied to @stellaraccident, this wasn’t a confusion as much as lack of clarity on my part, for which I apologise. I wasn’t trying to restrict how we communicate in the community nor I have any illusions that anyone can (or should try to) restrict how people communicate in general. I didn’t expect to have to make that clear (because it’s abundantly obvious to me), but I am repeating it here, for the record: my comment was about in addition to private conversations, we should have a public summary, on which to make public decisions on.

My (weak) concern with your reply is that it wasn’t providing any light on the reasons behind your stance and only providing off-list communication for sharing it further. I was not disagreeing with your stance (or choice of personal communication), just making sure when we have a position that we want to use to influence decisions, that the reasons are clear and public (in addition to the private ones, as Stella remarked).

That is why I provided mine in my reply (quite verbosely, I apologise), not as a disagreement or counter-stance, but for completeness and clarity.

1 Like

Hey everyone,

As directed above, we’ll open a thread under the LLVM Project category proposing an incubator for the TCP project.

Also, answering / echoing some points by @ftynse @nicolasvasilache above: I’d prefer we review TCP incrementally as we develop it in the incubator. IMO early community involvement is critical for a successful graduation. Given this thread it seems safe to assume that all stakeholders are aware of TCP, but let us know if there are other forums that we should advertise it in.

8 Likes

Started a thread in the LLVM Project category to request incubating TCP: [RFC][Incubation] Request for incubating TCP dialect for MLIR

5 Likes

We plan to have all technical discussion related to TCP in the TCP-WG category in MLIR. Please let us know if there are any concerns regarding that. To maximize the chances of graduating TCP to be in tree, we’d like to ensure that TCP-related discussions are broadly visible.

4 Likes

Hi,

Given the feedback shared during the ODM and offline, we propose altering our plan regarding the location of TCP to the following, assuming it is OK with @_sean_silva & other Torch-MLIR leads:

  1. We bootstrap TCP in torch-mlir/externals.
  2. We set a trigger condition for exiting out of Torch-MLIR. We suggest that the trigger be that something runs end to end, e.g. perhaps MNIST on CPU.
  3. When (2) triggers we reopen the discussion to move to LLVM upstream directly, following an incremental code & design review process.
  4. We keep the TCP incubator discussed thus far as a backup option in case (3) fails.

We believe this maximizes the chances of merging TCP fully into MLIR upstream in Q4.

6 Likes

Ok. Beyond the dialect itself, please keep in mind that we need all the testing infra etc for such a thing. I personally don’t think MNIST is actually the threshold you should be shooting for, I don’t think that “we can do matmul” is actually saying anything over what linalg already supports.

Sanjoy was originally suggesting ResNet, but then we get into the situation of “we are shooting e2e for ResNet which does non-dilated, non-strided, non-grouped convolutions, but we have to design a convolution op”. Batch norm and average pooling also enter there, but open their own design questions. I think the question is how much design work we want to happen before “graduation”.

For me at least, the most important part of the dialect is the consistent dynamic shape handling and related topics. For that, spending a bunch of cycles on whether depthwise convolution should be a separate op doesn’t seem useful. Maybe a BERT encoder would be a good first step since it has more transposey/viewey stuff? Turning on or off decompositions of softmax could also give an interesting exploration as well.

1 Like

That’s sort of my point - the entire challenge of this project is answering those questions well. Everyone already knows you can throw some new “convolution” or “matmul” operations into a dialect and “allow dynamic shapes”, but that isn’t actually solving the goals defined in the initial post. Coming back to my concern from before, LLVM incorporates code into the monorepo when it is found useful-in-practice to an important segment of the community.

Tossing a few operations into a dialect isn’t the nature of the project here - it is working through “all the hard parts” which many teams have tackled and have come up with different solutions, all of which are “not good enough to use as is” (on a technical design basis, neglecting any political concerns). Success means prevailing where these other projects haven’t.

That includes answering questions like “what are the semantics of a tensor” (e.g. strided pytorch style, tiled MKL style, abstracted completely etc?), “what are the boundaries of the dialect” (what is included and what not), “how do you represent shapes”, and even narrow questions like “what are the scatter semantics when indices overlap” and “do you support dynamic rank”.

This project is taking on all of these, and is similar in scope to the original ONNX project in terms of community-scale concerns. This is /far bigger/ scope than deciding if depthwise convolution is part of the IR spec, and even that one tiny question is essential for the project to be adopted by the community: if you drop something into LLVM/MLIR that isn’t useful to people, then you’re doing active harm to the community.

Beyond the design of the operators themselves, the LLVM questions will include things like “how do you test it”? How do you validate the design? etc. This isn’t a small scope project, which is specifically why I’m concerned.

-Chris

When scoped like that, I think that “Q4 in the monorepo” isn’t a realistic goal and it doesn’t make sense to incubate in Torch-MLIR. Do we need such a big scope and green-field design though?

When looking at the OP, “Existing Dialects” fails to mention MHLO, which basically solves all the stated goals, but has some targeted issues to address. In fact, MHLO is being used today by multiple teams for exactly the goals stated in the OP.

Personally, I think a more pragmatic approach is possible here by building off of MHLO. HLO/MHLO is time tested, has well-known issues we can address, and already has multiple stakeholders in the community (Torch-MLIR has support, ONNX-MLIR is adding support, JAX produces it natively, TF has good connectivity, multiple downstreams ingest it, and that number would be even larger if governed within the LLVM community like TCP will be). MHLO easily meets the bar of “found useful-in-practice to an important segment of the community”. I would not recommend wholesale adopting it for various reasons, but I think there is potentially a path here with enormous value to the community with significantly smaller design scope and green-fieldness.

edit: scrolling through the centithread we actually already discussed this, and agreed that it would essentially meet all the needs. Was there a reason we did not go that route?

1 Like

I’m wary of branding TCP as HLO 2.0 because that may obscure the fact that this is a community owned and community driven effort.

Having said that, we’ll obviously leverage the hard earned experience we all have from working on (M)HLO (and everything else!). As you said, this isn’t a green field design; the long term implications of many of the core design points are well understood IMO.

And I don’t know if Google has concrete plans to upstream MHLO into LLVM as @stellaraccident suggested above. If there are such plans please mention them here since that will (obviously) materially affect this TCP discussion.

Finally, I want to avoid legislating sufficient conditions for inclusion into LLVM right now (in retrospect I should not have broached the exit condition). I believe once we have had a chance to work together for a few months we’ll be able to make a high-trust judgement call on whether we’ve sufficiently de-risked TCP to continue development in the MLIR monorepo or if it should be moved into an incubator. And bootstrapping in Torch-MLIR will help us get to that point faster.

1 Like

I suppose that the point is:

So I’m specifically not advocating actually having TCP be “HLO 2.0” in any branding, project governance, or any other sense.

As Chris said, I think the scope creeps extremely fast if we don’t anchor to a really clear empirical community need. There’s many design considerations where weeks of discussions can be bypassed by “we have N projects using MHLO and this decision seems to work well”. If we don’t make a conscious effort to allow that sort of reasoning, then the timelines are going to drag out a lot.

1 Like

+1 from me on this – this is what I meant by “the long term implications of many of the core design points are well understood”.

Channeling /just my personal opinion/ as someone who cares a lot about LLVM: “we’re willing to wait”. LLVM is >20 years old and I fully intend it to survive another 20+ years: we’re not in a hurry to “claim success” here. It is better to get something later-but-good than something sooner-but-wrong.

ML is a rapidly evolving field and if you need a year or two to get to something that the broader community as a whole can adopt, then that is far better than prematurely trying to standardize on one particular vendor’s prior investment that they are trying to ensconce as a standard. Trying to use LLVM as the torchbearer is something that I’ve seen attempted several times before across the vast story arc of the project, and it has always been best to anchor on what serves a wide variety of stakeholders (or decline to participate until things settle out more).

-Chris

Governance.

Doesn’t fix the issue. OpenXLA/StableHLO governance is still outside LLVM and under the control of a single company, Google. It makes little difference if this is an “open” community or not.

The only equivalent to creating TCP is adopting StableHLO into MLIR proper and let the LLVM community decide its fate.

Putting it another way, creating TCP in a separate repo is just like using an existing dialect instead, but with the problem that it’ll be incomplete and with a fragmented community.

I (personally) don’t see value in having another high-level dialect that isn’t aimed at common optimisations. I don’t think all front-ends can (or want to) agree on a common high-level dialect either.

The original proposal was for an optimisation friendly dialect, lowered FROM the likes of MHLO and Torch, not as a replacement, to sit between those and (coexist with) Linalg. It seems now people are saying MHLO is a replacement for TCP, and I don’t understand the reasoning.

1 Like

Something to clarify if it isn’t clear: StableHLO and MHLO aren’t intended to serve the same purpose. The former is more like TOSA in that it intends to provide a spec, a reference implementation in basic C++, etc. It positions itself as an interface for frontends, and nothing else.
It also exists so that MHLO can stay a compiler IR: it is unstable and its evolution is driven by the need of the compiler internals. We have some vested interest to have it cohabit well with linalg as well since XLA CPU/GPU is adopting linalg for its code generation right now.
So in the world of XLA, MHLO is serving the same purpose as TCP from my point of view: it is enabling optimizations at a slightly higher level than Linalg.

Now MHLO evolution is driven by XLA which is its own project, everything is possible but I still wouldn’t bet on Google to just migrate MHLO and propose it as a dialect in LLVM. For example XLA cares a lot about “bounded shape” and large scale training features.
That said, taking MHLO as a baseline to bootstrap TCP initially does not seem like a bad approach to me: it’s fairly pragmatic and allows us to get started quickly.

Two reasons why we didn’t propose MHLO with the initial TPC proposal (>2y ago, right before the pandemic):

  1. At the time MHLO didn’t have yet dynamic shape support, so some redesign was necessary anyway.
  2. We felt that it would be better for a community driven project to avoid ambiguity by picking a new name and have each new op reviewed one by one. Our intent though was to propose ops from MHLO that aren’t here “just because of history” to TCP, while also discussing the dynamic shape support for each of them.

I feel the current situation isn’t that different, except that dynamic shapes are more mature in MHLO (I didn’t say perfect…) and XLA is accelerating its MLIR adoption and making it a separate project from TensorFlow. It changes the dynamic on the side of the XLA team, but does not make it so that we’re more likely to propose MHLO for upstream. I suspect we will have a vested common interest in having OpInterfaces and general transformation and infrastructure that spans well across TCP and MHLO.

3 Likes

Isn’t this enough, in a true spirit of creating a common asset, to bootstrap TCP?

My only doubt was if all parties are really interested in concretely working on GitHub in a repository like Torch-MLIR whose formal governance however doesn’t seem better defined then e.g. OpenXLA.

Right, exactly why I wasn’t in favor of doing that in torch-mlir (or mlir-hlo). I personally prefer to have a new incubator instead, but I’m not the one doing the hard work, so my opinion has very limited effect.

That’s why I proposed HLO and Torch teams to propose an intersection from their point of view, and then we (optimizers) intersect from ours, and find a common ground. If implementing it in torch-mlir is enough for folks, I can live with that, too.

Looking at TOSA and the current TCP proposal (esp. tcp.group), I think a combination of those would be a good start. If StableHLO is similar (in scope and/or level at least), then it also is a good start.

But neither StableHLO nor TOSA solve the governance problem. I hate to propose yet-another dialect, but I think the governance issue is really important.

For example, when StableHLO becomes a mature dialect, is it going to be included to MLIR upstream like TOSA was? If so, and if they have so much in common, aren’t we duplicating dialects that’ll confuse users and increase maintenance?

The fact that they have their own specs is a good thing (it’s not just a compiler IR), but the fact that those specs are fragmented, isn’t.

Also, my understanding is that TOSA is more about representing the computation semantics (which is great for code generation) than being a transform dialect itself (even though it is that). So the spec really cares about the semantics of each op. If StableHLO follows suit, then we might end up with two dialects with slightly different semantics, probably in edge cases, and the situation won’t have improved.

1 Like

Doesn’t fix the issue. OpenXLA/StableHLO governance is still outside LLVM and under the control of a single company, Google. It makes little difference if this is an “open” community or not.

We are actively building out a governance proposal that would provide a concrete pathway for OpenXLA/StableHLO to evolve to shared technical leadership and code ownership. When you say “control” here, what specifically are you referring to? At a minimum, anyone is free to fork the codebase at any time, regardless of governance.

This is a common fallacy when describing open source governance. Forking doesn’t solve the problem either.

The most important point I made is that both StableHLO and TOSA are separate communities, outside of the LLVM umbrella, with similar goals but different underlying pressure, and if they both want to be accepted standard dialects into MLIR, we’ll have redundancy and high maintenance.

And if we’re also creating TCP, then we now have three different approaches with similar (or at least intercepting) goals, two of them in-tree and one out-of-tree, being driven by three different (but overlapping) communities.

Maybe it’s just me, but that does not look stable long term…

1 Like