[RFC] Proposal for a high-level ML dialect in MLIR

Agreed, they have different uses and promises. I’d still want TOSA even if TCP is a thing. In fact the first legalization I’d want is TOSA to TCP, that’s how I’d want to interact with it and feed models to it. TOSA and {MHLO, TCP} don’t have as much overlap for me as others seem to feel.

I think the discussion here oversold frontend aspect. I’d go so far as to say the current positioning post discussions is “convenient common ops for entry of codegen or that restricted back ends need as library calls” rather than frontends. I’m not convinced any frontend should target TCP in fact. TCP comes lower in stack for me. Same thing as XLA HLO, it’s not convenient to target directly from frontend.

There is maintenance cost too. In tree it could also be made to optionally build. Which would also make opt-in explicit initially.

It is of new groups to MLIR that haven’t been part of the community. Which could benefit from working along with folks who’ve been in the community to ensure it works well (e.g., being new may actually be reason to do it with maximal visibility).

Folks don’t seem to mention the benefit here of that too: when bootstrapping you can go full out in making changes whichever way, you can update to latest version of LLVM repo and torch-mlir repo when it fits you (no maintenance while experimenting), it doesn’t accidentally become load bearing until you are ready (e.g., no churn for users who accidentally started using it as it wasn’t behind flag or what not). There is a lot of freedom during this initial phase which helps.

Now if this is very close to some other external dialects then perhaps not much baking is needed as a lot of existing code to look at it.

This will indeed be in a central spot and hence it is important that the folks it aims to serve be involved. There are different mechanisms here to do that. And perhaps the initial version of this is dialect with ~3 ops (scatter, gather, FFT) and then the semantics of those can be bashed out.

On a lighter note … this escalated quickly :slight_smile:

Screen Shot 2022-08-19 at 4.02.16 PM

6 Likes

I actually thought about that, too, but we don’t have enough information on TCP to know that for sure. Nor I want to make (or push for) a decision on another project that I have no stakes in (for or against).

We shall cross that bridge when we get there.

You seem to be focusing on the front-end side, where the current proposal (as @raghavanr’s summary states) is that TCP will be a transformation dialect, not a frontend-ingress one.

I don’t think that’s fair. There are people working directly on this (including the OP) who have nothing to do with PyTorch or HLO. We are users and we want to common up what they generate, which seems to be the opposite goal as you’re assuming.

I agree with @mehdi_amini’s reply, but I want to reinforce what to me is the most important.

We are working on MLIR, LLVM, Clang, etc. and we’re making it the best we can. The design decisions that go into those projects is a confluence of the people working on it at the time they’re working on it.

You seem to be expecting a democratic committee that will take the views of people all over the world, deliberate for years until a perfect dialect has reached consensus so that we can start implementing it.

While the real world doesn’t work that way. By the time we get the “votes in”, the constraints have changed, users have different needs, developers went to work on some other project.

LLVM is an upstream open source project. We work with what we have when we have and we do the best out of it. It is what it is.

If you have actual technical concerns, please make them known and we’ll address them. Otherwise, we will not ponder hypothetical or imaginary future constraints in case someone else needs something no one else today needs.

3 Likes

This is quite eloquent and on point, maybe worth rephrasing and lifting the principles into the relevant MLIR guideline doc?

2 Likes

(Focusing on the positive: this doesn’t need to be like this. I know you have a lot of experience in this area and are working on something that could be convergent – your design engagement and thought process would be welcome)

5 Likes

This is the most active MLIR topic all time.

1 Like

To all, I apologize if my last post came off as angry or negative by the way. Please take my comments as merely a data point that I wanted to raise. I am not strongly opposed to any location for this dialect. I merely want to point out that people actively working on the codebase now can and will most likely very easily overlook future use cases and will have an advantage in pushing their designs over other person’s. This is an unavoidable and natural situation and does arise in many situations. It is also not necessarily bad, but I do think that it is a relevant consideration. This is also likely to happen more often as it is still a newer field when compared to something like scf dialect or arith.

That was the agreement, but it was then followed by discussions around finding the intersection of different opsets rather than something like interfaces which is why I think it is important to see a more final design because I see mixed signals (which is expected and fine at this stage).

You’re right and I apologize. I created a situation that would create debates purely beyond a technical discussion However, the expectations of MLIR dialects in core also don’t want largely overlapping dialects (which you answered before the quoted message).

The reason I still have concerns is because of the final discussions for ops such as sort and gather. The larger, more convoluted ops can fuel design decisions in the transformation dialect. I’ve seen this in various forms in both Tensorflow->XLA transformations and in decisions for HLO’s design itself. But this is just an abstract concern; I don’t think my concern has any value beyond warning of tying the dialect too much to old design decisions of ops in other frameworks and hopefully being proven wrong or being remembered and preventing such a situation.

The decision of where to put this code is not a technical decision in the end. It is an organizational decision.

I don’t expect the real world to work this way. I’m pointing out a shortcoming to a large repository that other newer languages sometimes purposefully avoid through robust package management systems. There is nothing wrong with MLIR core playing a larger and larger role and potentially becoming even closer to a framework. There is nothing wrong with costs to decisions. I’m just pointing out some costs. And costs are always hypothetical until they happen.

Many decisions are not purely technical and many people would claim that one important expectation of senior engineers is predicting future constraints, so I think it’s difficult to make blanket statements.

Reiterating again because I think I gave an impression that was not my intent, I’m not trying to be negative. I’m not opposed to this, and I will try to be involved in later discussions. I’m also a large fan of programming languages that tried to keep the standard library small and have dealt with many headaches from systems that can’t change or need to change too quickly because of some downstream or upstream situation, so I’m trying to point out some of the disadvantages of MLIR continuing to grow which is very convenient, but that convenience will, as a side effect, cause less people to use a mix of their own out-of-tree IRs in their own pipelines to tests new ideas (to contradict myself: at the same time, a more successful MLIR will cause more people to potentially test new ideas).

4 Likes

This was a specific discussion point on the call. We do not want to propagate the mistakes of previous front-end ingress dialects or their parent representations.

MLIR is a framework. And as such, it will have redundancies and inaccuracies. It’s also a point-in-time representation of its uses and users, so it will be incomplete and usually neither backward not forward compatible.

MLIR is not a language. Like LLVM, MLIR can grow without passing down the pains of that growth to all users in full.

This is a topic that goes beyond MLIR, into LLVM for more than a decade. Strong layering in a framework makes things a lot harder than loosely coupled. Yes, that creates some problems to its users, but if we start controlling compatibility and dependencies, we’ll spend most of our time handling those, and not compiler code.

What each of us lose in efficiency individually in our downstream projects, we as a community gain back with dividends by not having to reinvent the wheel nor segmenting or controlling every little piece of the project.

And if there are inefficiencies that can be fixed, then we propose a fix, and people accept the fix, and onward we go.

Perhaps the missing bit here is the assumption that, after almost 20 years, our community still haven’t learnt the pros and cons of project management, or how to build a framework that works for the majority with minimum common cost, but added individual cost.

We’re not perfect, that’s for sure. But your warnings, while relevant in general, are already being considered by the various parties discussing in this thread, with the baggage of LLVM and MLIR, which changes the cost models and may not be straightforward to an outsider.

Perhaps trying to understand our intentions (with questions) before assuming (with statements) that we’re missing the big picture would help frame your point of view (that is very valid!) without seeming a little too negative.

(PS: To be clear, no offence intended, nor taken. I’m just trying to make sure we go back to the discussion at hand, and hopefully you can review the thread, document and the proposal that will come soon and give your opinion on that).

4 Likes

Going back to this aspect of the discussion:

One consideration is the speed of bootstrapping TCP. From my limited understanding, it appears that there is some workflow friction in the downstream repos which track to different “green” commits of LLVM. Maybe this is helpful in guiding this decision at least in the short-term.

For context, at a recent torch-mlir community meeting, folks from Microsoft (@sstamenova , @ashay) reported syncing overhead with multiple repos depending on different “green” commits of llvm (e.g. torch-mlir → llvm and onnx-mlir → llvm).

Assuming TCP were in-tree, are there any mechanisms to move TCP fast enough and simultaneously test torch-mlir <> TCP integrations without also bumping llvm hashes used by torch-mlir as frequently (which may be non-trivial to do, @_sean_silva may comment).

On the other hand, if TCP were an incubator repo, we could try to “sync” the green llvm commits in unison. That is, someone (possibly on a rotation basis) identifies a “green” llvm commit per week, and TCP, torch-mlir, onnx-mlir, … would all build against that commit for that week. This way TCP can progress fast, torch-mlir can bump TCP as fast (submodule update), while being mutually compatible in terms of the llvm commit they build against.

In the mid-to-long term I do agree with @mehdi_amini and @rengolin on the benefits of keeping TCP in-tree (as discussed at the ODM and on this thread/discord). It’s just the short-term (where we’re bootstrapping it and need closed loop testing with torch-mlir) that I’m worried about the workflow friction this introduces, but very eager/open to ideas from the community to address this (in llvm or torch-mlir).

1 Like

I think this outlines perhaps the weakest point of it being off-tree.

The choice of upstream LLVM for each project relies on what’s new up and what’s needed down. Different projects have different priorities, different ways to merge upstream, different internal rotas to do that work, etc.

Projects like mlir-hlo, torch-mlir, iree, onnx-mlir belong to different communities altogether and it may be hard, or even impossible to get them all to agree on a cadence and adapt that as part of their internal process.

By all means, if you can get them to align in such a way, being in a separate repo becomes a lot cheaper.

I agree, I’m also worried about it. This isn’t an isolated case, and it has been discussed before in other settings.

Perhaps this could be one of the rare exceptions where bootstrapping could live in a feature branch?

(PS: I know how ironic it sounds that I have previously vocally opposed to feature branches in other threads, but hey, this may be a good reason? Just throwing it out there…)

3 Likes

I agree this is hard and not ideal, but it has already started to happen in some ways. For example, torch-mlir adds mlir-hlo as a submodule (here), and there are feature branches in mlir-hlo (named greencommit/date-sha) to sync the llvm commit to that of torch-mlir. Similarly, onnx-mlir and torch-mlir are syncing to the same “green” llvm commit (updated once a week).

2 Likes

While I think there is a lot of excitement about this proposal, I’m not yet sure we’re “all touching the same part of the elephant”. I would personally feel more comfortable with at least some short, directed out of tree iteration on it so that we have something more concrete to look at/evaluate. And then I further think that it goes directly in tree from there as it will be co-developed with a bunch of other peer parts. I wouldn’t set up a new incubator for that (too heavy weight). A feature branch or a directory in torch-mlir would be fine with me. I think the outcomes are one of a) [UNLIKELY] we find that we are massively diverged and this effort goes out of scope again, or b) we get some good focused iteration in and converge on a starting point that goes in tree in a “month or two”. I’d plan the infra investment to hold together for that timeframe.

(again: this is just a feeling of mine and not a strong position)

6 Likes

Responding to many disconnected posts all at once, I’m sorry if I missed a key concern:

I disagree with the opinion that this is just a new dialect and support code and therefore the LLVM “new project” policy doesn’t apply. This is a massive expansion of the charter of MLIR and LLVM, it carries significant design risk, and there are no adopters of this. I agree there is a big community that wants this to exist, but LLVM doesn’t take new projects into the monorepo based on desire. This was true for major projects like clang and mlir as well as backend targets, and many other smaller things that are major expansions of the project.

The LLVM policies are published and exist for very good reasons that have been rehashed and discussed many times before on this forum and elsewhere. LLVM does not take early “in progress” projects into the monorepo for a wide variety of reasons (including exposing people to thrash, concerns contributors will walk away, and concerns that the project never develops into something “actually good”). The policies are built for the “long view” of maintaining a healthy and vibrant ecosystem that can scale for decades, and this puts additional burden on contributors of major new components by design (even if it causes “friction” during development).

We want innovation in the LLVM community, which is exactly why we created the LLVM Incubator process, to foster innovation and provide a place within the LLVM project to host this sort of work. Creating a new incubator for this, or expanding torch-mlir should both be fine because it provides a “graduation” process for code to go into the monorepo when it matures and is understood.

More broadly, there are many things from the early days of MLIR that were not done by proper process - that is a mistake, but that “precedent” should not be used to justify future mistakes. Making analogies to existing code in the MLIR tree with similar character and similar levels of immaturity isn’t justification.

MLIR is part of the LLVM project - if its policies for accepting changes are more broad than the LLVM policy, then we should fix the MLIR policy or revise the LLVM one.

-Chris

I’d like to clarify one thing here:

The current proposal is exactly this: a new dialect (TCP) plus support code (optimizations on TCP & lowering to/from other dialects).

In addition, it looks like there is strong interest in having an end to end flow, including from us (Cruise). But nobody afaict is proposing putting these end to end utilities in the monorepo.

5 Likes

For the record, I have less strong opinions about this than Chris, but I do basically agree with his perspective. I’m also generally optimistic about where this can go once we have a look at something concrete and I want us to get to that point and evaluate. Right now, without mind reading, I can’t quite say how incremental/adjacent this is.

I might not say this is a “massive expansion of the charter”, but I do agree that it is an expansion and we often use such points to consider whether we need to prioritize deferred project organization needs. In this case, we have been incrementally expanding the MLIR charter for a while into the direction of some specific forms of ML compilers. It will be better for all of us at some point if we actually organized that into its own project/structure instead of just adding more drops to the ocean: I do think we are approaching the scalability limit of the project charter and organization that we have. Maybe once we can see it, this “one more thing” passes through, but we really do need to consolidate this stuff into a critical mass and specific project at some point soon, I think.

In any case, I think we left Thursday’s ODM with agreeing to talk about this more f2f next week. It’d be great to parallel track the technical prototyping and the structural issues, and I’d recommend more talking face to face vs arguing project management philosophy on this thread.

And also, great work everyone – this has been a great discussion over the last couple of weeks and has had some really good thought and analysis put into it. Let’s keep it up and land it right.

My 2 cents.
Stella

5 Likes

Sorry for the confusion, clarified the point I was trying to make above.

For the record, I was talking about long term.

There has been a change in focus to split short and long term, and my view is similar to @stellaraccident.

3 Likes

I do feel that this is the type of thing that a focused group could iterate on in a regular personal github repo for a short period (say in a fork of Torch-MLIR) and bring up, say, ResNet end-to-end, then present the result to the community.

I’d be wary of tying this to a particular front end, to avoid the kind of effect that @tpopp was warning about.

I’ve been collecting MLIR generation scripts from different front ends at GitHub - rengolin/mlir-generator: Generator for MLIR files from known front-ends and the difference is visible.

My approach is to create ways into MLIR, then run my passes and output code. This helps end to end but doesn’t force a particular path.

I’d propose something similar, with a standalone dialect implementation that is easy to build, and a way to export the dialect, so front ends can take that small dependency and work their way to e2e independently.

1 Like

Given all the feedback, here’s what we (Cruise) recommend regarding the location of the code:

  1. We’ll create an incubator repository under the LLVM org that contains the TCP dialect and support code like transformations and lowerings into MLIR upstream dialects like linalg & affine, utilities for testing etc.
  2. The Torch->TCP legalization will live in Torch-MLIR, which means Torch-MLIR will depend on TCP. We’ll disable these legalizations from building by default until they’re deemed sufficiently stable to not slow down Torch-MLIR development.
  3. We’ll revisit this arrangement in a few months, perhaps after the MLIR/LLVM dev meeting in November.

The three alternatives were a) putting it in the MLIR monorepo, b) putting it in Torch-MLIR and c) creating a feature branch. There is some push back on (a), including strong push back from Chris. (b) sends a mixed message on framework neutrality, especially if we’re not able to migrate TCP out relatively soon. And (c) would have been fine if we had a fast path to merge back into main, but it looks like we don’t as of now. Given that we recommend the above.

Please let us know if you disagree with this assessment.

The next steps for us are:

  1. Present a spec for the initial set of TCP ops. We’re currently collaborating with various folks from the community on this; we hope to be able to share something before the ODM on Sep 1.
  2. Kick off the mechanical process of setting up an incubator repo.
12 Likes