For the record, I have less strong opinions about this than Chris, but I do basically agree with his perspective. I’m also generally optimistic about where this can go once we have a look at something concrete and I want us to get to that point and evaluate. Right now, without mind reading, I can’t quite say how incremental/adjacent this is.
I might not say this is a “massive expansion of the charter”, but I do agree that it is an expansion and we often use such points to consider whether we need to prioritize deferred project organization needs. In this case, we have been incrementally expanding the MLIR charter for a while into the direction of some specific forms of ML compilers. It will be better for all of us at some point if we actually organized that into its own project/structure instead of just adding more drops to the ocean: I do think we are approaching the scalability limit of the project charter and organization that we have. Maybe once we can see it, this “one more thing” passes through, but we really do need to consolidate this stuff into a critical mass and specific project at some point soon, I think.
In any case, I think we left Thursday’s ODM with agreeing to talk about this more f2f next week. It’d be great to parallel track the technical prototyping and the structural issues, and I’d recommend more talking face to face vs arguing project management philosophy on this thread.
And also, great work everyone – this has been a great discussion over the last couple of weeks and has had some really good thought and analysis put into it. Let’s keep it up and land it right.
I do feel that this is the type of thing that a focused group could iterate on in a regular personal github repo for a short period (say in a fork of Torch-MLIR) and bring up, say, ResNet end-to-end, then present the result to the community.
My approach is to create ways into MLIR, then run my passes and output code. This helps end to end but doesn’t force a particular path.
I’d propose something similar, with a standalone dialect implementation that is easy to build, and a way to export the dialect, so front ends can take that small dependency and work their way to e2e independently.
Given all the feedback, here’s what we (Cruise) recommend regarding the location of the code:
We’ll create an incubator repository under the LLVM org that contains the TCP dialect and support code like transformations and lowerings into MLIR upstream dialects like linalg & affine, utilities for testing etc.
The Torch->TCP legalization will live in Torch-MLIR, which means Torch-MLIR will depend on TCP. We’ll disable these legalizations from building by default until they’re deemed sufficiently stable to not slow down Torch-MLIR development.
We’ll revisit this arrangement in a few months, perhaps after the MLIR/LLVM dev meeting in November.
The three alternatives were a) putting it in the MLIR monorepo, b) putting it in Torch-MLIR and c) creating a feature branch. There is some push back on (a), including strong push back from Chris. (b) sends a mixed message on framework neutrality, especially if we’re not able to migrate TCP out relatively soon. And (c) would have been fine if we had a fast path to merge back into main, but it looks like we don’t as of now. Given that we recommend the above.
Please let us know if you disagree with this assessment.
The next steps for us are:
Present a spec for the initial set of TCP ops. We’re currently collaborating with various folks from the community on this; we hope to be able to share something before the ODM on Sep 1.
Kick off the mechanical process of setting up an incubator repo.
The guidelines mention that the project: Must be proposed through the LLVM RFC process, and have its addition approved by the LLVM community.
I don’t think this has been done: this RFC was about adding a dialect to MLIR along the line of the MLIR policy here: Developer Guide - MLIR
I don’t believe this RFC is representative of what the LLVM community at large for adding an incubator project.
So: please start another thread in LLVM focused on adding an incubator project.
Not trying to argue but just clarify: this is an RFC that meets the Developer Guide’s definition of such, and the conclusion/consensus was that an incubator project is called for. These guidelines were written prior to the consolidation onto discourse and I don’t think any thought has been put into what the appropriate category is for requesting such things. We should probably update the policy regardless to clarify:
The category to post in (replacing the “mailing list” terminology)
Whether an LLVM RFC which concludes that an incubator should be created is sufficient or whether a top-level RFC to create the incubator is called for
Sorry but I disagree with your “clarification”: this clearly does not count as “approved by the LLVM community” to me, as this likely didn’t even reach the LLVM community in the first place. Of course we should update the policy, but in the meantime I don’t understand how one can claim to consult the “LLVM community” that way: at best we may say that pending a category, there is no way to perform such an RFC, but it does not mean this meets the requirements…
Moreover neither the title of the RFC nor the body calls for an incubator at all.
I’m happy though to poll the community outside of MLIR to see who is aware that an incubator project is proposed outside of the MLIR people…
In case it isn’t clear: I am opposed to consider a new “dialect” as a "project’ and I don’t consider the scope presented here to be suitable to be an incubator.
So here you have it since it didn’t seem clear to you: there is no consensus.
I only saw Chris arguing this direction, the conclusion looks to me like coming only from this. And I’m happy to write a rebuttal there, because I find the arguments dubious right now.
On the other hand many other contributors and/or maintainers of MLIR subsystems (@nicolasvasilache, @sanjoyd, Myself, for example) are considering this as a dialect under the MLIR guidelines.
That escalated quickly and I don’t understand the magnitude of your response. You will note that my requested action is to update the policy to conform with your read of the situation. I was trying to put myself in the shoes of the author when I re-read the policy, and it seemed to me that it would be a perfectly reasonable thing to read it and assume that an RFC had been raised to the LLVM community via this thread.
I also independently argued this direction, although with less directness. My reasons for doing so remain: I do not believe we have reached consensus on the design and I would like to see a concrete proposal prior to engaging in upstream development. Especially given the design scope that is likely to be traversed in the related areas to this over the coming months, I do believe that elaborating it out of tree would be helpful. I had originally considered that such POC work in torch-mlir would be fine, but I do agree with @sanjoyd’s desire that if going that way, somewhere neutral would be better.
I am perfectly willing to acknowledge that you disagree and we can continue discussing, but I was taking it for granted that the authors of the RFC did spell out a plan forward that seemed to have good support on the thread at the end. The conversation at this point read as pretty indicative of consensus to me and what appears to be the other participants on the thread.
Having had a few hours to think about it, here is my proposal to get things moving:
We leverage the energy that exists between the stakeholders here and get them writing code for a strong poc that we can evaluate. I was serious in that I’ve evaluated and talked to a number of people and I do not believe we are at a level of detail to just start letting upstream phab reviews fly – maybe that resolves at the scheduled ODM and maybe (my estimate), it needs a couple of rounds of poc code we can look at and critique. Sanjoy’s estimate of November doesn’t seem wrong to me, having been down this path some number of times before:
It would be great for the community if folks did this in an llvm aligned repo, and the incubator process is how we do that.
If that turns in to a big, theoretical discussion vs its intent as a lightweight check for realness and alignment with the goals of the llvm project, then someone creates a personal GitHub repo, invites collaborators and writes some code there.
We have a look at the concrete proposal and code as it evolves and evaluate whether we move it upstream and continue development there. I personally am optimistic on that point after a few rounds on it, but we’ll need to discuss actuals.
A concrete advantage to not doing this upstream is that it isn’t clear to me yet that we are talking about one dialect or a suite: if we did this upstream, the hurdle of having this discussion at each stage will bias us towards the one and I’d rather have a friction free design space. If it turns out we are talking about multiple things, we let the stakeholders push in those directions so we can see.
One of us updates the policies for incubators to account for the move to discourse and clarify the nature of the RFC.
We start a new top level thread on MLIR project structure and contribution model and flush the queue on these topics there vs in what has become a tax spread over a number of threads I’ve observed over the last few months. There’s clearly a lot of thoughts and feelings about this and it may be more helpful to hash it out as a dedicated topic.
I agree too. And on re-reading it, I felt i the policy was ambiguous to the uninitiated and think we should update the verbiage: as stated, it is not surprising to me that folks would think this RFC is sufficient.
Honestly, this is not the right way of doing it. Neither 1-1 nor zoom, as none of this is permanent nor have a way to refer back to it. If you have concerns and reasons that influence community wide decisions, then they must be in a public forum, ie. here.
There are three ways we can go about this: existing repo (ex. Torch-MLIR), new incubator and in-tree. IIUC, most people here are strongly opposed to existing repo, which leaves us the other two.
I don’t have a strong preference for either, but I agree an actual RFC with the two clear options would make it easier to find and claim consensus.
To me, the arguments are mostly mechanical. I have no doubt that TCP will become an official in-tree dialect soon enough, there are just too many people and projects interested in it for it to fly. But there’s still the question of what would be its place in the dialect cloud, and that could converge faster with more speed and less stability.
Here’s my list…
A separate project exposing a standalone dialect that strongly depends on and interacts with existing in-tree dialects only.
POC can be cross-developed from very early days to a working prototype only by the people that really care about it, at a much greater speed than in-tree.
Speed and stability can be tuned independently without upsetting our existing buildbot infrastructure (upstream and downstream).
We can have different branches, different experiments, whatever, without affecting the in-tree policies.
We could use pull requests directly, have issues unique to this repo and not have to create filters for it on existing main GH issues.
Side-effect: We’d strengthen the standalone infrastructure, making it easier for people to compose dialects (new and existing) downstream.
Existing projects would need to add another dependency, albeit a very small and focused one, not unlike other dialect-carrying ones we all work with anyway.
Existing projects will pick an LLVM commit via two different paths (their own dep + tcp’s dep) and they’d have to be the same. It’d be hard to get all projects working with TCP to agree to the same base commit. This is perhaps the strongest CON.
Current in-tree MLIR developers would have to work on a separate repo. This is a weak CON, as I’m assuming we all work on multiple repos anyway.
We’d have to use it as a standalone dialect and connect to other dialects in perhaps less straightforward ways. This is a weak CON and has the positive side effect above.
This will create the problem that we all hate: at the end, there will be a massive number of commits that someone will have to go through, sanitize, re-write to the main repo and slowly apply.
This isn’t going to be a new directory in the main repo, but a move from standalone to in-tree dialect, which different structures and build flags.
Adding an experimental dialect in-tree, making sure we don’t break other people’s work/CI in the process.
Reviews on the same place, still on Phab (for those who like it), same cadence as existing MLIR.
Zero abstraction cost, dialect composition is in-tree and flow with the rest of MLIR development process and pace.
There is no multiple dependency to care about. Projects working on TCP will just pick straight from LLVM at a point when the necessary commits have landed.
We can treat this exactly like experimental targets in LLVM, for which there’s extensive policies about and we know how to handle well.
Projects will be forced to pick newer commits of LLVM as a whole just because TCP has changed. This is the same as when the incubator project updates LLVM, but more fine grained and never stopping.
It’s much easier to break other dialects/infra in MLIR with a commit to TCP (that leaks changes elsewhere) if the commit is in the monorepo, especially if done by experienced developers, certain that they’re doing the right thing.
To avoid buildbot instability (up/down-stream), we’d have to create a way to not build this dialect by default, for which I’m not sure we have that yet in MLIR.
It’d be harder to experiment, create different branches, share work via git remotes directly to other people’s repos, etc. because the main repo is huge and the number of commits between a couple of days is large.
To be honest, my critique applies to all incubator projects (including my own) as we are yet to see an attempt at graduation. I am generally concerned that we keep spawning more such projects without clear understanding of the longer-term path forward for them. FWIW, making mlir-tcp a short-lived incubator project as suggested in
may give us valuable information on what the graduation process should look like. This is not exactly the strategy nor the criteria I was referring to, but having a deadline is a start.
I have several hats in the broader incubator/graduation discussion, but my concerns with all of them are about visibility and co-design. Specifically, at which point is the broader community expected to chime in? As one of MLIR maintainers, should I actively follow the TCP incubator to make sure it follows our IR design practices? Will there be a design / code review upon graduation where other people can make suggestions or request changes? If so, what is the amount of material to review? If not, should we have some limits on what is explored by the project (e.g., it should not be replacing tensor or linalg dialects)? Conversely, as an owner of an incubator project, should I proactively seek feedback from future stakeholders? Or can I just write “intentionally draft” code with the assumption that it will have to be revised/rewritten upon graduation?
The comments from @stellaraccident and others suggest that there will be an extra review, but it would be better to have clear expectations.
All my comments aside, I would want us to bias towards action and take the least worst solution, something suboptimal but acceptable, so I can just side with the majority (presumably, in favor of the incubator, maybe a quick poll?) and have a separate discussion on the project charter extension policy.
+1. Maybe lift this to the LLVM level if incubators are involved.