We are looking to build a compilation stack for ML models at Cruise using MLIR (which @sanjoyd alluded to in this thread last week). This stack is meant for generating code natively for GPUs and accelerators. We would like to keep as much of this as possible MLIR upstream.
Requirements
We believe that we need a dialect that is at a higher-level of abstraction than, say, Linalg. More specifically:
We need a compiler focused dialect that is backend agnostic and amenable to codegen for GPUs and accelerators.
We need to support op-level fusion (for grouping operations to map to libraries like cuDNN, etc.).
We need to support dynamism at least in shapes (but not necessarily rank dynamism).
We need to support quantization as well as sparsity.
We need it to work for inference now and potentially for training in future.
Ideally, we would like to have the dialect support both implicit as well as explicit broadcasts and a lowering from the former to the latter.
Existing Dialects
Among the existing dialects, TOSA seems to be the closest to our requirements. But we have some concerns with TOSA:
TOSA seems to be tied to this spec. What is the correlation between TOSA dialect and the spec? Does the spec need to be updated before updating TOSA dialect? If so, what does it take to update the spec?
Is TOSA amenable to changes, if there is a clear gap? For example:
Scatter op in TOSA does not support reductions with repeated indices.
TOSA does not support f64 types.
Reductions with generic accumulation are not supported.
Proposal
Assuming the community is interested in the features listed above, we see a couple of options.
Option 1: If TOSA is not tied to the spec and is amenable to changes, we could update it to include some of the features that we listed above.
Option 2: Come up with a new dialect that supports the features listed above (or maybe, resurrect the TCP discussion?).
We are open to other alternatives as long as there is a way to support our requirements.
Thanks! I’d love it if someone was motivated to resurrect TCP and actually make it a reality
Feel free to book any of the weekly ODM slot if you want to organize a discussion about stakeholders interested in participating in such an effort!
This is really missing in MLIR right now to close the loop toward more of having end-to-end solutions available in-tree, and I’d love it if someone was motivated to drive this part of it.
The goal for TOSA is to keep alignment between the spec you pointed to and the dialect. That should include updating the spec before updating the dialect. The TOSA spec is open for contributions, probably the easiest way to start a discussion is on the discourse where the spec is: Discourse (mlplatform.org). That keeps gives us a place for spec specific discussions which might be slightly off topic for this board, although we do also post here when people have TOSA questions. It’s also possible to post proposed patches on Gerrit Code Review (mlplatform.org).
To address some of the specific questions:
Scatter intentionally doesn’t support repeated indexes, as you either force an ordering to traverse the indices, or create a race condition as to which value will be written to the tensor. Avoiding repeated indexes allows for more implementation options without those problems.
So far, we haven’t seen a draw for f64 in models. Adding f64 would increase the complexity of accelerators, and would have a negative impact on bandwidth/storage for the values.
Sorry, I’m probably missing something obvious, but could you expand more on what you want for reductions with generic accumulation.
I don’t have complete standing to actually make the following offer, but I think the feedback would provide some timely facts to the situation:
What if Google were to detach MHLO, CHLO and the lowerings (to Linalg/et-al) from mlir-hlo, clean them up, port existing framework connections to them and place them under unambiguous community governance, licensing and contribution models (i.e. possibly up to the extent that we previously sponsored for the investments to make torch-mlir a community project as an LLVM Incubator repository)? Would that satisfy the technical need? And what elements of community governance are deemed as important for potential collaborators on such a project (i.e. anything from “in the LLVM Foundation” to “aligned with an independent, open-source friendly other governance model”)?
Also, thanks for the explanations of why some of the decisions were made. If TOSA is to be used as a dialect for “all” ML models (which is what we are proposing here), we need to have to way to represent these, irrespective of their implications on implementation / performance.
could you expand more on what you want for reductions with generic accumulation.
I meant a reduce op with a lambda as input (as opposed to fixed reduction operators that are currently present).
This is a nice coincidence! Later today, we were planning to create a GitHub repository for StableHLO - a stable version of MHLO (and CHLO).
At Google, we have a team staffed to contribute to e.g. a spec, a reference implementation and a test suite, as well as new feature development (dynamism, quantization and sparsity are the big ones that come to mind, and I see that you mentioned them as well above).
An open question is the governance model - let’s figure it out together. At the moment, MHLO is pretty much Google-driven, but this is something that we want to change with StableHLO.
@burmako Is that more of a stable input format into the compiler (like TOSA) or is it going to be a compiler IR? E.g. will removing an op be a breaking change?
The current thinking is that StableHLO would be a stable input format, with backward compatibility guarantees, based on something like [RFC] IR Versioning. Removing an op would be a breaking change, and that would need to respect the agreed upon compatibility window.
@burmako Just to clarify, are you suggesting community governance for StableHLO? More specifically, is StableHLO going to be an “LLVM Incubator repository” as @stellaraccident had suggested?
We are listening/evaluating and making that decision ~now. For pragmatism, it will start as a google repo (as torch-mlir did) under a Google administered organization. But we would like it to be a community asset and are trying to figure out the governance model/final location/etc. If it were to become an “LLVM Incubator repository”, that would be because a) there is demand for that (and we debate it internally and conclude that is a good direction to go), and b) the LLVM community accepts it. Getting feedback on this thread informs both of those aspects.
Also, thanks for the explanations of why some of the decisions were made. If TOSA is to be used as a dialect for “all” ML models (which is what we are proposing here), we need to have to way to represent these, irrespective of their implications on implementation / performance.
Yes, TOSA takes an opinionated stand on operators, with the assumption that implementation / performance are important characteristics for models.
I meant a reduce op with a lambda as input (as opposed to fixed reduction operators that are currently present).
Hmm. Yes, that would be a tough one to fit under TOSA’s current principles. It would be interesting to see how that fit under MHLO / StableHLO.
Yes TOSA attempts to balance the hardware view of the spec operators with the compiler IR view within reason. But since it defines the functional implementation within the spec there’s an assumption here that the operators translate to hardware implementations from their TOSA forms without a substantial low level gulf in abstraction to hardware / microcoded level.
Viewed from such an abstraction level, an operator like a reduction with a lambda is somewhat higher up in abstraction, but could be potentially reduced to TOSA primitives. This doesn’t mean TOSA cannot accommodate new operators - as @eric-k, who maintains the spec, says - there are defined processes through which contributions are indeed welcome. @stellaraccident was the contributor of the tosa.fft op recently.
What do folks think about developing this incrementally in tree (i.e. not as an incubator project), after first presenting & reviewing the high level design? I feel like at this point the design space is well-characterized and there aren’t major unknowns.
Personally, I’ll defer to community consensus on this, but I’m also skeptical of our ability at this juncture to arrive at that consensus for in-tree development of something of this scale and category. If anything, I have a slight preference for seeing the “ML bits” come out of the main tree and in to a more domain specific repository (or set of repositories) where they can grow/mature and interop with each other more directly (and carry the dependencies common of this domain). I know that we need to improve the infra for managing the detached “ML repos”, but I’m interested in seeing that happen without special privilege being paid to those parts that happened to have existed at the right point in time to have reserved a spot in the monorepo. I’ve argued for more inclusion into the monorepo based on policies before, but I believe the community has been pretty clear on holding a higher standard there. In general, ML compilers are still young, varied and fast moving. I’d like us to have a repository positioning that reflects that vs continuing to add bulk to the monorepo – whose primary purpose continues to be the long term, high stability core APIs and toolchains.
I feel like this should exist at the same “privilege level” as torch-mlir and onnx-mlir.
Makes sense, I too don’t want TCP to get a “free pass” because of timing. Maybe let’s discuss this next Thursday as @mehdi_amini suggested. By that time hopefully we’ll have a decision on StableHLO’s location as well.
These are more complex organizationally though since they’re tied to external projects & specs. I’d expect TCP to be fully controlled by the community.
One of the things we’ve found is that beyond the dialects, there are tooling and integrations that are useful in converting in/out, testing, code generating, CI, deployment artifacts, etc. There really isn’t a place in the monorepo for such things to exist with any fidelity. Torch-mlir and onnx-mlir are also fully controlled by the community but they are free to handle these other parts a bit better, and I think that makes them stronger projects that we can put more weight on. Every time the upstream dialects need to grow a new layer of integration/testing/etc, it is a tax that everyone pays – we end up doing the bare minimum because of that, which still adds up but never quite gets us to where we would be quality wise vs if there was a more dedicated project structure for things that are “crossroad” components.
(The answer could be “start a new top level project in the monorepo” but that is an even higher bar – and easier to approach by way of incubator)
This could be interesting discussion wrt profiles and the like. E.g., TFLite dialect allows for types that TFlite flatbuffer and runtime doesn’t support for . It allows for using the ops with different types but of course that makes a gap with respect to numerics, e.g., one won’t have the same guarantees or conformance, but could use the same computational description. This has been the case in TFL dialect for a couple of years though without much issue. It would fall outside spec and it’s guarantees though.
I think that is key component no matter where this goes with all these potential candidates (or combination of candidates).
This is an interesting one as this is a case where the corresponding HLO op has gotten active negative feedback from stake holders and even JAX doesn’t use this functionality in general. I sometimes feel like folks want multiple dialects and abstractions and just concatted into one dialect for some reason (“we have a single input, yes it has bitshift and nD einsum and inter device communication primitives”) That is to say, if an SCF op fits the goal, why not use it? What are the constraints here? (Speaking from current experience on reduce, the number of actual uses of the lambda are not something that would motivate me to add it, i like it from a generality point of view only/e.g., it’s cute).
HLO scatter is probably the most disliked op in XLA (well that’s an emotion, but quantitatively the op with the largest number of bugs by some margin). What functionality are you after with repeated indices?
That is very interesting. I know you can have that without external project constraints. But without spec constraints I’m not sure how much you should expect any stability or versioning. I think it’s important to define the goals: TCP was not stable, it was an IR, it had no guarantees except of being useful for more codegen orientated optimization and being target for multiple frameworks. TCF was different story. Perhaps that’s all part of the ODM (i think there was a different one scheduled for next week though, but could be misremembering).
Speaking for myself, I don’t think I want to invest in another frontend “reduction” opset which doesn’t have some guarantees around this. The cost for community projects is just too high: outside of corporate codebases, it becomes prohibitively hard for projects to interop at the exact same commit. Even if these are soft requirements, they become constraints for testing infra, frontend integrations, and resulting CIs – resulting in poor quality software (since nothing can ever be tested together). Everything below that, sure, let it drift. But if it is serving a similar integration role as llvm ir for the domain, then it needs to be designed for some level of compatibility.
(True stories from the CI pit )
(But we may now be talking about two different things)