As the primary author and defacto maintainer of the quant dialect, I would like to solicit feedback on whether we should remove it from upstream MLIR.
To make one thing clear: Quantization algorithms and approaches are incredibly important for the ML-MLIR stacks. However, while the quant dialect was intended to be an implementation of such things, it was invented very early in the evolution of MLIR (trivia, the quantized types were the first non-builtin types) and the only concrete implementation of an actual use was in TFLite (to my knowledge). I suspect that if re-approached today, the path forward would look quite different, and having a bag of un-assembled components occupying such an important namespace can make it hard to talk about the next things.
I’m being somewhat evocative by requesting comments on removing it entirely. It is more that it is not being maintained and with current folks I know who work in this area, is not on a path to the generality/utility that we would want from this area.
For the uninitiated, the quant dialect defines several things:
Several QuantizedType implementations which are suitable for use as tensor element types and model typical affine quantization schemes.
A small library of structural ops intended to aid in the construction of algorithms for converting from simulated quantization to native low precision arithmetic:
Casts between simulated and native forms (quant.qcast, quant.dcast)
Casts to/from the native storage type (quant.scast)
A quant.region op which encapsulates a sequence of simulated arithmetic and aids in lowering fused sections to native arithmetic.
quant.const_fake_quant and quant.const_fake_quant_per_axis simulates the effects of quantization given (bit depth, min, max values) as constants.
quant.stats presents an observation point for use as part of guided quantization, in combination with a runtime which can run the computation and refine quantization parameters based on actual values.
quant.coupled_ref defines a “join point” between two SSA values which must resolve to the same quantization parameters.
Various folders and passes to propagate casts and materialize constants into their native forms.
Of the above, I think that the QuantizedTypes have demonstrated utility and cross-project/framework integration (in-tree, they are used by TOSA, and out of tree, I know that they are used by TFLite and various other stacks that I can see). I think their implementation could use some uplift (MLIR has improved a lot around them in the intervening years), but as a normalizing force across components, I think they are paying their way. I would propose that we keep (modernized) versions of them as builtin types.
For the rest, I could see this going a couple of different ways:
We move the parts that are used down to TFLite and delete the rest.
Folks step up who have an interest in solving this problem and believe that what is there is worth building on – and express a desire to take it forward from its current state. We may want to think about moving/renaming some of the things, since we’ve learned a lot about quantization in the intervening years, and this is but just one approach.
As part of making this decision, I’d like to hear from any folks who are using more of this facility and have just been silent. There is no rush to do something here or break something that is useful – just trying to be tidy and acknowledge the current state.
I support this. MLIR has accreted a lot of interesting things but many of them are under invested in, and this dilutes the project and leads to confusion. It would be great to consider dropping other dialects and routines that aren’t getting active love/use as well.
I agree the types themselves are the most important part of this, and I support keeping them: they help with cross-dialect interchange and communication.
LGTM for the rest of the proposal aside from this.
Why should these move into the builtin dialect? I feel like there is a dangerous mindset (in the ecosystem as a whole) that only operations really deserve a new dialect, and attributes/types can be added to the builtin dialect if there isn’t a suitable op-dialect in existence. I find this particularly dangerous because this is the exact mindset that lead to the current standard dialect (except with attributes/types instead of operations). Have you considered just leaving the quant dialect in existence, but without the current operations(which is the part of the dialect that is to be dropped by this RFC)? What would be the pros/cons of such an approach?
That would be strictly easier. This came up as an option from an offline conversation regarding cleaning up the codebase, and that particular part (moving types to builtin or elsewhere) was from a remark made in the discussion. I don’t feel very strongly about it.
Leaving the types in a dialect (either a cleaned up quant or tensor perhaps) would work better with the ability to manage quantized constants and such (via dialect hooks). If they are to go somewhere, I am somewhat partial to putting them in tensor since, as defined, they really only make sense as a tensor element type – putting them there would reduce some cross dialect referencing.
We also use the quant types quite extensively, but none of the quant ops, so the proposal looks good to me.
If they are to go somewhere, I am somewhat partial to putting them in tensor since, as defined, they really only make sense as a tensor element type.
I think they are also useful in the memref world: a) when it can be the element type for memref (which isn’t possible until the recent RFC that opened up memref element type interface), it adds more flexibility in the order of the passes of bufferization and lowering quantization arithmetics; b) in an execution system that generally works on buffers but also requires quantization parameters in some ABIs, it will be useful to reuse the quant type as metadata of the memref types in IRs.
QuantizedType: like the others indicated, we should keep them as the most convenient way to encode quantization parameters for tensors. Keep them in the quant dialect looks fine. (Or you have a different use of this namespace?)
cast ops, regions and passes: Some of the ops and annotations were added as an attempt to generalize the TFLite quantization algorithm. Eventually we chose to legalize the quantized TFLite ops to the targets (like TOSA does), so most of these can be removed for now.
utilities: These utilities were used to convert quantization parameters from TF to QuantizedTypes. I think this part can be move to TF/TFLite, or they can be kept as the utilities of the QuantizedTypes.
Thank you for the feedback. I am not opposed to keeping development upstream going in this area, but in my opinion, that would require a renewed focus on actually implementing a full algorithm upstream. The situation, we find ourselves in is that we have some disconnected pieces but no visibility into the uses. The original intent when this was written was that a more complete implementation would be built upstream, but the TFLite MOT team decided to do all of the work in their own repository instead, and these parts have basically been orphaned.
Having too many disconnected pieces could actually make it harder to restart real development in this area, because there is no way to know what is in use vs not. Biasing towards more of a clean slate is what we are aiming for.
Of course, if you have active uses of any of this, we don’t want to needlessly cause you pain, so please speak up. Ultimately, though, I do think that we need to trim out the parts that never went anywhere, and I would love it if a new project in this area started building the infra upstream for various quantization algorithms and approaches.
Removal of the ops looks ok to me. We had an internal pattern to remove instances of quant.stats that we saw being emitted by training flows.
Do the quantized types intend to be retained and supported ? There’s a fundamental difference between how TFLite expresses quantization in-tensor vs TOSA expresses it in-op. When TOSA was originally open sourced there was a question as to what happens to the tensor-carried quantization information. Our answer then was to leave it alone, even at the cost of duplicating information. The duplication made some backend analysis easier, e.g. concat needing to establish if input tensor scales match.
Hopefully the tensor-carried quantized types will remain supported and can be also augmented to support novel mechanisms, e.g. VSQuant.
I think there is pretty clear consensus that we keep them, and there seems to be consensus that we keep them in the quant dialect.
Just to reset this conversation a bit: I’m not looking to delete things that are thought out/useful, but when some of us were discussing offline the topic of cleaning up some of our old experiments, this area came up. I’m mainly trying to tidy up things that I think haven’t gone anywhere. This is both for the purpose of just having a cleaner repo and because quantization is very important and deserves more investment in our ecosystem. But when you have a half developed, experimental thing sitting there, it is just going to confuse people and make it harder to get new development going.
That’s what I’m trying to balance. There is no rush or solid line here – just doing janitorial work on old experiments.
They were a part of a specific quantization algorithm being developed at Google during the early days of the project, but it was the victim of organizational turmoil and left in a zombie state. We should have cleared it out long ago.
I believe we confirmed offline that these cases don’t really align with development upstream and they will be worked out locally to Google’s repositories.
I’ll plan to land the deletion patch in a couple of weeks to allow time to sever any incidental dependencies. Feel free to ask for more time if this creates a scheduling problem. Just trying to tidy things up… not create emergencies.
Just following up as this came up in a recent chat. I think there was still blocking part in TF repo to just deleting the majority of ops, is that current state? (Some of these just move to TFL or tf_type dialect I’d presume, using them during whichever lowering both sides, and TOSA converter side potentially additional work [I didn’t check if done]).
So retained followed by a round of modernization given changes in infra since these were added years ago. Is there someone looking at that/some fixed plans or “help wanted” section?
I’ve heard people wishing we had actual quantization algorithms and such here, which I think would make sense. I don’t think we want more stuff here that is just ops without users. Upgrades to the types make sense to me in principle.