Hi Vooblin - welcome.
For some historical context, most of this was written in the very early days of MLIR (literally the QuantizedType
was the first dialect-specific custom type), and it turned out to be quite early to be implementing something generic like this in MLIR: since the process of quantization (as formulated today – I’m not convinced this is the only way to formulate it) is largely an analysis/transformation done on frontend ops and types, it was hard to build complete tooling in the core. Since then, things like extensible traits and op interfaces have matured, which would make it more amenable to extend the generic support in the core. Given that, liufengdb@ and I ended up collaborating to keep the QuantizedType
type system in core while he implemented most of the rest of the tooling in TensorFlow, interoperating with the TFLite dialect. I also did some experiments, specifically to prove to myself where we could get with future codegen infra in core, and as you have noted, that is still there in terms of the FxpMathOps
, Quantizer
, and some common utilities for solving the constraints. What I didn’t have the tools to handle at the time were the tie-ins to the source dialects. I had implemented a POC flow on top of XLA HLO but since MLIR lacks any frontend ops itself, ended up leaving the top-level bits that drew it all together in a private repo, waiting for the infra to mature a bit before coming back to it.
However, fast-forwarding to now, I believe we have the infra we need in core to do this right (or at least will: this is why I’m specifically interested in seeing the Development of high-level Tensor Compute Primitives thread get traction (even without that, we can get somewhere with some example ops if needed).
For me, these next few months are all about picking up and finishing these things that we started in the early days. Right now, I’m doing a couple week sprint to thread dynamic shapes through and then was planning to come back and take a fresh look after that and write an RFC advocating a way forward.
At a high level, I was probably going to advocate for:
- Dropping FxpMathOps
- Leaving the common utilities that the Quantizer uses but removing the high level algorithm (which, if we need such a global algorithm could be implemented much more cleanly these days).
- Introducing some new IR constructs that we wish we could do a year ago aimed at solving generically for arbitrary frontend dialects.
Regarding that last point, Feng and I had been discussing a new quant_region
op which could be used to wrap arbitrary source-dialect high-precision ops and carry the conversion information that is typically frontend specific. This would solve a lot of the boiler-plate in both FxpMathOps
and how TFLite had to couple itself to constructs in their dialect. I hadn’t finished it and was working on it in my repo. Here is the draft. See specifically quantized_op_validation.mlir
and CompressOps.td
(in my project, we reason about quantization as a specific form of compression, thus the name).
With this, source frameworks could provide a pass to outline supported ops into quant_region
ops and then we could largely write generic passes on top of that. I provided one simple example for XLA HLO in the XlaOutlineQuantizable.cpp
pass.
Regarding your point about per-axis: the type system supports this and TFLite has proven the path pretty well. I generally prefer to get the infra working in terms of per-layer first since per-axis has odd dependencies on the tensor layout and codegen regime that need to be solved for carefully (TFLite neither has a concept of layout nor codegen and gets to ignore these). That is why you don’t see it in the current samples in the MLIR repo. It is fully possible but it is just a preference to solve that once the simpler cases are established.
Finally, as Feng notes above, TFLite uses a fairly local algorithm for resolving quantization parameters whereas the work derived from me uses a global algorithm. We’ll need to resolve this, and there may end up needing to be elements of both: the computations that the TFLite quantizer works on now are not very semantically complicated, and I expect more complexity with more advanced inputs (but that is just an intuition, albeit backed by fiddling with some examples).
HTH with context. Does any of that sound unreasonable? Or do you have any interest in collaborating on the path forward?