Currently, UniformQuantizedType only supports built-in MLIR storage types such as Integer. LLM quantization research introducing feature of using NF4 as a low precision datatype (see https://arxiv.org/pdf/2305.14314). There is a growing need to make the system extensible and maintainable as more types are added. Ensuring that MLIR can natively support NF4 through a clean, extensible interface is essential for both current and future quantization workflows.
Current Approach and Its Limitations:
-
The present implementation relies on dynamic checks (e.g., type switches or if-else chains) to determine the storage type and retrieve type-specific information for legality checks.
-
This approach works for a small, fixed set of types, but as the number of supported types grows, the code becomes harder to read, maintain, and extend.
Proposed Interface-Based Approach:
-
Define a StorageTypeInterface that specifies the required methods any storage type must implement to be used in UniformQuantizedType.
-
Each storage type (Integer, Float8E5M2, Float8E4M3FN, and new types like NF4) would implement this interface, encapsulating their type-specific logic.
-
When UniformQuantizedType needs to check legality or retrieve information, it can use MLIR’s dyn_cast mechanism to check if the type implements the interface and then call the required methods.
-
This design decouples UniformQuantizedType from the specifics of each storage type, making it easy to add new types (such as NF4) without modifying the core logic or introducing more type checks.
Benefits:
-
Extensibility: New storage types can be added by simply implementing the interface, without touching the core UniformQuantizedType logic.
-
Readability: The code is cleaner, as it avoids large switch statements or if-else chains.
-
Maintainability: Type-specific logic is encapsulated within each type, reducing the risk of errors and making the codebase easier to understand and update.
Example Implementation:
To demonstrate this approach, I have prepared a pull request (see Extending UniformQuantizedType with interface-based support for new storage types in Quant dialect by Roman-Pevnyi · Pull Request #152966 · llvm/llvm-project · GitHub) where I implemented the QuantizationInterface. This PR serves as a concrete example of how we can use interface for legality checks on any storage datatypes in MLIR’s quantization infrastructure in a scalable and maintainable way.
I’ve commented on the PR, but perhaps the comment is more high level than code review, so I’ll duplicate here.
I think we need to decide first if we want a contiguous storage type for sub-byte types or not.
For example:
int4 and fp4 can have size = 4, storage_size = 8 but still pack two elements per byte. Or we call storage_type the actual storage (ie. 4) and have an additional allignment = 8 to mean the last element has padding.
fp6 can be represented in two lists (4 + 2 bits), and those lists themselves be packed or not.
Depending on the answers is what this interface would look like.
The second question is if we want to have an MX type (tuple of vectors, with payload, scaling factor, storage type and element type). If we do, then the conversion between the MX and non-MX would be in an MX dialect (potentially quant), and if we make MX types native in MLIR, then in theory, we could tile and fuse them by teaching those patterns to descend into the sub-types.
I don’t mind experimenting with it like your implementation, but it would be good to know what folks would prefer as a final destination, so that we go all to the same place.
Thanks Roman for this effort.
Similar to @rengolin I added some PR related comments in PR. My main other question is - “currently MLIR quantization..storage type only as Integer”. Could you add another use-case other than (Buitin_Integer implements QuantizationInterface), which i guess you already have to make the case stronger, and will also serve as instructive for other future users of this interface. Thanks.
Thank you for the review!
I will add another use-case other than Buitin_Integer to see the picture better.
1 Like
Thanks @Roman-Pevnyi for the PR/RFC.
I am not aware of the inception of this dialect. However I feel it can only represent scale as a constant but we may have the case where the scale is being computed on the fly(Dynamic Quantization) and we need to encode this computed scale. Is it supported or am I missing anything?
I’ve added support for Float8E5M2 and Float8E4M3FN.
As shown in the second commit, this only required extending the float types themselves by implementing the QuantizationInterface methods. The Quant dialect code was not touched because it can already accept any type that implements this interface.
I will present below NF4, which is not a built-in type and has a defined structure.
You can see the current NF4 type implementation here.
NF4 consists of:
It already has get, print, and parse methods implemented.
To make NF4 usable as a UniformQuantizedType storage type, I would add the UniformQuantizedType method implementations, following the same approach used for the float8 formats in the second commit.
Example code skeleton:
class NF4Type : public mlir::Type::TypeBase<
NF4Type,
QuantileFloatType,
vpux::detail::QuantileFloatTypeStorage,
mlir::QuantizationInterface::Trait> {
/// Existing code...
// QuantizationInterface method implementations
bool isStorageSigned() const { return true; }
/// Get the bit width of this 4-bit normalized floating point type.
unsigned getStorageWidth() const { return 4; }
/// Get default maximum value for this 4-bit normalized floating point type.
int64_t getDefaultMaximum() const { return 16; }
/// Get default minimum value for this 4-bit normalized floating point type.
int64_t getDefaultMinimum() const { return -getDefaultMaximum(); }
/// Get the storage type name as a string.
std::string getStorageType() const {
std::string result = "!QuantileFloat.nf4<";
llvm::raw_string_ostream os(result);
os << getStorageType();
os << ":";
os << getQuantileType();
os << ", {";
ArrayRef<double> quantiles = this->getQuantiles();
printQuantiles(quantiles);
os << "}>";
os.flush();
return result;
}
};
Example IR:
!qalias = !quant.uniform<!QuantileFloat.nf4<ui4:f16, {-1.000000e+00, ..., 1.000000e+00}> : f16, 1.0 : 0>
!qalias = !quant.uniform<!QuantileFloat.nf4<storage_type:quantiles_type, {quantiles}> : expressed_type, scale : zero_point>
Related to MX types, I would argue that we can separate their datatype and quantization details, and represent them, for example MXFP4 as simple f4E2M1FN storage together with sub channel quantization scheme which also stores it’s scales as low precision f8E8M0FN data.
Like in the original proposal https://discourse.llvm.org/t/rfc-supporting-sub-channel-quantization-in-mlir/82694
Let's imagine a tensor of with the following specs:
// tensor<16384x3072xMXFP4>
// quantizationDimensions : [0,1]
// blockSizes: [1,32]
// scales: [[s0/0, s0/95], [s1/0,s1/95], .., [s16383/0, ... s16383/95]] : tensor<16384x96xf8E8M0FN>
Then we'd have the following
tensor<16384x3072x!quant.uniform<i8:f32:{0:1, 1:32}:{f8E8M0FN}, {{s0/0, s0/95}, {s1/0,s1/95}, .., {s16383/0, ... s16383/95}}>>
Please correct me if I’m missing any MXFP detail, which can’t be covered by above representation;
1 Like
The one thing missing is to separate the “storage type” from the “storage container type”.
For example, PyTorch’s AO quantizes fp6 into INT8 containers (which they call naive), while other implementations create two separate tensors of INT2 and INT4.
Depending on the pair { storage, container } type, the extraction could require a sequence of vector loads and shuffles (for the fp6 on two separate tensor cases). This is the exception, though, so not high priority right now.
1 Like