[RFC] Should the OCP microscaling float scalars be added to APFloat and FloatType?

krzysz00 · March 8, 2024, 8:34pm

Background

The Open Compute Project has defined a new standard for block-scaled microfloat (MX) data types.

This format defines datatypes that consist of a block of micro-floats (such as the 4-bit E2M1 type or the 6-bit types E3M2 and E2M3) along with a shared scale, which holds 8 exponent bits and no mantissa.

The intended usage of these microfloats is as part of a block-scaled type:blocks of K (with K=32 being required by the specification) microfloats along with their shared scale. This is intended to provide efficient matrix multiplication and to allow the compressed storage of tensors, most typically the weight tensors in machine learning models.

Since there’s currently software emulation for using these types in PyTorch and that hardware vendors will likely be adding support for these MX types in the future (on account of having signed on to the spec), it’ll be useful to have these types representable in MLIR’s core library in the interests of interoperability.

The types

The components of a microscale block are somewhat like floats, in that they have sign, exponent, and mantissa bits. However, due to their small size, the individual components don’t have infinities or NaNs. When a NaN is needed, it is applied to entire block of microfloats by setting the shared scale to 0xff.

The alternatives

Between their very limited range, sub-byte length, lack of special values, and the fact that the OCP specification doesn’t comtemplate operations on individual microfloats, it’s not clear that individual microfloat types (like a hypothetical FloatE2M1FN) should be added to MLIR’s FloatType hierarchy (which implies their addition to APFloat as a prerequisite matter).

However, adding those scalars to FloatType would enable microfloats to be handled with much of MLIR’s existing infrastructure. For example, a block of 4-bit floats could be a vector<32 x fE2M1FN> instead of needing a custom type.

An alternative approach would be to define a new microscaling dialect and add custom types like !microscaling.fe2m1 and !microscaling.block<32xfe2m1> and !microscaling.scale. When combined with a sufficiently general bitcast operation, this would allow defining operations on microscale types (which, at the end of the day, will still lower to an appropriate-width integer (or vector of them) since there are no plans to add these microfloats to LLVM or SPIR-V’s type systems to my knowledge).

However, because these custom types would sit outside of the float type hierarchy, they wouldn’t be permissible in many high-level dialects like Tosa, creating substantial additional friction around the process of generating a high-level description for model fragments that are meant to use microscaled floats on their inputs or outputs.

I’m not seeing a clear best path forward on this design question, so I’m writing to get the opinions of other people who’ll be hooking up microfloat support in MLIR at some future time.

eric-k · March 9, 2024, 1:35am

Thanks for bringing this up. We’ve started to look at microscaling format support for TOSA. It’s still early, so we don’t have a specific proposal yet, but it would be great to have something aligned across dialects.

In terms of the TOSA operator set, our current idea is to only allow the microscaling formats as inputs/outputs to the tosa.cast operator. That way you would cast them to a standard FP type to do your calculations and cast back to a microscaling format afterward. I think that would fit better conceptually with your second approach, but would also like to hear other opinions.

krzysz00 · March 11, 2024, 6:35pm

@eric-k I’m in agreement with the mechanism in Tosa being casts, since that’s something anyone can pattern-match on if they’ve got a hardware-specific microscaling convolution or matmul. I’d argue you’d want to, right now, ensure you can cast to/from the 8-bit float types (f8E5M2, f8E4M3FN, and so on) in Tosa, since 8-bit float matrix multiplication acceleration already exists.

Now, that being said, the cast operation for microscaled floats is a bit weird, compared to arith.extf and arith.truncf. I’ve got a preliminary sketch of a microscaling dialect bouncing around my machine, but I haven’t posted it because I wanted to solve this design issue first. Relevantly, I’ve been considering the cast operations like so

%block, %scale = microscaling.trunc %floats : vector<Kxf32> to !block<fE3M2, K>, !scale
%floats = microscaling.ext %block, %scale : <fE2M1, K>, !scale to vector<Kxf32>

(though that f32 might be f16 or something like that)

This is operating on the assumption that the computation is being passed a tensor of blocks and a tensor of scales, which is one solution for the cache locality vs. alignment tradeoff that comes with attaching 8 bits to a 128-256 bit packed vector. In other words, at a tensor level, we’d see

func.func @ext_microscaled_soa(%blocks: tensor<Nx!microscaling.block<fE5M2, 32>>, %scales: tensor<Nx!microscaling.scale>) -> tensor<Nx32xf32> {
  %ret = tosa.cast %blocks, %scales : ...
  return %ret : tensor<Nx32xf32>
}

There’s an alternative storage scheme that looks like

func.func @ext_microscaled_aos(%data : tensor<Nx!microscaling.packed_block_scale<fE5M2, 32>>) -> tensor<Nx32xf32> {
  %ret = tosa.cast %data : ...
  return %ret : tensor<Nx32xf32>
}

Which of these schemes makes more sense is, as far as I can tell, architecture and problem dependent, so we’d want to support both options.

Now, if we’re going this route, it’ll probably be useful to have %packed_block_scales = microscaling.pack %blocks, %scales and %blocks, %scales = microscaling.unpack %packed_block_scales for getting everything to fit into APIs. (Those operations would work on the tensor level, and, ideally, would be folded away during codegen)

There’s also something like

func.func(%floats : tensor<Nx32x!microscaling.elem<fE2M1>>, %scales tensor<Nx!microscaling.scale>>) -> tensor<Nx32xf32> {
  %ret = tosa.cast %floats, %scales : ...
  return %ret : tensor<Nx32xf32>
}

but that has the problem that you’ve now got a tensor with sub-byte elements, meaning that you can’t index into it post-bufferization and expect that to work without a bunch of special-casing, so I’m not particularly enthusiastic about this approach.

I hope this is making some sort of sense.

stellaraccident · March 23, 2024, 12:50am

Thanks. One of the problems I’ve been having while parsing the MX specs is that they say nothing about layouts. I’m having a bit of trouble connecting a concrete design to such an abstract description.

But if I were imagining how this might work out and borrowing from other spaces, it kind of has to break down to some form of planar or interleaved/packed formats, and it seems to me that at the level we operate here, we will need to have a way to represent both.

I find it much easier to think of these things from a bottom up perspective, and there are already a ton of priors. Particularly, this looks a lot like some of the structs that llama.cpp uses for its various numeric formats. And that connects to some experiments that were done last year to literally use llvm.struct as a tensor element type. I’m not saying we should literally do that, but the result was that such a representation composed reasonably well.

It seems to me there might be 3-ish ingredients:

At the very high level, potentially some form of encoding attributes/types and corresponding constant attributes/ops for representation of logical literals prior to imbuing them with layout.
MX structured types and various pack/unpack ops to transform them between planar and packed forms.
Type interfaces for struct-like types and work to make those operate as element types for tensor and memref. Possibly make vector implement the interface.

Thinking about a generic, non hardware accelerated CPU lowering pipeline should inform what pieces are needed.

One issue I’ve seen in the past is for people, when faced with an abstract type like this, to assume that nothing can be assumed about layout and how it decomposes. In reality, the universe of how it decomposes, while platform specific, is still bounded by the usual concepts of locality and alignment. Ie. Arrangements that are non-sensical can be ignored vs joined over, and unless if I miss my guess, there will end up being a manageable number of permutations. The hint I’ve found that you’re in one of these non sensical areas is that you end up with unaligned sub-byte types that are not blocked in a coherent way. I’m yet to see that arise in nature in a way that is legal for any problem or platform. The answer there is always to propagate the constraint upward towards the user level programming model, in my experience.

Reading back over your message, I think we’re on a similar page, but it helps me nonetheless to write out my thoughts.

References:

ggml quants: llama.cpp/ggml-quants.h at master · ggerganov/llama.cpp · GitHub
Ggml packed sub-byte structs: llama.cpp/ggml-common.h at 50ccaf5eacb50a2ca378a4ef0dc7aeb45fead652 · ggerganov/llama.cpp · GitHub

stellaraccident · March 23, 2024, 1:32am

And to answer the question posed at the top, I don’t think these are related to APFloat or FloatType. I think that from a code generation perspective and for targeting lower level hardware intrinsics they are something new for MLIR and more closely related to packing/blocking. Would be a good opportunity to connect those concepts up properly, as they also come up frequently as we deal with various implementation defined sub byte formats.

As for the basic scalar types themselves, I think that is an option we should keep an open mind about. The hint that we should define them in APFloat is if we end up starting to use them as scalars or basic element types in constants. I could see such a situation arising naturally as part of the rest of the design, and the advantage to defining them properly is that you get basic parse/print/typing and arithmetic emulation suitable for folding and such. If we find ourselves there, it makes a lot of sense to add them as real types, and I’ve found that a much better option than trying to define such things another way.

sjarus · May 23, 2024, 5:46pm

Hi @krzysz00 bringing this thread up since we discussed OCP/MXFP during today’s TOSA community meeting. Has there been any development in this space even if its not close to landing ? In particular, I’m interested in learning more about whether the design choice between dialect vs type extension with existing MLIR shaped types has evolved in a more concrete direction.

cc: @manupak

krzysz00 · May 28, 2024, 3:06pm

I can say my current thoughts are to extend getIntOrFloatBitwidth() to something like getScalarBitwidth(), allowing for arbitrary semantic “tags” on collections of bits, so that we can sort out this microfloat representation question Later ™

ThomasRaoux · June 7, 2024, 3:45pm

sorry for joining this thread late. Few comments/questions.

I do like the proposal to add getScalarBitwidth() and (we may have discussed it before) having an interface for it would allow this to work with any type that may live downstream which would be a big improvement. I’m happy to help if we have a consensus about this.

About allowing fp4/fp6 formats in tensors, I still think this would be useful for high level representations. We already support sub-bytes type in tensor for integers so it doesn’t seem like a new problem.

+1, and this seems orthogonal to the block representation.
Note that there is a recent PR that goes in that direction (I’m not the author nor I have discussed much with them but I think this would be useful):

krzysz00 · June 7, 2024, 4:44pm

Re getScalarBitwidth(), I should probably RFC this - and we might want to have a chat first, but, overall, this sort of “stick a tag on some bytes” thing seems quite useful (for instance, signed and unsigned ints could’ve been done that way back in the day).

… Now that I’m looking, I think there’s an argument for PlainDataTypeInterface or ScalarTypeInterface which exposes a getBitWidth(). And then I’m inclined to argue for some of the bitcast operations growing the ability to go from ScalarTypeInterface to iN for relevant n.

sergey-kozub · September 4, 2024, 8:34am

Note that APFloat for MX float4 and float6 were added as:

github.com/llvm/llvm-project

[APFloat] Add APFloat support for FP4 data type

llvm:main ← durga4github:durgadossr/fp4_apfloat_updates

opened 11:11AM - 13 Jun 24 UTC

durga4github

+283 -7

This patch adds APFloat type support for the E2M1 FP4 datatype. The definitions… for this format are detailed in section 5.3.3 of the OCP specification, which can be accessed here: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf Since the condition-checks specific to the OCP formats were added through earlier PR (#94735), this patch simply adds the FP4 type and its tests.

I’d like to add these types to MLIR (as I’m going to implement them in OpenXLA).
Posted my first PR: [MLIR] Add f6E3M2FN type by sergey-kozub · Pull Request #105573 · llvm/llvm-project · GitHub
I see no progress (no responses) for more than a week.

How do I find a reviewer willing to look at it?

Topic		Replies	Views
Custom float types MLIR	18	755	March 25, 2024
Rethink on approach to low precision FP types MLIR	28	1619	January 16, 2025
RFC: Add APFloat and MLIR type support for fp8 (e5m2) LLVM Project	21	3542	March 28, 2025
[RFC] Add built-in support for scalable vector types MLIR	14	1911	December 4, 2021
Questions about vscale LLVM Dev List Archives	21	152	April 20, 2020

[RFC] Should the OCP microscaling float scalars be added to APFloat and FloatType?

Background

The types

The alternatives

Related topics