RFC: Add APFloat and MLIR type support for fp8 (e5m2)

Given the announcement by Nvidia, ARM, and Intel and the corresponding whitepaper, we would like to add support for these datatypes in a way that they can be real MLIR types and (optionally) start to be used in LLVM IR and backends. As a first step, I have prepared a patch for one of these types (e5m2).

It looks (to me) like prior types (like BFLOAT16) were added by patch and did not have a dedicated RFC. However, given how many parties are engaged with FP8 support, I decided that it was best to raise it as an RFC.

I think there are two decisions to make:

  • Do we want to have the ability to support these FP8 types natively within LLVM/MLIR? (I have assumed yes, but this can be debated)
  • Is the naming and integration approach taken in the patch appropriate?

Thank you.

Patch comments below for discussion

This is a first step towards high level representation for fp8 types
that have been built in to hardware with near term roadmaps. Like the
BFLOAT16 type, the family of fp8 types are inspired by IEEE-754 binary
floating point formats but, due to the size limits, have been tweaked in
various ways in order to maximally use the range/precision in various
scenarios. The list of variants is small/finite and bounded by real

This patch introduces the E5M2 FP8 format as proposed by Nvidia, ARM,
and Intel in the paper: https://arxiv.org/pdf/2209.05433.pdf

As the more conformant of the two implemented datatypes, we are plumbing
it through LLVM’s APFloat type and MLIR’s type system first as a
template. It will be followed by the range optimized E4M3 FP8 format
described in the paper. Since that format deviates further from the
IEEE-754 norms, it may require more debate and implementation

Given that we see two parts of the FP8 implementation space represented
by these cases, we are recommending naming of:

  • F8M<N> : For FP8 types that can be conceived of as following the
    same rules as FP16 but with a smaller number of mantissa/exponent
    bits. Including the number of mantissa bits in the type name is enough
    to fully specify the type. This naming scheme is used to represent
    the E5M2 type described in the paper.
  • F8M<N>F : For FP8 types such as E4M3 which only support finite
    values (and NAN).

The first of these (this patch) seems fairly non-controversial. The
second is previewed here to illustrate options for extending to the
other known variant (but can be discussed in detail in the patch
which implements it).

Many conversations about these types focus on the Machine-Learning
ecosystem where they are used to represent mixed-datatype computations
at a high level. At that level (which is why we also expose them in
MLIR), it is important to retain the actual type definition so that when
lowering to actual kernels or target specific code, the correct
promotions, casts and rescalings can be done as needed. We expect that
most LLVM backends will only experience these types as opaque I8
values that are applicable to some instructions.

MLIR does not make it particularly easy to add new floating point types
(i.e. the FloatType hierarchy is not open). Given the need to fully
model FloatTypes and make them interop with tooling, such types will
always be “heavy-weight” and it is not expected that a highly open type
system will be particularly helpful. There are also a bounded number of
floating point types in use for current and upcoming hardware, and we
can just implement them like this (perhaps looking for some cosmetic
ways to reduce the number of places that need to change). Creating a
more generic mechanism for extending floating point types seems like it
wouldn’t be worth it and we should just deal with defining them one by
one on an as-needed basis when real hardware implements a new scheme.
Hopefully, with some additional production use and complete software
stacks, hardware makers will converge on a set of such types that is not
terribly divergent at the level that the compiler cares about.

(I cleaned up some old formatting and sorted some items for this case:
If we converge on landing this in some form, I will NFC commit format
only changes as a separate commit)


I might be in the minority here, but this subject touches close to home for me. I’m very interested in an extensible floating-point type (in particular for the MLIR world).

As far as I’m aware the following have all been proposed:

FP8: s1e5m2 (https://arxiv.org/pdf/2209.05433.pdf), s1e4m3 (https://arxiv.org/pdf/2209.05433.pdf), s1e3m4 (Untether AI Unveils Its Second-Generation At-Memory Compute Architecture at Hot Chips 2022 — Untether AI)

FP8 with custom bias: CFloat8_1_4_3, CFloat8_1_5_2 (https://tesla-cdn.thron.com/static/SBY4B9_tesla-dojo-technology_OPNZ0M.pdf)

All of which could potentially benefit from deviating away from the conventional floating-point spec for representing NaNs/Infs (though this is architecture and potentially application specific)

Further floating-point types in the wild:
TF32: s1e8m10 (What is the TensorFloat-32 Precision Format? | NVIDIA Blog)
CB16: s1e6m9 (Data Formats — Software Documentation (Version 1.5.0))
SHP: s1e5m10 (with custom bias, https://tesla-cdn.thron.com/static/SBY4B9_tesla-dojo-technology_OPNZ0M.pdf)
UHP: e6m11 (with custom bias, https://tesla-cdn.thron.com/static/SBY4B9_tesla-dojo-technology_OPNZ0M.pdf)

In the FPGA community you can end up supporting very broad formats (some of which I explored here: FPGA-based training of convolutional neural networks with a reduced precision floating-point library | IEEE Conference Publication | IEEE Xplore, e.g. s1e6m5).
Another FPGA example of a broad FPGA library found here: https://www.flopoco.org/
Another even more esoteric floating-point format (block floating-point): https://www.microsoft.com/en-us/research/blog/a-microsoft-custom-data-type-for-efficient-inference/

Some of the architectures I’ve worked with also have a fairly easy path toward something like FP24 (i.e. s1e8m15).

With some of the above in mind, I think a path toward flexibility in specifying FloatTypes in MLIR seems fairly compelling to me, though we could hypothetically go down the route of just enumerating all of these types (but that feels weird with an already extensible IntegerType).

What would likely need to be captured going down that route would be (and I’m likely missing something else):

  1. Min Exponent Value
  2. Max Exponent Value
  3. Inf Semantics
  4. NaN Semantics
  5. Subnormal support
  6. Sign-bit inclusion
  7. Mantissa width

Extending APFloat with one more IEEE float format isn’t really a significant change; it’s minimally invasive, and there isn’t really any stability commitment because APFloat is only exposed in the C++ API.

If you want an LLVM IR type, I think we’d want a more serious discussion of how we actually want to handle this sort of construct. There’s a significant stability commitment for LLVM types, and supporting code generation is a significant amount of code. The existing implementation is designed under the assumption that new floating-point types are added rarely, but machine learning seems to be leading to a proliferation of new floating-point types.

1 Like

This assessment is why I opted to raise this as an RFC vs just patch. I think adding a couple of these FP types is not a significant change, but I also wanted to solicit comments because we may want to rethink our approach based on what we see happening.

Right, I think the complexity difference between representing arbitrary width for integers and floating-point is very different.

Having a first-class variable-FP would be more stable long term, but could also add unnecessary complexity to the majority of use cases (non-ML/DSP/FPGA). But adding individual formats doesn’t scale.

I don’t know if there’s a way to derive conversions between generic FP formats (like sext / zext / trunc for INTs), so they could end up as different types anyway (like FP80). If there is, we’d still have to re-implement the semantics, int conversions, casts and have some hard-coded (or templated) versions for the standard ones.

It’s probably doable, but as Eli said, standard code gen isn’t expecting that. In a way, we can probably emulate that with isDoubleTy wrappers and completely ignore any FP that isn’t “standard”, but I’m not sure that’d be enough to avoid all problems down the line.

Of course, we can always “do this one for now” once more and think about it later…

To be clear, I think we should just “do this one [two] more for now” and think about it later, but my conscience pricked enough to raise the question.


Oh nice. I’m +1 on adding formats like these to APFloat.

The only concern I have is about “naming” of these. For example, calling something “float8” is potentially problematic given there are various 8 bit types with different mantissa/exp tradeoffs or if they implicitly ignore denormals etc (e.g. both E4M3 and E5M2 are in that whitepaper). I’d rather that we have multiple fp semantics that precisely match the various numerics that are used by accelerators rather than try to have one thing that isn’t an exact match.

So I guess my concrete question is “should this be called f8, or fE5M2” (or something less yuck)?


I agree with the naming concern. How about Float8e5m2 and Float8e4m3?

The F8 part in F8M2 can be confused with TensorFloat32 which is e8m10 if someone heavily picks on just the F8 part.

1 Like

From a user perspective, one has intN with the N being how many bits are used to store the integer number, and the instruction used to determine whether calculations with that number should be signed or unsigned.
For floating point numbers we have Exponent and significand (or mantissa)
So, the type could be Float8e4m3 or Float8e5m2
Would the instruction therefore be the thing that determines if the instruction understand denormals or not and the rounding mode used etc.?

I figured that “naming” would be a main discussion point :slight_smile: My goal was to align them precisely with what is implemented (vs inexact match).

A couple of points of clarification:

  • I believe that both of these types presented in the whitepaper do not implicitly ignore denormals (that is how I read Table 1). Open to a second opinion here.
  • The described E4M3 type deviates in two additional ways from standard-like (using the saved bit-patterns to augment the range of the mantissa): a) infinities are not represented, b) there is only one NAN bit-pattern.

In the wild (as noted above), there are a couple of other variations, including exponent biases that deviate from IEEE-753 conventions.

I’d propose that we use IEEE-753 conventions as a baseline and attempt to fully qualify deviations in the name, possibly by tacking on suffix terms (I don’t think there is any avoiding the “yuck” and also being clear). Maybe something like:

  • F<VARIANT>? : Supports finite values, not infinities. If more weird infinity semantics are needed, we can just tack an integer on the end to disambiguate.
  • N<VARIANT>? : NAN variant.
  • B<OFFSET> : Bias offset.

We don’t need to decide on all of those in this patch but just agree in principle to something like this.

With the above, and cherry-picking from some of the other comments, how about, for the two listed types:

  • f8E5M2 / Float8E5M2
  • f8E4M3FN / Float8E4M3FN

I have no strong opinions and having batted this around for some time have also come to the conclusion that no matter what we do is going to be a mouthful, so we might as well err on the side of encoding all of the details.

Sounds good to me, I think that Manish’s suggestions are reasonable:

Stella, you’re right the letters should probably be capitalized (Float8E5M2) as well, but I’m not 100% sure about that. Both work for me.


Great list,
you might want to add a configurable exponent bias to that list. Unless that is covered by max exponent and min exponent.

There are also the Graphcore/AMD/Qualcomm formats, which represent the extreme of dynamic range, preserving only one of the 256 bit patterns for “NaN/Inf/Negative0”.

I believe the full list of potential deviations from IEEE is quite hard to enumerate. nVidia et al have 4 NaN/Inf codes, but one might easily choose to use only two of them. I believe that to cover all cases mentioned above, even in E5M2, we need to specify the meaning of something like the following bit patterns


This is so many, that realistically there will be fewer real-world formats than the 2^6 this implies, and that “market forces” will dictate which are included.


Noting too that most FP8 experimentation today proceeds by type-punning on uint8, defining e.g.

my_f8_dot_product(uint8* A, uint8* B, ....)

Maybe there is utility in defining a single float8 which serves the purpose of allowing slightly nicer names than uint8 for such blocks of memory, but punts all arithmetic to vendor libraries. Given configurable exponent bias, the only thing we can do with this type in core LLVM is create zeroes; or reinterpret cast from other types. Nevertheless it may be a useful tool.

A particular distinction to uint8 might be that float8 would participate in automatic differentiation, e.g. in enzyme, or might satisfy IsTrainable in TensorFlow etc.

Of course we would still implement the “market forces” formats explicitly, e.g. if an IEEE float8 (or float8s) were to emerge, but this “storage only” type would still offer value to practitioners.

1 Like

Naming aside, we already have prior art for how to handle target-specific numerics in APFloat: we have both S_PPCDoubleDouble and S_x87DoubleExtended, which are target-specific. I think we can follow that approach to support other weird things.

Looks like the only real commentary is naming and suggestions for the future. As such, I think ⚙ D133823 Add APFloat and MLIR type support for fp8 (e5m2). is ready for review. PTAL.

1 Like

While not for the current patch, I was originally thinking of “vendoring” the E4M3 type in [2209.05433] FP8 Formats for Deep Learning by qualifying it with “NV” or something, since it is opinionated about its bit patterns in a specific way. But then Nvidia/ARM/Intel were nice enough to announce joint support for it. As such, I’m considering just calling it Float8E4M3FN (where the ‘F’ indicates that it only supports finite values and the ‘N’ indicates that it supports NaN values but varies from the “standard” bit patterns for such). Open to suggestions. I’ll try to get that patch out next week.

Yeah, somehow neither Intel_Arm_NVidia_Float8 nor Graphcore_AMD_Qualcomm_Float8 work as well as the other “vendored” naming schemes, not to mention they all propose a few variations, so we’d still need additional tags in the names.

Also, which company name comes first? :confused: We don’t want to go there, really.

Makes sense to me!

Fyi - cross posting for ML framework integration: [RFC] FP8 in XLA · Discussion #22 · openxla/xla · GitHub