AMD and GraphCore have developed an 8-bit floating point format that differs somewhat from the ones already added to APFloat. Since we have hardware (current or, in AMD’s case, upcoming) that supports this format, we need to represent it in APFloat so that we can add it to MLIR so we can target our 8-bit float instructions. (For example, an upcoming AMD GPU will have accelerated 8-bit floating point matrix math instructions which use the format I define, which are already present in the intrinsics table)

I have already published ⚙ D141863 [llvm][APFloat] Add NaN-in-negative-zero formats by AMD and GraphCore to add the formats to APFloat.

We’re posting this RFC to

- Get an external reviewer on the patch
- Raise awareness of alternate 8-bit floating point formats in order to prevent premature standardization on Nvidia’s proposal
- Get comments on naming

# The formats

In most respects, the formats we propose - Float8E5M2FNUZ and Float8E4M3FNUZ - are similar to the existing proposed types, most notably Float8E4M3FN.

Like Float8E4M3FN, we have one NaN pattern, which also serves as the infinity value. This means that overflow in either direction goes to NaN.

Unlike Float8E4M3FN, our formats have exactly one NaN value, which is unsigned, and uses the encoding normally used for negative zero. This means that, in our proposed formats, zero (and NaN) are unsigned. That is, all NaN values are represented by `0x80`

, and all zero values by `0x00`

.

This choice allows more of the 256 values an 8-bit float can take on to be used for numbers that are meaningful at these low precisions.

In addition, our formats use minimum exponents one smaller than those used by the existing Float8 formats. In the case of Float8E5M2FNUZ, this was acheived by dropping support for IEEE NaN and infinity, just like in Float8E4M3FN. For Float8E4M3FNUZ, this was a design choice - compared to Float8E4M3, we have a smaller overall range but higher precision.

# Naming

I’ve mostly followed the existing format for float8 types - Float8E[exponent]M[mantisa bits][flags] . As in the existing format, the `FN`

suffix is used for `Finite, `

Nan-only`and we have added`

UZ`for`

Unsigned Zero`.