[RFC] Making space for a flush-to-zero flag in FastMathFlags

Hi,

I need to add a flush-denormals-to-zero (FTZ) flag to FastMathFlags,
but we've already used up the 7 bits available in
Value::SubclassOptionalData (the "backing storage" for
FPMathOperator::getFastMathFlags()). These are the possibilities I
can think of:

1. Increase the size of FPMathOperator. This gives us some additional
bits for FTZ and other fastmath flags we'd want to add in the future.
Obvious downside is that it increases LLVM's memory footprint.

2. Steal some low bits from pointers already present in Value and
expose them as part of SubclassOptionalData. We can at least steal 3
bits from the first two words in Value which are both pointers. The
LSB of the first pointer needs to be 0, otherwise we could steal 4
bits.

3. Allow only specific combinations in FastMathFlags. In practice, I
don't think folks are equally interested in all the 2^N combinations
present in FastMathFlags, so we could compromise and allow only the
most "typical" 2^7 combinations (e.g. we could nonan and noinf into a
single bit, under the assumption that users want to enable-disable
them as a unit). I'm unsure if establishing the most typical 2^7
combinations will be straightforward though.

4. Function level attributes. Instead of wasting precious
instruction-level space, we could move all FP math attributes on the
containing function. I'm not sure if this will work for all frontends
and it also raises annoying tradeoffs around inlining and other
inter-procedural passes.

My gut feeling is to go with (2). It should be semantically
invisible, have no impact on memory usage, and the ugly bit
manipulation can be abstracted away. What do you think? Any other
possibilities I missed?

Why I need an FTZ flag: some ARM Neon vector instructions have FTZ
semantics, which means we can't vectorize instructions when compiling
for Neon unless we know the user is okay with FTZ. Today we pretend
that the "fast" variant of FastMathFlags implies FTZ
(rG5cb666add7d4), which is not ideal. Moreover
(this is the immediate reason), for XLA CPU I'm trying to generate FP
instructions without nonan and noinf, which breaks vectorization on
ARM Neon for this reason. An explicit bit for FTZ will let me
generate FP operations tagged with FTZ and all fast math flags except
nonan and noinf, and still have them vectorize on Neon.

-- Sanjoy

Can we move HasValueHandle out of the byte used for SubClassOptionalData and move it to the flags at the bottom of value by shrinking NumUserOperands to 27?

We knew the day when we needed another FMF bit was coming back in:
https://reviews.llvm.org/D39304
…it was just a question of ‘when’. :slight_smile:

I’m guessing that an FTZ bit won’t be the last new bit needed if we consider permutations between strict FP and fast-math. Even without that, denormals-as-zero (DAZ) might also be useful?

So rather than continuing to carve these out bit-by-bit, it’s worth considering a more general solution: instruction-level metadata.

IIUC, the main argument for making FMF part of the instruction was that per-instruction metadata gets expensive if we’re applying it to a significant chunk of the instructions.
But let’s think about that - even the most FP-heavy code tops out around 10% FP math ops out of the total instruction count. Typical FP benchmark code is only 2-5% FP ops. The rest is the same load/store/control-flow/ALU stuff found in integer code.

I’m not exactly sure yet what it would take to do the experiment, but it seems worth exploring moving the existing FMF to metadata.

One point in favor of this approach is that we already have an “MD_fpmath” enum. It’s currently only used to convey reduced precision requirements to the AMDGPU backend. We could extend that to include arbitrary FMF settings.

A couple of related points for FMF-as-metadata:

  1. It might encourage fixing a hack added for reciprocals: we use a function-level attribute for those (grep for “reciprocal-estimates”). IIRC, that was just a quicker fix than using MD_fpmath. The existing squished boolean FMF can’t convey the more general settings that we need for reciprocal optimizations.

  2. These don’t require new bits, but FMF isn’t applied correctly today as-is:
    https://reviews.llvm.org/D48085
    https://bugs.llvm.org/show_bug.cgi?id=38086

https://bugs.llvm.org/show_bug.cgi?id=39535

https://reviews.llvm.org/D51701
…so we need to make FMF changes regardless of FTZ.

If this is true, what do you think about option (1)? It might be
simpler to increase the size of FPMathOperator by a word, giving us 64
more bits of fastmath flags. We could also have this extra word in
only those instances of FPMathOperator that have a non-zero
FastMathFlags (this would force us to remove setFastMathFlags since
we'd need to know the contents of FastMathFlags at Instruction
construction time).

-- Sanjoy

Can we move HasValueHandle out of the byte used for SubClassOptionalData and move it to the flags at the bottom of value by shrinking NumUserOperands to 27?

I like this approach because it is less work for me. :slight_smile:

But I agree with Sanjay below that this only kicks the can slightly
further down the road (solutions (2) and (3) also have the same
problem). Let's see if we can agree on a more future proof solution.

-- Sanjoy

Another thing to consider: The current bitcode reader/writer handles backwards compatibility with the previous IR version with some mapping done to preserve context. If we change the bitcode layout we effectively have a new version of IR, bringing up the notion once more of compatibility with a prior version.
It is just another item to add to the work list...

Regards,
Michael

Hi Michael,

Another thing to consider: The current bitcode reader/writer handles backwards compatibility with the previous IR version with some mapping done to preserve context. If we change the bitcode layout we effectively have a new version of IR, bringing up the notion once more of compatibility with a prior version.
It is just another item to add to the work list...

That's good to keep in mind, though I don't quite understand why this
would be non-trivial. It seems like we already have a split between
bitc::FastMathMap and FastMathFlags with an explicit encode/decode
step. Why would a different storage scheme for FastMathFlags
influence reading/writing bitcode?

-- Sanjoy

Hi Sanjoy.

Just the scheme for storage by itself isn’t. However there is a similar size constraint in FastMathMap in number of bits used/mapped for each IR, meaning we are currently reading/writing 8 flags of FMF in the bit code utilities for the 7 FMF flags plus Unsafe. New FMF flags at this point means bigger IR in the bitcode and a bump in the version of the IR. Something vendors will have to act on as there are quite few still on current-1 IR. We will need to be clear on communicating the divergence when it happens.

On another front, IMO we should leave nan and inf processing under their respective flags and module guards, keeping the contexts separate. It muddies too much functionality to join them in common context.

Regards,
Michael

I don’t have any objections to increasing the size of FPMathOperator, but I also don’t know what perf impact that would have.
I made this comment in D39304:
“I don’t think we can just add a field to FPMathOperator because Operator is not intended to be instantiated.”

That could just be me not understanding the class hierarchy?

Yes, FPMathOperator isn’t part of the class hierarchy. I would interpret “increasing the size of FPMathOperator” as shorthand for increasing the size of ContstantExpr and the relevant subclasses of Instruction. (Exactly which subclasses of Instruction gets a little sticky because most floating-point operations are represented by BinaryOperator, which is also used for integer arithmetic, but it’s probably solvable somehow.)

-Eli

As for me, I lean for Sanjay’s proposal and Sanjoy’s #4, as both seem to me to be more future proof and enable mimicking the behavior of GCC more accurately.

On another note, do y’all have any thoughts about changing the FP math semantics to FTZ and DAZ for the whole program, as some, if not all, current targets support such FP modes through bits in their FP unit control register, or similar?

As Hal once pointed out to me, the way that GCC works is a bit unnerving, as any DSO that changes the FP mode to use such semantics affects all modules, even those which were written without this change in mind. Perhaps emitting the initialization code to change the FP mode for DSOs might be suppressed, thus leaving this run time change in the hands of the program developer, not the library developer’s. Although this raises some questions as well.

GCC accomplishes this in libgcc, whereas, should the same behavior be copied by LLVM, it would likely reside in compiler-rt.

Cheers,