Static rounding mode in IR

Hi all.

I need help in the interpretation of LLVM Language Reference concerning the design of constrained intrinsics.

Background

Some targets support static rounding mode for floating point operations. In this case the rounding mode is specified in an instruction rather than read from some register. Code, that needs particular rounding, becomes faster and more compact because access to FP control register is not needed. This feature is popular in processor designed to ML, and users want to take advantage of it.

The latest C standard draft introduces the notion of static (or constant) rounding, which is managed by the pragma FENV_ROUND (https://www.iso-9899.info/n3047.html#7.6.2). This rounding can be different from the dynamic, which is managed by calls to fesetround. Implementation of this pragma also requires support of the static rounding in the compiler.

The question is how to represent the static rounding mode in IR.

Problem

The obvious solution is to use constrained intrinsics. As the documentation claims (https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics), they provide handling of floating-point operations when specific rounding mode or floating-point exception behavior is required. However the description also requires:

For values other than “round.dynamic” optimization passes may assume that the actual runtime rounding mode (as defined in a target-specific manner) matches the specified rounding mode, but this is not guaranteed. Using a specific non-dynamic rounding mode which does not match the actual rounding mode at runtime results in undefined behavior.

This statement actually prevents using constrained intrinsics from representing operations with static rounding mode. Consequently it blocks implementation of static rounding support in codegen.

We could treat the values other than “round.dynamic” as static rounding modes, not just optimization hint. In this case the paragraph cited above should be removed.

Does some code relies on the statement cited above? Can the requirement stated in it be removed? May be there are other ways to represent static rounding?

Thanks.

What if hardware doesn’t support static rounding mode for a particular operation or a particular data type? For example, static rounding mode is only supported for 512 bits vectors and scalars in AVX512. RISC-V only supports static rounding mode for scalars. Would the backend need to change the global rounding mode around instructions in order to guarantee the requested rounding mode is used? Or is the responsibility of the IR producer to only use it where it is supported?

As such this has nothing to do with the constrained intrinsics. The constrained intrinsics and strictfp are specifically for dealing with the dynamic mode and fp exceptions. The only relation here could be you could replace the constrained intrinsics with static rounding operations, if you know both the mode and that fp exceptions are ignored.

A set of intrinsics with a constant rounding argument, like we already have in llvm.fptrunc.round. Full strictfp and fp exception support should not be required to implement these functions. The backend may be required to manage mode switch and restore to implement these intrinsics.

The intent of these intrinsics was the compiler does not need to insert code to manage the rounding mode. It’s expressing what the user set the known rounding mode to. IMO these hints are kind of useless and we’re not doing anything with them. But they definitely should not be changed to mean the compiler should be managing the FP mode, like you’re effectively stating.

What if hardware doesn’t support static rounding mode for a particular operation or a particular data type?

The C standard draft proposes emulation using dynamic rounding (7.6.2p5), and requests preserving dynamic rounding in such case. Some instructions would use static rounding, others - emulation.

Would the backend need to change the global rounding mode around instructions in order to guarantee the requested rounding mode is used? Or is the responsibility of the IR producer to only use it where it is supported?

In the case of FENV_ROUND, it can be IR producer (Implementation of '#pragma STDC FENV_ROUND' by spavloff · Pull Request #89617 · llvm/llvm-project · GitHub). But nothing prevents from more general implementation.

As such this has nothing to do with the constrained intrinsics. The constrained intrinsics and strictfp are specifically for dealing with the dynamic mode and fp exceptions.

Does this mean that static rounding should use their own representation in IR? What about using operations with static rounding if strict exception handling is needed? They should use constrained intrinsics but cannot due to static rounding.

The intent of these intrinsics was the compiler does not need to insert code to manage the rounding mode. It’s expressing what the user set the known rounding mode to.

I would expect that these ittrinsics express side effect caused by reading control modes and setting exception flags. Static rounding removes dependency on some bits in control register but still affects exception.

IMO these hints are kind of useless and we’re not doing anything with them.

They acquire meaning if are used to represent static rounding. Otherwise yes, they are useless, if compiler even cannot rely on them:

the actual runtime rounding mode (as defined in a target-specific manner) matches the specified rounding mode, but this is not guaranteed.

A set of intrinsics with a constant rounding argument, like we already have in llvm.fptrunc.round.

It means a new representation for FP operations, different from constrained intrinsics, but also limited. And the case with static rounding but strict exception handling is also not covered. I would like to reuse the existing mechanism if possible.

The best case would be instructions that list accessed resources with access type, as was discussed in Thought on strictfp support - #7 by arsenm but it is a distant future.

Yes

Yes, it would be yet another strictfp variant of the same operation.

Yes, having to have additional strictfp intrinsics for every operation is unmanageable, and part of why target intrinsic support is still broken. I still think we should just have some callsite attribute / bundle to avoid this

This seems like it’ll be pretty tricky issue to implement well, in a way that works for all ISAs.

The IR representation could have an additional attribute on all the floating-point instructions and intrinsics which specifies a static rounding mode – including adding it to the constrained intrinsics. This is different from the existing arg which is an assumption of the current dynamic rounding mode – and we don’t necessarily need to prevent reordering of these instructions.

However, most CPU targets don’t have instructions that specify a static rounding mode, so will need to modify the dynamic rounding mode. Changing the rounding mode is expensive, so we need to ensure it’s changed as little as possible – e.g. don’t do it around every floating-point operation, or every basic-block.

That may argue for a more memory-like interface, where we explicitly represent modifications to the rounding mode as IR instructions, and then constant-propagate known rounding-modes into target instructions if possible, and delete writes to the rounding-mode if we can determine that there are no reads remaining?

The IR representation could have an additional attribute on all the floating-point instructions and intrinsics which specifies a static rounding mode – including adding it to the constrained intrinsics.

As the documentation states, there is no guarantee that dynamic rounding matches the rounding specified by metadata operand. It makes this operand useless. We can modify the interpretation by treating the operand as a rounding mode, which must be used for the specifier operation. For static rounding it would represent the rounding to encode in the instruction. If only dynamic rounding is available, it would be the rounding expected to be in an FP control register. It does not invalidate assumptions we have made in the compiler so far, but would allow support of static rounding.

This is different from the existing arg which is an assumption of the current dynamic rounding mode – and we don’t necessarily need to prevent reordering of these instructions.

Constrained intrinsics always have a side effect, so they always prevents from reordering FP operations.

However, most CPU targets don’t have instructions that specify a static rounding mode, so will need to modify the dynamic rounding mode. Changing the rounding mode is expensive, so we need to ensure it’s changed as little as possible – e.g. don’t do it around every floating-point operation, or every basic-block.

This is a matter of optimization. A special pass could eliminate unneeded operations on FP control register.

That may argue for a more memory-like interface, where we explicitly represent modifications to the rounding mode as IR instructions, and then constant-propagate known rounding-modes into target instructions if possible, and delete writes to the rounding-mode if we can determine that there are no reads remaining?

IIUC, you proposes to emulate static rounding using dynamic even on all targets? It makes sense to some extent. Clang could always emit rounding setting instructions to make the resulting IR more target-independent. But why deducing static rounding mode instead of keeping it in IR explicitly? What are benefits?

The IR representation for “this is a static rounding mode, which the compiler generates code to enforce” needs to be different from “this is a dynamic rounding mode, which the IR sets before executing this code”, so the backend knows whether it need to modify the rounding mode register for operations where that’s required. (We can’t assume the frontend knows whether the backend supports static rounding modes for a given operation.)

On the strictfp side, if you want to add new possible values for the existing metadata argument (“round.static.tonearest” or something like that), I think that would be okay.

I think this would be a mistake to implement it this way. It binds implementing these functions to strictfp support, which is avoidable. As it is strictfp is still experimental, barely implemented, and cannot support target intrinsics