Static rounding mode in IR

Hi all.

I need help in the interpretation of LLVM Language Reference concerning the design of constrained intrinsics.

Background

Some targets support static rounding mode for floating point operations. In this case the rounding mode is specified in an instruction rather than read from some register. Code, that needs particular rounding, becomes faster and more compact because access to FP control register is not needed. This feature is popular in processor designed to ML, and users want to take advantage of it.

The latest C standard draft introduces the notion of static (or constant) rounding, which is managed by the pragma FENV_ROUND (https://www.iso-9899.info/n3047.html#7.6.2). This rounding can be different from the dynamic, which is managed by calls to fesetround. Implementation of this pragma also requires support of the static rounding in the compiler.

The question is how to represent the static rounding mode in IR.

Problem

The obvious solution is to use constrained intrinsics. As the documentation claims (https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics), they provide handling of floating-point operations when specific rounding mode or floating-point exception behavior is required. However the description also requires:

For values other than “round.dynamic” optimization passes may assume that the actual runtime rounding mode (as defined in a target-specific manner) matches the specified rounding mode, but this is not guaranteed. Using a specific non-dynamic rounding mode which does not match the actual rounding mode at runtime results in undefined behavior.

This statement actually prevents using constrained intrinsics from representing operations with static rounding mode. Consequently it blocks implementation of static rounding support in codegen.

We could treat the values other than “round.dynamic” as static rounding modes, not just optimization hint. In this case the paragraph cited above should be removed.

Does some code relies on the statement cited above? Can the requirement stated in it be removed? May be there are other ways to represent static rounding?

Thanks.

What if hardware doesn’t support static rounding mode for a particular operation or a particular data type? For example, static rounding mode is only supported for 512 bits vectors and scalars in AVX512. RISC-V only supports static rounding mode for scalars. Would the backend need to change the global rounding mode around instructions in order to guarantee the requested rounding mode is used? Or is the responsibility of the IR producer to only use it where it is supported?

As such this has nothing to do with the constrained intrinsics. The constrained intrinsics and strictfp are specifically for dealing with the dynamic mode and fp exceptions. The only relation here could be you could replace the constrained intrinsics with static rounding operations, if you know both the mode and that fp exceptions are ignored.

A set of intrinsics with a constant rounding argument, like we already have in llvm.fptrunc.round. Full strictfp and fp exception support should not be required to implement these functions. The backend may be required to manage mode switch and restore to implement these intrinsics.

The intent of these intrinsics was the compiler does not need to insert code to manage the rounding mode. It’s expressing what the user set the known rounding mode to. IMO these hints are kind of useless and we’re not doing anything with them. But they definitely should not be changed to mean the compiler should be managing the FP mode, like you’re effectively stating.

What if hardware doesn’t support static rounding mode for a particular operation or a particular data type?

The C standard draft proposes emulation using dynamic rounding (7.6.2p5), and requests preserving dynamic rounding in such case. Some instructions would use static rounding, others - emulation.

Would the backend need to change the global rounding mode around instructions in order to guarantee the requested rounding mode is used? Or is the responsibility of the IR producer to only use it where it is supported?

In the case of FENV_ROUND, it can be IR producer (Implementation of '#pragma STDC FENV_ROUND' by spavloff · Pull Request #89617 · llvm/llvm-project · GitHub). But nothing prevents from more general implementation.

As such this has nothing to do with the constrained intrinsics. The constrained intrinsics and strictfp are specifically for dealing with the dynamic mode and fp exceptions.

Does this mean that static rounding should use their own representation in IR? What about using operations with static rounding if strict exception handling is needed? They should use constrained intrinsics but cannot due to static rounding.

The intent of these intrinsics was the compiler does not need to insert code to manage the rounding mode. It’s expressing what the user set the known rounding mode to.

I would expect that these ittrinsics express side effect caused by reading control modes and setting exception flags. Static rounding removes dependency on some bits in control register but still affects exception.

IMO these hints are kind of useless and we’re not doing anything with them.

They acquire meaning if are used to represent static rounding. Otherwise yes, they are useless, if compiler even cannot rely on them:

the actual runtime rounding mode (as defined in a target-specific manner) matches the specified rounding mode, but this is not guaranteed.

A set of intrinsics with a constant rounding argument, like we already have in llvm.fptrunc.round.

It means a new representation for FP operations, different from constrained intrinsics, but also limited. And the case with static rounding but strict exception handling is also not covered. I would like to reuse the existing mechanism if possible.

The best case would be instructions that list accessed resources with access type, as was discussed in Thought on strictfp support - #7 by arsenm but it is a distant future.

Yes

Yes, it would be yet another strictfp variant of the same operation.

Yes, having to have additional strictfp intrinsics for every operation is unmanageable, and part of why target intrinsic support is still broken. I still think we should just have some callsite attribute / bundle to avoid this

This seems like it’ll be pretty tricky issue to implement well, in a way that works for all ISAs.

The IR representation could have an additional attribute on all the floating-point instructions and intrinsics which specifies a static rounding mode – including adding it to the constrained intrinsics. This is different from the existing arg which is an assumption of the current dynamic rounding mode – and we don’t necessarily need to prevent reordering of these instructions.

However, most CPU targets don’t have instructions that specify a static rounding mode, so will need to modify the dynamic rounding mode. Changing the rounding mode is expensive, so we need to ensure it’s changed as little as possible – e.g. don’t do it around every floating-point operation, or every basic-block.

That may argue for a more memory-like interface, where we explicitly represent modifications to the rounding mode as IR instructions, and then constant-propagate known rounding-modes into target instructions if possible, and delete writes to the rounding-mode if we can determine that there are no reads remaining?

The IR representation could have an additional attribute on all the floating-point instructions and intrinsics which specifies a static rounding mode – including adding it to the constrained intrinsics.

As the documentation states, there is no guarantee that dynamic rounding matches the rounding specified by metadata operand. It makes this operand useless. We can modify the interpretation by treating the operand as a rounding mode, which must be used for the specifier operation. For static rounding it would represent the rounding to encode in the instruction. If only dynamic rounding is available, it would be the rounding expected to be in an FP control register. It does not invalidate assumptions we have made in the compiler so far, but would allow support of static rounding.

This is different from the existing arg which is an assumption of the current dynamic rounding mode – and we don’t necessarily need to prevent reordering of these instructions.

Constrained intrinsics always have a side effect, so they always prevents from reordering FP operations.

However, most CPU targets don’t have instructions that specify a static rounding mode, so will need to modify the dynamic rounding mode. Changing the rounding mode is expensive, so we need to ensure it’s changed as little as possible – e.g. don’t do it around every floating-point operation, or every basic-block.

This is a matter of optimization. A special pass could eliminate unneeded operations on FP control register.

That may argue for a more memory-like interface, where we explicitly represent modifications to the rounding mode as IR instructions, and then constant-propagate known rounding-modes into target instructions if possible, and delete writes to the rounding-mode if we can determine that there are no reads remaining?

IIUC, you proposes to emulate static rounding using dynamic even on all targets? It makes sense to some extent. Clang could always emit rounding setting instructions to make the resulting IR more target-independent. But why deducing static rounding mode instead of keeping it in IR explicitly? What are benefits?

The IR representation for “this is a static rounding mode, which the compiler generates code to enforce” needs to be different from “this is a dynamic rounding mode, which the IR sets before executing this code”, so the backend knows whether it need to modify the rounding mode register for operations where that’s required. (We can’t assume the frontend knows whether the backend supports static rounding modes for a given operation.)

On the strictfp side, if you want to add new possible values for the existing metadata argument (“round.static.tonearest” or something like that), I think that would be okay.

I think this would be a mistake to implement it this way. It binds implementing these functions to strictfp support, which is avoidable. As it is strictfp is still experimental, barely implemented, and cannot support target intrinsics

I agree that it is worthwhile to support static rounding mode
instructions in LLVM IR. It’s something that’s increasingly common on
architectures (especially offload accelerators, but I’ll note that there
are examples in architectures older than I am). But I also share some of
@arsenm’s concerns about constrained intrinsics–they have clear
problems with regards to target intrinsics, e.g.

Support for static rounding mode on architectures without appropriate
instructions by generating fpenv-changing instructions seems
appropriate, and it shouldn’t be too hard to marry that with
optimizations to minimize unnecessary environment changes (although this
would probably have to be both late-stage LLVM IR passes rather than a
selectiondag/globalisel thing). This also suggests that proper modelling
of the floating-point environment is kind of a precondition to usable
static rounding mode support.

Playing with bundle operands to replace existing constrained intrinsic
logic has long been on my todo list, albeit regrettably in a position
that keeps getting preempted by other work.

IR should always model the requested semantics in a straightforward, high-level way. For operations with static rounding modes, that means recording the static rounding mode directly on the FP instruction; whether that’s done with new intrinsics, a new flag on BinaryOperator, or an operand bundle is an open choice.

To simplify writing backends that don’t directly support static rounding modes, we can also provide a late legalization pass that rewrites instructions with static rounding modes into dynamic environment changes. (Presumably this would need to be sensitive to the exact instruction, rounding mode, and subtarget, since even an ISA that generally support static rounding mode might not support a specific rounding mode on a specific operation.)

I apologize for being late to this discussion, but I would like to share a few thoughts.

First, Matt is correct that the intention of the rounding mode arguments to the constrained intrinsics was to provide hints to the compiler when we could prove in some way what the rounding mode would be for all paths to that instruction. The idea was that we’d generate everything with “round.dynamic” in the front end and then if during optimization we could prove that all paths to that instruction went through a call to fesetround() or something equivalent, we could update the rounding mode argument, which would enable constant-folding and allow backends to generate instructions with the static rounding mode encoded (which is allowed, but not required). Of course, Matt is also correct that such optimizations have not been implemented and I don’t think it’s likely that they will be any time soon.

At the time the constrained intrinsics were originally implemented, I believe the proposal for #pragma STDC FE_ROUND had a similar, descriptive not effective, semantic. I could be wrong about that, but that was at least how I understood it at the time. Obviously, the final form is different.

I’m skeptical that it is possible to implement static rounding support without constrained intrinsics for targets that do not support encoding the rounding mode directly into all FP operations. There is nothing to prevent code motion of the instructions with static rounding relative to instructions with default rounding, which could lead to a lot of extra calls to change the dynamic rounding mode.

My preference would be to have some intrinsics that begin and end a static rounding region and use the constrained intrinsics as they are currently defined between those markers. Targets that fully support static rounding could drop these begin/end intrinsics (perhaps even early in the optimization phase), and targets that require explicit rounding mode changes could lower these intrinsics to the appropriate code to change the dynamic rounding mode. Targets that support static rounding for some instructions but not all could analyze the region to see if the explicit rounding mode change was required.

I am against adding additional intrinsics that are more yet another set of the basic math functions. As Joshua mentioned, there has long been a rough plan to replace the constrained intrinsics to operand bundles. This was needed to mix strictfp support with vector predication.

BTW, I have put this topic on the agenda for the Floating-Point WG meeting that is scheduled for tomorrow at 5 PM UTC (10 AM Pacific). I don’t mean to imply that the discussion would lead to any kind of binding decision, but perhaps we could at least make some progress.

@spavloff I understand that this time isn’t particularly convenient for you, but if you are able to join the meeting I think it could be helpful. Here’s the meeting link: https://meet.google.com/kxo-bayk-nnd

You can do this by changing the mode around certain instructions during legalization, and changing it back. Or you can use pseudos and have a legalization pass determine the minimum place to insert required mode changes. AMDGPU has both legalization strategies implemented in different scenarios where the mode needs to be changed (and this is all in non-strictfp functions, and none of it requires strictfp support).

If we do the lowering of rounding mode changes during/after isel, we don’t need any representation before isel, sure. But post-isel transformations are significantly harder to implement.

I’m skeptical that it is possible to implement static rounding support
without constrained intrinsics for targets that do not support
encoding the rounding mode directly into all FP operations. There is
nothing to prevent code motion of the instructions with static
rounding relative to instructions with default rounding, which could
lead to a lot of extra calls to change the dynamic rounding mode.

Lowering to dynamic rounding mode changes already requires some sort of
FP environment-aware representation of FP instructions. If such lowering
is done before ISel, then we need to lower to constrained intrinsics
today (which is the only representation we have right now for such
intrinsics), or maybe the pass could be scheduled in a way that there’s
essentially a handshake agreement that there’s no optimization that will
do such code motion before or during ISel. If it’s done during or after
ISel, there’s no need to worry about these problems at all, at the cost
of making the environment-change minimization pass much harder to implement.

My preference would be to have some intrinsics that begin and end a
static rounding region and use the constrained intrinsics as they are
currently defined between those markers.

I’m opposed to those approach, because it makes life really bad for
architectures that support static rounding mode instructions but not
dynamic rounding mode. Additionally, this kind of approach to IR seems
likely to generate miscompiles in optimization, since it requires much
more analysis to know if you can code motion an FP operation.

I am against adding additional intrinsics that are more yet another
set of the basic math functions. As Joshua mentioned, there has long
been a rough plan to replace the constrained intrinsics to operand
bundles. This was needed to mix strictfp support with vector predication.

I agree with this. I think we shouldn’t be attempting to introduce any
more semantics to constrained intrinsics, given that it’s more or less a
known failed experiment. But given that it’s a partially-working
solution, I can support relying on it for the interim to implement a
pre-ISel lowering of static rounding mode to dynamic rounding mode.

Yes, technically, you can achieve the numerically correct results by inserting dynamic changes around the instruction, but if you have to change back and forth much it will kill performance. I’d much rather keep the instructions with the same rounding mode together. Depending on how the user’s code is written, that may not be a problem, but the user can mix rounding modes within a function or code with a different rounding mode can get inlined. The reason we needed the constrained intrinsics in the first place is because the normal FP operations are assumed to have no side effects and so nothing blocks their motion.

As I mentioned above, if the target has full support for static rounding operations, it would be trivial to just discard the end markers.

We’re already doing this with constrained intrinsics. The intrinsics are modeled as accessing inaccessible memory. This blocks the code motion we need blocked. It may be more conservative than necessary in some cases, but I think it’s OK.

But then this requires implementing strictfp support, which I cannot possibly do given the current state of support.

I don’t see these ever being used frequently enough for this to be important, but you can use the deferred mode switch insert strategy