AllowFPOpFusion vs. SDNodeFlags::hasAllowContract()

I recently posted an RFC under the clang category proposing that -ffp-contract=fast should honor pragmas rather than requiring -ffp-contract=fast-honor-pragmas ( see [RFC] Honor pragmas with -ffp-contract=fast). There seemed to be good support for this as a general concept, but it hit a road block when I looked at how various backends handle floating-point contraction.

I proposed implementing the change by having clang set TargetOptions::AllowFpOpFusion = FPOpFusion::Standard rather than FPOpFusion::Fast for -ffp-contract=fast. Unfortunately, this caused several backends to stop generating FMA operations entirely. An implication of this discovery, BTW, is that these backends will not generate FMA operations if -ffp-contract=fast-honor-pragmas.

After some investigation, I found that there is a general lack of consistency in the way that backends handle floating-point contraction, and that many backends haven’t been updated to recognize that contraction can be enabled or disabled at the instruction level. These backends rely on either a global flag that enables contraction everywhere, or the UnsafeFpMath attributes, which enables it at the function level.

The big problem with the global option is that it doesn’t allow for mixed modes within a module. The option naming points to C pragmas as a way that contraction can be turned on or off, but you can also get into a mixed state by using LTO and linking IR from sources that were compiled with different -ffp-contract options. I’m sure this is OK for a lot of use cases, but given that we have the ability to express at the instruction-level whether contraction is allowed or not, I’d like to see us bring all in-tree back ends up-to-date with that aspect of the IR.

As I said, the implementation of this is fairly inconsistent. I do see two common patterns. One is some form of:

allowFMA = Options.AllowFPOpFusion == FPOpFusion::Fast ||
           Options.UnsafeFPMath;

The other is some form of:

allowFMA = Options.AllowFPOpFusion == FPOpFusion::Fast ||
           Options.UnsafeFPMath ||
           Node->getFlags().hasAllowContract();

Unfortunately, it’s not a simple matter of just adding the node-level checks everywhere that the global flag is being checked. In many cases, the global check is being performed in a way that makes it difficult to check the node flags on all nodes that would be involved. For example, the NVPTX backend has allowFMA and noFMA predicates in its pattern matchers that don’t reference the nodes involved at all.

Does anyone have any ideas on how to get started on moving all backends to a common level of support for the contract flag at the node level? Would anyone object to such a project?

After further investigation, I re-opened my clang pull-request (https://github.com/llvm/llvm-project/pull/105746), but I’m still not sure the general state of the backends are ready for it.

I did an experiment in Compiler Explorer, trying this IR with various llc options

define float @test_fma(float %x, float %y, float %z) {
    %mul = fmul contract float %x, %y
    %add = fadd contract float %mul, %z
    ret float %add
}

define float @test_fma_intrin(float %x, float %y, float %z) {
    %ret = call float @llvm.fma.f32(float %x, float %y, float %z)
    ret float %ret
}

define <2 x double> @test_vec_fma(<2 x double> %x, <2 x double> %y, <2 x double> %z) {
    %mul = fmul contract <2 x double> %x, %y
    %add = fadd contract <2 x double> %mul, %z
    ret <2 x double> %add
}

define float @test_nocontract_muladd(float %x, float %y, float %z) {
    %mul = fmul float %x, %y
    %add = fadd float %mul, %z
    ret float %add
}

declare float @llvm.fma.f32(float %x, float %y, float %z)

I meant for the case with the intrinsic to be a control to verify that I was using command-line options that allowed FMA at the target feature level. I compiled with the following llc command-line options:

-mtriple=aarch64-linux-gnu
-mtriple=amdgcn -mcpu=gfx1010
-mtriple armv8.6a-arm-none-eabi
-mtriple=loongarch32 -mattr=+d
-march=nvptx64
-mtriple=powerpc64
-mtriple=s390x-linux-gnu -mcpu=z14
-mtriple=thumbv8.1-m-none-eabi -mattr=+fullfp16
-mattr=+fma (for x86)

These were chosen based on LIT tests that checked for FMA with these targets. Most of these generated FMA for the first three cases and not for the last, as expect. The thumbv8.1 case didn’t produce fma for the vector case, but I think that might have been because of the data types.

I’m trying to reconcile this with what I saw when looking at the pattern matching files. I suspect that the target-independent code to form FMA is handling the simple cases I tried. There may be other cases that get missed. Certainly there are some patterns that become obsolete.

Another sticking point is the existing tests. I see 83 CodeGen tests using -fp-contract=fast on run lines. Of those, 51 fail if I map the -fp-contract=fast option to FPOpFusion::Standard. I looked at the 6 failing X86 tests, and five of them were easy to re-enable just by adding explicit contract flags to the affected operations. The sixth could probably have been made to work that way, but the test was written to work with both -fp-contract=fast and no -fp-contract option, which expected no FMA to be formed.

Even if the existing target-independent code is sufficient to form FMA everywhere it is expected, it will be a fair bit of work to sort out the tests and get them to behave as intended with just the contract flag to enable fused operations. It’s also possible that some of these tests would expose problems where we really do rely on the global flag. I can’t say one way or the other.

My biggest concern remains the pattern matching files. I see the following fragments in AArch64SVEInstrInfo.td:

def AArch64fadd_p_contract : PatFrag<(ops node:$op1, node:$op2, node:$op3),
                                     (AArch64fadd_p node:$op1, node:$op2, node:$op3), [{
  return N->getFlags().hasAllowContract();
}]>;
def AArch64fsub_p_contract : PatFrag<(ops node:$op1, node:$op2, node:$op3),
                                     (AArch64fsub_p node:$op1, node:$op2, node:$op3), [{
  return N->getFlags().hasAllowContract();
}]>;

And in ARMInstrInfo.td:

// An ‘fadd’ node which can be contracted into a fma
def fadd_contract : PatFrag<(ops node:$lhs, node:$rhs),(fadd node:$lhs, node:$rhs),[{
return N->getFlags().hasAllowContract();
}]>;

That seems to me to be the right way to handle this in the pattern files, but it’s notable that neither of these has a corresponding fragment that checks for the contract flag on an fmul operation. The AArch64 patterns seem to be merging an add or sub into an existing fma node, so maybe(?) there’s no need to check for contract on that? I’m not certain what the ARM case is doing.

I’m not sure if I’m giving a soliloquy to the void here. I don’t know who reads this category or who is active in the various backends.It looks like @jholewinski introduced the NVPTX allowFMA predicates ten years ago, which was before the contract fast-math flag was added. @AlexM seems to be a frequent contributor to the NVPTXInstrInfo.td file, but maybe doesn’t look at Discourse a lot? @majnemer? @gonzalobg (I’m obviously fishing for someone who knows anything about the NVPTX backend here.)

Other backends? @arsenm @sjoerdmeijer @topperc @anb

I think the global flag just needs to be removed. It’s just going to be grinding through updating backends and tests.

I would assume there would be separate PatFrags for the 2 operations with the flags set, and another helper PatFrag that uses the two contracted variants

I would start by removing all of the UnsafeFPMath checks to allow FMA.

That’s what I was expecting to find, but I don’t see a PatFrag for the multiply instruction with the contract flag set. I’m not sure if it’s worth trying to work explicit PatFrags into targets that don’t have them already, since the target-independent combiner seems to be doing what needs to be done in most cases, but I guess each target can make that decision for themselves.

I agree that the global flag should be removed, but I don’t want to break things that are relying on it. The NVPTX backend particularly concerns me because CUDA is the only target that sets the fp-contract state to fast by default (as opposed to HIP which uses fast-honor-pragmas), and also because the NVPTX backend appears to have the most pattern matching that currently relies on the global flag.

I can imagine various front ends wanting to enable contraction unconditionally, but I strongly believe that global flags shouldn’t be enabling any changes to the semantics expressed in the IR. The UnsafeFPMath attribute use worries me less because clang takes care of not setting that unless contract is enabled everywhere in a function, and it is at least something in the IR. I do agree that it should be eliminated though.

You can trivially define such inst+contract flag PatFrags, but yes I don’t expect this to be useful. The contraction logic is nontrivial, and can require consideration of multiple uses. I would expect the contracting backend patterns to be removable

I think this is more due to Sam noticing the bug first, and then the unwillingness to change any behavior for any reason anywhere else

1 Like

The global options for FMA contraction and other fast-math optimizations have a long history in the NVPTX backend and through-out the CUDA stack. As you say, they were introduced before LLVM had the concept of contract and we’ve lived with them ever since. Unfortunately, they’re well entrenched in our toolchain and it would take a very large amount of effort to move completely to per-instruction flags, and validate that nothing breaks. That said, I’m in complete agreement that getting rid of the global flags is the right thing to do. We just need to carefully plan and stage how it gets applied to the NVPTX backend.

Thanks for the reply, @jholewinski

I think the next step, for all backends, is to update the existing LIT tests to not rely on the global flag. If the existing LIT tests can be made to pass with just the contract flag while still generating fused operations in all the cases currently tested, I’ll feel pretty good about where the backends are. At that point we should be able to land my patch to stop having clang set the global flag, perhaps with a way for backends that aren’t confident of the change to opt-in to using the global flag a while longer. Then maybe we could declare the global flag as deprecated and remove it sometime later, once everyone was comfortable with the change.