FPOpFusion = Fast and Multiply-and-add combines

Hi all,

The AllowFPOpFusion option passed to a target can currently take 3 different values, Fast, Standard or Strict (TargetOptions.h), being Standard the default.

In the DAGCombiner, during the combination of mul and add/subtract into multiply-and-add/subtract, this option is expected to be Fast in order to enable the combine. This means, that by default no multiply-and-add opcodes are going to be generated. If I understand it correctly, this is undesirable given that multiply-and-add for targets like PPC (I am not sure about all the other targets) does not pose any rounding problem and it can even be more accurate than performing the two operations separately.

Also, in TargetOptions.h I read:

Standard, // Only allow fusion of ‘blessed’ ops (currently just fmuladd)

which made me suspect that the check against Fast in the DAGCombiner is not correct.

I was wondering if this is something to be fixed in the DAG combiner or if the target should set a different option to be checked by the DAGCombiner saying that mul-add/subtract is okay.

Any comments?

Thanks in advance!
Samuel

Hi Samuel,

In the DAGCombiner, during the combination of mul and add/subtract into
multiply-and-add/subtract, this option is expected to be Fast in order to
enable the combine. This means, that by default no multiply-and-add opcodes
are going to be generated. If I understand it correctly, this is undesirable
given that multiply-and-add for targets like PPC (I am not sure about all
the other targets) does not pose any rounding problem and it can even be
more accurate than performing the two operations separately.

That extra precision is actually what we're being very careful to
avoid unless specifically told we're allowed. It can be just as
harmful to carefully written floating-point code as dropping precision
would be.

Also, in TargetOptions.h I read:

Standard, // Only allow fusion of 'blessed' ops (currently just fmuladd)

which made me suspect that the check against Fast in the DAGCombiner is not
correct.

I think it's OK. In the IR there are 3 different ways to express mul + add:

1. fmul + fadd. This must not be fused into a single step without
intermediate rounding (unless we're in Fast mode).
2. call @llvm.fmuladd. This *may* be fused or not, depending on
profitability (unless we're in Strict mode, in which case it's
separate).
3. call @llvm.fma. This must not be split into two operations (unless
we're in Fast mode).

That middle one is there because C actually allows you to allow &
disallow contraction within a limited region with "#pragma STDC
FP_CONTRACT ON". So we need a way to represent the idea that it's not
usually OK to fuse them (i.e. not Fast mode), but this particular one
actually is OK.

Cheers.

Tim.

Hi Tim,

Thanks for the thorough explanation. It makes perfect sense.

I was not aware fast-math is supposed to prevent more precision being used than what is in the standard.

I came across this issue while looking into the output or different compilers. XL and Microsoft compiler seem
to have that turned on by default. But I assume that clang follows what gcc does, and have that turned off.

Thanks again,
Samuel

Hi Samuel,

I don’t think clang follows what gcc does regarding FMA - at least by default. I don’t have a PPC compiler to test with, but for x86-64 using clang trunk and gcc 4.9:

$ cat fma.c
float foo(float x, float y, float z) { return x * y + z; }

$ ./clang -march=core-avx2 -O2 -S fma.c -o - | grep ss
vmulss %xmm1, %xmm0, %xmm0
vaddss %xmm2, %xmm0, %xmm0

$ ./gcc -march=core-avx2 -O2 -S fma.c -o - | grep ss
vfmadd132ss %xmm1, %xmm2, %xmm0

Hi Sanjay,

You are right. I tried XL and gcc 4.8.2 for PPC and I also got multiply-and-add operations.

I supported my statement on what I read in the gcc man page. -ffast-math is used in clang to set fp-contract to fast (default is standard) and in gcc it activates (among others) the flag -funsafe-math-optimizations whose description includes:

“Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards.”

I am not a floating point expert, for the applications I care usually more precision is better, and that is what muladd provides. Given Tim’s explanation, I thought that muladd would conflict with (b) and some user would expect the exact roundings for the mul and add. However, I find this statement in Section 5 of IEEE floating point standard:

“Each of the computational operations that return a numeric result specified by this standard shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result, …”

which perfectly fits what the muladd instructions in PPC and also in avx2 are doing: using infinite precision after the multiply.

It may be possible there is something in the C/C++ standards I am not aware, that makes the fusing illegal. As you said, another reason may be just implementation choice. But in that case I believe we would be doing a bad choice as I suspect there are much more users looking for faster execution that taking advantage of a particular rounding property.

Maybe there is someone who can shed some light on this?

Thanks,
Samuel

"Each of the computational operations that return a numeric result specified
by this standard shall be performed as if it first produced an intermediate
result correct to infinite precision and with unbounded range, and then
rounded that intermediate result, ..."

which perfectly fits what the muladd instructions in PPC and also in avx2
are doing: using infinite precision after the multiply.

There are two operations in "a + b * c". Using muladd omits the second
requirement ("and then rounded that intermediate result") on the
first.

IEEE describes a completely separate "fusedMultiplyAdd" operation with
the "muladd" semantics.

Cheers.

Tim.

From: "Tim Northover" <t.p.northover@gmail.com>
To: "Samuel F Antao" <sfantao@us.ibm.com>
Cc: "Olivier H Sallenave" <ohsallen@us.ibm.com>, llvmdev@cs.uiuc.edu
Sent: Wednesday, August 6, 2014 10:59:43 PM
Subject: Re: [LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

> "Each of the computational operations that return a numeric result
> specified
> by this standard shall be performed as if it first produced an
> intermediate
> result correct to infinite precision and with unbounded range, and
> then
> rounded that intermediate result, ..."
>
> which perfectly fits what the muladd instructions in PPC and also
> in avx2
> are doing: using infinite precision after the multiply.

There are two operations in "a + b * c". Using muladd omits the
second
requirement ("and then rounded that intermediate result") on the
first.

IEEE describes a completely separate "fusedMultiplyAdd" operation
with
the "muladd" semantics.

Samuel,

To add to Tim's (correct) response...

C11, for example, addresses this: Section 6.5 paragraph 8 says, " A floating expression may be contracted, that is, evaluated as though it were a single
operation, thereby omitting rounding errors implied by the source code and the
expression evaluation method. The FP_CONTRACT pragma in <math.h> provides a
way to disallow contracted expressions." The 7.12.2 says, "The default state (‘‘on’’ or ‘‘off’’) for the pragma is
implementation-defined."

There are a few implications here, the most important being that C allows contraction only within floating-point expressions, but not across statement boundaries. This immediately imposes great challenges to performing mul+add fusion late in the optimizer (in the SelectionDAG, for example), because all notion of source-level statement boundaries have been lost. Furthermore, the granularity of the effects of the FP_CONTRACT pragma are defined in terms of source-level constructs (in 7.12.2).

Many compilers, including GCC on PowerPC, use a non-standard-compliant mode by default. GCC's manual documents:

[from GCC man page]
-ffp-contract=style
     -ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression
     contraction such as forming of fused multiply-add operations if the target has native support for them.
     -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is currently
     not implemented and treated equal to -ffp-contract=off.

     The default is -ffp-contract=fast.
[end from GCC man page]

Clang, however, chooses to provide standard compliance by default. When -ffp-contract=fast is provided, we enable aggressive fusion in DAGCombine. We also enable this whenever fast-math is enabled. When -ffp-contract=on is in effect, we form contractions only where allowed (within expressions). This is done by having Clang itself emit the @llvm.fmuladd intrinsic. We use -ffp-contract=off by default. The benefit of this is that programs compiled with Clang should produce stable answers, as dictated by the relevant standard, across different platforms.

On PowerPC, LLVM's test-suite uses -ffp-contract=off so that the output is stable against optimizer fusion decisions across multiple compilers.

Finally, although counter-intuitive, extra precision is not always a good thing. Many numerical algorithms function correctly only in the presence of unbiased rounding that provides symmetric error cancellation across various expressions. If some of those expressions are computed with different amounts of effective precision, these errors don't cancel as they should, and the resulting program can produce inferior answers. Admittedly, I believe such situations are relatively rare, but do certainly exist in thoughtfully-constructed production code.

-Hal

Hal, Tim,

Thanks for the thorough explanation. That is very clarifying.

Thanks again!
Samuel