question about fused multiply add and Clang GNU modes

Hi folks,

GNU GCC allows fused multiply add instruction generation in –std=gnu* modes (default mode in GCC) on both ARM 32-bit and 64-bit targets. See outputs below.

Clang 3.8 defaults to gnu11 for C programs, according to http://clang.llvm.org/docs/UsersManual.html#c-language-features and function CompilerInvocation::setLangDefaults in ./lib/Frontend/CompilerInvocation.cpp in the Clang source code.

So why fp-contract=fast is not made default in Clang as it is done in GNU GCC?

Just trying to understand the rationale behind this decision. We know the instruction produces results with higher precision and compliant to IEEE 754 standard.

This difference in default behavior in Clang/LLVM compared to GNU GCC is a performance disadvantage.

Thanks!

Ana.

$ cat t.c

double f(double a, double b)

{

return bb+aa;

}

$ gcc-linaro-4.9-2015.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc -S -O3 -o- -std=c99 t.c

.cpu generic+fp+simd

.file “t.c”

.text

.align 2

.global f

.type f, %function

f:

fmul d1, d1, d1

fmul d0, d0, d0

fadd d0, d1, d0

ret

.size f, .-f

.ident “GCC: (Linaro GCC 4.9-2015.05) 4.9.3 20150413 (prerelease)”

.section .note.GNU-stack,“”,%progbits

$ gcc-linaro-4.9-2015.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc -S -O3 -o- -std=gnu99 t.c

.cpu generic+fp+simd

.file “t.c”

.text

.align 2

.global f

.type f, %function

f:

fmul d0, d0, d0

fmadd d0, d1, d1, d0

ret

.size f, .-f

.ident “GCC: (Linaro GCC 4.9-2015.05) 4.9.3 20150413 (prerelease)”

.section .note.GNU-stack,“”,%progbits

$ gcc-linaro-4.9-2015.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc -S -O3 -o- t.c

.cpu generic+fp+simd

.file “t.c”

.text

.align 2

.global f

.type f, %function

f:

fmul d0, d0, d0

fmadd d0, d1, d1, d0

ret

.size f, .-f

.ident “GCC: (Linaro GCC 4.9-2015.05) 4.9.3 20150413 (prerelease)”

.section .note.GNU-stack,“”,%progbits

Ana Pazos
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

Hi Ana,

It would change the behavior of a lot of existing software in subtle ways to let -std=gnu11 license fp-contract=fast. I’m honestly rather surprised that GCC made that choice.

I’m not sure what you mean by "We know the instruction produces results with higher precision and compliant to IEEE 754 standard.” FMA produces different results than FMUL + FADD, but they are not always more accurate. The classical example of naive FMA formation gone wrong is multiplying a complex number by its conjugate. The imaginary part should be zero, but when FMA formation is licensed, one generally gets a small non-zero imaginary part.

IEEE doesn’t actually license fma formation. I’m not sure where you got the idea that it does. It doesn’t expressly forbid it either. Rather it makes the following recommendations:

"A language standard should require that by default, when no optimizations are enabled and no alternate exception handling is enabled, language implementations preserve the literal meaning of the source code.”

This means that by–default–an implementation should not transform FMUL + FADD into FMADD. It encourages this transform to be available as an option, however:

"A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to:

― Applying the associative or distributive laws.
― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition.
― Synthesis of a formatOf operation from an operation and a conversion of the result of the operation.
― Use of wider intermediate results in expression evaluation."

Note that the other transforms that IEEE-754 groups in with FMA formation here are all things that we license only under fast-math.

Now, it so happens that fma formation makes results more accurate more often than it makes them less accurate. It is usually a good thing, so the case isn’t quite a cut and dry as I’m presenting it to be. It’s also quite beneficial for performance on many platforms (but rather detrimental to performance on some other platforms with hardware FMA support, so again the case is not terribly clear).

It should also be noted that -ffp-contract=fast goes beyond what is allowed by the C rules for #pragma STDC FP_CONTRACT ON (which allows fma formation only within an expression):

scanon$ cat foo.c

#pragma STDC FP_CONTRACT ON

float foo(float x, float y, float z) {

return x*y + z; // fma formation is licensed here.

}

float bar(float x, float y, float z) {

float p = x*y;

return p + z; // fma formation is not licensed here.

}

scanon$ clang fma.c -Os -c -arch arm64 && otool -tvV fma.o

fma.o:

(__TEXT,__text) section

_foo:

0000000000000000 fmadd s0, s0, s1, s2 // fma only where licensed

0000000000000004 ret

_bar:

0000000000000008 fmul s0, s0, s1

000000000000000c fadd s0, s0, s2

0000000000000010 ret

scanon$ clang fma.c -Os -c -arch arm64 -ffp-contract=fast && otool -tvV fma.o

fma.o:

(__TEXT,__text) section

_foo:

0000000000000000 fmadd s0, s0, s1, s2

0000000000000004 ret

_bar:

0000000000000008 fmadd s0, s0, s1, s2 // fma even where not licensed

000000000000000c ret

Now, it does appear to me that we do not default to having STDC FP_CONTRACT ON, which is inhibiting fma formation even within an expression. Given that we support STDC FP_CONTRACT OFF, we could certainly choose to make ON the default, and I would encourage doing so.

– Steve

From: "Stephen Canon via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Ana Pazos" <apazos@codeaurora.org>
Cc: cfe-dev@lists.llvm.org
Sent: Saturday, September 19, 2015 3:00:53 PM
Subject: Re: [cfe-dev] question about fused multiply add and Clang
GNU modes

Hi Ana,

It would change the behavior of a lot of existing software in subtle
ways to let -std=gnu11 license fp-contract=fast. I’m honestly rather
surprised that GCC made that choice.

I’m not sure what you mean by "We know the instruction produces
results with higher precision and compliant to IEEE 754 standard.”
FMA produces *different* results than FMUL + FADD, but they are not
always more accurate. The classical example of naive FMA formation
gone wrong is multiplying a complex number by its conjugate. The
imaginary part *should* be zero, but when FMA formation is licensed,
one generally gets a small non-zero imaginary part.

IEEE doesn’t actually license fma formation. I’m not sure where you
got the idea that it does. It doesn’t expressly forbid it either.
Rather it makes the following recommendations:

"A language standard should require that by default, when no
optimizations are enabled and no alternate exception handling is
enabled, language implementations preserve the literal meaning of
the source code.”
This means that by--default--an implementation should not transform
FMUL + FADD into FMADD. It encourages this transform to be available
as an option, however:

"A language standard should also define, and require implementations
to provide, attributes that allow and disallow value-changing
optimizations, separately or collectively, for a block. These
optimizations might include, but are not limited to:

― Applying the associative or distributive laws.
― Synthesis of a fusedMultiplyAdd operation from a multiplication and
an addition.
― Synthesis of a formatOf operation from an operation and a
conversion of the result of the operation.
― Use of wider intermediate results in expression evaluation."

Note that the other transforms that IEEE-754 groups in with FMA
formation here are all things that we license only under fast-math.

Now, it so happens that fma formation makes results more accurate
more often than it makes them less accurate. It is *usually* a good
thing, so the case isn’t quite a cut and dry as I’m presenting it to
be. It’s also quite beneficial for performance on many platforms
(but rather detrimental to performance on some other platforms with
hardware FMA support, so again the case is not terribly clear).

It should also be noted that -ffp-contract=fast goes beyond what is
allowed by the C rules for #pragma STDC FP_CONTRACT ON (which allows
fma formation only within an expression):

[... ]

Now, it *does* appear to me that we do not default to having STDC
FP_CONTRACT ON, which is inhibiting fma formation *even within an
expression*. Given that we support STDC FP_CONTRACT OFF, we could
certainly choose to make ON the default, and I would encourage doing
so.

I agree. Also, our behavior here is appears somewhat buggy. Not only do we not set -ffp-contract=on by default (as I recall had been our intention), but -ffp-contract=on does not even work correctly. The code in lib/Frontend/CompilerInvocation.cpp does call Opts.setFPContractMode(CodeGenOptions::FPC_On) when passed -ffp-contract=on, but only in OpenCL mode do we set Opts.DefaultFPContract = 1. Setting CodeGenOptions::FPC_On does pass the right flag to to the backend, and does enable generating @llvm.fmuladd when an operation is tagged as 'FPContractable', but...

1. The STDC FP_CONTRACT pragma's DEFAULT option always resets to getLangOpts().DefaultFPContract, and thus is unaffected by the -ffp-contract flag (because that's always 0 except in OpenCL mode).

2. FPFeatures.fp_contract is initialized to 0 in include/clang/Basic/LangOptions.h, and this is never changed (except by the STDC FP_CONTRACT pragma handlers). When we create BinaryOperator AST nodes (etc.) we use the current state of FPFeatures.fp_contract to set the node's FPContractable flag, and because this always defaults to 0, regardless of how -ffp-contract is set (except setting it to fast which bypasses all of this), none of the AST nodes are marked as contractible, and we don't generate FMAs at all.

I think that the first step here is fixing all of this so that -ffp-contract=on actually works.

-Hal

– Steve

Hi folks,

GNU GCC allows fused multiply add instruction generation in –std=gnu*
modes (default mode in GCC) on both ARM 32-bit and 64-bit targets.
See outputs below.

Clang 3.8 defaults to gnu11 for C programs, according to
http://clang.llvm.org/docs/UsersManual.html#c-language-features and
function CompilerInvocation::setLangDefaults in
./lib/Frontend/CompilerInvocation.cpp in the Clang source code.

So why fp-contract=fast is not made default in Clang as it is done in
GNU GCC?

Just trying to understand the rationale behind this decision. We know
the instruction produces results with higher precision and compliant
to IEEE 754 standard.

This difference in default behavior in Clang/LLVM compared to GNU GCC
is a performance disadvantage.

Thanks!
Ana.

$ cat t.c
double f(double a, double b)
{
return b*b+a*a;
}

$
gcc-linaro-4.9-2015.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
-S -O3 -o- -std=c99 t.c
.cpu generic+fp+simd
.file "t.c"
.text
.align 2
.global f
.type f, %function
f:
fmul d1, d1, d1
fmul d0, d0, d0
fadd d0, d1, d0
ret
.size f, .-f
.ident "GCC: (Linaro GCC 4.9-2015.05) 4.9.3 20150413 (prerelease)"
.section .note.GNU-stack,"",%progbits
$
gcc-linaro-4.9-2015.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
-S -O3 -o- -std=gnu99 t.c
.cpu generic+fp+simd
.file "t.c"
.text
.align 2
.global f
.type f, %function
f:
fmul d0, d0, d0
fmadd d0, d1, d1, d0
ret
.size f, .-f
.ident "GCC: (Linaro GCC 4.9-2015.05) 4.9.3 20150413 (prerelease)"
.section .note.GNU-stack,"",%progbits
$
gcc-linaro-4.9-2015.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
-S -O3 -o- t.c
.cpu generic+fp+simd
.file "t.c"
.text
.align 2
.global f
.type f, %function
f:
fmul d0, d0, d0
fmadd d0, d1, d1, d0
ret
.size f, .-f
.ident "GCC: (Linaro GCC 4.9-2015.05) 4.9.3 20150413 (prerelease)"
.section .note.GNU-stack,"",%progbits

Ana Pazos
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project.

_______________________________________________
cfe-dev mailing list
cfe-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
cfe-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Thanks Steve, an excellent summary of where IEEE-754 stands. I knew
about the recommendation for languages to provide controls, but hadn't
tracked down the one for the default.

Now, it *does* appear to me that we do not default to having STDC
FP_CONTRACT ON, which is inhibiting fma formation *even within an
expression*. Given that we support STDC FP_CONTRACT OFF, we could certainly
choose to make ON the default, and I would encourage doing so.

I doubt we really supoprt FP_CONTRACT ON: I've never seen any attempt
to track C expressions in the IR, which seems like a necessity. So the
choice is probably between a conformant OFF and a buggy ON for
default.

Cheers.

Tim.

From: "Tim Northover via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Stephen Canon" <scanon@apple.com>
Cc: cfe-dev@lists.llvm.org
Sent: Sunday, September 20, 2015 11:42:56 PM
Subject: Re: [cfe-dev] question about fused multiply add and Clang GNU modes

Thanks Steve, an excellent summary of where IEEE-754 stands. I knew
about the recommendation for languages to provide controls, but
hadn't
tracked down the one for the default.

> Now, it *does* appear to me that we do not default to having STDC
> FP_CONTRACT ON, which is inhibiting fma formation *even within an
> expression*. Given that we support STDC FP_CONTRACT OFF, we could
> certainly
> choose to make ON the default, and I would encourage doing so.

I doubt we really supoprt FP_CONTRACT ON: I've never seen any attempt
to track C expressions in the IR, which seems like a necessity. So
the
choice is probably between a conformant OFF and a buggy ON for
default.

As someone who helped review the patches, I can assure you that we do support the pragmas. They work better than the command-line flags (for the reasons I pointed out in my previous e-mail). We don't track the C-level expressions in the IR, but Clang will directly form @llvm.fmuladd intrinsics where allowed. CodeGen then converts these into FMA nodes, or expands them into ADD + MUL depending on target hooks.

-Hal

Wouldn't that be suboptimal from a CSE PoV? Consider something like:

r = a + b * c + b * c * d;

If we are greedy, the b * c would end up as (a + b * c) FMA instrinsic
and the multiplication computed twice?

Joerg

From: "Joerg Sonnenberger via cfe-dev" <cfe-dev@lists.llvm.org>
To: cfe-dev@lists.llvm.org
Sent: Monday, September 21, 2015 8:06:00 AM
Subject: Re: [cfe-dev] question about fused multiply add and Clang GNU modes

> We don't track the C-level expressions in the IR, but Clang will
> directly form @llvm.fmuladd intrinsics where allowed. CodeGen then
> converts these into FMA nodes, or expands them into ADD + MUL
> depending
> on target hooks.

Wouldn't that be suboptimal from a CSE PoV? Consider something like:

r = a + b * c + b * c * d;

If we are greedy, the b * c would end up as (a + b * c) FMA
instrinsic
and the multiplication computed twice?

Yes, I think that it could. That's a good point.

-Hal

For what I would call “modern” hardware FMA implementations (where fma is no more costly than multiply, and often as cheap as an add), this can never be too bad, because adding the different addends to the common products isn’t actually significantly cheaper than doing partially-redundant FMAs, and if the product is re-used in a non-FMA expression, computing FMA and product is no more expensive than product and sum.

There definitely exist some FMA implementations where it’s as expensive as a separate multiply and add, however, and on those machines this *can* indeed be a hazard.

– Steve

As long as FMA and plain multiply are more expensive than add, the above
can be trivially extended by another term or two to still highlight the
problem. But this goes back to the core of the issue: It is a target
specific issue what is better and C -> IR is too early for that decision
to be made.

Joerg

Semantically llvm.fmuladd is just a “fusable” multiply-add pair. The decision hasn’t been made yet.

– Steve

I doubt we really supoprt FP_CONTRACT ON: I've never seen any attempt
to track C expressions in the IR, which seems like a necessity. So
the
choice is probably between a conformant OFF and a buggy ON for
default.

As someone who helped review the patches, I can assure you that we do support the pragmas.

Ah, sorry about that misinformation then. It didn't even occur to me
that it could be done at the clang level.

Tim.