Poor floating point optimizations?

I wanted to use LLVM for my math parser but it seems that floating point
optimizations are poor.

For example consider such C code:
float foo(float x) { return x+x+x; }

and here is the code generated with "optimized" live demo:
define float @foo(float %x) nounwind readnone { entry: %0 = fmul float %x,
2.000000e+00 ; <float> [#uses=1] %1 = fadd float %0, %x
; <float> [#uses=1] ret float %1 }

So what is going on? I mean, it was not replaced by %0 = fmul %x, 3.0.
I thought that maybe LLVM follows some strict IEEE floating point rules like
exceptions/subnormals/NaN/Inf etc issues. But (x+x) was actually transformed to
%x * 2.0. And this is even stranger, because on many architectures, MUL takes
more cycles than ADD (reason you'd rather use LEA than IMUL in X86 code).

Could someone explain what is going on? Maybe there are some special
optimization passes for such scenarios but I've been playing with them (in C++
app) and I didn't achieve anything. And to be honest, those optimization passes
are not well documented.

With regards,
Bob D.

And also the resulting assembly code is very poor:

00460013 movss xmm0,dword ptr [esp+8]
00460019 movaps xmm1,xmm0
0046001C addss xmm1,xmm1
00460020 pxor xmm2,xmm2
00460024 addss xmm2,xmm1
00460028 addss xmm2,xmm0
0046002C movss dword ptr [esp],xmm2
00460031 fld dword ptr [esp]

Especially pxor&and instead of movss (which is unnecessary anyway) is just pure
madness.

Bob D.

X+0.0 isn't the same as X if X is -0.0. Have you tried setting 'UnsafeFPMath' in TargetOptions.h?

If you're still having problems, it would be best to indicate more about your configuration, how you're using the llvm tools/libs etc.

-Chris

Thanks for replying so fast. This UnsafeFPMath trick in fact solves "pxor adds"
case, but the resulting code is still not as good as I expected from LLVM.

For example expressions like "1+x+1+x+1+x+1+x" (basically adding a lot of
constants and variables) are complied to a long series off <add>s both in IR
and
assembly code.
Both GCC and MSVC generates C1*x +C2 (mov + mul + add).

I am new to using LLVM. I am using Visual Studio 2008 on Windows, targeting
32-bit X86 code. I'm using LLVM as a library. I've
added setOptLevel(CodeGenOpt::Aggressive) to the engine but with no effect.
Please let me know if there are some optimization options/passes in LLVM that
could optimize such example expressions as "1+x+1+x+1+x+1+x".

Bob

Hi Bob,

For example expressions like "1+x+1+x+1+x+1+x" (basically adding a lot of
constants and variables) are complied to a long series off<add>s both in IR
and
assembly code.
Both GCC and MSVC generates C1*x +C2 (mov + mul + add).

I am new to using LLVM. I am using Visual Studio 2008 on Windows, targeting
32-bit X86 code. I'm using LLVM as a library. I've
added setOptLevel(CodeGenOpt::Aggressive) to the engine but with no effect.
Please let me know if there are some optimization options/passes in LLVM that
could optimize such example expressions as "1+x+1+x+1+x+1+x".

the reassociate pass deliberately does not perform this kind of optimization
because it is not right to do so if strict floating point correctness is
required. Probably it should do it if the user supplied an "unsafe math"
flag. There's actually a plan to add an "unsafe math" flag to each fp
operation in the IR, so this setting can be per-operation, but it didn't
happen yet.

Ciao,

Duncan.

To do FP optimization correctly the compiler needs quite detailed knowledge
of the FP model, that goes beyond just "safe/unsafe".

For example as Chris pointed out, x + 0.0 -> x is not a valid rewrite if
the FP model has signed zeroes (i.e. zeroes whose sign is observable by
well-defined operations in the language). It is valid in Standard C because
the sign of zero is not observable in Standard C. On the other hand
your example needs reassociation, to change 1 + x + 1 to x + (1 + 1),
and reassociation generally isn't valid even in Standard C, because it
changes the numeric value of the result.

Other examples: to optimize away a voided computation you need to know
"does the FP model trap or set cumulative exception bits". To rewrite
x + y as y + x you need to know "does the FP model trap". There are other
optimizations whose validity depends on whether the rounding mode can be
changed at runtime.

Overall the FP model needs to be able to answer at least these queries:

  - does it trap (i.e. invoke user-defined trap handler with the original
    operands in the original order)
  - does it set cumulative exception bits
  - can the rounding mode be changed at runtime
  - does it have Inf and NaN
  - are denormals flushed to zero (and if so how)

Some of the flags tend to imply others, e.g. you don't in practice have the
ability to trap without the ability to set cumulative exception bits.
To support Standard C/C++ you only need a minimal model; to claim full
IEEE support you need pretty much everything. The "Java numerics" model
is somewhere in between.

All of these are within the space of "safe" models, the difference is the
set of values and observable events.

As well as this you have a concept of "unsafe" (or "fast") modes which aren't
really models at all - they are modes where the compiler doesn't implement a
well-defined predictable model of FP arithmetic and where numeric results can
vary between different optimization levels, different programming styles (e.g.
inline functions vs. macros) and from one day's build of the compiler to the next.

And within the space of fast modes there are some choices for varying the
heuristics - e.g. replacing division-by-constant by multiply-by-reciprocal
is generally numerically "safer" than reassociation.

At the very least (I believe) a compiler for serious FP work needs a
fast/unsafe mode, a minimal Standard C mode, and an IEEE mode. Attaching
a safe/unsafe bit to each FP operation is only sufficient if there is only
one safe model in effect at any time. Even then, exactly what the safe
model is, ought to be configurable.

Al

I'm aware that there are IEEE requirements for floating point. But all C/C++
compilers like GCC or MSVC have unsafe/fast math switches that disable all
problems related to NaN/Inf/+-0/precision etc because in some applications those
are not as important as performance (for example graphics calculation).

But as I understand, LLVM currently does not have such unsafe/fast math
optimizations built in and support of such is yet being planned?
I've found this
document: http://nondot.org/sabre/LLVMNotes/FloatingPointChanges.txt but I
suppose it was not implemented yet?

Bob

I'm aware that there are IEEE requirements for floating point. But all
C/C++
compilers like GCC or MSVC have unsafe/fast math switches that disable
all
problems related to NaN/Inf/+-0/precision etc because in some
applications those
are not as important as performance (for example graphics calculation).

Right, but there's a middle ground of "performance with repeatability" where
you want to have a minimal model of FP arithmetic without NaN/Inf/+-0, but at
the same time you don't want numerical results to be sensitive to which
optimizations the compiler happens to do, which will themselves be sensitive
to compiler version, optimization level, structure of source code etc.

I've found this
document: http://nondot.org/sabre/LLVMNotes/FloatingPointChanges.txt

It says

"GCC provides relaxed floating point support through the flags -ffast-math,
-funsafe-math-optimizations, -fhonor-{s}nans, and some other flags that noone
really uses (-ffast-math is all most people know about)...
"Note that if we were to magically make our own C/C++ compilers, we should
definitely default to the equivalent of -ffast-math. Empirically, this produces
faster and more precise code for GCC"

This seems to be an x86-centric view, i.e. from the fact that on x86
it's faster not to reduce intermediate results to their source-type
precision. So by relaxing _this particular_ requirement you really do
get results that are "faster and more precise". On most architectures
this isn't relevant though.

Other "relaxed" optimizations are faster and generally _less_ precise;
some of them gracefully so, but reassociation can be catastrophically
less precise. And all of them can lead to unrepeatable results, or
situations where two semantically equivalent sources give different
results. So even within the space of relaxed optimizations, some fine
grained control is useful.

Al

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.