Floating-point performance question

We have been comparing the performance of code generated by Clang++ 3.3 with G++ 4.5.1. The results have been mixed.

We ran a profiler to look for what could cause some cases to run slower with Clang++ and found that some floating-point routines were taking a lot of time:

samples % image name symbol name
596677 19.7935 studio++ gcopy2
274870 9.1182 libm-2.13.so feholdexcept
262358 8.7032 libm-2.13.so fesetenv
258225 8.5661 studio++ cgi...
207915 6.8971 libm-2.13.so fesetround
193316 6.4129 studio++ dcopy2
126933 4.2107 libm-2.13.so __ieee754_exp2
122614 4.0675 studio++ fcopy2

For g++ the top contributors were these:

samples % image name symbol name
466893 21.3064 studio++ gcopy2
300240 13.7013 studio++ cgi...
176191 8.0404 studio++ dcopy2
132491 6.0462 studio++ cgi...
129580 5.9133 libm-2.13.so __ieee754_pow
126938 5.7928 studio++ ecopy2
119610 5.4583 studio++ fcopy2

The libm floating-point routines 'fe...' only show up with Clang++, so I suspect they account for the slower performance.

We are not purposely changing the floating-point precision or rounding mode, so I am looking for a way to avoid code that uses these functions unnecessarily.

We are compiling with these options:

-march=core2 -msse4.1 -m64 -std=c++0x -fPIC -pthread -gcc-toolchain /opt/gcc-4.7.2 -Wno-logical-op-parentheses -Wno-shift-op-parentheses -O2

There isn't any obvious reason why feholdexcept etc. would be called from
clang-compiled code, but not gcc-compiled code; clang never generates calls
to it implicitly.

Can you hop into a debugger and get a stack trace from a call to
feholdexcept?

-Eli

Usually the reason these symbols show up on linux is that you’re hitting the errno-versions of the libm entry points (i.e. GCC is likely generating calls to a different set of more streamlined libm entry points, while clang is hitting the default versions).

– Steve

glibc's expf() function changes the FP rounding mode on every call -- which are the fe* calls you're seeing -- resulting in a dreadful performance (IIRC there's a pipeline stall when rounding mode changes).

Have a look at sysdeps/ieee754/flt-32/e_expf.c in the glibc sources to verify. This is true as of glibc 2.14, at least.

We had to roll our own to work around it.

  - ½

Same applies to exp2f, btw, since they have fairly very similar implementation.

  - ½

Thanks for all the clues. Here is the stack trace:

 feholdexcept,
 __ieee754_exp2,
 exp2,
 _ZN9cgi...

Based on your various hints, I’m guessing that our code ‘pow (2.0, x)’ is being optimized by Clang++ to ‘exp2 (x)’ and not by G++. We will try using exp2 explicitly and see what happens with the G++ version.

Perhaps we are running into a floating-point standards issue that our old version of G++ is ignoring.

We’ll continue investigating tomorrow.

We changed our code to use ‘exp2 (x)’ instead of ‘pow (2.0, x)’ and verified that the G++ version now calls feholdexcept. We’ll also run our benchmarks again to compare Clang++ with G++ on our modified code. So, the question for Clang developers is: how can we avoid the optimization that converts ‘pow (2.0, x)’ to ‘exp2 (x)’? I don’t know why the library functions differ in their need to call feholdexcept, but regardless of the explanation, I want to pick the faster one for this particular usage.

Thanks for all the clues. Here is the stack trace:
feholdexcept,
__ieee754_exp2,
exp2,
_ZN9cgi... Based on your various hints, I'm guessing that our code
'pow (2.0, x)' is being optimized by Clang++ to 'exp2 (x)' and not
by G++. We will try using exp2 explicitly and see what happens with
the G++ version.

Perhaps we are running into a floating-point standards issue that our
old version of G++ is ignoring.

We'll continue investigating tomorrow.

We changed our code to use 'exp2 (x)' instead of 'pow (2.0, x)' and
verified that the G++ version now calls feholdexcept. We'll also run
our benchmarks again to compare Clang++ with G++ on our modified
code.

So, the question for Clang developers is: how can we avoid the
optimization that converts 'pow (2.0, x)' to 'exp2 (x)'?

This happens in Transforms/Utils/SimplifyLibCalls.cpp -- Maybe we should add a function to TargetLibraryInfo in order to mark exp2 as expensive on some platforms?

-Hal

This is really a library bug — exp2(x) should be at least as fast as pow(2,x) with any sane implementation (in particular, the library writers can implement exp2 by calling pow(2,x), so there’s no excuse for it to be slower). That said, it might be worth working around in the meantime.

– Steve

Our workaround: 'const double TWO = 2.;' is defined in a library separate from all our code. The compiler can't change 'pow (TWO, x)' to 'exp (x)'. This improved performance by quite a bit. I'll address the next performance issue in a separate message.

It would still be nice to have some control over whether translations of math functions are performed automatically by the compiler. Are there many of them?

Compiler options could help deal with this kind of library implementation problem.

IMO, if glibc's implementation of exp2 is so bad, LibCall info should get a bit to indicate this. This could disable pow(2 -> exp2, and could enable transformations from exp2(x) -> pow(2,x).

-Chris

Maybe a cost table / cost info in LC, so that optimizers could work out if
it's worth exchanging N pows for M exps, or something like that, on a
target specific basis.

cheers,
--renato

This may not be necessary. I found another clue here, in section 1.1: https://sourceware.org/newlib/libm.html

When I tested the initial value of libm’s _LIB_VERSION, I found that my G++ version starts up with _IEEE_, the fastest mode (no exception handling, no warnings, ignoring errno. The Clang++ version starts up with _POSIX_, which sets errno correctly.

I will test our code's behavior, when I set ``_LIB_VERSION to ``_IEEE_ explicitly, and I'll let you know if this explains the difference.

Unfortunately, that did not help. The version change did not improve the behavior of exp2 – it is still slower than pow. I will have to stick with my workaround, giving pow a const double with the value 2.0 instead of a literal 2.0. You may continue to consider how Clang++ could be controlled to avoid converting to the slower function.