Complex arithmetic ignores -ffast-math after clang r219557, serious performance regressions

After building with clang 3.7svn recently, I saw a huge speed hit across much of our HPC and floating point DSP code. I looked at the asm output and it’s riddled with calls to ___mulsc3, which is never inlined (preventing lots of other optimizations) and which includes a bunch of C99 Annex G-recommended branch conditions for range checks and whatnot. One of the purposes of -ffast-math has always been to disable these sort of checks, trusting the developer to ensure that they can’t happen or will be handled upstream.

Explicitly writing out the real and imaginary component math in one of my critical sections was enough to confirm that the problem lies here and not elsewhere. However, doing this throughout all of our code would be prohibitive, and of course greatly reduces the readability of the code and presumably the ability for future compilers to optimize it in a way that I haven’t though of yet.

The relevant patch discussion in the mailing list is here: http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20141006/116248.html and includes a comment from hfinkel also requesting that the libcalls be skipped in fast-math mode. From what I can see there was no followup on this.

At the bare minimum I think these checks should be disabled within mulsc3 when ffast-math or the relevant subflag is enabled, and preferably that the library calls be skipped entirely as before, so that other compiler optimizations aren’t prevented.

Hey Chandler,

What's the status on this? As Richard points out, we really do need to elide this runtime calls in fast-math mode. It seems like this is just as simple as conditionally restoring some pieces of code from CGExprComplex.cpp that you removed. Are there additional complications or objections to this?

-Hal

A temporary workaround is defining __mulsc3 in your own code... clang seems
to pick up on it correctly, e.g.:

__attribute__(( always_inline ))
static inline float _Complex
__mulsc3( float ar, float ai, float br, float bi)
{
  return (float _Complex){ ar * br - ai * bi, ar * bi + ai * br };
}

I've noticed it really needs to be static always_inline to get optimized
properly. At least using latest clang-3.7 from debian sid with:
-target arm-linux-gnueabihf -mfloat-abi=hard -mcpu=cortex-a8 -mfpu=neon
-Ofast

Different storage class specifications give fascinating differences, even
with a function as simple as return a * b; where a and b are its complex
float arguments.

Two curious observations:
* If my __mulsc3 is declared "extern inline", clang nevertheless emits code
for it. I had expected any non-inlineable uses to become references to the
standard one.
* If it is declared static (inline or not) it acquires soft float ABI
calling conventions (with associated terrible overhead), and it still gets
called in places where __mulsc3 would normally get called. Using
always_inline avoids this.

(Since you're declaring complex mul, you can of course take the opportunity
to see if there's any benefit in a different implementation of complex
multiply, e.g.

  float t = ai * ( br - bi );
  return (float _Complex){ br * (ar - ai) + t, bi * (ar + ai) + t };

or one of its many variants. Probably not unless your target has a slow
multiplier or the relevant sums/differences are needed already anyway, but
who knows...)

Matthijs

Thanks for the tip, it seems to be working just fine. I’ll leave it in my code until this gets fixed in Clang.

Along similar lines, couldn’t we define the -ffast-math/-freciprocal-math version of __divsc3 as:

__attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
    const float one_over_denominator = 1.0f / (br * br + bi * bi);
    return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
}

To the best of my knowledge, I’ve always seen both gcc and clang emit two [v]divss instructions when dividing two complex numbers, or taking the reciprocal of one complex number, even though only a single real divide is necessary. In my critical code whenever a divide is absolutely necessary, I have to write this out, in order to get the single divss instruction.

R Campbell