Clang ignoring --fast-math for complex division, serious performance hit

Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.

Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?

These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code. The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.

I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary). I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.

John.

__attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
    const float one_over_denominator = 1.0f / (br * br + bi * bi);
    return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
}

This contains a single real divide, and is general. I proposed it two years ago for -ffast-math/-freciprocal-math but didn’t get any takers. Clang and gcc both normally (until this regression) implement a complex divide using two divide instructions.

I see. You're saying that, under -ffast-math, we can implement a full complex division with one reciprocal plus a bunch of other operations, not that we can implement it with just one division. I believe you're correct that that's acceptable under the usual assumptions of -ffast-math, as well as that it's likely to generally be faster to do that, but I'm not an expert; I've CC'ed a few people who are. One of those people, Chandler, happens to be the person who last changed this (with r219557 in October 2014). If they sign off on the idea, I'd be happy to accept a patch making Clang emit an inlined version under -ffast-math using a single reciprocal.

John.

Yes, this is fine for fast-math.
- Steve

The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.

By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself? Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.

My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors. If this is something that interests you, we'd be very open to contributions.

John.

Are multiplication and division really the same thing for -ffast-math?
There are a lot of cases were division will trivially result in horrible
loss of precision. I think the situation is far worse than for
multiplication.

Joerg

I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.

The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.

By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself? Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.

My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors. If this is something that interests you, we'd be very open to contributions.

John.

I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

Complex divide is a large, complicated operation when full precision and infinity-correctness is required. We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.

While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.

Richard, let me be clear about your options here. If you're interested in working on this, that would be great, and I'd be happy to review your patches. If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

John.

When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

Complex divide is a large, complicated operation when full precision and infinity-correctness is required. We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.

Perhaps not, except that the compiler was previously doing the right thing without the associated very serious performance hit.

While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.

Richard, let me be clear about your options here. If you're interested in working on this, that would be great, and I'd be happy to review your patches. If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

I’d be happy to submit a patch, but don’t know where to begin with that process. I expect the solution will look nearly identical to the solution to the same problem when it occurred with __mulsc3 in July 2015. That fix seems to have happened with no explanation on the mailing list, or I’d have a head start on fixing this myself.

[+Alex]

The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.

By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself? Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.

My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors. If this is something that interests you, we'd be very open to contributions.

John.

I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

Complex divide is a large, complicated operation when full precision and infinity-correctness is required. We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.

While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.

Richard, let me be clear about your options here. If you're interested in working on this, that would be great, and I'd be happy to review your patches. If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

I'd like to add that Alex L. looked at this in some detail in 2013. For some relevant notes, see PR17248 (and how divide is handled in https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). There are indeed more- and less-numerically-stable ways to implement complex division. For an extended discussion, I recommend looking at https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions that are reasonable to inline, especially in the fast-math context, and I support doing so. Alex found that we had to use Smith's algorithm in order to pass LAPACK's regression tests.

  -Hal

When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

Complex divide is a large, complicated operation when full precision and infinity-correctness is required. We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.

Perhaps not, except that the compiler was previously doing the right thing without the associated very serious performance hit.

It was doing an acceptable thing for -ffast-math. Chandler's patch was largely a correctness fix.

While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.

Richard, let me be clear about your options here. If you're interested in working on this, that would be great, and I'd be happy to review your patches. If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

I’d be happy to submit a patch, but don’t know where to begin with that process. I expect the solution will look nearly identical to the solution to the same problem when it occurred with __mulsc3 in July 2015. That fix seems to have happened with no explanation on the mailing list, or I’d have a head start on fixing this myself.

I would look into the commit history for Clang's lib/CodeGen/CGExprComplex.cpp.

John.

(actually cc'ing Alex this time)

(actually cc'ing Alex this time)

[+Alex]

The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.

By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself? Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.

My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors. If this is something that interests you, we'd be very open to contributions.

John.

I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

Complex divide is a large, complicated operation when full precision and infinity-correctness is required. We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.

While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.

Richard, let me be clear about your options here. If you're interested in working on this, that would be great, and I'd be happy to review your patches. If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

I'd like to add that Alex L. looked at this in some detail in 2013. For some relevant notes, see PR17248 (and how divide is handled in https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). There are indeed more- and less-numerically-stable ways to implement complex division. For an extended discussion, I recommend looking at https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions that are reasonable to inline, especially in the fast-math context, and I support doing so. Alex found that we had to use Smith's algorithm in order to pass LAPACK's regression tests.

One more thing, we can use the cheaper (but less numerically-stable formula) when we have #pragma STDC CX_LIMITED_RANGE ON.

  -Hal

Again, by far the biggest performance problem right now is that it’s making a function call, rather than the specifics of the arithmetic. One would hope that simply inlining the existing __divsc3() and allowing the compiler to eliminate the inf and nan branches (which it should do automatically with -ffinite-math-only or CX_LIMITED_RANGE) would result in performance not noticeably slower than the previous baseline.

Unfortunately, no one seems to be able to tell why this problem went away for __mulsc3 in July 2015 (a change in CGExprComplex.cpp does NOT seem to have been involved) after Chandler’s earlier mods in r219557 originally caused __mulsc3 to start issuing function calls. If we knew why __mulsc3 started being inlined again, we’d be able to apply the same fix to __divsc3.

Richard

Sorry about the delay, I saw the thread only today.

There’s an existing bug report about this already btw: https://bugs.llvm.org/show_bug.cgi?id=31872. When I looked into it last time ICC and GCC had different fast division implementations. With -fp-model fast=1 ICC promotes floats to doubles and doubles to 80-bit long doubles to avoid loss of precision. With -fp-model fast=2 ICC uses the original type, but does one division for floats to get a reciprocal instead of doing two divisions, and two (in one instruction) divisions for complex double. GCC just uses two divisions every time without any type promotion.

(actually cc'ing Alex this time)

[+Alex]

I'd like to add that Alex L. looked at this in some detail in 2013. For some relevant notes, see PR17248 (and how divide is handled in https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). There are indeed more- and less-numerically-stable ways to implement complex division. For an extended discussion, I recommend looking at https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions that are reasonable to inline, especially in the fast-math context, and I support doing so. Alex found that we had to use Smith's algorithm in order to pass LAPACK's regression tests.

One more thing, we can use the cheaper (but less numerically-stable formula) when we have #pragma STDC CX_LIMITED_RANGE ON.

-Hal

Again, by far the biggest performance problem right now is that it’s making a function call, rather than the specifics of the arithmetic.

I think that we're all well aware of that.

  One would hope that simply inlining the existing __divsc3() and allowing the compiler to eliminate the inf and nan branches (which it should do automatically with -ffinite-math-only or CX_LIMITED_RANGE) would result in performance not noticeably slower than the previous baseline.

To be clear. CX_LIMITED_RANGE and -ffinite-math-only are different in this regard. C says, "The usual mathematical formulas for complex multiply, divide, and absolute value are problematic because of their treatment of infinities and because of undue overflow and underflow." It's the numerical stability, the "undue overflow and underflow" part, that makes them different.

Unfortunately, no one seems to be able to tell why this problem went away for __mulsc3 in July 2015 (a change in CGExprComplex.cpp does NOT seem to have been involved) after Chandler’s earlier mods in r219557 originally caused __mulsc3 to start issuing function calls. If we knew why __mulsc3 started being inlined again, we’d be able to apply the same fix to __divsc3.

I vaguely recall when this happened, and I'm pretty sure that there was indeed a change to CodeGen somewhere. I do plan to look at this.

  -Hal