Hi there, I notice that our current implementation of fast division transformation (turn `a / b` into `a * (1/b)`) is worse in precision compared with GCC. Like this case in ppc64le:

```
float fdiv\(unsigned int a, unsigned int b\) \{
return \(float\)a / \(float\)b;
\}
```

Result of Clang -Ofast is 41A00001 (in Hex), while GCC produces 41A00000 which is the same as no optimizations opened.

Currently, DAGCombiner uses `BuildReciprocalEstimate` to calculate the reciprocal (`1/b`) first and multiply it with `a`. But if we put the operand `a` into iterations in the estimate function, the result would be better.

Patching such a change may break several existing test cases in different platforms since it’s target-independent code. So any suggestions are welcome. Thanks.

Regards,

Qiu Chaofan

Qiu Chaofan,

Yes, clearly, two floating point operations instead of one will increase the degree of resulting error already present in the necessarily or commonly fixed length number representations.

The reason for the two operations appears to be that there may be machine instructions for a reciprocal that when combined with a multiplication obtains fewer machine cycles than a division.

The trade-off is then precision vs. speed. There may be additional computations along this line and perhaps an additional compile flag, along with code changes, would allow that choice.

Regards, Neil Nelson

Hi there, I notice that our current implementation of fast division transformation (turn `a / b` into `a * (1/b)`) is worse in precision compared with GCC. Like this case in ppc64le:

```
float fdiv\(unsigned int a, unsigned int b\) \{
return \(float\)a / \(float\)b;
\}
```

Result of Clang -Ofast is 41A00001 (in Hex), while GCC produces 41A00000 which is the same as no optimizations opened.

Currently, DAGCombiner uses `BuildReciprocalEstimate` to calculate the reciprocal (`1/b`) first and multiply it with `a`. But if we put the operand `a` into iterations in the estimate function, the result would be better.

Patching such a change may break several existing test cases in different platforms since it’s target-independent code. So any suggestions are welcome. Thanks.

Test cases can be changed if the result is universally better, and

alternatively, we can introduce a way for the target to control the

behavior (e.g., how we choose between buildSqrtNROneConst and

buildSqrtNRTwoConst). What's the effect on performance?

-Hal

Hal,

Yes, speed is an important factor of making dicision. Here I just put the numerator into estimation, so it won't add any more instructions. A simple benchmark below keeps the same running time between the demo and current master:

```
float fdiv(unsigned int a, unsigned int b) {
return (float)a / (float)b;
}
float m;
__attribute__((noinline)) void foo() {
m = 0.0;
}
int main() {
for (int i = 1; i < 1000000; ++i)
for (int j = 1; j < 30000; ++j) {
m = fdiv(i, j);
foo();
}
}
```

Regards,

Qiu Chaofan

I think that it’s certainly worth posting a patch and then we can evaluate it.

Thanks again,

Hal

Hal,

Here is the patch. Thanks.

Regards,

Qiu Chaofan

recip-new.patch (3.23 KB)

Hi, Qiu Chaofan,

Can you please upload the patch to reviews.llvm.org? It’s much easier for me to review patches there (see https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions).

Thanks again,

Hal