Hi,
I want an efficient way to implement function pow in LLVM instead of invoking pow() math builtin. For algorithm part, I am clear for the logic. But I am not quite sure for which parts of LLVM should I replace builtin pow with another efficient pow implementation. Any comments and feedback are appreciated. Thanks!
The llvm.pow() intrinsic just calls pow() from libm; if you want to replace it with a different implementation, you can just write a function called pow() and link it into your program.
If you're looking for the part of LLVM which optimizes pow() calls, see LibCallSimplifier::optimizePow in lib/Transforms/Utils/SimplifyLibCalls.cpp.
Eli
In Numba, we have decided to optimize some usages of the power function
in our own frontend, so that LLVM IR gets an already optimized form,
as we have found that otherwise LLVM may miss some optimization
opportunities. YMMV.
(e.g. we detect that the exponent is a compiletime constant and
transform `x**3` into `x*x*x`)
Note that not only `pow(x, 3)` can be slower than `x*x*x`, but it may
also have some precision issues, as it goes through log() and exp()
calls.
Regards
Antoine.
Hi,
I want an efficient way to implement function pow in LLVM instead of
invoking pow() math builtin. For algorithm part, I am clear for the logic.
But I am not quite sure for which parts of LLVM should I replace builtin
pow with another efficient pow implementation. Any comments and feedback
are appreciated. Thanks!
In Numba, we have decided to optimize some usages of the power function
in our own frontend, so that LLVM IR gets an already optimized form,
as we have found that otherwise LLVM may miss some optimization
opportunities. YMMV.
It seems to me that it would be more interesting to gather these misoptimization and fix LLVM to catch them.
(e.g. we detect that the exponent is a compiletime constant and
transform `x**3` into `x*x*x`)
This seems definitely in the scope of what LLVM could do, potentially with TTI.
LLVM already does this... but only if the pow() call is marked "fast". IEEE 754 pow() is supposed to be correctly rounded, but (x*x)*x has an extra rounding step.
Eli
pow( ) is not supposed to be correctly rounded.
IEEE 754 recommends that a correctly rounded power function exist [9.2], but does not recommend or require that this be the default pow( ) function. [Note: it was not actually shown until quite late in the IEEE 754 revision process that it was even possible to have a reasonably efficient correctlyrounded pow( ) function, and such a function is at minimum an order of magnitude slower than a good subulpaccuratebutnotcorrectlyrounded implementation. Most 754 committee members would recommend that the default pow( ) function *not* be correctly rounded.]
In a good math library, however, pow( ) should absolutely be subulp accurate, which means that it should *not* be implemented via log( ) and exp( ), and indeed, most math libraries don’t do that.
Just to provide some example data, for singleprecision x in [1,2) on current OS X:
The worstcase error of x*x*x is 1.28736 ulp
The worstcase error of powf(x, 3) is 0.500013 ulp
The RMS error of x*x*x is 0.585066 ulp
The RMS error of powf(x,3) is 0.499984 ulp
– Steve
Hi,
I want an efficient way to implement function pow in LLVM instead of
invoking pow() math builtin. For algorithm part, I am clear for the logic.
But I am not quite sure for which parts of LLVM should I replace builtin
pow with another efficient pow implementation. Any comments and feedback
are appreciated. Thanks!
In Numba, we have decided to optimize some usages of the power function
in our own frontend, so that LLVM IR gets an already optimized form,
as we have found that otherwise LLVM may miss some optimization
opportunities. YMMV.
It seems to me that it would be more interesting to gather these misoptimization and fix LLVM to catch them.
(e.g. we detect that the exponent is a compiletime constant and
transform x**3
into x*x*x
)
This seems definitely in the scope of what LLVM could do, potentially with TTI.
LLVM already does this… but only if the pow() call is marked “fast”. IEEE 754 pow() is supposed to be correctly rounded, but (x*x)*x has an extra rounding step.
pow( ) is not supposed to be correctly rounded.
IEEE 754 recommends that a correctly rounded power function exist [9.2], but does not recommend or require that this be the default pow( ) function. [Note: it was not actually shown until quite late in the IEEE 754 revision process that it was even possible to have a reasonably efficient correctlyrounded pow( ) function, and such a function is at minimum an order of magnitude slower than a good subulpaccuratebutnotcorrectlyrounded implementation. Most 754 committee members would recommend that the default pow( ) function not be correctly rounded.]
So we should use xxx even without fastmath…
In a good math library, however, pow( ) should absolutely be subulp accurate, which means that it should not be implemented via log( ) and exp( ), and indeed, most math libraries don’t do that.
Just to provide some example data, for singleprecision x in [1,2) on current OS X:
The worstcase error of xxx is 1.28736 ulp
The worstcase error of powf(x, 3) is 0.500013 ulp
The RMS error of xxx is 0.585066 ulp
The RMS error of powf(x,3) is 0.499984 ulp
… or maybe not!
I’m not sure what to think of these results. For instance what’s the perf impact of calling into powf(x, 3) on OSX vs using xxx?
At the source level, what could the user do to decide that 1.28ULP is acceptable for his usecase?
Hi,
I want an efficient way to implement function pow in LLVM instead of
invoking pow() math builtin. For algorithm part, I am clear for the logic.
But I am not quite sure for which parts of LLVM should I replace builtin
pow with another efficient pow implementation. Any comments and feedback
are appreciated. Thanks!
In Numba, we have decided to optimize some usages of the power function
in our own frontend, so that LLVM IR gets an already optimized form,
as we have found that otherwise LLVM may miss some optimization
opportunities. YMMV.
It seems to me that it would be more interesting to gather these misoptimization and fix LLVM to catch them.
(e.g. we detect that the exponent is a compiletime constant and
transform x**3
into x*x*x
)
This seems definitely in the scope of what LLVM could do, potentially with TTI.
LLVM already does this… but only if the pow() call is marked “fast”. IEEE 754 pow() is supposed to be correctly rounded, but (x*x)*x has an extra rounding step.
pow( ) is not supposed to be correctly rounded.
IEEE 754 recommends that a correctly rounded power function exist [9.2], but does not recommend or require that this be the default pow( ) function. [Note: it was not actually shown until quite late in the IEEE 754 revision process that it was even possible to have a reasonably efficient correctlyrounded pow( ) function, and such a function is at minimum an order of magnitude slower than a good subulpaccuratebutnotcorrectlyrounded implementation. Most 754 committee members would recommend that the default pow( ) function not be correctly rounded.]
So we should use xxx even without fastmath…
In a good math library, however, pow( ) should absolutely be subulp accurate, which means that it should not be implemented via log( ) and exp( ), and indeed, most math libraries don’t do that.
Just to provide some example data, for singleprecision x in [1,2) on current OS X:
The worstcase error of xxx is 1.28736 ulp
The worstcase error of powf(x, 3) is 0.500013 ulp
The RMS error of xxx is 0.585066 ulp
The RMS error of powf(x,3) is 0.499984 ulp
… or maybe not!
I’m not sure what to think of these results. For instance what’s the perf impact of calling into powf(x, 3) on OSX vs using xxx?
Large! xxx has reciprocal throughput of 1 cycle, plus the opportunity for the compiler to autovectorize. powf(x, 3) is much harder to vectorize, and has reciprocal throughput of ~40 cycles on OS X (which is on the faster side; it’s more like 100 cycles on some platforms).
At the source level, what could the user do to decide that 1.28ULP is acceptable for his usecase?
Write xxx. =)
I can see an argument for adding a flag that allows a few ulp of error in math functions (roughly what Intel’s LA versions in MKL or some of the YEPPP implementations provide), but there is a question of how many users would actually find it and use it.
– Steve
Hi,
I want an efficient way to implement function pow in LLVM instead of
invoking pow() math builtin. For algorithm part, I am clear for the logic.
But I am not quite sure for which parts of LLVM should I replace builtin
pow with another efficient pow implementation. Any comments and feedback
are appreciated. Thanks!
In Numba, we have decided to optimize some usages of the power function
in our own frontend, so that LLVM IR gets an already optimized form,
as we have found that otherwise LLVM may miss some optimization
opportunities. YMMV.
It seems to me that it would be more interesting to gather these misoptimization and fix LLVM to catch them.
(e.g. we detect that the exponent is a compiletime constant and
transform x**3
into x*x*x
)
This seems definitely in the scope of what LLVM could do, potentially with TTI.
LLVM already does this… but only if the pow() call is marked “fast”. IEEE 754 pow() is supposed to be correctly rounded, but (x*x)*x has an extra rounding step.
pow( ) is not supposed to be correctly rounded.
IEEE 754 recommends that a correctly rounded power function exist [9.2], but does not recommend or require that this be the default pow( ) function. [Note: it was not actually shown until quite late in the IEEE 754 revision process that it was even possible to have a reasonably efficient correctlyrounded pow( ) function, and such a function is at minimum an order of magnitude slower than a good subulpaccuratebutnotcorrectlyrounded implementation. Most 754 committee members would recommend that the default pow( ) function not be correctly rounded.]
So we should use xxx even without fastmath…
In a good math library, however, pow( ) should absolutely be subulp accurate, which means that it should not be implemented via log( ) and exp( ), and indeed, most math libraries don’t do that.
Just to provide some example data, for singleprecision x in [1,2) on current OS X:
The worstcase error of xxx is 1.28736 ulp
The worstcase error of powf(x, 3) is 0.500013 ulp
The RMS error of xxx is 0.585066 ulp
The RMS error of powf(x,3) is 0.499984 ulp
… or maybe not!
I’m not sure what to think of these results. For instance what’s the perf impact of calling into powf(x, 3) on OSX vs using xxx?
Large! xxx has reciprocal throughput of 1 cycle, plus the opportunity for the compiler to autovectorize. powf(x, 3) is much harder to vectorize, and has reciprocal throughput of ~40 cycles on OS X (which is on the faster side; it’s more like 100 cycles on some platforms).
At the source level, what could the user do to decide that 1.28ULP is acceptable for his usecase?
Write xxx. =)
Ahah!
Just to be sure (even though I’m sure you know already my thoughts on this), the cases I’m thinking about is a function calling powf(x, y) and after inlining and other optimizations y is identified as 3.
I can see an argument for adding a flag that allows a few ulp of error in math functions (roughly what Intel’s LA versions in MKL or some of the YEPPP implementations provide), but there is a question of how many users would actually find it and use it.
Yeah I didn’t imagine any good way to make it discoverable and easy to use unfortunately
As I work as a consultant for high performance computing, I came across a lot of these problems. I’ll give my thoughts on it:

The IEEE 7542008 standard defines the roundind modes for +, , *, / and sqrt which should give the result of those operations rounded to infinity, +infinity, towards 0 or to the nearest floating point. The default mode being, rounding to nearest.

For all other functions, I would say that “it is open bar” for the platform. The reason is that even though it is possible to write libraries such that given a double x, std::exp(x) will be the exact value rounded to the nearest floating point, these algorithms are extremely costly. Most people, don’t need such an accuracy, and an “engineering tradeoff” has to be made. Think, for instance at this program:
Bare in mind that writing a very efficient x^n is not that straightforward. Even if it’s a corner cases, the example of x^7 is interesting. If you apply the divide and rule algorithm, you get the following algorithm:
double pow_7(double x) {
const double x2 = x * x;
const double x3 = x * x2;
const double x6 = x3 * x3;
const double x7 = x * x6;
return x7;
}
But you can also write
double pow_7(double x) {
const double x2 = x * x;
const double x3 = x * x2;
const double x4 = x2 * x2;
const double x7 = x3 * x4;
return x7;
}
which is a different algorithm using the same number of application. But, because of instruction level parallelism, they don’t get the same performance. In the first implementation every single instruction has to wait for the previous instruction to start. In the second one x3 and x4 could be computed using instruction level parallelism as they don’t depend upon each other.
The performance difference in between the two are (compiled with O3 and mavx2):
 clang 3.9.1: The first one is 25% slower than the second one
 intel 17.0.1: The first one is 05% slower than the second one
 gcc 6.2.0: The first one is 3% slower than the first one (but both are 2 times slower than clang and intel)
Attached is the code asserting my claim.
François Fayard
Is the original question here not about pow(int, int), rather than pow(float, int)? In which case the problem about rounding or precision becomes less of an issue (of course, for larger values, there will be overflows instead  but that’s still a problem if you do `int a = pow(x, y) where x**y is out of range for the int value).
In my experience, it’s not at all unlikely to find code like int a = pow(2, 3);
in code, often with the resulting questions about “why is my result 7, not 8”, because even with 0.5 ulp, it’s not (always) EXACTLY the integer value that comes out of the function…