We have some code which is manually written as intrinsics. But LLVM is trying to optimize further because of --fast-math flag. Manual intrinsic is better compared to LLVM optimized one. Example source code:
inline __m256 simd_evaluate_polynomial<__m256, APPROX_DEFAULT>(__m256 x, const std::array<__m256, APPROX_DEFAULT + 1>& coeff)
{
__m256 power = _mm256_set1_ps(1.0f);
__m256 res = _mm256_set1_ps(0.0f);
for (unsigned int i = 0; i <= APPROX_DEFAULT; i++) {
__m256 term = _mm256_mul_ps(coeff[i], power);
power = _mm256_mul_ps(power, x);
res = _mm256_add_ps(res, term);
}
return res;
}
For above function LLVM ASSEMBLY
Address Source Line Assembly CPU Time: Total CPU Time: Self
0x1402bbf7d 0 Block 1:
0x1402bbf7d 19 vmovaps ymm5, ymmword ptr [rip+0x50e4b5b] 0.1% 15.584ms
0x1402bbf85 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b32] 0.1% 15.595ms
0x1402bbf8e 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b09] 0.6% 93.654ms
0x1402bbf97 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ae0] 0.2% 31.178ms
0x1402bbfa0 21 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ab7] 0.3% 46.992ms
Can anyone please explain this why this is happening?