LLVM is optimizing the intrinsic code as well

We have some code which is manually written as intrinsics. But LLVM is trying to optimize further because of --fast-math flag. Manual intrinsic is better compared to LLVM optimized one. Example source code:

inline __m256 simd_evaluate_polynomial<__m256, APPROX_DEFAULT>(__m256 x, const std::array<__m256, APPROX_DEFAULT + 1>& coeff)


  __m256 power = _mm256_set1_ps(1.0f);

  __m256 res = _mm256_set1_ps(0.0f);

  for (unsigned int i = 0; i <= APPROX_DEFAULT; i++) {

    __m256 term = _mm256_mul_ps(coeff[i], power);

    power = _mm256_mul_ps(power, x);

    res = _mm256_add_ps(res, term);


  return res;


For above function LLVM ASSEMBLY

Address Source Line         Assembly            CPU Time: Total CPU Time: Self

0x1402bbf7d      0              Block 1:                

0x1402bbf7d      19           vmovaps ymm5, ymmword ptr [rip+0x50e4b5b] 0.1%      15.584ms

0x1402bbf85      19           vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b32]         0.1%      15.595ms

0x1402bbf8e      19           vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b09]         0.6%      93.654ms

0x1402bbf97      19           vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ae0]         0.2%      31.178ms

0x1402bbfa0      21           vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ab7]         0.3%      46.992ms

Can anyone please explain this why this is happening?

Does anyone seen same issue on LLVM Compiler?. Why manual intrinsic been further optimized by compiler?

This behavior is intentional and expected. Those are compiler intrinsics,
and compiler is free to implement them as it sees fit. There is no strict
guarantee that they will codegen into the assembly their name implies.
If you really need the latter, you might need to look into inline assembly,
although i would advise against that.