Should llvm optimize 1.0 / x ?

Hi,

Here is a small C++ program:

vec.cc:

#include <cmath>

using v4f32 = float __attribute__((__vector_size__(16)));

v4f32 fct1(v4f32 x)
{
  return 1.0 / x;
}

v4f32 fct2(v4f32 x)
{
  return __builtin_ia32_rcpps(x);
}

Which is compiled to:

vec.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <_Z4fct1Dv4_f>:
   0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1 # 9
<_Z4fct1Dv4_f+0x9>
   7: 00 00
   9: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0
   d: c3 retq
   e: 66 90 xchg %ax,%ax

0000000000000010 <_Z4fct2Dv4_f>:
  10: c5 f8 53 c0 vrcpps %xmm0,%xmm0
  14: c3 retq

As you can see, 1.0 / x is not turned into vrcpps. Is it because of
precision or a missing optimization?

Regards,

Hi Alexandre,

Have you tried to compile this with fast-math enabled (`-ffast-math` https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)?

I would expect LLVM to require the `arcp` flag to perform this optimization (https://www.llvm.org/docs/LangRef.html#fast-math-flags).

Cheers,
-Quentin

Hi Quentin,

You are correct, I could manage to get clang to use vrcpps, but not in
a satisfying way:

clang++ -O3 -march=native -mtune=native \
-Rpass=loop-vectorize -Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize \
-ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \
-c -o vec.o vec.cc

0000000000000140 <_Z4fct4Dv4_f>:
140: c5 f8 53 c8 vrcpps %xmm0,%xmm1
144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2 # 14d
<_Z4fct4Dv4_f+0xd>
14b: 00 00
14d: c4 e2 71 ac c2 vfnmadd213ps %xmm2,%xmm1,%xmm0
152: c4 e2 71 98 c1 vfmadd132ps %xmm1,%xmm1,%xmm0
157: c3 retq
158: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
15f: 00

0000000000000160 <_Z4fct5Dv4_f>:
160: c5 f8 53 c0 vrcpps %xmm0,%xmm0
164: c3 retq

As you can see, fct4 is not equivalent to fct5.

Regards,
Alexandre Bique

Hi Quentin,

You are correct, I could manage to get clang to use vrcpps, but not in
a satisfying way:

clang++ -O3 -march=native -mtune=native \
-Rpass=loop-vectorize -Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize \
-ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \
-c -o vec.o vec.cc

0000000000000140 <_Z4fct4Dv4_f>:
  140: c5 f8 53 c8 vrcpps %xmm0,%xmm1
  144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2 # 14d
<_Z4fct4Dv4_f+0xd>
  14b: 00 00
  14d: c4 e2 71 ac c2 vfnmadd213ps %xmm2,%xmm1,%xmm0
  152: c4 e2 71 98 c1 vfmadd132ps %xmm1,%xmm1,%xmm0
  157: c3 retq
  158: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
  15f: 00

0000000000000160 <_Z4fct5Dv4_f>:
  160: c5 f8 53 c0 vrcpps %xmm0,%xmm0
  164: c3 retq

As you can see, fct4 is not equivalent to fct5.

Perhaps it's better :wink:

It looks like the compiler has generated one Newton iteration after the estimate to increase the precision of the answer. The reciprocal estimate is, after all, only an estimate, and for many applications, is not sufficient on its own.

This behavior is generally adjustable. Try using -mrecip=vec-divf:0 (or -mrecip=all:0) to turn off all of the Newton iterations.

-Hal

Perhaps it's better :wink:

It looks like the compiler has generated one Newton iteration after the
estimate to increase the precision of the answer. The reciprocal
estimate is, after all, only an estimate, and for many applications, is
not sufficient on its own.

Yes.

This behavior is generally adjustable. Try using -mrecip=vec-divf:0 (or
-mrecip=all:0) to turn off all of the Newton iterations.

Thank you very much! It did the job!