AVX2 codegen - question reg. FMA generation

Hello,

On the appended reasonably simple test case that has an fmul/fadd
sequence on <8 x float> vector types, I don't see the x86-64 code
generator (with cpu set to haswell or later types) turning it into an
AVX2 FMA instructions. Here's the snippet in the output it generates:

$ llc -O3 -mcpu=skylake

It appears you need 'reassoc' on fmul/fadd:
https://godbolt.org/z/nuTzx2

Hello,

On the appended reasonably simple test case that has an fmul/fadd
sequence on <8 x float> vector types, I don't see the x86-64 code
generator (with cpu set to haswell or later types) turning it into an
AVX2 FMA instructions. Here's the snippet in the output it generates:

$ llc -O3 -mcpu=skylake

---------------------
.LBB0_2: # =>This Inner Loop Header: Depth=1
vbroadcastss (%rsi,%rdx,4), %ymm0
vmulps (%rdi,%rcx), %ymm0, %ymm0
vaddps (%rax,%rcx), %ymm0, %ymm0
vmovups %ymm0, (%rax,%rcx)
incq %rdx
addq $32, %rcx
cmpq $15, %rdx
jle .LBB0_2
-----------------------

$ llc --version
LLVM (http://llvm.org/):
  LLVM version 8.0.0
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: skylake
(llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31)

Using opt -O3 followed by llc leads to the same vmulps / vaddps
sequence. (adding -mattr=fma doesn't help, although this I assume
isn't needed given the cpu type.) The result is the same even with
-mcpu=haswell.

This is a common pattern involved in a reduction with two things on
the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx),
and %ymm0. If another register is used to hold a loaded value, the
vfmadd instruction could be used in multiple ways. I suspect I'm
missing something, which I why I'm not already posting this on
llvm-bugs. Is this expected behavior?

-------------------------------------------------------------------------------------------
; ModuleID = 'LLVMDialectModule'
source_filename = "LLVMDialectModule"

declare i8* @malloc(i64)

declare void @free(i8*)

define <8 x float>* @fma(<8 x float>* %0, float* %1, <8 x float>* %2) {
  br label %4

4: ; preds = %7, %3
  %5 = phi i64 [ %19, %7 ], [ 0, %3 ]
  %6 = icmp slt i64 %5, 16
  br i1 %6, label %7, label %20

7: ; preds = %4
  %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5
  %9 = load <8 x float>, <8 x float>* %8, align 16
  %10 = getelementptr float, float* %1, i64 %5
  %11 = load float, float* %10, align 16
  %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
  %13 = load <8 x float>, <8 x float>* %12, align 16
  %14 = insertelement <8 x float> undef, float %11, i32 0
  %15 = shufflevector <8 x float> %14, <8 x float> undef, <8 x i32>
zeroinitializer
  %16 = fmul <8 x float> %15, %9
  %17 = fadd <8 x float> %16, %13
  %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
  store <8 x float> %17, <8 x float>* %18, align 16
  %19 = add i64 %5, 1
  br label %4

20: ; preds = %4
  ret <8 x float>* %2
}

Roman

Fusing of the fadd and fmul is not allowed by default.
http://llvm.org/docs/LangRef.html#floating-point-environment

‘contract’ on the fadd (and an fma-capable target) are the minimum requirements; ‘reassoc’ will also work, but that may enable other (possibly unintended) transforms.
https://godbolt.org/z/-k6G2h

define float @fma(float %x, float %y, float %z) {
%m = fmul float %x, %y
%a = fadd contract float %m, %z
ret float %a
}

It appears you need 'reassoc' on fmul/fadd:
https://godbolt.org/z/nuTzx2

Thanks very much, that was it. Either that or providing
-enable-unsafe-fp-math to llc yielded FMAs. I didn't expect this since
using FMAs here instead of mul/add appears to be safer (the reverse is
unsafe).

~ Uday

It goes in both directions. There are expressions that are more accurate when evaluated with FMAs but also cases (albeit less common) where replacing a mul-then-add with a FMA causes problems.

One example (due to Kahan) is that x^2 - y^2 evaluated as fma(-y,y, fmul(x,x)) can result in a negative value for x=y (if the original product rounded down) whereas fsub(fmul(x,x), fmul(y,y)) won't.

-Fabian

>>
>> It appears you need 'reassoc' on fmul/fadd:
>> https://godbolt.org/z/nuTzx2
>
> Thanks very much, that was it. Either that or providing
> -enable-unsafe-fp-math to llc yielded FMAs. I didn't expect this since
> using FMAs here instead of mul/add appears to be safer (the reverse is
> unsafe).
>
> ~ Uday

It goes in both directions. There are expressions that are more accurate
when evaluated with FMAs but also cases (albeit less common) where
replacing a mul-then-add with a FMA causes problems.

One example (due to Kahan) is that x^2 - y^2 evaluated as fma(-y,y,
fmul(x,x)) can result in a negative value for x=y (if the original
product rounded down) whereas fsub(fmul(x,x), fmul(y,y)) won't.

Thank you very much, Fabian and others.

~ Uday