Fusing contract fadd/fsub with normal fmul

Hi,

On LLVM 5.0 (current trunk), fadd/fsub and fmul that are both marked
with `contract` or `fast` can be merged to a fma instruction by the
backend.

I'm wondering about the exact semantic of this new flag as well as
`fast` and in particular, would it be valid to do this when only the
`fadd`/`fsub` (and not the `fmul`) is marked with `contract` or at
least `fast`. The reasoning is that doing this will have a similar
effect as if the `fadd`/`fsub` is performed not to IEEE spec so a
single flag on this instruction should be enough for the
transformation.

The particular case I'm interested in is vectorized loop with
reduction like in pseudo C code `s += a[i] * b[i]`. Our front end will
recognize this and mark the `+` as `fast` to enable vectorization.
It'll be great if this can enable the reduction to be done with `fma`
instructions.

Yichao Yu

It seems like the contract flag is underspecified in this regard. I'd lean, however, toward requiring it on both instructions in order to contract them. That way inlining a function where contraction was prohibited into a function where contraction was permitted would not be able to effectively remove the final-result rounding from the callee.

  -Hal

For reference, the FMF ‘contract’ patches are listed here:
https://bugs.llvm.org/show_bug.cgi?id=25721#c6

If we can make the documentation better, that would certainly be a welcome patch.

It would be better to see the IR for your example(s), but I think you’d need ‘contract’ on both the fmul and fadd to generate an FMA. Conservatively, we wouldn’t alter the result if either component somehow required strict FP. To vectorize, you probably need ‘fast’ on both ops because vectorization would be changing the order of operations (reassociation).

For reference, the FMF 'contract' patches are listed here:
25721 – fp-contract (FMA) isn't always captured in LLVM IR

If we can make the documentation better, that would certainly be a welcome
patch.

It would be better to see the IR for your example(s), but I think you'd need

The IR of the scalar loop is

if13:                                             ; preds = %scalar.ph, %if13
 %s.124 = phi double [ %51, %if13 ], [ %bc.merge.rdx, %scalar.ph ]
 %"i#672.023" = phi i64 [ %52, %if13 ], [ %bc.resume.val, %scalar.ph ]
 %46 = getelementptr double, double* %13, i64 %"i#672.023"
 %47 = load double, double* %46, align 8
 %48 = getelementptr double, double* %15, i64 %"i#672.023"
 %49 = load double, double* %48, align 8
 %50 = fmul double %47, %49
 %51 = fadd fast double %s.124, %50
 %52 = add nuw nsw i64 %"i#672.023", 1
 %53 = icmp slt i64 %52, %9
 br i1 %53, label %if13, label
%L11.outer.split.L11.outer.split.split_crit_edge.outer.loopexit

And it can be vectorized to

vector.body:                                      ; preds =
%vector.body, %vector.ph
 %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
 %vec.phi = phi <4 x double> [ %19, %vector.ph ], [ %40, %vector.body ]
 %vec.phi94 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %41,
%vector.body ]
 %vec.phi95 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %42,
%vector.body ]
 %vec.phi96 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %43,
%vector.body ]
 %20 = getelementptr double, double* %13, i64 %index
 %21 = bitcast double* %20 to <4 x double>*
 %wide.load = load <4 x double>, <4 x double>* %21, align 8
 %22 = getelementptr double, double* %20, i64 4
 %23 = bitcast double* %22 to <4 x double>*
 %wide.load100 = load <4 x double>, <4 x double>* %23, align 8
 %24 = getelementptr double, double* %20, i64 8
 %25 = bitcast double* %24 to <4 x double>*
 %wide.load101 = load <4 x double>, <4 x double>* %25, align 8
 %26 = getelementptr double, double* %20, i64 12
 %27 = bitcast double* %26 to <4 x double>*
 %wide.load102 = load <4 x double>, <4 x double>* %27, align 8
 %28 = getelementptr double, double* %15, i64 %index
 %29 = bitcast double* %28 to <4 x double>*
 %wide.load103 = load <4 x double>, <4 x double>* %29, align 8
 %30 = getelementptr double, double* %28, i64 4
 %31 = bitcast double* %30 to <4 x double>*
 %wide.load104 = load <4 x double>, <4 x double>* %31, align 8
 %32 = getelementptr double, double* %28, i64 8
 %33 = bitcast double* %32 to <4 x double>*
 %wide.load105 = load <4 x double>, <4 x double>* %33, align 8
 %34 = getelementptr double, double* %28, i64 12
 %35 = bitcast double* %34 to <4 x double>*
 %wide.load106 = load <4 x double>, <4 x double>* %35, align 8
 %36 = fmul <4 x double> %wide.load, %wide.load103
 %37 = fmul <4 x double> %wide.load100, %wide.load104
 %38 = fmul <4 x double> %wide.load101, %wide.load105
 %39 = fmul <4 x double> %wide.load102, %wide.load106
 %40 = fadd fast <4 x double> %vec.phi, %36
 %41 = fadd fast <4 x double> %vec.phi94, %37
 %42 = fadd fast <4 x double> %vec.phi95, %38
 %43 = fadd fast <4 x double> %vec.phi96, %39
 %index.next = add i64 %index, 16
 %44 = icmp eq i64 %index.next, %n.vec
 br i1 %44, label %middle.block, label %vector.body

If contracting normal mul and fast add is allowed, both loop can use fma.