For reference, the FMF 'contract' patches are listed here:
25721 – fp-contract (FMA) isn't always captured in LLVM IR
If we can make the documentation better, that would certainly be a welcome
patch.
It would be better to see the IR for your example(s), but I think you'd need
The IR of the scalar loop is
if13: ; preds = %scalar.ph, %if13
%s.124 = phi double [ %51, %if13 ], [ %bc.merge.rdx, %scalar.ph ]
%"i#672.023" = phi i64 [ %52, %if13 ], [ %bc.resume.val, %scalar.ph ]
%46 = getelementptr double, double* %13, i64 %"i#672.023"
%47 = load double, double* %46, align 8
%48 = getelementptr double, double* %15, i64 %"i#672.023"
%49 = load double, double* %48, align 8
%50 = fmul double %47, %49
%51 = fadd fast double %s.124, %50
%52 = add nuw nsw i64 %"i#672.023", 1
%53 = icmp slt i64 %52, %9
br i1 %53, label %if13, label
%L11.outer.split.L11.outer.split.split_crit_edge.outer.loopexit
And it can be vectorized to
vector.body: ; preds =
%vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x double> [ %19, %vector.ph ], [ %40, %vector.body ]
%vec.phi94 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %41,
%vector.body ]
%vec.phi95 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %42,
%vector.body ]
%vec.phi96 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %43,
%vector.body ]
%20 = getelementptr double, double* %13, i64 %index
%21 = bitcast double* %20 to <4 x double>*
%wide.load = load <4 x double>, <4 x double>* %21, align 8
%22 = getelementptr double, double* %20, i64 4
%23 = bitcast double* %22 to <4 x double>*
%wide.load100 = load <4 x double>, <4 x double>* %23, align 8
%24 = getelementptr double, double* %20, i64 8
%25 = bitcast double* %24 to <4 x double>*
%wide.load101 = load <4 x double>, <4 x double>* %25, align 8
%26 = getelementptr double, double* %20, i64 12
%27 = bitcast double* %26 to <4 x double>*
%wide.load102 = load <4 x double>, <4 x double>* %27, align 8
%28 = getelementptr double, double* %15, i64 %index
%29 = bitcast double* %28 to <4 x double>*
%wide.load103 = load <4 x double>, <4 x double>* %29, align 8
%30 = getelementptr double, double* %28, i64 4
%31 = bitcast double* %30 to <4 x double>*
%wide.load104 = load <4 x double>, <4 x double>* %31, align 8
%32 = getelementptr double, double* %28, i64 8
%33 = bitcast double* %32 to <4 x double>*
%wide.load105 = load <4 x double>, <4 x double>* %33, align 8
%34 = getelementptr double, double* %28, i64 12
%35 = bitcast double* %34 to <4 x double>*
%wide.load106 = load <4 x double>, <4 x double>* %35, align 8
%36 = fmul <4 x double> %wide.load, %wide.load103
%37 = fmul <4 x double> %wide.load100, %wide.load104
%38 = fmul <4 x double> %wide.load101, %wide.load105
%39 = fmul <4 x double> %wide.load102, %wide.load106
%40 = fadd fast <4 x double> %vec.phi, %36
%41 = fadd fast <4 x double> %vec.phi94, %37
%42 = fadd fast <4 x double> %vec.phi95, %38
%43 = fadd fast <4 x double> %vec.phi96, %39
%index.next = add i64 %index, 16
%44 = icmp eq i64 %index.next, %n.vec
br i1 %44, label %middle.block, label %vector.body
If contracting normal mul and fast add is allowed, both loop can use fma.