Commutability of X86 FMA3 instructions.

Hi all,

The 213 variant of the FMA3 instructions is currently marked
commutable (see X86InstrFMA.td). Is that safe? According to the ISA
the FMA3 instructions aren't commutable for non-numeric results, so
I'd have thought commuting this would only be valid in fast-math mode?

For the curious, the reason that I'm asking is that we currently
always select the 213 variant, but this introduces an extra copies in
accumulator-style loops. Something like:

while (...)
  accumulator = x * y + accumulator;

yields:

loop:
  vfmadd.213 y, x, acc
  vmovaps acc, x
  decl count
  jne loop

instead of

loop:
  vfmadd.231 acc, x, y
  decl count
  jne loop

I have started writing a patch to generate the 231 variant by default,
and I want to know whether I need to go to the trouble of adding
custom commute logic. If these things aren't commutable then I don't
need to worry at all. If they are commutable, but only in fast-math
mode, then I can support that too.

Thanks for the help!

- Lang.

Hi Lang,

Unfortunately, I don’t have an answer on the commutability question, but I wanted to let you know that I filed a bug on this:
http://llvm.org/bugs/show_bug.cgi?id=17229

This also shows a memory operand variant of the fma that you may want to consider in your patch and testcases.

Thanks!

Hi Kay,

My patch will partially address your bug. For now I'm just looking to
switch the default FMA from vfmadd213xx to vfmadd231xx. That will
cause the code in PR17229 to compile as desired, but would regress
code like:

double foo(double a, double b, double c) {
  return a * b + c;
}

Which will now require a vmovaps + vfmadd231.

If this impacts real benchmarks we could add an optimization to change
the FMA variant based on how it's used.

- Lang.

1) Our architects say that vmovaps %reg, %reg causes only register renaming in HW and does not affect performance.

2) The non-numeric result of FMA is not commutative for multiplicand and the multiplier. You are right. This property should be removed.

- Elena

Hi Elena,

Thank you very much for looking in to that.

I'll go ahead and remove the isCommutable flag from those
instructions, since it sounds like that's the right thing to do. I
would still like to change the default from the 231 variant to 213
too, as this will reduce code-size for accumulator-style loops. I have
at least one benchmark that shows significant speedups when this
change is made (presumably the smaller loop fits within the decode
buffer, though I haven't verified this). Please let me know if there's
any reason to keep the 231 variant.

Cheers,
Lang.

I remember that I tried all variants and checked many benchmarks.
I looked for a variant where more memory operands can be folded into instruction.
I, probably, found that the current solution is better than others.

I also should say that FMA does not promise speed up at all. We measure many benchmarks (~30) and saw performance regression for many of them. We went after solution where the average performance was higher.
So, I propose to check many benchmarks before switch and see what happens with memory folding.

- Elena

Hi Elena,

Ok. I will do some wider benchmarking before considering any change to
the default.

From your observations it sounds like right solution is to add a

post-isel peephole optimization to chose the most appropriate variant.
I'll give that a try.

- Lang.

It should be done before live range calculations because one of the source registers becomes a destination.
If we take FMA as A*B+C,
C is killed in the form "321"
B (or A) is killed in the form "213"

- Elena