[AArch64] Is the cost of MSUB instruction is significantly higher than that of the MADD instruction?

Base on ⚙ D40306 [AArch64] Add patterns to replace fsub fmul with fma fneg., we transforming (fsub (fmul x y) z) into (fma x y (fneg z)), instead of using the fmls directive.

So Is the cost of MSUB instruction is significantly higher than that of the MADD instruction in AArch64 target ?

  • releted code in MachineCombiner pass
  case MachineCombinerPattern::FMLSv4f32_OP1:
  case MachineCombinerPattern::FMLSv4i32_indexed_OP1: {
    RC = &AArch64::FPR128RegClass;
    Register NewVR = MRI.createVirtualRegister(RC);
    MachineInstrBuilder MIB1 =
        BuildMI(MF, Root.getDebugLoc(), TII->get(AArch64::FNEGv4f32), NewVR)
            .add(Root.getOperand(2));
    InsInstrs.push_back(MIB1);
    InstrIdxForVirtReg.insert(std::make_pair(NewVR, 0));
    if (Pattern == MachineCombinerPattern::FMLSv4i32_indexed_OP1) {
      Opc = AArch64::FMLAv4i32_indexed;
      MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
                             FMAInstKind::Indexed, &NewVR);
    } else {
      Opc = AArch64::FMLAv4f32;
      MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
                             FMAInstKind::Accumulator, &NewVR);
    }
    break;
  }

From the review you linked: “This has a lower latency on micro architectures where fneg is cheap.” The assumption is that fmul has the same cost as fmla, but fneg is cheaper than fsub.

Oh, wait, I think I see your question. I don’t think fmls has the right semantics for this transform you’re thinking of.

Am I missing something ?

FMLS (vector) fmls x,y, z = x*y - z

Floating-point fused Multiply-Subtract from accumulator (vector).

FMLA (by element) fmla x,y, z = x*y + z

Floating-point fused Multiply-Add to accumulator (by element).

fmls computes z-(x*y).

1 Like

Thanks for your patient reply