In the file lib/CodeGen/MachineCombiner.cpp I see that in the function MachineCombiner::preservesCriticalPathLen
we try to determine whether the new combined instruction lengthens the critical path or not.
In order to do this we compute the depth and latency for the current instruction (MUL+ADD) and the alternate instruction (MADD).
But we call two different set of APIs for the current and new instructions:
For new instruction we use:
unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, BlockTrace);
unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);
While for the current instruction we use:
unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;
unsigned RootLatency = TSchedModel.computeInstrLatency(Root);
This has been introduced in the following commit:
Author: Gerolf Hoflehner email@example.com
MachineCombiner Pass for selecting faster instruction sequence on AArch64
For this example code sequence:
%mul = mul nuw nsw i32 %conv2, %conv
%mul7 = mul nuw nsw i32 %conv6, %conv4
%add = add nuw nsw i32 %mul7, %mul
ret i32 %add
We generate the following assembly:
mul w8, w0, w1
mul w9, w2, w3
add w0, w9, w8
Whereas I expected the MUL+ADD to be combined to MADD otherwise I see degraded performance in several of my tests.
Could someone please explain why we use two different APIs to compute depth and latency for the two instructions?