Question on Machine Combiner Pass

Ping

From: "Mandeep Singh Grang" <mgrang@codeaurora.org>
To: llvmdev@cs.uiuc.edu
Sent: Wednesday, February 4, 2015 12:58:27 PM
Subject: Re: [LLVMdev] Question on Machine Combiner Pass

Ping

From: Mandeep Singh Grang [mailto:mgrang@codeaurora.org]
Sent: Tuesday, February 03, 2015 4:34 PM
To: 'llvmdev@cs.uiuc.edu'
Cc: 'ghoflehner@apple.com'; 'apazos@codeaurora.org';
mgrang@codeaurora.org
Subject: Question on Machine Combiner Pass

Hi,

In the file lib/CodeGen/MachineCombiner.cpp I see that in the
function MachineCombiner::preservesCriticalPathLen
we try to determine whether the new combined instruction lengthens
the critical path or not.

In order to do this we compute the depth and latency for the current
instruction (MUL+ADD) and the alternate instruction (MADD).

But we call two different set of APIs for the current and new
instructions:

For new instruction we use:

unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg,
BlockTrace);

unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);

While for the current instruction we use:

unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;

unsigned RootLatency = TSchedModel.computeInstrLatency(Root);

The BlockTrace comes from MachineTraceMetrics, which is an analysis pass, and thus might have its data pre-computed. This only strictly applies to the current instruction sequence. We need to use a different method to compute information for potential new instructions. Are you finding that these methods compute inconsistent results?

-Hal

Hi Hal,

We came across many missed opportunities to generate madd instructions in
AArch64.

In this particular test case, a mul instruction that feeds into a madd
instruction as its accumulator should have a reduced latency. But when
dumping MachineCombiner decisions we do not see the reduced latency value
(should be 1).

We debugged a bit and noticed the difference in how NewRoot and Root latency
and depth are computed. We were not sure they were leading to the same
queries to the SchedMod. You can see for example that in one instance it
queries with the machine instruction pointer, in the other with only the
machine instruction opcode:

TSchedModel.computeInstrLatency(Root)
TSchedModel.computeInstrLatency(NewRoot->getOpcode())

We made a local change that adds the new candidate instructions to the basic
block (with different output virtual register for the NewRoot), and then
ran the trace analysis on the modified BB. This allowed us to use the same
code for computing latency and depth of both NewRoot and Root. Still we did
not get the reduced latency value for the mult.

We asked Dave Estes to take a look at our reduced test and he may have found
a bug in SchedAlias,. He will bring it up to the community in a separate
discussion. With his fix now we can see the reduced mult latency and madd
being generated for the reduced test case. We are verifying the fix with
other complex tests.

But we still are left with this question about MachineCombiner: Shouldn't
the trace analysis run on a modified basic block that contains the new
candidate instructions and NewRoot/Root latency and depth be computed with
the same SchedMod apis? In our local change we rerun the trace analysis for
each candidate instruction we add to the basic block, and remove them from
the basic block if they are not profitable.

Thanks,
Ana.

From: "Ana Pazos" <apazos@codeaurora.org>
To: "Hal Finkel" <hfinkel@anl.gov>, "Mandeep Singh Grang" <mgrang@codeaurora.org>
Cc: llvmdev@cs.uiuc.edu
Sent: Friday, February 6, 2015 12:00:21 PM
Subject: RE: [LLVMdev] Question on Machine Combiner Pass

Hi Hal,

We came across many missed opportunities to generate madd
instructions in
AArch64.

In this particular test case, a mul instruction that feeds into a
madd
instruction as its accumulator should have a reduced latency. But
when
dumping MachineCombiner decisions we do not see the reduced latency
value
(should be 1).

We debugged a bit and noticed the difference in how NewRoot and Root
latency
and depth are computed. We were not sure they were leading to the
same
queries to the SchedMod. You can see for example that in one instance
it
queries with the machine instruction pointer, in the other with only
the
machine instruction opcode:

TSchedModel.computeInstrLatency(Root)
TSchedModel.computeInstrLatency(NewRoot->getOpcode())

I think that the idea is that what is currently done computes a minimal modification to the existing analysis. Perhaps this is not true, however.

We made a local change that adds the new candidate instructions to
the basic
block (with different output virtual register for the NewRoot), and
then
ran the trace analysis on the modified BB. This allowed us to use the
same
code for computing latency and depth of both NewRoot and Root. Still
we did
not get the reduced latency value for the mult.

We asked Dave Estes to take a look at our reduced test and he may
have found
a bug in SchedAlias,. He will bring it up to the community in a
separate
discussion. With his fix now we can see the reduced mult latency and
madd
being generated for the reduced test case. We are verifying the fix
with
other complex tests.

Okay, sounds good.

But we still are left with this question about MachineCombiner:
Shouldn't
the trace analysis run on a modified basic block that contains the
new
candidate instructions and NewRoot/Root latency and depth be computed
with
the same SchedMod apis? In our local change we rerun the trace
analysis for
each candidate instruction we add to the basic block, and remove them
from
the basic block if they are not profitable.

I assume this was not done because of compile-time concerns. Does this scale as (N^2)*(# candidates) in the size of the BB?