Machine LICM and cheap instructions?

Hi everyone,

The MachineLICM pass has a heuristic such that, even in low-register-pressure situations, it will refuse to hoist "cheap" instructions out of loops. By default, when an itinerary is available, this means that all of the defined operands are available in at most 1 cycle. ARM overrides this, and provides this more-customized definition:

bool ARMBaseInstrInfo::
hasLowDefLatency(const InstrItineraryData *ItinData,
                 const MachineInstr *DefMI, unsigned DefIdx) const {
  if (!ItinData || ItinData->isEmpty())
    return false;

  unsigned DDomain = DefMI->getDesc().TSFlags & ARMII::DomainMask;
  if (DDomain == ARMII::DomainGeneral) {
    unsigned DefClass = DefMI->getDesc().getSchedClass();
    int DefCycle = ItinData->getOperandCycle(DefClass, DefIdx);
    return (DefCycle != -1 && DefCycle <= 2);
  }
  return false;
}

So it won't hoist instructions that have defined operands ready in at most two cycles for general-domain instructions. Regardless, I don't understand the logic behind this heuristic. High-register-pressure situations are one thing, but why is it ever profitable in low-register-pressure situations not to hoist even "cheap" instructions out of loop bodies?

Thanks again,
Hal

Hi Hal,

I haven’t looked at the implementation but could it be that we leave room for expensive instructions to be hoisted?
The rational being that hoisting a cheap instruction will increase the register pressure and then we may block expensive instruction (assuming we use some kind of iterative algorithm).

Other than that, I would say that seem rather silly.

Cheers,
Q.

From: "Quentin Colombet" <qcolombet@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Dev" <llvmdev@cs.uiuc.edu>
Sent: Thursday, January 8, 2015 5:06:24 PM
Subject: Re: [LLVMdev] Machine LICM and cheap instructions?

Hi Hal,

I haven’t looked at the implementation but could it be that we leave
room for expensive instructions to be hoisted?
The rational being that hoisting a cheap instruction will increase
the register pressure and then we may block expensive instruction
(assuming we use some kind of iterative algorithm).

Interesting point. That cuts both ways, however, because hoisting a cheap instruction can then make an expensive instruction loop invariant. Perhaps we need two passes over the instructions to make a good decision here.

Thanks,
Hal

Some cheap and trivially remateriablizable instructions are free. It makes little sense to hoist them and increase register pressure. In some cases, the register allocator will end up rematerialize them at use sites again.

The heuristics is obviously conservative. Due to phase ordering reasons, register pressure estimates cannot be 100% accurate. It’s here to prevent harm. During my testings from way back then, hoisting these cheap instructions can indeed harm performance. It probably can help in some cases as well. But I’d wager it’s pretty much a wash at best.

Evan

From: “Quentin Colombet” <qcolombet@apple.com>
To: “Hal Finkel” <hfinkel@anl.gov>
Cc: “Dev” <llvmdev@cs.uiuc.edu>
Sent: Thursday, January 8, 2015 5:06:24 PM
Subject: Re: [LLVMdev] Machine LICM and cheap instructions?

Hi Hal,

I haven’t looked at the implementation but could it be that we leave
room for expensive instructions to be hoisted?
The rational being that hoisting a cheap instruction will increase
the register pressure and then we may block expensive instruction
(assuming we use some kind of iterative algorithm).

Interesting point. That cuts both ways, however, because hoisting a cheap instruction can then make an expensive instruction loop invariant. Perhaps we need two passes over the instructions to make a good decision here.

Yes, I think up to this point we’ve been heavily reliant on IR-level LICM to handle chains of computation.

It’s kind of hard for me to generalize what’s good/bad in MachineLICM without looking at concrete code for a particular architecture. Maybe the target hook should be more expressive.
-Andy

On Hexagon it's generally better to hoist everything out of the loop and then rematerialize cheap instructions if necessary. Hexagon executes instruction in packets, and each extra instruction in the loop adds the likelihood of having more packets than necessary. In case of small loops this can be particularly expensive and predicting packetization is extremely difficult without actually performing it.
The "hasLowDefLatency" is not really a good tool to model this, since cheap instructions are still cheap, and yet they are better hoisted out.

-Krzysztof