Scheduler: modelling long register reservations?

Johnson_Nicholas_Pau · April 3, 2017, 7:37pm

Hello,

My out-of-tree target features some high latency instructions (let's call them FXLV). When an FXLV issues, it reserves its destination register and execution continues; if a subsequent instruction attempts to read or write that register, the pipline will stall until the FXLV completes. I have attempted to encode this constraint in the machine scheduler (excerpt at bottom of email). This solves half of the problem: the scheduler moves any instruction that reads the FXLV result register to a much later position.

However, this doesn't solve all of the problem. In particular, the scheduler seems indifferent to an instruction which overwrites the FXLV's result register---including instructions which overwrite only one lane of the vector result. Am I specifying the scheduling constraints incorrectly? Can llvm support this kind of constraint?

Thank you,
Nick Johnson
D. E. Shaw Research

// Excerpted from lib/Target/MyTarget/MyTargetSchedule.td:
//
def DesGCv3GenericModel : SchedMachineModel
{
let IssueWidth = 1;
let MicroOpBufferSize = 0;

let CompleteModel = 1;
}
// ...
def FlexU : ProcResource<64> { let BufferSize = 1; }
def : WriteRes<IIFlexRead, [FlexU]> { let Latency = 25; let ResourceCycles = [25]; }
class SchedFlexRead : Sched< [IIFlexRead] >; // I apply this to the definition of FXLV instruction
// ...

Johnson_Nicholas_Pau · April 10, 2017, 4:50pm

(Thank you Alex Bradbury for publicizing this thread in the weekly)

I'll update the thread with my partial solution. I have introduced a pseudo-instruction 'DontOverwriteFlexResult' as in Snippet1 (below). That instruction has no effect. Then, I updated some instruction selection patterns so that they wrap every occurrence of FXLV within a DontOverwriteFlexResult pseudo-instruction (Snippet2, below). The scheduler will attempt to schedule the pseudo-instruction to satisfy the long latency. This extends the live-interval of the FXLV's result vector register, and prevents the register allocator from prematurely overwriting subvectors of the result register.

This solution works in some cases, but doesn't yet support the case in which the FXLV result is completely unused, since the 'DontOverwriteFlexResult' pseudo will get dead-code-eliminated. I'm planning on marking the pseudo as side-effecting to inhibit dead code elimination, but still need a plan to prevent that from pessimizing the scheduler.

Nick Johnson
D. E. Shaw Research

// Snippet 1
// Here is a fancy fake instruction which prevent the compiler
// from clobbering all or part of a flex api instruction's result.
let hasNoSchedulingInfo = 1, mayLoad=0, mayStore=0, hasSideEffects=0, isAsCheapAsAMove=1 in
{
  def DontOverwriteFlexResults :
    DesGCv3PseudoInst<
      (outs VecRegs:$rd),
      (ins VecRegs:$rs),
      "# DontOverwriteFlexResults_v4i32\t$rd",
      >
  {
    let Constraints = "$rd = $rs";
  }
}

// Snippet 2
def : Pat<
(v4i32 (Aligned16LoadFromFlex (i32 DesGCv3RegPlusInt26:$ptr) )),
(DontOverwriteFlexResults (v4i32 (FXLV_UNCOUNTED (i32 DesGCv3RegPlusInt26:$ptr) )))>;

JonPsson · April 12, 2017, 7:25am

Hi Nick,

ScheduleDAGInstrs::addPhysRegDeps(SUnit *SU, unsigned OperIdx) is the method that adds the edges with their latencies for Output dependencies (def -> def). It seems unfortunately that there currently isn't a way to specify latency for output deps with computeOperandLatency() or similar.

I am then thinking that one option might be to add a DAGMutator where you could manually set the latency of the anti-edge to 25, after the DAG has been built.

If you have a problem with subregs, did you try to model the stalling subreg def as defining the whole vector reg, while in the output adjusting the register operand text, or similar?

/Jonas

atrick · May 22, 2017, 10:29pm

Wow, this was in the digest and I still missed it! Anyway, for future reference…

The scheduler has bits and pieces of in-order support. In this case, the DAG builder assumes that the WAW instructions are fully pipelined and take the same latency, hence the one-cycle edge:

unsigned TargetSchedModel::
computeOutputLatency(const MachineInstr *DefMI, unsigned DefOperIdx,
                     const MachineInstr *DepMI) const {
  if (!SchedModel.isOutOfOrder())
    return 1;

It seems perfectly reasonable to me to use the difference in latency between the two instructions (when the first instruction has higher latency), plus one cycle.

Note that if the second, dependent instruction is also high latency, but uses different resources, you don’t want to delay it.

-Andy

Topic		Replies	Views
Scheduler Integration Questions LLVM Dev List Archives	2	87	April 2, 2011
Schedules, latency and register liveness for complex instructions LLVM Dev List Archives	3	179	December 2, 2017
Schedules, latency and register liveness for complex instructions LLVM Dev List Archives	2	85	December 4, 2017
LLVM Scheduler and Itinieraries: Negative latency? LLVM Dev List Archives	3	94	April 14, 2011
Tweaking the Register Allocator's spill placement LLVM Dev List Archives	5	85	January 10, 2017

Scheduler: modelling long register reservations?

Related Topics