How Hexagon handles OperandCycles?

Hi, community.
I am working on understanding how Hexagon uses InstrItinData rather than WriteRes to define operand latency.
I modified the OperandCycles of Hexagon’s L2_loadri_io instruciton def.
I changed the InstrItinData of L2_loadri_io and changed the OperandCycles.
[4, 1, 2] → [20, 1, 2]
I don’t see any stall or any reschedule happened in the generated assembly.
The cmdline in use is:

./build/bin/llc -march=hexagon  -mcpu=hexagonv55 ../strlcpy.ll --debug

The test file strlcpy.ll is here:

I noticed the SDep latency is increased.

Scheduling DAG of the packetize region
SU(0):   renamable $r0 = L2_loadri_io $r29, 16 :: (dereferenceable load (s32) from %ir.src.addr)
  # preds left       : 0
  # succs left       : 5
  # rdefs left       : 0
  Latency            : 1
  Depth              : 0
  Height             : 40
  Successors:
    SU(5): Out  Latency=1
    SU(5): Data Latency=19 Reg=$r0
    SU(1): Data Latency=18 Reg=$r0
    SU(2): Ord  Latency=0 Memory
    SU(3): Ord  Latency=1 Artificial

I want to figure out:

  1. by alter the OperandCycles this way, does it mean the instruction will delay 20 cycle as expected?
  2. Is there any thing missing to define a instruction operand latency in Hexagon.

Hi, altering it this way tells the schedulers that the operand is only available at cycle 20

To see the effect, you most probably need to try it on a large basicblock, one that would span at least 20 cycles, so that the scheduler would have enough room to move 20 cycles further the instructions using the load value. Otherwise, the scheduler might assume they will stall anyway (I am not sure what happens in such case)

Keep in mind the schedulers work on basic blocks, not superblocks or traces. Also, there are 2 or 3 schedulers: DAGScheduler, (possibly MachineScheduler on prepass?), then PostRAScheduler. When examining the debug output, instructions might be moved more than once… Perhaps try examining what happens on the very last scheduling.

1 Like

Merci . Sir csix,
I still have some following up questions.

  1. “Otherwise, the scheduler might assume they will stall anyway”
    Does it mean that it should be inserting stall or empty bundle rather than doing nothing?
    What if I haven’t got a big BB, would insertion of nops be necessary in this senerio?
    I was expecting to see something like LeonPasses to insert nop.
  2. I confirmed that PostRA schedule is not run in this testcase. ScheduleDAGMILive::scheduleMI is also not triggered during debug.

It depends on the architecture. If the architecture has an interlocked pipeline (the pipeline stalls when a dependency is not ready => not respecting latencies only give a less performant program, it does not hurt correctness), then you do not have to emit nops. But if your architecture does not have it, then the scheduler has to emit nops.

I see a PreEmitNoops method in ScheduleHazardRecognizer.h ; which is then used in PostRASchedulerList. Perhaps this method can be overriden to tell PostRASchedulerList scheduler to insert nops?

To know which passes get run, you can use llc -mtriple=arm -O3 -debug-pass=Structure empty.ll command, with empty.ll an empty LLVM IR file, and replacing arm by your triple - just like the test file O3-pipeline.ll of ARM

1 Like