[insn schedule] Is it reasonable to issue the load instruction preferentially

For small case, where there is no register pressure issue, Is it reasonable to issue the load instruction preferentially as load insn general has more latency according performance ? for example

.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	ldp	q1, q2, [x10, #-16]
	subs	x8, x8, #4
	add	x10, x10, #32
	ldp	q3, q4, [x9, #-16]   -- hoist this insn before the subs ?
	fmla	v3.2d, v0.2d, v1.2d
	fmla	v4.2d, v0.2d, v2.2d
	stp	q3, q4, [x9, #-16]
	add	x9, x9, #32
	b.ne	.LBB0_1

see detail in Compiler Explorer

After register allocation, I would say it is preferable to hoist the loads as much as possible (without altering their order) so that you get a minimal penalty if the memory loads do a cache miss.

Why it chose this particular schedule depends on the architecture backend. It is hard to say why it did not hoist further without doing a deep dive in the backend. A starting point would be to look at the debug output for the post-RA scheduler.

Perhaps the heuristics of the post-RA scheduler need some tuning to issue load instructions first.

Execution ports also play a role in scheduling, it’s possible the scheduler tried to put some distance between the ldps because of that.

Thanks @csix and @nhaehnle very much, I’ll try to debug it according your reminder.

If it is reasonable in this specific case depends on whether this makes a difference on your uarch?

AFAIK the scheduling model used by AArch64 by default assumes out-of-order execution and won’t try to aggressively schedule for latency.

hi @fhahn
I see the comment as your remaiding , but I doesn’t see the condition checked the out-of-order execution in fact ? Did I miss something?

llvm-project/MachineScheduler.cpp at main · llvm/llvm-project · GitHub

// Schedule aggressively for latency in PostRA mode. We don’t check for
// acyclic latency during PostRA, and highly out-of-order processors will
** // skip PostRA scheduling**.
if (!OtherResLimited &&
(IsPostRA || shouldReduceLatency(Policy, CurrZone, !RemLatencyComputed,
RemLatency))) {
Policy.ReduceLatency |= true;
LLVM_DEBUG(dbgs() << " " << CurrZone.Available.getName()
<< " RemainingLatency " << RemLatency << " + "
<< CurrZone.getCurrCycle() << "c > CritPath "
<< Rem.CriticalPath << “\n”);
}

Based on the following information (see detail attaiched), can it be confirmed that this is due to the Execution limitation?

  • When schedule the node SU(4), the SU(1) is not in the Cand list, even when it already in the TopQ.A, then after the node SU(4) issued, it is moved from TopQ.A into TopQ.P (start from line 1045 of the attaiched file)
  renamable $q1, renamable $q3 = LDPQi renamable $x9, -1 :: (load (s128) from %ir.scevgep10, align 8), (load (s128) from %ir.lsr.iv79, align 8) -- SU(0)
  renamable $q2, renamable $q4 = LDPQi renamable $x8, -1 :: (load (s128) from %ir.scevgep5, align 8), (load (s128) from %ir.lsr.iv13, align 8) -- SU(1)
  renamable $q2 = nnan ninf nsz arcp contract afn reassoc FMLAv2f64 killed renamable $q2(tied-def 0), renamable $q0, killed renamable $q1 -- SU(2)
  renamable $q4 = nnan ninf nsz arcp contract afn reassoc FMLAv2f64 killed renamable $q4(tied-def 0), renamable $q0, killed renamable $q3 -- SU(3)
  renamable $x2 = SUBSXri killed renamable $x2, 4, 0, implicit-def $nzcv -- SU(4)
  STPQi renamable $q2, renamable $q4, renamable $x8, -1 :: (store (s128) into %ir.scevgep4, align 8), (store (s128) into %ir.lsr.iv13, align 8) -- SU(5)

** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 0 1 4 
  TopQ.A RemainingLatency 0 + 0c > CritPath 17
  Cand SU(0) ORDER                              
Pick Top TOP-PATH  
Scheduling SU(0) renamable $q1, renamable $q3 = LDPQi renamable $x9, -1 :: (load (s128) from %ir.scevgep10, align 8), (load (s128) from %ir.lsr.iv79, align 8)
  Ready @0c
  A57UnitL +2x6u
  *** Critical resource A57UnitL: 2c
  TopQ.A BotLatency SU(0) 17c
  *** Max MOps 3 at cycle 0
Cycle: 1 TopQ.A
TopQ.A @1c
  Retired: 3
  Executed: 2c
  Critical: 2c, 2 A57UnitL
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 4 1 7 
  TopQ.A RemainingLatency 0 + 1c > CritPath 17
  TopQ.A ResourceLimited: A57UnitL
  Cand SU(4) ORDER                              
Pick Top RES-REDUCE
Scheduling SU(4) renamable $x2 = SUBSXri renamable $x2, 4, 0, implicit-def $nzcv
  Ready @1c
  A57UnitI +1x3u
TopQ.A @1c
  Retired: 4
  Executed: 2c
  Critical: 2c, 2 A57UnitL
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
  SU(1) uops=3
Queue TopQ.P: 1 
Queue TopQ.A: 7 
Pick Top ONLY1     
Scheduling SU(7) renamable $x9 = ADDXri renamable $x9, 32, 0
  Ready @1c
  A57UnitI +1x3u
TopQ.A @1c
  Retired: 5
  Executed: 2c
  Critical: 2c, 2 A57UnitL
  ExpectedLatency: 0c
  - Resource limited.

all.txt (38.0 KB)

candidate MR: ⚙ D129927 [MachineScheduler] Try to issue the load instruction preferentially