Allen
July 8, 2022, 3:14am
1
For small case, where there is no register pressure issue, Is it reasonable to issue the load instruction preferentially as load insn general has more latency according performance ? for example
.LBB0_1: // %vector.body
// =>This Inner Loop Header: Depth=1
ldp q1, q2, [x10, #-16]
subs x8, x8, #4
add x10, x10, #32
ldp q3, q4, [x9, #-16] -- hoist this insn before the subs ?
fmla v3.2d, v0.2d, v1.2d
fmla v4.2d, v0.2d, v2.2d
stp q3, q4, [x9, #-16]
add x9, x9, #32
b.ne .LBB0_1
see detail in Compiler Explorer
csix
July 12, 2022, 10:04am
2
After register allocation, I would say it is preferable to hoist the loads as much as possible (without altering their order) so that you get a minimal penalty if the memory loads do a cache miss.
Why it chose this particular schedule depends on the architecture backend. It is hard to say why it did not hoist further without doing a deep dive in the backend. A starting point would be to look at the debug output for the post-RA scheduler.
Perhaps the heuristics of the post-RA scheduler need some tuning to issue load instructions first.
Execution ports also play a role in scheduling, it’s possible the scheduler tried to put some distance between the ldp
s because of that.
Allen
July 12, 2022, 2:40pm
4
Thanks @csix and @nhaehnle very much, I’ll try to debug it according your reminder.
fhahn
July 12, 2022, 4:11pm
5
If it is reasonable in this specific case depends on whether this makes a difference on your uarch?
AFAIK the scheduling model used by AArch64 by default assumes out-of-order execution and won’t try to aggressively schedule for latency.
Allen
July 16, 2022, 3:00am
6
hi @fhahn
I see the comment as your remaiding , but I doesn’t see the condition checked the out-of-order execution in fact ? Did I miss something?
llvm-project/MachineScheduler.cpp at main · llvm/llvm-project · GitHub
// Schedule aggressively for latency in PostRA mode. We don’t check for
// acyclic latency during PostRA, and highly out-of-order processors will
** // skip PostRA scheduling**.
if (!OtherResLimited &&
(IsPostRA || shouldReduceLatency(Policy, CurrZone, !RemLatencyComputed,
RemLatency))) {
Policy.ReduceLatency |= true;
LLVM_DEBUG(dbgs() << " " << CurrZone.Available.getName()
<< " RemainingLatency " << RemLatency << " + "
<< CurrZone.getCurrCycle() << "c > CritPath "
<< Rem.CriticalPath << “\n”);
}
Allen
July 16, 2022, 6:48am
7
Based on the following information (see detail attaiched), can it be confirmed that this is due to the Execution limitation?
When schedule the node SU(4), the SU(1) is not in the Cand list, even when it already in the TopQ.A, then after the node SU(4) issued, it is moved from TopQ.A into TopQ.P (start from line 1045 of the attaiched file)
renamable $q1, renamable $q3 = LDPQi renamable $x9, -1 :: (load (s128) from %ir.scevgep10, align 8), (load (s128) from %ir.lsr.iv79, align 8) -- SU(0)
renamable $q2, renamable $q4 = LDPQi renamable $x8, -1 :: (load (s128) from %ir.scevgep5, align 8), (load (s128) from %ir.lsr.iv13, align 8) -- SU(1)
renamable $q2 = nnan ninf nsz arcp contract afn reassoc FMLAv2f64 killed renamable $q2(tied-def 0), renamable $q0, killed renamable $q1 -- SU(2)
renamable $q4 = nnan ninf nsz arcp contract afn reassoc FMLAv2f64 killed renamable $q4(tied-def 0), renamable $q0, killed renamable $q3 -- SU(3)
renamable $x2 = SUBSXri killed renamable $x2, 4, 0, implicit-def $nzcv -- SU(4)
STPQi renamable $q2, renamable $q4, renamable $x8, -1 :: (store (s128) into %ir.scevgep4, align 8), (store (s128) into %ir.lsr.iv13, align 8) -- SU(5)
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P:
Queue TopQ.A: 0 1 4
TopQ.A RemainingLatency 0 + 0c > CritPath 17
Cand SU(0) ORDER
Pick Top TOP-PATH
Scheduling SU(0) renamable $q1, renamable $q3 = LDPQi renamable $x9, -1 :: (load (s128) from %ir.scevgep10, align 8), (load (s128) from %ir.lsr.iv79, align 8)
Ready @0c
A57UnitL +2x6u
*** Critical resource A57UnitL: 2c
TopQ.A BotLatency SU(0) 17c
*** Max MOps 3 at cycle 0
Cycle: 1 TopQ.A
TopQ.A @1c
Retired: 3
Executed: 2c
Critical: 2c, 2 A57UnitL
ExpectedLatency: 0c
- Resource limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P:
Queue TopQ.A: 4 1 7
TopQ.A RemainingLatency 0 + 1c > CritPath 17
TopQ.A ResourceLimited: A57UnitL
Cand SU(4) ORDER
Pick Top RES-REDUCE
Scheduling SU(4) renamable $x2 = SUBSXri renamable $x2, 4, 0, implicit-def $nzcv
Ready @1c
A57UnitI +1x3u
TopQ.A @1c
Retired: 4
Executed: 2c
Critical: 2c, 2 A57UnitL
ExpectedLatency: 0c
- Resource limited.
** ScheduleDAGMI::schedule picking next node
SU(1) uops=3
Queue TopQ.P: 1
Queue TopQ.A: 7
Pick Top ONLY1
Scheduling SU(7) renamable $x9 = ADDXri renamable $x9, 32, 0
Ready @1c
A57UnitI +1x3u
TopQ.A @1c
Retired: 5
Executed: 2c
Critical: 2c, 2 A57UnitL
ExpectedLatency: 0c
- Resource limited.
all.txt (38.0 KB)
Allen
July 16, 2022, 8:56am
8