This example is on AMDGPU:
Code:
SU(0): %3588:sreg_32 = nsw S_MUL_I32 %163:sreg_32_xm0_xexec, 50
...
%3049:vgpr_32 = V_ADD3_U32_e64 %673:vgpr_32, %6970:vgpr_32, %674:vgpr_32, implicit $exec
%3050:vgpr_32 = V_ADD3_U32_e64 %3049:vgpr_32, %706:vgpr_32, %707:vgpr_32, implicit $exec
%3051:vgpr_32 = V_ADD3_U32_e64 %3050:vgpr_32, %739:vgpr_32, %740:vgpr_32, implicit $exec
%3052:vgpr_32 = V_ADD3_U32_e64 %3051:vgpr_32, %772:vgpr_32, %773:vgpr_32, implicit $exec
%3053:vgpr_32 = V_ADD3_U32_e64 %3052:vgpr_32, %807:vgpr_32, %808:vgpr_32, implicit $exec
%3054:vgpr_32 = V_ADD3_U32_e64 %3053:vgpr_32, %842:vgpr_32, %843:vgpr_32, implicit $exec
%3055:vgpr_32 = V_ADD3_U32_e64 %3054:vgpr_32, %877:vgpr_32, %878:vgpr_32, implicit $exec
%3056:vgpr_32 = V_ADD3_U32_e64 %3055:vgpr_32, %912:vgpr_32, %913:vgpr_32, implicit $exec
%3057:vgpr_32 = V_ADD3_U32_e64 %3056:vgpr_32, %947:vgpr_32, %948:vgpr_32, implicit $exec
%3058:vgpr_32 = V_ADD3_U32_e64 %3057:vgpr_32, %982:vgpr_32, %983:vgpr_32, implicit $exec
%3059:vgpr_32 = V_ADD3_U32_e64 %3058:vgpr_32, %1017:vgpr_32, %1018:vgpr_32, implicit $exec
%3060:vgpr_32 = V_ADD3_U32_e64 %3059:vgpr_32, %1052:vgpr_32, %1053:vgpr_32, implicit $exec
%3061:vgpr_32 = V_ADD3_U32_e64 %3060:vgpr_32, %1087:vgpr_32, %1088:vgpr_32, implicit $exec
%3062:vgpr_32 = V_ADD3_U32_e64 %3061:vgpr_32, %1122:vgpr_32, %1123:vgpr_32, implicit $exec
...
Debug statements for the first 2 iterations of bidirectional scheduling
** ScheduleDAGMILive::schedule picking next node
Queue BotQ.P:
Queue BotQ.A: 2626 2410
Queue TopQ.P:
Queue TopQ.A: 0 2 51 52 304 561 818 1089 1095 1380 1637 1894 2610 2611 2614
TopQ.A + Remain MOps: 2627
BotQ.A RemLatency SU(2626) 131c
BotQ.A + Remain MOps: 2628
TopQ.A RemLatency SU(0) 131c
Picking from Bot:
Cand SU(2626) ORDER
Picking from Top:
Cand SU(0) ORDER
Top Cand: Cand SU(0) ORDER
Bot Cand: Cand SU(2626) ORDER
Picking: Cand SU(2626) ORDER
Scheduling SU(2626) BUFFER_STORE_BYTE_OFFEN_exact
--------------------------------------------------------------------------------------------------------------
** ScheduleDAGMILive::schedule picking next node
Queue BotQ.P:
Queue BotQ.A: 2410 2543 2623 2622 2625
Queue TopQ.P:
Queue TopQ.A: 0 2 51 52 304 561 818 1089 1095 1380 1637 1894 2610 2611 2614
TopQ.A + Remain MOps: 2626
BotQ.A RemLatency SU(2623) 131c
BotQ.A + Remain MOps: 2628
TopQ.A RemLatency SU(0) 131c
Picking from Bot:
Cand SU(2410) ORDER
Cand SU(2543) ORDER
Cand SU(2623) ORDER
Cand SU(2625) ORDER
Picking from Top:
Cand SU(0) NOCAND
Top Cand: Cand SU(0) NOCAND
Bot Cand: Cand SU(2625) ORDER
Picking: Cand SU(2625) ORDER
Scheduling SU(2625) %6477:vgpr_32 = V_ADD_U32_e32 4, %6476:vgpr_32, implicit $exec
In the example above, the scheduler enqueues SU(0) into TopQ. It decides to pick a buffer_store_byte
instruction with reason Order
. SU(0) is also the TopCand
in the next iteration, however this time it’s reason is set to NoCand
because no heuristic in tryCandidate
could alter the reason to anything else. So the scheduler ends up picking the alternative until the very end when it picks SU(0) from BotQ
after dequeuing it from TopQ
. This pattern is problematic because often the scheduler is forced to pick nodes which lead to excessive register pressure. Remark that the V_ADD3’s operands above depend on three operands: result from the previous V_ADD3 and from V_MUL’s that are about 2000 instructions above. This leads to too many long live ranges interfering with each other which further causes too many vgpr spills.
I tried forcing the scheduler to pick the TopCand
if it’s SU# is <= 100. Yet, the scheduler ends up scheduling bottom-up like the above case.
Thanks,
Anshil