Preventing bidirectional scheduler from only scheduling in one direction

Hi,

I found an instance where a bidirectional scheduler ends up scheduling bottom-up throughout the block in question because some SU “prevents” the scheduler from picking the top candidate. I am thinking of adding a counter to SchedCandidate that keeps track of how many times the alternative node was picked and if the counter goes over some limit, the scheduler should add some priority towards picking that node. This addition will further avoid eviction chains during register allocation as it has the potential to reduce the number of interfering live ranges.

Cheers,
Anshil

Can you explain how “some SU “prevents” the scheduler from picking the top candidate”? Adding a counter sounds like a workaround. There might be a better way to fix the underlying problem.

Do you have an example?

This example is on AMDGPU:

Code:

SU(0):   %3588:sreg_32 = nsw S_MUL_I32 %163:sreg_32_xm0_xexec, 50
...
  %3049:vgpr_32 = V_ADD3_U32_e64 %673:vgpr_32, %6970:vgpr_32, %674:vgpr_32, implicit $exec
  %3050:vgpr_32 = V_ADD3_U32_e64 %3049:vgpr_32, %706:vgpr_32, %707:vgpr_32, implicit $exec
  %3051:vgpr_32 = V_ADD3_U32_e64 %3050:vgpr_32, %739:vgpr_32, %740:vgpr_32, implicit $exec
  %3052:vgpr_32 = V_ADD3_U32_e64 %3051:vgpr_32, %772:vgpr_32, %773:vgpr_32, implicit $exec
  %3053:vgpr_32 = V_ADD3_U32_e64 %3052:vgpr_32, %807:vgpr_32, %808:vgpr_32, implicit $exec
  %3054:vgpr_32 = V_ADD3_U32_e64 %3053:vgpr_32, %842:vgpr_32, %843:vgpr_32, implicit $exec
  %3055:vgpr_32 = V_ADD3_U32_e64 %3054:vgpr_32, %877:vgpr_32, %878:vgpr_32, implicit $exec
  %3056:vgpr_32 = V_ADD3_U32_e64 %3055:vgpr_32, %912:vgpr_32, %913:vgpr_32, implicit $exec
  %3057:vgpr_32 = V_ADD3_U32_e64 %3056:vgpr_32, %947:vgpr_32, %948:vgpr_32, implicit $exec
  %3058:vgpr_32 = V_ADD3_U32_e64 %3057:vgpr_32, %982:vgpr_32, %983:vgpr_32, implicit $exec
  %3059:vgpr_32 = V_ADD3_U32_e64 %3058:vgpr_32, %1017:vgpr_32, %1018:vgpr_32, implicit $exec
  %3060:vgpr_32 = V_ADD3_U32_e64 %3059:vgpr_32, %1052:vgpr_32, %1053:vgpr_32, implicit $exec
  %3061:vgpr_32 = V_ADD3_U32_e64 %3060:vgpr_32, %1087:vgpr_32, %1088:vgpr_32, implicit $exec
  %3062:vgpr_32 = V_ADD3_U32_e64 %3061:vgpr_32, %1122:vgpr_32, %1123:vgpr_32, implicit $exec
...

Debug statements for the first 2 iterations of bidirectional scheduling

** ScheduleDAGMILive::schedule picking next node
Queue BotQ.P: 
Queue BotQ.A: 2626 2410 
Queue TopQ.P: 
Queue TopQ.A: 0 2 51 52 304 561 818 1089 1095 1380 1637 1894 2610 2611 2614 
  TopQ.A + Remain MOps: 2627
BotQ.A RemLatency SU(2626) 131c
  BotQ.A + Remain MOps: 2628
TopQ.A RemLatency SU(0) 131c
Picking from Bot:
  Cand SU(2626) ORDER                              
Picking from Top:
  Cand SU(0) ORDER                              
Top Cand:   Cand SU(0) ORDER                              
Bot Cand:   Cand SU(2626) ORDER                              
Picking:   Cand SU(2626) ORDER                              
Scheduling SU(2626) BUFFER_STORE_BYTE_OFFEN_exact
--------------------------------------------------------------------------------------------------------------
** ScheduleDAGMILive::schedule picking next node
Queue BotQ.P: 
Queue BotQ.A: 2410 2543 2623 2622 2625 
Queue TopQ.P: 
Queue TopQ.A: 0 2 51 52 304 561 818 1089 1095 1380 1637 1894 2610 2611 2614 
  TopQ.A + Remain MOps: 2626
BotQ.A RemLatency SU(2623) 131c
  BotQ.A + Remain MOps: 2628
TopQ.A RemLatency SU(0) 131c
Picking from Bot:
  Cand SU(2410) ORDER                              
  Cand SU(2543) ORDER                              
  Cand SU(2623) ORDER                              
  Cand SU(2625) ORDER                              
Picking from Top:
  Cand SU(0) NOCAND                             
Top Cand:   Cand SU(0) NOCAND                             
Bot Cand:   Cand SU(2625) ORDER                              
Picking:   Cand SU(2625) ORDER
Scheduling SU(2625) %6477:vgpr_32 = V_ADD_U32_e32 4, %6476:vgpr_32, implicit $exec

In the example above, the scheduler enqueues SU(0) into TopQ. It decides to pick a buffer_store_byte instruction with reason Order. SU(0) is also the TopCand in the next iteration, however this time it’s reason is set to NoCand because no heuristic in tryCandidate could alter the reason to anything else. So the scheduler ends up picking the alternative until the very end when it picks SU(0) from BotQ after dequeuing it from TopQ. This pattern is problematic because often the scheduler is forced to pick nodes which lead to excessive register pressure. Remark that the V_ADD3’s operands above depend on three operands: result from the previous V_ADD3 and from V_MUL’s that are about 2000 instructions above. This leads to too many long live ranges interfering with each other which further causes too many vgpr spills.

I tried forcing the scheduler to pick the TopCand if it’s SU# is <= 100. Yet, the scheduler ends up scheduling bottom-up like the above case.

Thanks,
Anshil

Per offline discussion –

It seems that implementing a new heuristic in tryCandidate to compare the absolute state of RP across !SameBoundary may more directly address the concern – the idea is for scheduler to produce amenable patterns for eviction.

We could introduce a class called InterferingLiveRanges which tracks the number of interfering live ranges during scheduling. More precisely, every time an SU is scheduled, the tracker inserts the LiveRange of the SU to a linked list containing previously inserted LiveRanges. To determine whether candidate TryCand would introduce more interfering live ranges than Cand, we can loop over the linked list and check if TryCand’s live range intersects with the ranges in the linked list. We can cache results for speedup. This can probably be extended to work with the register allocator as well. Thoughts?

cc @byrnesj1

Hi —

I’m not sure we can use LiveIntervals in the way you’ve described as they we will be modified while making scheduling decisions. Besides, this seems very similar to what we have with the RPTracker, perhaps we can use that instead?

I would also not be surprised if the heuristic needed redesigning to accomplish it’s goals (1. tiebreaker between directions, 2. good eviction order), but I guess this is something that will need to be evaluated experimentally.

1 Like

I’m not sure, but it sounds like this might be a bug. It is very vaguely reminiscent of ⚙ D68338 [AMDGPU] Remove dubious logic in bidirectional list scheduler which also caused the scheduler to get stuck in a state where it made bad decisions about whether to pick topcand or botcand.

The scheduler already tracks register pressure, which is how we measure the number of interfering live ranges. Maybe the problem you’re running into is that the scheduler only considers the register pressure at the point where it adds a new instruction to the schedule. It does not consider that the peak register pressure for the whole basic block might be somewhere else, in the middle of the instructions that have not been scheduled yet. It would be great if we could constantly track the pressure throughout the entire basic block, and how it changes as each instruction is scheduled, but I don’t know how efficiently that can be implemented. The scheduler is already pretty slow.

1 Like

The following condition seems contrary to its corresponding comment. Shouldn’t the second argument be CandP.getUnitInc() > 0 to check if the candidate is increasing RP?