Preventing bidirectional scheduler from only scheduling in one direction

Hi,

I found an instance where a bidirectional scheduler ends up scheduling bottom-up throughout the block in question because some SU “prevents” the scheduler from picking the top candidate. I am thinking of adding a counter to SchedCandidate that keeps track of how many times the alternative node was picked and if the counter goes over some limit, the scheduler should add some priority towards picking that node. This addition will further avoid eviction chains during register allocation as it has the potential to reduce the number of interfering live ranges.

Cheers,
Anshil

Can you explain how “some SU “prevents” the scheduler from picking the top candidate”? Adding a counter sounds like a workaround. There might be a better way to fix the underlying problem.

Do you have an example?

This example is on AMDGPU:

Code:

SU(0):   %3588:sreg_32 = nsw S_MUL_I32 %163:sreg_32_xm0_xexec, 50
...
  %3049:vgpr_32 = V_ADD3_U32_e64 %673:vgpr_32, %6970:vgpr_32, %674:vgpr_32, implicit $exec
  %3050:vgpr_32 = V_ADD3_U32_e64 %3049:vgpr_32, %706:vgpr_32, %707:vgpr_32, implicit $exec
  %3051:vgpr_32 = V_ADD3_U32_e64 %3050:vgpr_32, %739:vgpr_32, %740:vgpr_32, implicit $exec
  %3052:vgpr_32 = V_ADD3_U32_e64 %3051:vgpr_32, %772:vgpr_32, %773:vgpr_32, implicit $exec
  %3053:vgpr_32 = V_ADD3_U32_e64 %3052:vgpr_32, %807:vgpr_32, %808:vgpr_32, implicit $exec
  %3054:vgpr_32 = V_ADD3_U32_e64 %3053:vgpr_32, %842:vgpr_32, %843:vgpr_32, implicit $exec
  %3055:vgpr_32 = V_ADD3_U32_e64 %3054:vgpr_32, %877:vgpr_32, %878:vgpr_32, implicit $exec
  %3056:vgpr_32 = V_ADD3_U32_e64 %3055:vgpr_32, %912:vgpr_32, %913:vgpr_32, implicit $exec
  %3057:vgpr_32 = V_ADD3_U32_e64 %3056:vgpr_32, %947:vgpr_32, %948:vgpr_32, implicit $exec
  %3058:vgpr_32 = V_ADD3_U32_e64 %3057:vgpr_32, %982:vgpr_32, %983:vgpr_32, implicit $exec
  %3059:vgpr_32 = V_ADD3_U32_e64 %3058:vgpr_32, %1017:vgpr_32, %1018:vgpr_32, implicit $exec
  %3060:vgpr_32 = V_ADD3_U32_e64 %3059:vgpr_32, %1052:vgpr_32, %1053:vgpr_32, implicit $exec
  %3061:vgpr_32 = V_ADD3_U32_e64 %3060:vgpr_32, %1087:vgpr_32, %1088:vgpr_32, implicit $exec
  %3062:vgpr_32 = V_ADD3_U32_e64 %3061:vgpr_32, %1122:vgpr_32, %1123:vgpr_32, implicit $exec
...

Debug statements for the first 2 iterations of bidirectional scheduling

** ScheduleDAGMILive::schedule picking next node
Queue BotQ.P: 
Queue BotQ.A: 2626 2410 
Queue TopQ.P: 
Queue TopQ.A: 0 2 51 52 304 561 818 1089 1095 1380 1637 1894 2610 2611 2614 
  TopQ.A + Remain MOps: 2627
BotQ.A RemLatency SU(2626) 131c
  BotQ.A + Remain MOps: 2628
TopQ.A RemLatency SU(0) 131c
Picking from Bot:
  Cand SU(2626) ORDER                              
Picking from Top:
  Cand SU(0) ORDER                              
Top Cand:   Cand SU(0) ORDER                              
Bot Cand:   Cand SU(2626) ORDER                              
Picking:   Cand SU(2626) ORDER                              
Scheduling SU(2626) BUFFER_STORE_BYTE_OFFEN_exact
--------------------------------------------------------------------------------------------------------------
** ScheduleDAGMILive::schedule picking next node
Queue BotQ.P: 
Queue BotQ.A: 2410 2543 2623 2622 2625 
Queue TopQ.P: 
Queue TopQ.A: 0 2 51 52 304 561 818 1089 1095 1380 1637 1894 2610 2611 2614 
  TopQ.A + Remain MOps: 2626
BotQ.A RemLatency SU(2623) 131c
  BotQ.A + Remain MOps: 2628
TopQ.A RemLatency SU(0) 131c
Picking from Bot:
  Cand SU(2410) ORDER                              
  Cand SU(2543) ORDER                              
  Cand SU(2623) ORDER                              
  Cand SU(2625) ORDER                              
Picking from Top:
  Cand SU(0) NOCAND                             
Top Cand:   Cand SU(0) NOCAND                             
Bot Cand:   Cand SU(2625) ORDER                              
Picking:   Cand SU(2625) ORDER
Scheduling SU(2625) %6477:vgpr_32 = V_ADD_U32_e32 4, %6476:vgpr_32, implicit $exec

In the example above, the scheduler enqueues SU(0) into TopQ. It decides to pick a buffer_store_byte instruction with reason Order. SU(0) is also the TopCand in the next iteration, however this time it’s reason is set to NoCand because no heuristic in tryCandidate could alter the reason to anything else. So the scheduler ends up picking the alternative until the very end when it picks SU(0) from BotQ after dequeuing it from TopQ. This pattern is problematic because often the scheduler is forced to pick nodes which lead to excessive register pressure. Remark that the V_ADD3’s operands above depend on three operands: result from the previous V_ADD3 and from V_MUL’s that are about 2000 instructions above. This leads to too many long live ranges interfering with each other which further causes too many vgpr spills.

I tried forcing the scheduler to pick the TopCand if it’s SU# is <= 100. Yet, the scheduler ends up scheduling bottom-up like the above case.

Thanks,
Anshil

Per offline discussion –

It seems that implementing a new heuristic in tryCandidate to compare the absolute state of RP across !SameBoundary may more directly address the concern – the idea is for scheduler to produce amenable patterns for eviction.

We could introduce a class called InterferingLiveRanges which tracks the number of interfering live ranges during scheduling. More precisely, every time an SU is scheduled, the tracker inserts the LiveRange of the SU to a linked list containing previously inserted LiveRanges. To determine whether candidate TryCand would introduce more interfering live ranges than Cand, we can loop over the linked list and check if TryCand’s live range intersects with the ranges in the linked list. We can cache results for speedup. This can probably be extended to work with the register allocator as well. Thoughts?

cc @byrnesj1

Hi —

I’m not sure we can use LiveIntervals in the way you’ve described as they we will be modified while making scheduling decisions. Besides, this seems very similar to what we have with the RPTracker, perhaps we can use that instead?

I would also not be surprised if the heuristic needed redesigning to accomplish it’s goals (1. tiebreaker between directions, 2. good eviction order), but I guess this is something that will need to be evaluated experimentally.

1 Like

I’m not sure, but it sounds like this might be a bug. It is very vaguely reminiscent of ⚙ D68338 [AMDGPU] Remove dubious logic in bidirectional list scheduler which also caused the scheduler to get stuck in a state where it made bad decisions about whether to pick topcand or botcand.

The scheduler already tracks register pressure, which is how we measure the number of interfering live ranges. Maybe the problem you’re running into is that the scheduler only considers the register pressure at the point where it adds a new instruction to the schedule. It does not consider that the peak register pressure for the whole basic block might be somewhere else, in the middle of the instructions that have not been scheduled yet. It would be great if we could constantly track the pressure throughout the entire basic block, and how it changes as each instruction is scheduled, but I don’t know how efficiently that can be implemented. The scheduler is already pretty slow.

1 Like

The following condition seems contrary to its corresponding comment. Shouldn’t the second argument be CandP.getUnitInc() > 0 to check if the candidate is increasing RP?

Hello there,

I think it is indeed a bug. I also noticed it recently while working on our downstream KVX backend.

The problem lies there: https://github.com/llvm/llvm-project/blob/52b33ff760b01e73dc4a0960f97812f9effe18f3/llvm/lib/CodeGen/MachineScheduler.cpp#L3677

  // Pick best from BotCand and TopCand.
  SchedCandidate Cand = BotCand;
  TopCand.Reason = NoCand;
  if (tryCandidate(Cand, TopCand, nullptr)) {
    Cand.setBest(TopCand);
    LLVM_DEBUG(traceCandidate(Cand));
  }

TopCand.Reason is set to NoCand. I believe that comes from the older versions where the code looked like this :

  SchedCandidate Cand = BotCand;
  TopCand.Reason = NoCand;
  tryCandidate(Cand, TopCand, nullptr);
  if (TopCand.Reason != NoCand) {
    Cand.setBest(TopCand);
    LLVM_DEBUG(traceCandidate(Cand));
  }

The information “Was the candidate picked?” was computed by checking TopCand.Reason != NoCand : if TopCand.Reason was still NoCand, then no heuristic modified the reason => TopCand was not picked.

As TopCand.Reason is not restored after deciding Bot vs Top, the candidate for the SU (in OP’s example, SU(0)) will always be stuck to NoCand. Therefore Bot vs Top will always prefer Bot because any reason (including Order) is better than NoCand. As a result, bidirectional scheduling is really just a bottom-up scheduling, except in a few cases where Top triumphs from the get go.

I fixed it downstream by restoring the state of TopCand.Reason. This corrected the behavior but I found no real difference either way as Bottom gets the priority over Top: the tryCandidate favors the 1st argument, in the sense that if no reason is given to prefer the 2nd argument, then the 1st will get picked.

I tried something a little bit different where I alternate the default candidate between Top and Bottom: if Top was picked last, then the default candidate will be Bottom ; vice-versa, if Bottom was picked last, then Top will be the default candidate.

I got some nice performance improvement overall with that alternation (although it’s not a clear cut: scheduling changes a lot of things so there are regressions). My downstream patch is under code review right now, I can share it once it is done.

(The emphasis is mine.) I don’t think that’s true. The next time we compare Bot vs Top we don’t look at Bot.Reason or Top.Reason, we actually compare the SUs to see which one is better and return a new reason.

I do think the API of tryCandidate is weird and confusing though. Maybe it should return a signed integer indicating which one is better and by how much.

2 Likes

It’s interesting that that patch gave you an overall improvement. I’d like to understand why.

We’re talking about the case where the scheduler’s heuristics cannot distinguish two candidates X and Y. Currently it always picks Y. You’re proposing to alternate picking X and Y. You could imagine any other arbitrary choice, like always picking X, or picking one of them pseudo-randomly, but these are all still completely arbitrary.

It would be much better to understand why you’re getting better scheduling in some cases, and code that up as a new heuristic (or improve one of the many existing heuristics) so that the scheduler could reliably make better choices.

1 Like

Indeed you are right, thanks for bringing it up. I implicitly assumed that Cand.isValid() was checking for Cand.Reason. That explains why I did not see any difference with the fix (except in the debug log who was now showing correctly the Reason for Top candidate instead of NoCand).

Just for the context, I was examining the pre-RA scheduler (on our target we always do top down for post-RA).

Top-down scheduling tends to get good scheduling results in terms of latency; but because it schedules as soon as possible, it also tends to increase register pressure.

Bottom-up scheduling gives better results in register pressure but sometimes the latency is worse than what you would have gotten with a top-down scheduling.

We originally picked bidirectional scheduling to get something between the two approaches.

After activating MachinePipeliner, I noticed examples where the scheduler was far from the schedule that was originally planned by MachinePipeliner. In particular I had one example where the main loop had more than 5 cycles difference between the schedule planned by MachinePipeliner and the pre-RA schedule from MachineScheduler.

I attempted to dig why, examining the debug log of a few examples that did not perform as good as what MachinePipeliner said it would. That’s when I noticed that, on those examples, the schedule was really just bottom-up: Top was neither winning nor losing the heuristics so Bottom was chosen by default. When trying top-down on that specific example, the scheduling got better.

My first attempt was to switch to top-down scheduling on all basic blocks that were pipelined with MachinePipeliner. But it turns out this did not do good on some other examples because of register pressure rising and, ultimately, spills being inserted.

My second attempt has been what I wrote above : alternate between top and bottom to get something more bidirectional and less bottom-up. That seems to work well enough, although I cannot say why exactly right now. It is mostly empirical.

You are right that there is something to dig there. I will try looking into that.