Considering register pressure when deciding Initiation Interval in MachinePipeliner

Now I’d like to append to the MachinePipeliner an algorithm that takes into account register pressure when searching Initiation Interval (II).

Background

Modulo scheduling of software pipelining tends to increase register pressure due to the overlaps of each iteration.
Large register pressure can result in additional register spill/fill, which can cause performance degradation.
Therefore, considering register pressure when scheduling may yield better results.
In fact, there are several scheduling algorithms that attempt to reduce the resister pressure, such as the one in this paper.

Algorithm

The approach in the above paper is based on Iterative Modulo Scheduling algorithm, which differs from current MachinePipeliner’s one (Swing Modulo Scheduling).
So I come up with a simple algorithm to reduce the register pressure.
The idea is to reject an II where the maximum register pressure (corresponds to the term Maxlives in the paper) is so high that register spill/fill may occur.
Calculating the maximum register pressure is not so difficult: simply unroll the scheduled instructions, then checks the live intervals of each register.

Details

In current MachinePipeliner’s implementation, searching II is done in SwingSchedulerDAG::schedulePipeline.
Here is a code fragment of the function and what I’d like to add.

bool scheduleFound = false;
for (unsigned II = MII; II <= MAX_II && !scheduleFound; II++) {
  // Try to schedule with II
  ...

  // At this point, the `scheduleFound` is true if there are no problems (i.e. scheduling went well)
  // Our proposal is to insert an additional check that calculates the maximum register pressure of the schedule and rejects the II if the pressure is too high, as shown below
  if (Register pressure is too high) 
    scheduleFound = false
}

The following is a pseudo code that calculates the maximum register pressure.

function CalcMaxRegPress(ScheduledInstructions)
  N <- Number of Pressure Set
  CurRegPress <- [0, 0, ..., 0]  // N elements array, indexed by Pressure Set ID
  MaxRegPress <- [0, 0, ..., 0]
  Unrolled <- Unrolled ScheduleInstructions for simulating the loop from prolog to kernel

  for Instr in Unrolled
    for Def in AllDefs(Instr)
      Increase CurRegPress corresponds to Def
    end for

    for LastUse in AllLastUses(Instr)
      Decrease CurRegPress corresponds to LastUse
    end for

    for i in 1 to N
      MaxRegPress[i] <- Max(MaxRegPress[i], CurRegPress[i])
    end for
  end for

  return MaxRegPress
end function

Does anyone have any advice or thoughts?

Thanks.

We implemented the above algorithm and found a case where this algorithm is effective.

We used a code of Gauss-Seidel method that is based on this one, experimented on Graviton3 (Neoverse V1). Please see this PR comment for details.

If the pipeliner simply applied, the II will be 11. On the other hand, when our algorithm is also applied, the II becomes 20. This is because this algorithm rejects schedules that may cause a lot of register spills. In fact, when 11 is used for II, many register spill/fill occur, but hardly at all when II is 20. Also, as shown in the table of the PR comment, this change in II makes the code more efficient. The cycles changes from 19.6 to 15.7, meaning an improvement of about 20% (please note that we also apply the MVE algorithm).

I would appreciate any comments or advice.