Register pressure calculation in the machine scheduler and live-through registers

Hello,

In a previous email, Matthias mentioned that register pressure estimates in the machine scheduler are not absolute; they only account for the registers that are used in the block.I assume that he meant that registers that are live-through (both live-in and live-out) are not accounted for in register pressure calculations. If a register is either live-in or live-out but not both, it must be accounted for, because its live range length may be affected by the scheduling decisions within the region.

I understand that the live range of a live-through register is not affected by the scheduling decisions within the region, but unfortunately, the scheduler that we are integrating in the machine scheduler does need to know the absolute register pressure. Our scheduler is a combinatorial scheduler with an explicit cost function. The explicit cost function is what we call the Peak Excess Register Pressure (PERP), which is basically the number of registers above the physical limit at the highest pressure point in the schedule. If the peak pressure does not exceed the physical limit for any register type, the PERP is set to zero.

Our combinatorial scheduler first invokes the existing machine scheduler and computes the RP of the schedule that it produces. If that RP is within the physical limits (PERP=0), it is considered optimal and no further action is taken. If the computed PERP is greater than zero, our scheduler performs an exhaustive search for an optimal schedule (a schedule with minimum PERP) using a branch-and-bound (B&B) approach. Without accounting for live-through registers, two bad things can happen:

  1. A schedule whose actual peak pressure exceeds the physical limit may be considered optimal and thus will not get passed to the B&B phase. For example, if the physical limit is 10 and the peak pressure without counting live-through registers is 7, our scheduler will think that this is optimal and will not attempt to find a better schedule. If the number of live-through registers happens to be 4 or more, this will be a bad decision.

  2. Even if a schedule is passed to the B&B phase, the B&B search may terminate with a sub-optimal schedule thinking that it is optimal. For example, if the physical limit is 10, our B&B scheduler may find a schedule with a peak pressure of 8 and terminate thinking it is optimal, while it may not be optimal if the number of live-through registers happens to be 3 or more. Without knowing about live-through registers, our B&B scheduler will not continue to search for a better schedule in this case.

You may think that setting all the physical limits for all register types to zero may resolve this problem. This is true, but it will cause our computationally expensive scheduler to do a lot of unnecessary work. It will be wasting a lot of time processing many small-and-trivial scheduling regions.

So, is there an easy way to account for all registers and compute the absolute register pressure?

Thanks

Ghassan Shobaki
Assistant Professor of Computer Science

University of California, Sacramento

Hello,

In a previous email, Matthias mentioned that register pressure estimates in the machine scheduler are not absolute; they only account for the registers that are used in the block.I assume that he meant that registers that are live-through (both live-in and live-out) are not accounted for in register pressure calculations. If a register is either live-in or live-out but not both, it must be accounted for, because its live range length may be affected by the scheduling decisions within the region.

The pressure estimates account for registers used inside a scheduling region (which may be smaller than a basic block in the presence of calls or other scheduling barriers).
Yes that means live-through registers are not accounted for. With live-through I mean: alive inside the region but not used inside the region (the register or one of its aliases are not used in any machine operand or clobbered by a register mask); in loops you may have a register that is live-in, live-out and also used inside a region this is not what I consider live-through and will be accounted for.

I understand that the live range of a live-through register is not affected by the scheduling decisions within the region, but unfortunately, the scheduler that we are integrating in the machine scheduler does need to know the absolute register pressure. Our scheduler is a combinatorial scheduler with an explicit cost function. The explicit cost function is what we call the Peak Excess Register Pressure (PERP), which is basically the number of registers above the physical limit at the highest pressure point in the schedule. If the peak pressure does not exceed the physical limit for any register type, the PERP is set to zero.

That means you cannot use the code from RegisterPressure.{cpp|h} to compute this. The other liveness analysis we have in llvm codegen is LiveIntervals (LiveItnervalAnalysis) which gives you a list of liveness segments of a given vreg (the same representation is used in most linear scan allocators even though LLVM is not using a linear scan approach any more). This representation is not optimized for register pressure queries though: If you want to know how many variables are alive at a certain point in the program you have to check all virtual registers to see whether that point is contained in the liverange of that variable.
To make this efficient you probably need some form of precomputation over the whole function.

Our combinatorial scheduler first invokes the existing machine scheduler and computes the RP of the schedule that it produces. If that RP is within the physical limits (PERP=0), it is considered optimal and no further action is taken. If the computed PERP is greater than zero, our scheduler performs an exhaustive search for an optimal schedule (a schedule with minimum PERP) using a branch-and-bound (B&B) approach. Without accounting for live-through registers, two bad things can happen:

  1. A schedule whose actual peak pressure exceeds the physical limit may be considered optimal and thus will not get passed to the B&B phase. For example, if the physical limit is 10 and the peak pressure without counting live-through registers is 7, our scheduler will think that this is optimal and will not attempt to find a better schedule. If the number of live-through registers happens to be 4 or more, this will be a bad decision.

  2. Even if a schedule is passed to the B&B phase, the B&B search may terminate with a sub-optimal schedule thinking that it is optimal. For example, if the physical limit is 10, our B&B scheduler may find a schedule with a peak pressure of 8 and terminate thinking it is optimal, while it may not be optimal if the number of live-through registers happens to be 3 or more. Without knowing about live-through registers, our B&B scheduler will not continue to search for a better schedule in this case.

You may think that setting all the physical limits for all register types to zero may resolve this problem. This is true, but it will cause our computationally expensive scheduler to do a lot of unnecessary work. It will be wasting a lot of time processing many small-and-trivial scheduling regions.

So, is there an easy way to account for all registers and compute the absolute register pressure?

I think the current design is inspired by the fact that scheduling is most important in the inner loop. So using more registers there to produce a better schedule (considering goals like minimizing critical path length, hiding latencies), is worth using extra registers even if it leads to additional pressure/spills/reloads outside the loop.

See above for ideas on how to compute absolute register pressure.

  • Matthias

The code in RegisterPressure.cpp is meant to work with LiveIntervals. Those queries only see within a block but are meant to be “seeded” with live-through information. That could be done be directly calling addLiveRegs. Alternately you can record live-through pressure separately via initLiveThru. It’s just that the MachineScheduler does not bother initializing the live-through information.

As Matthias said, actually determining live-through information requires a separate global liveness analysis, because LiveIntervals doesn’t tell you “what’s live at this point”.

-Andy

Ghassan, have you managed to try this, yet? This seems interesting to me on SystemZ, as I am still seeing increased spilling when activating mischeduler for SystemZ and switching to isel source-order scheduling. /Jonas

We have run an experiment in which we set all physical register limits to zero. The purpose of this experiment was estimating the potential impact of getting complete liveness info. The results of this experiment were better than the results that we got using the real physical limits (the total number of spills in CPU2006 was reduced), but the difference was not substantial. We took this as an indicator that getting complete live-through information is unlikely to buy us much. So, we focused on investigating other problems that could have a higher impact on the results.

Overall, we are not currently getting the expected results on Intel x86 (the reduction in spilling is not as significant as expected), and we attribute this to two major problems:

  1. There are some cases in which our scheduler reduces the register pressure but the number of spills generated by the register allocator increases. We understand that the correlation between register pressure and the amount of spilling is not perfect, because it depends on the register allocation algorithm, which may often make bad decisions. However, across the statistically significant data set that we are using (all 900K+ regions in CPU2006), the correlation should be stronger. The weakness of the correlation between register pressure and spilling indicates that there is something wrong with the register info (Def/Use and liveness) that we are using to compute register pressure. Will you be able to take a look at that part of the code that collects register info and passes it to our scheduler and tell us if you spot any problems in it?
  2. The register pressure reductions that we are getting in the machine scheduler are not as significant as the register pressure reductions that we were getting in the selection DAG scheduler. We suspect that the cause of this is that the machine-level DAGs are more heavily constrained than the selection DAGs. So, if you are getting more spills when you enable the machine scheduler, that may be because the machine-level DAGs are more constrained not because the machine-level scheduling algorithm is weaker. In order to test this hypothesis, we plan on enabling our combinatorial scheduler at the selection DAG level in the version that are currently using and then doing a direct comparison between selection DAG scheduling and machine scheduling in the same version of the code.

As for live-through information, we found that the machine scheduler does call initLiveThru() and here is a pointer to the code:

https://gitlab.com/CSUS_LLVM/LLVM_DRAGONEGG/blob/master/Generic/llvmTip/llvm-master/lib/CodeGen/MachineScheduler.cpp#L921

And here is a pointer to the function initLiveThru()

https://gitlab.com/CSUS_LLVM/LLVM_DRAGONEGG/blob/master/Generic/llvmTip/llvm-master/lib/CodeGen/RegisterPressure.cpp#L318

Thanks

Ghassan Shobaki
Assistant Professor of Computer Science

California State University, Sacramento

Hi Ghassan,

As for live-through information, we found that the machine scheduler does call initLiveThru() and here is a pointer to the code:

https://gitlab.com/CSUS_LLVM/LLVM_DRAGONEGG/blob/master/Generic/llvmTip/llvm-master/lib/CodeGen/MachineScheduler.cpp#L921

The first part of the comment above initLiveThru() says “The register tracker is unaware of global liveness so ignores normal live-thru ranges…”. It is then of course confusing to see these methods like initLiveThru()…

My understanding is that (please correct me if I’m wrong)

  1. All instructions are traversed bottom-up during DAG building. While doing this reg pressure is tracked based on just looking at those instructions. So if a def has no use in an mbb it is a “live-out” reg, and if there is a use with no def, it would become “live-in”. This is then a kind of local live-through concept, in contrast to a true live-through analysis which would be aware of registers not used/defed in the region as well.
  2. We should ideally have an analysis of global liveness so that the regpressure trackers could be properly initialized, but this is currently missing. Of course, one might hope that it wouldn’t be too hard to extend LiveIntervals to also provide this information… It would be interesting to merely try this and see how valuable it would be…

/Jonas

Hi Ghassan,

As for live-through information, we found that the machine scheduler does call initLiveThru() and here is a pointer to the code:

https://gitlab.com/CSUS_LLVM/LLVM_DRAGONEGG/blob/master/Generic/llvmTip/llvm-master/lib/CodeGen/MachineScheduler.cpp#L921

The first part of the comment above initLiveThru() says “The register tracker is unaware of global liveness so ignores normal live-thru ranges…”. It is then of course confusing to see these methods like initLiveThru()…

My understanding is that (please correct me if I’m wrong)

  1. All instructions are traversed bottom-up during DAG building. While doing this reg pressure is tracked based on just looking at those instructions. So if a def has no use in an mbb it is a “live-out” reg, and if there is a use with no def, it would become “live-in”. This is then a kind of local live-through concept, in contrast to a true live-through analysis which would be aware of registers not used/defed in the region as well.

Yes, the first pass during DAG construction determines the maximum register pressure and the list of live-out values. I think the code consults liveintervals to differentiate dead-defs from true live-outs or detect killing uses that aren’t marked as such.

  1. We should ideally have an analysis of global liveness so that the regpressure trackers could be properly initialized, but this is currently missing. Of course, one might hope that it wouldn’t be too hard to extend LiveIntervals to also provide this information… It would be interesting to merely try this and see how valuable it would be…

Again a global minimum amount of spills/reloads is not necessarily better than a good schedule inside a loop with extra spills/reloads outside the loop. But would certainly be worth exploring.
Writing a naive function to determine all live values at a certain point is easy: Just iterate over all vregs and check for each whether the point in question is covered by a live interval. However to do this efficiently so it can be used in production would probably require a concept such as keeping live-in/live-out lists on basic blocks up-to-date. To put that into production we should first demonstrate that it is worth it.

  • Matthias

Jonas and Matthias,

Thank you for informative replies!

Implementing that naive and inefficient procedure for getting complete live-in/live-out info sounds like a suitable solution for us at this point. We are working a research project, in which we are experimenting with relatively slow scheduling algorithms to explore the limits of register pressure (RP) reduction. I agree with Matthias that when our experimentation shows that these algorithms make a significant difference, we will worry about putting them in production. So, for now, we will try to implement this naive approach.

We realize that the register allocator will always try to place spills in basic blocks with lower frequencies of execution, which implies that locally reducing RP within a basic block (or a few basic blocks) will not necessarily reduce the static spill count in the whole function, which is the metric that we are currently using. However, with the large size of our data set (all CPU2006 benchmarks with about 900 thousand scheduling regions), we should statistically get a very strong (but not perfect) correlation between local RP and the static spill count. The current correlation is not as strong as expected, and that’s why we think that there is a problem with the cost function that we are minimizing. It does not seem to be fully capturing the register info.

After reading Matthias’s comment though, I concluded that the weighted spill count (with each spill multiplied by the block’s frequency of execution) would be a better metric for us to use. Of course, we do run the benchmarks and look at the actual execution time once in a while, but we cannot afford doing this after every change that we make. We have started looking into calculating a weighted spill count, but we are having problems with this. We will address that in a separate email.

Overall, we have reasons to believe that precise scheduling algorithms can significantly impact the performance of scientific programs (at least), because most FP2006 benchmarks have significant amounts of spilling in their hot functions, and significant reductions in these hot spills always lead to significant performance gains. Our framework is general enough to support balancing RP and Instruction-level parallelism (ILP), but we are currently focusing on the RP part, targeting processors with powerful out-of-order execution, such as Intel x86. Once we complete our exploration of the limits of RP reduction, we will start experimenting with the balance between RP and ILP on in-order processors.

Thank you again for your insightful comments!

Ghassan Shobaki
Assistant Professor of Computer Science

California State University, Sacramento