LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Hi,

We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM’s pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM’s pre-allocation scheduling algorithm.

First, here is a brief description of our work:

We have developed a combinatorial algorithm for balancing instruction-level parallelism (ILP) and register pressure (RP) during pre-allocation scheduling. The algorithm is based on a branch-and-bound (B&B) approach, wherein the objective function is a linear combination of schedule length and register pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as an alternate pre-allocation scheduler. Then we compared the performance of our (B&B) scheduler with that of LLVM’s default scheduler on x86 (BURR scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our B&B scheduler significantly improves the performance of some benchmarks relative to LLVM’s default scheduler by up to 21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then observed that LLVM’s ILP scheduler on x86-64 uses “rough” latency values. So, we added the precise latency values published by Agner (Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X) and that led to more speedup relative to LLVM’s ILP scheduler on some benchmarks. The most significant gain from adding precise latencies was on the gromacs benchmark, which has a high degree of ILP. I am attaching the benchmarking results on x86-64 using both LLVM’s rough latencies and Agner’s precise latencies:

This work makes two points:

-A B&B algorithm can discover significantly better schedules than a heuristic can do for some larger hard-to-schedule blocks, and if such blocks happen to occur in hot code, their scheduling will have a substantial impact on performance.

  • A B&B algorithm is generally slower than a heuristic, but it is not a slow as most people think. By applying such an algorithm selectively to the hot blocks that are likely to benefit from it and setting some compile-time budget, a significant performance gain may be achieved with a relatively small increase in compile time.

My questions are:

  1. Our current experimental results are based on LLVM 2.9. We definitely plan on upgrading to the latest LLVM version in our future work, but is there a fundamentally compelling reason for us to upgrade now to 3.1 for the sake of making the above points in the publication?

  2. The BURR scheduler on x86-32 appears to set all latencies to one (which makes it a pure RR scheduler with no ILP), while the ILP scheduler on x86-64 appears to set all latencies to 10 expect for a few long-latency instructions. For the sake of documenting this in the paper, does anyone know (or can point me to) a precise description of how the scheduler sets latency values? In the revised paper, I will add experimental results based on precise latency values (see the attached spreadsheet) and would like to clearly document how LLVM’s rough latencies for x86 are determined.

  3. Was the choice to use rough latency values in the ILP scheduler based on the fact that using precise latencies makes it much harder for a heuristic non-backtracking scheduler to balance ILP and RP or the choice was made simply because nobody bothered to write an x86 itinerary?

  4. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle empty) when there are ready instructions? Here is a small hypothetical example that explains what I mean:

Suppose that at Cycle C the register pressure (RP) is equal to the physical limit and all ready instructions in that cycle start new live ranges, thus increasing the RP above the physical register limit. However, in a later cycle C+Delta some instruction X that closes a currently open live range will become ready. If the objective is minimizing RP, the right choice to make in this case is leaving Cycles C through C+Delta-1 empty and scheduling Instruction X in Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP scheduler ever make such a choice or it will always schedule an instruction when the ready list is not empty?

Thank you in advance!

-Ghassan

Ghassan Shobaki
Assistant Professor
Department of Computer Science
Princess Sumaya University for Technology
Amman, Jordan

Attachments inlined:

Rough Latencies

Benchmark | Branch-and-Bound | LLVM |
|

  • | - | - | - |

    | SPEC Score | SPEC Score | % Score Difference |
    400.perlbench | 21.2 | 20.2 | 4.95% |
    401.bzip2 | 13.9 | 13.6 | 2.21% |
    403.gcc | 19.5 | 19.8 | -1.52% |
    429.mcf | 20.5 | 20.5 | 0.00% |
    445.gobmk | 18.6 | 18.6 | 0.00% |
    456.hmmer | 11.1 | 11.1 | 0.00% |
    458.sjeng | 19.3 | 19.3 | 0.00% |
    462.libquantum | 39.5 | 39.5 | 0.00% |
    464.h264ref | 28.5 | 28.5 | 0.00% |
    471.omnetpp | 15.6 | 15.6 | 0.00% |
    473.astar | 13 | 13 | 0.00% |
    483.xalancbmk | 21.9 | 21.9 | 0.00% |
    GEOMEAN | 19.0929865 | 19.00588287 | 0.46% |
    410.bwaves | 15.2 | 15.2 | 0.00% |
    416.gamess | CE | CE | #VALUE! |
    433.milc | 19 | 18.6 | 2.15% |
    434.zeusmp | 14.2 | 14.2 | 0.00% |
    435.gromacs | 11.6 | 11.3 | 2.65% |
    436.cactusADM | 8.31 | 7.89 | 5.32% |
    437.leslie3d | 11 | 11 | 0.00% |
    444.namd | 16 | 16 | 0.00% |
    447.dealII | 25.4 | 25.4 | 0.00% |
    450.soplex | 26.1 | 26.1 | 0.00% |
    453.povray | 20.5 | 20.5 | 0.00% |
    454.calculix | 8.44 | 8.3 | 1.69% |
    459.GemsFDTD | 10.7 | 10.7 | 0.00% |
    465.tonto | CE | CE | #VALUE! |
    470.lbm | 38.1 | 31.5 | 20.95% |
    481.wrf | 11.6 | 11.6 | 0.00% |
    482.sphinx3 | 28.2 | 26.9 | 4.83% |
    GEOMEAN | 15.91486307 | 15.54419555 | 2.38% |

Precise Latencies

x86-64_BB_vs_LLVM_roughLatencies.xls (10 KB)

x86-64_BB_vs_LLVM_preciseLatencies.xls (10 KB)

Hi Ghassan, this is very interesting, however...

We are currently working on revising a journal article that describes our work
on pre-allocation scheduling using LLVM and have some questions about LLVM's
pre-allocation scheduler. The answers to these question will help us better
document and analyze the results of our benchmark tests that compare our
algorithm with LLVM's pre-allocation scheduling algorithm.

First, here is a brief description of our work:

We have developed a combinatorial algorithm for balancing instruction-level
parallelism (ILP) and register pressure (RP) during pre-allocation scheduling.
The algorithm is based on a branch-and-bound (B&B) approach, wherein the
objective function is a linear combination of schedule length and register
pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as
an alternate pre-allocation scheduler. Then we compared the performance of our
(B&B) scheduler with that of LLVM's default scheduler on x86 (BURR scheduler on
x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our B&B
scheduler significantly improves the performance of some benchmarks relative to
LLVM's default scheduler by up to 21%.

... are these differences statistically significant? In my experience SPEC
scores can vary considerably on different runs, so how many runs did you do
and what was the estimated standard deviation?

Best wishes, Duncan.

  The geometric-mean speedup on FP2006 is

Hi,

We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM’s pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM’s pre-allocation scheduling algorithm.

First, here is a brief description of our work:

We have developed a combinatorial algorithm for balancing instruction-level parallelism (ILP) and register pressure (RP) during pre-allocation scheduling. The algorithm is based on a branch-and-bound (B&B) approach, wherein the objective function is a linear combination of schedule length and register pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as an alternate pre-allocation scheduler. Then we compared the performance of our (B&B) scheduler with that of LLVM’s default scheduler on x86 (BURR scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our B&B scheduler significantly improves the performance of some benchmarks relative to LLVM’s default scheduler by up to 21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then observed that LLVM’s ILP scheduler on x86-64 uses “rough” latency values. So, we added the precise latency values published by Agner (http://www.agner.org/optimize/) and that led to more speedup relative to LLVM’s ILP scheduler on some benchmarks. The most significant gain from adding precise latencies was on the gromacs benchmark, which has a high degree of ILP. I am attaching the benchmarking results on x86-64 using both LLVM’s rough latencies and Agner’s precise latencies:

This work makes two points:

-A B&B algorithm can discover significantly better schedules than a heuristic can do for some larger hard-to-schedule blocks, and if such blocks happen to occur in hot code, their scheduling will have a substantial impact on performance.

  • A B&B algorithm is generally slower than a heuristic, but it is not a slow as most people think. By applying such an algorithm selectively to the hot blocks that are likely to benefit from it and setting some compile-time budget, a significant performance gain may be achieved with a relatively small increase in compile time.

My questions are:

  1. Our current experimental results are based on LLVM 2.9. We definitely plan on upgrading to the latest LLVM version in our future work, but is there a fundamentally compelling reason for us to upgrade now to 3.1 for the sake of making the above points in the publication?

Yes there is. While the pre-allocation scheduler has not had algorithmic changes during the past year it has received minor tweaks which can impact performance. Also note the scheduler is on its way out. I don’t know when the article will be published. But it’s possible by the time the paper is published, you would be essentially comparing against deprecated technology.

  1. The BURR scheduler on x86-32 appears to set all latencies to one (which makes it a pure RR scheduler with no ILP), while the ILP scheduler on x86-64 appears to set all latencies to 10 expect for a few long-latency instructions. For the sake of documenting this in the paper, does anyone know (or can point me to) a precise description of how the scheduler sets latency values? In the revised paper, I will add experimental results based on precise latency values (see the attached spreadsheet) and would like to clearly document how LLVM’s rough latencies for x86 are determined.

I don’t think your information is correct. The ILP scheduler is not setting the latencies to 10. LLVM does not have machine models for x86 (except for atom) so it’s using a uniform latency model (one cycle).

  1. Was the choice to use rough latency values in the ILP scheduler based on the fact that using precise latencies makes it much harder for a heuristic non-backtracking scheduler to balance ILP and RP or the choice was made simply because nobody bothered to write an x86 itinerary?

No one has bothered to write the itinerary.

  1. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle empty) when there are ready instructions? Here is a small hypothetical example that explains what I mean:

Suppose that at Cycle C the register pressure (RP) is equal to the physical limit and all ready instructions in that cycle start new live ranges, thus increasing the RP above the physical register limit. However, in a later cycle C+Delta some instruction X that closes a currently open live range will become ready. If the objective is minimizing RP, the right choice to make in this case is leaving Cycles C through C+Delta-1 empty and scheduling Instruction X in Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP scheduler ever make such a choice or it will always schedule an instruction when the ready list is not empty?

I don’t believe so.

Evan


Duncan,

Yes, there is a significant random variance among runs for some (but not all) SPEC benchmarks (examples are libquantum, bwaves and cactus). However, we run a complete SPEC test with three or five iterations after every significant change we make to the code and make sure that we reproduce all previously measured differences. If we can’t reproduce some previously seen result, that flags a bug in our latest changes that we trace and fix. For the production test that was used to generate the results for the paper, we ran SPEC using 9 iterations (as documented in the paper). Since we have been working on this for over a year, making multiple significant changes every week, most of the differences that are reported here have been reproduced tens if not hundreds of times. Any difference that cannot be reproduced many times is reported as a zero difference, hence the many zero differences in the tables.

Furthermore, we have analyzed most of the benchmarks on which significant differences has been measured (lbm, gromacs, cactus, sphinx, etc) and identified the cause of the performance difference in each case. In most cases, the cause is a reduction in register pressure in some hot basic block that causes a significant reduction in spill code, as reported by the register allocator. For example, on the lbm benchmark, using LLVM’s scheduler causes the register allocator to spill 12 virtual registers in the hottest function that amounts to 99% of the execution time, while using the branch-and-bound scheduler causes the register allocator to spill only 2 registers in that hot function.

So, we are quite certain that the reported differences are real and reproducible.

Evan,

Please see my inlined answers below:

Thanks
-Ghassan

Ghassan-

LLVM version 3.0 onwards has better register allocator, greedy register allocator at higher optimization levels. This might handle the not-so-great schedules better than the register allocation used in version 2.9.

-Prashantha

Ghassan,

Evan is right. Your speedups from enhancing the scheduler’s latency are probably different with 3.1. But that doesn’t really change your two main points that (1) you can further reduce register pressure and (2) your compile time isn’t horrible. I think the major heuristic changes in the PreRA scheduler went in before llvm 2.9. OTOH, Prasantha makes a good point that the new register allocator may diminish the impact of a bad schedule.
At the very least, I would check the code being generated by 3.1 for the few benchmarks that interest you to see how much the baseline changed.

Another thing that you may want to experiment with…

I implemented a register pressure tracking scheduler earlier this year. You can run it yourself on x86-64 with the flag -enable-misched. The current implementation is a platform for experimentation only. I haven’t invested any time developing heuristics, which will need to work across a range of interesting benchmarks and subtargets. Others have done some performance analysis with the framework, and may be able to chime in on it’s good points or weak points.

This is not based on Sethi-Ullman in any way. There are a number of interesting heuristics that I can think of implementing, but will not be doing so until I can justify adding the cost and complexity.

The current (experimental) MI scheduler implements a 3-level “back-off”:

  1. Respect the target’s register limits at all times.

  2. Indentify critical register classes (pressure sets) before scheduling.
    Track pressure within the currently scheduled region.
    Avoid increasing scheduled pressure for critical registers.

  3. Avoid exceeding the max pressure of the region prior to scheduling (don’t make things locally worse).

Think of these crude heuristics as baseline. This is what can be done easily and cheaply. Anything more sophisticated has to surpass this low bar.

All of the heuristics that I have planned are greedy, and none are sophisticated, but some require precomputing register lineages
(dependence chains that reuse a single register). An interesting twist is that the MI scheduler can alternate between top-up and bottom-down,
which doesn’t fundamentally change the problem, but avoids the common cases in which greedy schedulers “get stuck”.

Here’s my plan for the near future:

  • SpillCost: Map register units onto a spill cost that is more meaningful for heuristics.

  • Pressure Query: (compile time) Redesign the pressure tracker to summarize information at the instruction level for fast queries during scheduling.

  • Pressure Range: Before scheduling, compute the high pressure region as a range of instructions. If the scheduler is not currently under pressure, prioritize instructions from within the range.

  • Register Lineages: Before scheduling, use a heuristic to select desirable lineages. Select the longest lineage from the queue. After scheduling an instruction, look at the next instruction in the lineage. If it has an unscheduled operand, mark that operand’s lineage as pending, and priortize the head of that lineage. This solves some interesting cases where a greedy scheduler is normally unable to choose among a set of identical looking instructions by knowing how their dependence chain relates to any already scheduled instructions.

-Andy

  1. The BURR scheduler on x86-32 appears to set all latencies to one (which makes it a pure RR scheduler with no ILP), while the ILP scheduler on x86-64 appears to set all latencies to 10 expect for a few long-latency instructions. For the sake of documenting this in the paper, does anyone know (or can point me to) a precise description of how the scheduler sets latency values? In the revised paper, I will add experimental results based on precise latency values (see the attached spreadsheet) and would like to clearly document how LLVM’s rough latencies for x86 are determined.

I don’t think your information is correct. The ILP scheduler is not setting the latencies to 10. LLVM does not have machine models for x86 (except for atom) so it’s using a uniform latency model (one cycle).

Evan’s description is precise. Everything is one cycle, unless it is 10 cycles :wink: But it’s easy to reconfigure to use itineraries, as I guess you’ve done.

  1. Was the choice to use rough latency values in the ILP scheduler based on the fact that using precise latencies makes it much harder for a heuristic non-backtracking scheduler to balance ILP and RP or the choice was made simply because nobody bothered to write an x86 itinerary?

No one has bothered to write the itinerary.

I recently committed infrastructure that allows machine models to be developed incrementally, at the level of detail appropriate for the processor. I have a feeling we will start to see models begin to evolve for x86 processors very soon. The framework is documented in TargetSchedule.td and you’re welcome to contribute.

The only feature that I still plan to implement is the ability of a machine model to specify that it is derived from another. It will be trivial to add though when the time comes.

  1. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle empty) when there are ready instructions? Here is a small hypothetical example that explains what I mean:

Suppose that at Cycle C the register pressure (RP) is equal to the physical limit and all ready instructions in that cycle start new live ranges, thus increasing the RP above the physical register limit. However, in a later cycle C+Delta some instruction X that closes a currently open live range will become ready. If the objective is minimizing RP, the right choice to make in this case is leaving Cycles C through C+Delta-1 empty and scheduling Instruction X in Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP scheduler ever make such a choice or it will always schedule an instruction when the ready list is not empty?

The standard ILP scheduler does not have a “ReadyFilter” so instructions are inserted in the ready queue the moment their predecessors are scheduled. So, yes, it will effectively “impose stalls” to reduce register pressure. Note that things work differently with an itinerary though. And the answer will depend on how you’ve written the itinerary.

-Andy

... are these differences statistically significant? In my experience SPEC
scores can vary considerably on different runs, so how many runs did you do
and what was the estimated standard deviation?

It is also subject to measurement bias. You probably want to look at

http://www.inf.usi.ch/faculty/hauswirth/publications/asplos09.pdf

Best wishes, Duncan.

Cheers,
Rafael