MI Scheduler vs SD Scheduler?

Hi,

We are currently in the process of upgrading from LLVM 2.9 to LLVM 3.3. We are working on instruction scheduling (mainly for register pressure reduction). I have been following the llvmdev mailing list and have learned that a machine instruction (MI) scheduler has been implemented to replace (or work with?) the selection DAG (SD) scheduler. However, I could not find any document that describes the new MI scheduler and how it differs from and relates to the SD scheduler. So, I would appreciate any pointer to a document (or a blog) that may help us understand the difference and the relation between the two schedulers and figure out how to deal with them. We are trying to answer the following questions:

  • A comment at the top of the file ScheduleDAGInstrs says that this file implements re-scheduling of machine instructions. So, what does re-scheduling mean? Does it mean that the real scheduling algorithms (such as reg pressure reduction) are currently implemented in the SD scheduler, while the MI scheduler does some kind of complementary work (fine tuning) at a lower level representation of the code?
    And what’s the future plan? Is it to move the real scheduling algorithms into the MI scheduler and get rid of the SD scheduler? Will that happen in 3.4 or later?

  • Based on our initial investigation of the default behavior at -O3 on x86-64, it appears that the SD scheduler is called while the MI scheduler is not. That’s consistent with the above interpretation of re-scheduling, but I’d appreciate any advice on what we should do at this point. Should we integrate our work (an alternate register pressure reduction scheduler) into the SD scheduler or the MI scheduler?

  • Our SPEC testing on x86-64 has shown a significant performance improvement of LLVM 3.3 relative to LLVM 2.9 (about 5% in geomean on INT2006 and 15% in geomean on FP2006), but our spill code measurements have shown that LLVM 3.3 generates significantly more spill code on most benchmarks. We will be doing more investigation on this, but are there any known facts that explain this behavior? Is this caused by a known regression in scheduling and/or allocation (which I doubt) or by the implementation (or enabling) of some new optimization(s) that naturally increase(s) register pressure?

Thank you in advance!

Ghassan Shobaki

Assistant Professor

Department of Computer Science

Princess Sumaya University for Technology

Amman, Jordan

Hi,

We are currently in the process of upgrading from LLVM 2.9 to LLVM 3.3. We are working on instruction scheduling (mainly for register pressure reduction). I have been following the llvmdev mailing list and have learned that a machine instruction (MI) scheduler has been implemented to replace (or work with?) the selection DAG (SD) scheduler. However, I could not find any document that describes the new MI scheduler and how it differs from and relates to the SD scheduler.

MI is now the place to implement any heuristics for profitable scheduling. SD scheduler will be directly replaced by a new pass that orders the DAG as close as it can to IR order. We currently emulate this with -pre-RA-sched=source.
The only thing necessarily different about MI sched is that it runs after reg coalescing and before reg alloc, and maintains live interval analysis. As a result, register pressure tracking is more accurate. It also uses a new target interface for precise register pressure.
MI sched is intended to be a convenient place to implement target specific scheduling. There is a generic implementation that uses standard heuristics to reduce register pressure and balance latency and CPU resources. That is what you currently get when you enable MI sched for x86.
The generic heuristics are implemented as a priority function that makes a greedy choice over the ready instructions based on the current pressure and the resources and latency of the scheduled and unscheduled set of instructions.
An DAG subtree analysis also exists (ScheduleDFS), which can be used for register pressure avoidance. This isn’t hooked up to the generic heuristics yet for lack of interesting test cases.

So, I would appreciate any pointer to a document (or a blog) that may help us understand the difference and the relation between the two schedulers and figure out how to deal with them. We are trying to answer the following questions:

  • A comment at the top of the file ScheduleDAGInstrs says that this file implements re-scheduling of machine instructions. So, what does re-scheduling mean?

Rescheduling just means optional scheduling. That’s really what the comment should say. It’s important to know that MI sched can be skipped for faster compilation.

Does it mean that the real scheduling algorithms (such as reg pressure reduction) are currently implemented in the SD scheduler, while the MI scheduler does some kind of complementary work (fine tuning) at a lower level representation of the code?
And what’s the future plan? Is it to move the real scheduling algorithms into the MI scheduler and get rid of the SD scheduler? Will that happen in 3.4 or later?

I would like to get rid of the SD scheduler so we can reduce compile time by streamline the scheduling data structures and interfaces. There may be some objection to doing that in 3.4 if projects haven’t been able to migrate. It will be deprecated though.

  • Based on our initial investigation of the default behavior at -O3 on x86-64, it appears that the SD scheduler is called while the MI scheduler is not. That’s consistent with the above interpretation of re-scheduling, but I’d appreciate any advice on what we should do at this point. Should we integrate our work (an alternate register pressure reduction scheduler) into the SD scheduler or the MI scheduler?

Please refer to my recent messages on llvmdev regarding enabling MI scheduling by default on x86.
http://article.gmane.org/gmane.comp.compilers.llvm.devel/63242/match=machinescheduler

I suggest integrating with the MachineScheduler pass.
There are many places to plug in. MachineSchedRegistry provides the hook. At that point you can define your own ScheduleDAGInstrs or ScheduleDAGMI subclass. People who only want to define new heuristics should reuse ScheduleDAGMI directly and only define their own MachineSchedStrategy.

  • Our SPEC testing on x86-64 has shown a significant performance improvement of LLVM 3.3 relative to LLVM 2.9 (about 5% in geomean on INT2006 and 15% in geomean on FP2006), but our spill code measurements have shown that LLVM 3.3 generates significantly more spill code on most benchmarks. We will be doing more investigation on this, but are there any known facts that explain this behavior? Is this caused by a known regression in scheduling and/or allocation (which I doubt) or by the implementation (or enabling) of some new optimization(s) that naturally increase(s) register pressure?

There is not a particular known regression. It’s not surprising that optimizations increase pressure.

Andy

Thank you for the answers! We are currently trying to test the MI scheduler. We are using LLVM 3.3 with Dragon Egg 3.3 on an x86-64 machine. So far, we have run one SPEC CPU2006 test with the MI scheduler enabled using the option -fplugin-arg-dragonegg-llvm-option=’-enable-misched:true’ with -O3. This enables the machine scheduler in addition to the SD scheduler. We have verified this by adding print messages to the source code of both schedulers. In terms of correctness, enabling the MI scheduler did not cause any failure. However, in terms of performance, we have seen a mix of small positive and negative differences with the geometric mean difference being near zero. The maximum improvement that we have seen is 3% on the Gromacs benchmark. Is this consistent with your test results?

We have then tried to run a test in which the MI scheduler is enabled but the SD scheduler is disabled (or neutralized) by adding the option: -fplugin-arg-dragonegg-llvm-option=’-pre-RA-sched:source’ to the flags that we have used in the first test. However, this did not work; we got the following error message:

GCC_4.6.4_DIR/install/bin/gcc -c -o lbm.o -DSPEC_CPU -DNDEBUG -O3 -march=core2 -mtune=core2 -fplugin=‘DRAGON_EGG_DIR/dragonegg.so’ -fplugin-arg-dragonegg-llvm-option=’-enable-misched:true’ -fplugin-arg-dragonegg-llvm-option=’-pre-RA-sched:source’ -DSPEC_CPU_LP64 lbm.c
cc1: for the -pre-RA-sched option: may only occur zero or one times!
specmake: *** [lbm.o] Error 1

What does this message mean?

Is this a bug or we are doing something wrong?

How can we test the MI scheduler by itself?

Is it interesting to test 3.3 or there are interesting features that were added to the trunk after branching 3.3? In the latter case, we are willing to test the trunk.

Thanks

Ghassan Shobaki

Assistant Professor

Department of Computer Science

Princess Sumaya University for Technology

Amman, Jordan

Thank you for the answers! We are currently trying to test the MI scheduler. We are using LLVM 3.3 with Dragon Egg 3.3 on an x86-64 machine. So far, we have run one SPEC CPU2006 test with the MI scheduler enabled using the option -fplugin-arg-dragonegg-llvm-option=’-enable-misched:true’ with -O3. This enables the machine scheduler in addition to the SD scheduler. We have verified this by adding print messages to the source code of both schedulers. In terms of correctness, enabling the MI scheduler did not cause any failure. However, in terms of performance, we have seen a mix of small positive and negative differences with the geometric mean difference being near zero. The maximum improvement that we have seen is 3% on the Gromacs benchmark. Is this consistent with your test results?

I haven’t benchmarked fortran. On x86-64, I regularly see wild swings in performance, 10-20% for small codegen changes (small benchmarks with a primary hot loop). This is not a natural consequence of scheduling, unless spill code changed in the hot loop (rare on x86-64). Quite often, a somewhat random change in copy coalescing results in different register allocation and code layout. The results are chaotic and very platform (linker) and microarchitecture specific. Large benchmarks are immune to wild swings, but the small changes you see could just be the accumulation of chaotic behavior of individual loops. It’s hard for me to draw conclusions without looking at hardware counters and isolating the data to individual loops.

The MI scheduler’s generic heuristics are much more about avoiding worst-case scheduling in pathological situations (very large unrolled loops) than it is about tuning for a microarchitecture. People who want to do that may want to plugin their own scheduling strategy. The precise machine model and register pressure information is all there now.

The broadest statement I can make is that we should not unnecessarily spill within loops (with rare exceptions). If you see that, file a bug. I know there are still situations that we don’t handle well, but haven’t had a compelling enough reason to add the complexity to the generic heuristics. If good test cases come in, then I’ll do that.

We have then tried to run a test in which the MI scheduler is enabled but the SD scheduler is disabled (or neutralized) by adding the option: -fplugin-arg-dragonegg-llvm-option=’-pre-RA-sched:source’ to the flags that we have used in the first test. However, this did not work; we got the following error message:

GCC_4.6.4_DIR/install/bin/gcc -c -o lbm.o -DSPEC_CPU -DNDEBUG -O3 -march=core2 -mtune=core2 -fplugin=‘DRAGON_EGG_DIR/dragonegg.so’ -fplugin-arg-dragonegg-llvm-option=’-enable-misched:true’ -fplugin-arg-dragonegg-llvm-option=’-pre-RA-sched:source’ -DSPEC_CPU_LP64 lbm.c
cc1: for the -pre-RA-sched option: may only occur zero or one times!
specmake: *** [lbm.o] Error 1

What does this message mean?

Is this a bug or we are doing something wrong?

I’m not sure why the driver is telling you this. Maybe someone familiar with dragonegg can help?

You can always rebuild llvm with the enableMachineScheduler() hook implemented.
http://article.gmane.org/gmane.comp.compilers.llvm.devel/63242/match=machinescheduler

Then -enable-misched=true/false simply toggles MI Sched without changing anything else.

How can we test the MI scheduler by itself?

Is it interesting to test 3.3 or there are interesting features that were added to the trunk after branching 3.3? In the latter case, we are willing to test the trunk.

It doesn’t look like my June checkins made it into 3.3. If you’re enabling MI Sched, and actually evaluating performance of the default heuristics, then it’s best to use trunk.

-Andy

Hi Andy,

We have done some experimental evaluation of the different schedulers in LLVM 3.3 (source, BURR, ILP, fast, MI). The evaluation was done on x86-64 using SPEC CPU2006. We have measured both the amount of spill code as well as the execution time as detailed below.

Here are our main findings:

  1. The SD schedulers significantly impact the spill counts and the execution times for many benchmarks, but the machine instruction (MI) scheduler in 3.3 has very limited impact on both spill counts and execution times. Is this because most of you work on MI did not make it into the 3.3 release? We don’t have a strong motivation to test the trunk at this point (we’ll wait for 3.4), because we are working on a publication and prefer to base that on an official release. However, if you tell me that you expect things to be significantly different in the trunk, we’ll try to find the time to give that a shot (unfortunately, we only have one test machine, and SPEC tests take a lot of time as detailed below).

  2. The BURR scheduler gives the minimum amount of spill code and the best overall execution time (SPEC geo-mean).

  3. The source scheduler is the second best scheduler in terms of spill code and execution time, and its performance is very close to that of BURR in both metrics. This result is surprising for me, because, as far as I understand, this scheduler is a conservative scheduler that tries to preserve the original program order, isn’t it? Does this result surprise you?

  4. The ILP scheduler has the worst execution times on FP2006 and the second worst spill counts, although it is the default on x86-64. Is this surprising? BTW, Dragon Egg sets the scheduler to source. On Line 368 in Backend.cpp, we find:
    if (!flag_schedule_insns)
    Args.push_back("–pre-RA-sched=source");

Here are the details of our results:

Spill Counts

I find the results surprising, too. What CPU did you perform your tests on, scheduler performance can vary a lot depending on the microarchitecture of your chip.

- Ben

Our test machine has two Intel Xeon E5540 processors running at 2.53 GHz with 24 GB of memory. Each CPU has 8 threads (16 threads in total). All our tests, however, were single threaded. Which result is particularly surprising for you? The low impact of the MI scheduler, the relatively good performance of the source scheduler or the relatively poor performance of the ILP scheduler?

Thanks
-Ghassan

We have done some experimental evaluation of the different schedulers in
LLVM 3.3 (source, BURR, ILP, fast, MI). The evaluation was done on x86-64
using SPEC CPU2006. We have measured both the amount of spill code as well
as the execution time as detailed below.

Hi Ghassan,

This is an amazing piece of work, thanks for doing this. We need more
benchmarks like yours, and more often, too.

3. The source scheduler is the second best scheduler in terms of spill code

and execution time, and its performance is very close to that of BURR in
both metrics. This result is surprising for me, because, as far as I
understand, this scheduler is a conservative scheduler that tries to
preserve the original program order, isn't it? Does this result surprise
you?

Well, SPEC is an old benchmark, when code was written to accommodate the
hardware requirements, so preserving the code order might not be that big
of a deal on SPEC, as it is on other types of code. So far, I haven't found
SPEC being too good to judge overall compilers' performance, but specific
micro-optimized features.

Besides, hardware and software are designed nowadays based on some version
of Dhrystone, EEMBC, SPEC or CoreMark, so it's not impossible to see 50%
increase in performance with little changes in either.

4. The ILP scheduler has the worst execution times on FP2006 and the second

worst spill counts, although it is the default on x86-64. Is this
surprising? BTW, Dragon Egg sets the scheduler to source. On Line 368 in
Backend.cpp, we find:
if (!flag_schedule_insns)
    Args.push_back("--pre-RA-sched=source");

This looks like someone ran a similar test and did the sensible thing. How
that reflects with Clang, or how important it is to be the default, I don't
know. This is the same discussion as the optimization levels, and what
passes should be included in what. It also depends on which scheduler will
evolve faster or further in time, and what kind of code you're compiling...

This is not a perfectly accurate metric, but, given the large sample size

(> 10K functions), the total number of spills across such a statistically
significant sample is believed to give a very strong indication about each
scheduler's performance at reducing register pressure.

I agree this is a good enough metric, but I'd be cautious in stating that
there is a "very strong indication about each scheduler's performance".
SPEC is, after all, a special case in compiler/hardware world, and anything
you see here might not happen anywhere else.

Real world, modern code, (such as LAMP stack, browsers, office suites, etc)
are written expecting the compiler to do magic, while old-school benchmarks
weren't, and they were optimized for decades by both compiler and hardware
engineers.

The %Diff Max (Min) is the maximum (minimum) percentage difference on a
single benchmark between each scheduler and the source scheduler. These
numbers show the differences on individual FP benchmarks can be quite
significant.

I'm surprised that you didn't run "source" 5/9 times, too. Did you get the
exact performance numbers multiple times? Would be good to have a more
realistic geo-mean for source as well, so we could estimate how much the
other geo-means vary in comparison to source's.

Most of the above performance differences have been correlated with

significant changes in spill counts in hot functions.

Which is a beautiful correlation between spill-rate and performance,
showing that your metrics are at least reasonably accurate, for all
purposes.

We should probably report this as a performance bug if ILP stays the

default scheduler on x86-64.

You should, regardless of what's the default choice.

cheers,
--renato

I'm not a HyperThread expert, but does leaving HT enabled messes up
single-threaded experience?

I once had a similar box and left HT off for benchmarks, but that was more
out of ignorance from my part.

--renato

Hi Renato,

Please see my answers below.

Thanks
-Ghassan

Ghassan: You have made me so curious to try other benchmarks in our future
work. Most academic publications on CPU performance though use SPEC. You
can even find some recent publications that are still using SPEC CPU2000!
When I was at AMD in 2009, performance optimization and benchmarking was
all about SPEC CPU2006. Have things changed so much in the past 4 years?

Unfortunately, no. Most manufacturers still use SPEC (and others) to
design, test and certify their products.

This is not a problem per se, as SPEC is very good and reasonably generic,
but any single benchmark can't cover the wide range of applications a CPU
is likely to undergo along its life. So, my grudge is that there isn't much
effort into understanding how to benchmark the different uses of a CPU, not
necessarily against SPEC. I think SPEC is a good match for your project.

And the more important question is: what specific features do these

non-SPEC benchmarks have that are likely to affect the scheduler's register
pressure reduction behavior?

No idea. :wink: Mind you that I don't know any decent benchmark that will give
you the "general user" case, but there are a number of specific benchmarks
(browsers, codecs, databases, web servers all have benchmark features
enabled).

Also, for your project, you're only interested in a very specific behaviour
of a very specific part of the compiler (spills), so any benchmark will
give you a way to test it, but every one will have some form of bias.

What I recommend is not to spend much time running a plethora of
benchmarks, only to find out that they all tell you the same story, but try
to find a benchmark that is completely different from SPEC (say,
Browsermark or the MySQL benchmark suite) and see if the spill correlation
is similar.

If it is, ignore it. If not, just mention that this correlation may not be
seen with other benchmarks. :wink:

Ghassan: Can you please give more specific features in these modern

benchmarks that affect spill code reduction? Note that our study included
over ten thousand functions with spills. Such a large sample is expected to
cover many different kinds of behavior, and that's why I am calling it a
"statistically significant" sample.

I was being a bit pedantic in pointing out that 10K data points are only
statistically relevant if they're independent, which they might not be if
each individual test was created / crafted with the same intent in mind
(similar function size, number of functions, number of temporaries, etc).

Most programmers don't pay that much attention to good code and end up
writing horrible code, that stress specific parts of the compiler. If you
have access to PlumHall suite, I encourage you to compile the chapter 7.22
test as an example.

Also, related to register pressure, different bad codes will stress
different algorithms, so you also have to be careful in stating that one
algorithm is much better than others only based on one badly-written
program.

Ghassan: Sorry if I did not include a clear enough description of the

numbers meanings. Let me explain that more precisely:
First of all, the "source" scheduler was indeed run for 9 iterations
(which took about 2 days), and that was our baseline. All the numbers in
the execution-time table are percentage differences relative to "source".
Of course, there were random variations in the numbers, but we did the
standard SPEC practice of taking the median. For most benchmarks, the
random variation was not significant.

I see, my mistake.

There was one particular benchmark though (libquantum), on which we thought

that the random variation is too large to make a meaningful comparison, and
therefore we decided to exclude that.

Quite amusing, having the libquantum behaving erratically. :wink:

cheers,
--renato

I should note here that although SPEC provided us with a sufficiently large sample for our spill-count experiment, I don’t think that SPEC has enough hot functions with spills to make our execution-time results statistically significant. That’s because SPEC has many benchmarks with peaky profiles, where one of two functions dominate the execution time. So, if one heuristic gets very lucky (or unlucky) on a few hot functions, it may get a deceivingly high (or low) score. That’s why I think if someone runs the same kind of test on a different benchmark suite with comparable size, he may get different execution-time results, but most likely he will get the same spill count results that we got (of course, I mean the relative results).

-Ghassan

I see what you mean, and I think you'll only know after you run the same
analysis on a very different benchmark, on software that doesn't have very
few very hot functions, like browsers, databases and the like.

--renato

Ghassan, and anyone else interested in the scheduler:

This is a good time for me to give a thourough update of the MI scheduler. Hopefully that will answer many of your questions.

Some important things changed between the time I introduced the MI scheduler a year ago, and the release of 3.3. The biggest change was loop vectorization, which reduces register pressure and somewhat preschedules loops. Since 3.3 was released, the generic MI scheduler’s heuristics were reevaluated in preparation for making it the default for targets without a custom scheduling strategy–more on that later. The source order scheduler was also fixed so that it actually preserves IR order, which is at least closer to source order.

For many benchmarks we’ve looked at, source order scheduling approaches the lower bound on register pressure–heuristics can only hurt–making it difficult to distinguish between a lucky scheduler and a good scheduler.

It’s not surprising that SelectionDAG scheduling with BURR reduces spill code on average. It is fully register pressure aggressive. It gives highest priority to Sethi-Ullman number, which is typically nonsense, but does prevent some of the worst register pressure situations. It then does an expensive check to determine the shortest live range. This is also inaccurate, but on average reduces pressure.

The reason we switched from BURR to ILP a couple years ago was that although BURR is very aggressive, it is not very smart. Giving highest priority to inaccurate heuristics means generating pathologically bad schedules for some class of code. Regardless of how the programmer wrote the code, or what earlier passes have done, it will reschedule everything, fully serializing dependence chains. At that time, we noticed horrible performance on some crypto benchmarks. We decided to pay a small price in spill code for avoiding worst-case performance. We also realized after performance anlaysis, that incrementally tuning these heuristics to avoid test-suite regressions was not leading toward an overall better scheduler for real programs. We decided that, since some targets need an MI-level scheduler anyway, we should redirect efforts into that project.

The high-level design goal of MI scheduler is to allow subtargets to plug in custom scheduling strategies, while providing a “safe” generic scheduler. The generic scheduler is safe in that it preserves instruction order until it detects a performance problem according to the subtarget’s machine model. This is a nice feature. It means that the scheduler should not often introduce a performance problem that did not already exist, and it makes the scheduled code much easier to understand and debug. So the close correlation between source order and MI scheduler is natural. In fact, you’ll find that, when scheduling for SandyBridge, the scheduler seldom perturbs the instruction sequence. This is a fundamental departure from the conventional approach of scheduling for out-of-order processors as if they execute in-order.

This does raise a difficult challenge of how the scheduler can know when the out-of-order processor is likely to stall. The new machine model has enough information to roughly estimate stalls if a long enough execution trace can be fed through it. However, for very heavily out-of-order processors (Nehalem+) it is extremely rare for acyclic code to saturate any resources. As a cheap, partial solution, the MI scheduler now computes the cyclic critical path, allowing it to estimate.

One major advantage of the MI scheduler is that it models register
pressure with almost perfect precision. This is great for analyzing register pressure, but by itself isn’t a solution, and greedy heuristics are often unable to solve the problem without backtracking. The difficulty hasn’t been thinking of new heuristics and solving individual cases. Rather, finding a strong justification to add cost and complexity to the scheduler.

A month ago, Arnold Schwaighofer and I investigated this issue. We didn’t do this because spilling was a serious performance problem, but because the performance of the scheduler is annoyingly random when governed by greedy heuristics. If the scheduler always did the right thing, that would simplify performance tracking. We were able to solve each individual case with some combination of heuristics. The most efficient approach I’ve found so far involves partitioning the DAG into subtrees (see computeDFSResult–I think the implementation of subtree is still somewhat flawed though). We’ve tried biased scheduling by subtree, computing Sethi-Ullman numbers according to the subtree partition, and tracking live-ins that are reachable from dag nodes, among other things.

Ultimately, we decided not to enable any of these techniques in the generic scheduler–targets are still free to do what they like. The problem is that there are always cases in which these cheap heuristics do the wrong thing. So, while we could engineer good results for SPEC, we would not be solving the underlying problem of unstable scheduling heuristics. Given the primary goals of reducing compile time and maintaining instruction order unless performance is at stake, the bar for adding heuristics is high. Complicating the heuristics now also means making them harder to understand and improve in the future.

I would like to see a general solution to scheduling for register pressure. I had plenty of ideas for more ad-hoc heuristcs within the bounds of list scheduling, but given that we haven’t dominstrated the value of simple heuristics, I don’t want to pursue anything more complicated. I think better solutions will have to transcend list scheduling. I do like to the idea of constraining the DAG prior to scheduling [Touati, “Register Saturation in Superscalar and VLIW Codes”, CC 2001], because that entirely separates the problem from list scheduler heuristics. However, I won’t be able to justify adding more complexity, beyond list scheduling heuristics, to the LLVM codebase to solve this problem. Work in this area would need to be done as side project. I don’t expect to do any more work on it.

In my next message I’ll explain the near-term plans for the scheduler.

-Andy

In my last message, I explained the goals of the generic MI scheduler and current status. This week, I'll see if we can enable MI scheduling by default for x86. I'm not sure which flags you're using to test it now. But by making it default and enabling the corresponding coalescer changes, we can be confident that benchmarking efforts are improving on the same baseline. At that point, I expect bugs to be filed for specific instances of badly scheduled code. Getting a fix committed may not be easy, because we have to show that new heuristics aren't likely to pessimize other code. But at least I'll be able to provide an explanation for why MI isn't currently handling it.

There are other reasons that MI sched should be enabled now on x86 anyway:

(1) The Selection DAG scheduler will be disabled as soon as I can implement a complete replacement. That should eliminate about 10% of codegen (llc) compile time. The Selection DAG scheduler has also long suffered from unnacceptable worst-cast compile time behavior and unresolved defects. We been chipping away at the problems, but some remain: PR15941, PR16365. This is a fundamentally bad place to perform scheduling.

(2) The postRA scheduler will also be eliminated. That will eliminate another 10% of compile time for targets that currently enable it. It also eliminates a maintenance problem because its dependence on kill flags and implicit operands is frightening--these can easily be valid for some targets but not others.

(3) Non-x86 targets have been using MI sched for the past year to achieve important performance and compile time benefits. For quality and maintenance reasons, we should use the same scheduling infrastructure for mainstream targets.

The basic theme here is that we want a single scheduling infrastructure that is efficient enough to enable by default--even if it is typically performance-neutral, can leverage verification across many targets, and can be safely customized by plugging in heuristics.

-Andy

  1. The SD schedulers significantly impact the spill counts and the execution times for many benchmarks, but the machine instruction (MI) scheduler in 3.3 has very limited impact on both spill counts and execution times. Is this because most of you work on MI did not make it into the 3.3 release? We don’t have a strong motivation to test the trunk at this point (we’ll wait for 3.4), because we are working on a publication and prefer to base that on an official release. However, if you tell me that you expect things to be significantly different in the trunk, we’ll try to find the time to give that a shot (unfortunately, we only have one test machine, and SPEC tests take a lot of time as detailed below).

Most of the above performance differences have been correlated with significant changes in spill counts in hot functions. Note that the ILP scheduler causes a degradation of 22% on one benchmark (lbm) relative to the source scheduler. We have verified that this happens because of poor scheduling that increases the register pressure and thus leads to generating excessive spills in this benchmark’s hottest loop. We should probably report this as a performance bug if ILP stays the default scheduler on x86-64.

The life of ILP as default scheduler is numbered in days.

Regarding your publication. It’s fine to use BURR with 3.3 and call it the best LLVM scheduler for your set of benchmarks. Even though the default MI scheduler on trunk has improved, I doubt it will beat BURR at average register pressure reduction (as I mentioned in a previous email, we decided not to enable global register pressure heuristics). I did notice a 10% improvement on 470.lbm at some point after 3.3, but you would need to verify.

In 3.4, the SD scheduler may still be available, but won’t be maintained. So I recommend using MI scheduler for baselines.

Note that you can easily plug in alternate scheduling heuristics using the MI scheduler. It wouldn’t be hard to implement a BURR strategy for the new scheduler.

For the purpose of evaluating register reduction, I agree that your spill metric is a reasonable indicator. It is true that we will readily trade a single spill inside a loop for multiple spills outside. But it seems good enough for evaluating different versions of the scheduler before gathering your final numbers for publication. That gets you results in minutes instead of days.

I’m not really sure why spills have such an impact on your SPEC2006 performance, whether it’s critical path load latency, load bandwidth (SandyBridge has doubled this), decode bandwidth, or even loop stream detector capacity.

-Andy

While I'm generally really excited by this, I would ask for a bit more
staging of this change.

Specifically, I would really like for a single, clear switch to enable
exactly what you want benchmark data on *before* it becomes the default,
and to give various folks time to run benchmarks and report serious
regressions.

I don't want our ability to ship LLVM from top-of-tree to be seriously
impaired by this, and enabling a feature that can have dramatic performance
impact without a giving folks a really simple way to try it out and a
period of time to run benchmarks and collect data seems to do that. =/

Once it is the default, it would be really good to leave in the single,
simple switch for a period of time for folks to disable it if need be.

Ok. I tried to make that clear when I went through this process in July. But I’ll add another flag (in addition to the subtarget hook) to “flip the switch” and remove the flag later.
The purpose of changing SD scheduler policy, register coalescer policy, and MI scheduler simultaneously is only to avoid having folks who are watching performance results waste time chasing transient regressions.

For x86, this is primarily about compile time (while continuing to avoid worst-case scheduling). Switching the default now is to give people time to file bugs before the next step: disabling SD scheduler.

-Andy

Andy,

Thank you for the explanation! Using a statistical approach, we have also come to the conclusion that it is extremely hard to find one good register pressure heuristic that works well in all cases or even a large enough percentage of the cases. In our statistical study, we applied about 20 different heuristics to the 7216 functions with spills in FP2006. The BURR heuristic gave the best overall result (total sum of spills), but it was the best heuristic on only 64% of the functions. This means that on more than one third of the functions, BURR resulted in extra spills relative to the best heuristic for each function.

So, I agree that a greedy heuristic cannot give a general solution to this problem. We have been exploring combinatorial approaches to the problem, but the problem with our current algorithm is that the scheduling cost function, which is the peak excess register pressure (PERP), does not always correlate well with the amount of spill code. That’s because in a sufficiently large basic block, you may minimize the peak register pressure in the block but still have unnecessarily high register pressure at non-peak points in the block (see Section 5 in our recent paper: Preallocation Instruction Scheduling with Register Pressure Minimization using a Combinatorial Approach, ACM TACO, V10, Issue 3, 2013, with doi10.1145/2512432). We are currently exploring alternate cost functions. However, we do not expect any combinatorial algorithm that will come out of this work to be fast enough to become the default scheduling algorithm in a production compiler in the present or even in the near future. It may be a good reference for evaluating heuristics though.

For the time being, and until a good general solution to the problem is discovered, I think it makes sense for an open-source compiler to only support, by default, a basic scheduler that uses the minimum amount of compile time. Then people who are interested in optimizing the performance of a specific application or the performance of their hardware on a specific benchmark suite can write more complex heuristics if necessary and tune them for their target programs. It’d be nice though to have multiple heuristics optionally available in addition to the default heuristic.

Regards
Ghassan

I just added a flag: -misched-bench. You can use it to flip back and forth between your target’s default SD scheduler and the machine scheduler. It’s doesn’t affect whether the postRA scheduler is also run.

-Andy