[RFC] Scheduler Rework

Hey Everyone,

I'd like to begin a project to rework the scheduler to address some
problems we've discovered on this end. The goal is to get a more
configurable/flexible scheduler while simplifying maintenance by
separating policy from implementation to get independent and
interchangeable parts.

This is going to be challenging because we are still stuck on LLVM 2.9.
We will be upgrading to LLVM 3.1 when it is released but I really need
to start this work now to meet work deadlines. But I want to make sure
I have something I can contribute upstream so I want to present the idea
and get some input on whether it is a good design and how best to move
it upstream. It looks like the fundamentals of the scheduler haven't
changed much from 2.9 to 3.1 so that makes it a bit easier.

Most specifically, I would like some comment on whether it is reasonable
to just start a new RegReduction-like scheduler from scratch rather than
try to incrementally change the existing one. I'm looking at a pretty
invasive reengineering of the scheduler structure so that it will have
familiar elements but be put together quite differently. That's why I'm
not sure incremental change to the existing code makes sense. The two
implementations could live side-by-side for a while if that makes sense.

The existing RegReduction scheduler design goes something like this:

            SchedulingPriorityQueue
                      ^
                      >
              RegReductionPQBase
                      ^
                      >
          RegReductionPriorityQueue<SF>
                                    ^
                                   /|\
                            ______/ | \______
                           > > >
                      Heuristic 1 ... Heuristic N

The only point of customization right now is the template parameter to
RegReductionPriorityQueue. This is the sort criteria for prioritizing
which node gets scheduled next. SchedulingPriorityQueue has some
virtual functions which can be (and are) overridden by
RegReductionPQBase and/or RegReductionPriorityQueue. However, the
current inheritance structure makes it impossible to tweak much even by
subclassing RegReductionPriorityQueue, either because the customization
points aren't virtual (easy to fix) or because customization points are
coupled (hard to fix without a redesign).

Here are some things we have found useful to tune. Most are not
currently orthogonal and thus are not easily separated from each other:

- Node priority function (the SF parameter above). This is
  well-separated and easily changed.

- Queue implementation. The current implementation uses a std::vector<>
  and has very bad asymptotic performance (N^2 currently). Previously
  this was a std::priority_queue<> but was changed because it missed
  some of the dynamic variance of node priorities. We have found it
  useful and essential to be able to replace this but even after
  inheriting from RegReductionPriorityQueue it's pretty ugly because
  there are two spearate queues in the scheduler, one completely unused.
  Different queue types exhibit different heuristic behaviors based on
  how much dynamism they provide. Thus the queue implementation is
  policy that really ought to be separated from the scheduler proper.

- Register pressure heuristic. The ILP scheduler uses a fairly
  simplistic register pressure calculation which is pretty good in most
  cases but we have found it to not work well in a few important cases.
  It would be nice to make this a customization point. It is a separate
  heuristic from the node priority function. In fact some node priority
  functions look at the register pressure heuristic to calculate their
  result. The two are orthogonal and we should be able to vary them
  independently. Currently the register pressure heuristic is coupled
  tightly to RegReductionPQBase. Again due to the data structures begin
  in RegReductionPQBase it is not convenient to simply make some of the
  heuristic methods virtual and override them. Such an arrangement
  doesn't allow easy mix-and-match among register pressure and node
  priority heuristics. Moreover, more elaborate register pressure
  calculations will almost certainly involve very different data
  structures.

Here's a quick sketch of a new design floating in my head. I don't have
any code yet so this is subject to change but it gets the general idea
across. The idea is to configure the scheduler via policy classes.

            SchedulingPriorityQueue
                      ^
                      >
          RegReductionPriorityQueue<Queue, Pressure, Sort>
                                     ^ ^ ^
                                    /|\ | |
                               ____/ | \____ | |
                              > > > > >
                           Queue 1 ... Queue N | |
                                                 > >
                                                 > >
                                                /|\ |
                                         ______/ | \_____ |
                                        > > > >
                                    Pressure 1 ... Pressure N |
                                                               /|\
                                                           ___/ | \___
                                                          > > >
                                                       Sort 1 ... Sort N

I'll have to work out exactly how to allow the policy classes to
communicate, for example to allow the Sort heuristic to contact the
Pressure heuristic for information. There are a few ways to do that.
One is to reintroduce RegReductionPQBase with some pure virtual
functions and use multiple inheritance from RegReductionPriorityQueue to
let the policy classes override them. A better way, I think, would be
to use composition and pass pointers to each policy object to the
others.

The above design also should allow simple expansion in the future. If
we identify more customization points we can introduce new policy
classes pretty easily.

Does the above look like a good design? If so, I believe it is easiest
to build this from scratch (borrowing a lot of code from the current
scheduler) rather than try to incrementally morph the curent scheduler
to this design but I am open to guidance.

Thanks for your help!

                             -Dave

I’d like to begin a project to rework the scheduler to address some
problems we’ve discovered on this end. The goal is to get a more
configurable/flexible scheduler while simplifying maintenance by
separating policy from implementation to get independent and
interchangeable parts.

This is going to be challenging because we are still stuck on LLVM 2.9.
We will be upgrading to LLVM 3.1 when it is released but I really need
to start this work now to meet work deadlines. But I want to make sure
I have something I can contribute upstream so I want to present the idea
and get some input on whether it is a good design and how best to move
it upstream. It looks like the fundamentals of the scheduler haven’t
changed much from 2.9 to 3.1 so that makes it a bit easier.

We plan to move to the MachineScheduler by 3.2. The structure is:

ScheduleDAG: Abstract DAG of SUnits and SDeps

v
ScheduleDAGInstrs: Build the DAG from MachineInstrs, each SUnit tied to an MI
Delimit the current “region” of code being scheduled.

v
ScheduleDAGMI: Concrete implementation that supports both top-down and bottom-up scheduling
with live interval update. It divides the region into three zones:
Top-scheduled, bottom-scheduled, and unscheduled.

The ScheduleDAGMI constructor takes an instance of MachineSchedStrategy. This is currently a very simply interface that provides pickNode(), releaseTopNode(), releaseBottomNode().

The MachineScheduler is plugable at every level.

  1. The pass itself is optional. Targets may disable or override it
    completely. For example, a target that implements global scheduling
    would need to override the standard pass.

  2. Targets may create a MachineSchedRegistry instance that knows how
    to construct a custom ScheduleDAGInstrs. This is an easy way to
    reuse the standard scheduling pass and DAG builder. The driver will
    determine the scheduling region, but everything else is
    customizable.

  3. New scheduling algorithms can be introduced by implementing new
    MachineSchedStrategies. We can add a target hook to select the
    strategy if needed.

There’s still plenty of opportunity to restructure the code and add customization points. Keep in mind that the goal for the target independent code is clarity and maintainability. Code reuse across custom schedulers is secondary and I’m not currently thinking of designing a template library for custom schedulers.

As much as possible, the standard scheduling algorithm will be built from standalone utilities and data structures. The customizations that you describe would all be handled by providing a new MachineSchedStrategy. Start by composing your scheduler from the pieces that are available, e.g. HazardChecker, RegisterPressure… (There’s not much value providing a scheduling queue abstraction on top of vector or priority_queue). From that point, we can improve the design.

Here’s what you can expect in the MachineScheduler:

  1. Precise register pressure tracking with back off. I’m in the process of checking in the pieces for this. It should be usable sometime in the next couple of weeks.

  2. Division of the target-defined resources into “interlocked” vs. “buffered”. The HazardChecker will continue to handle the interlocks for the resources that the hardware handles in order. Buffered resources, which the hardware can handle out of order. These will be considered separately for scheduling priority. We will also make a distinction between minimum and expected operation latency.

  3. A register pressure reducing scheduler that considers interlocks and register pressure. This will prioritize register reuse above the pressure limits and include a very simple heuristic for avoiding high register pressure situations.

  4. A balanced scheduler that still prioritizes interlocks and register pressure limits, but balances buffered resource usage with register pressure avoidance.

  5. Improved heuristics. For example, we can precompute register lineages and use that information to avoid high register pressure situations.

The overarching goal of the standard scheduling algorithm is to handle the most important cases that the SDScheduler and PostRAScheduler handle today without unnecessarily shuffling instructions. This should result in more debugable generated code with more stable performance. The goal of the scheduler framework is to support the standard scheduler while providing a place for targets to introduce a custom scheduler, or simply any optimization that requires the scheduling DAG.

-Andy

Andrew Trick <atrick@apple.com> writes:

We plan to move to the MachineScheduler by 3.2. The structure is:

How hard will this be to backport to 3.1? Has woprk on this started
yet?

ScheduleDAG: Abstract DAG of SUnits and SDeps
  >
  v
ScheduleDAGInstrs: Build the DAG from MachineInstrs, each SUnit tied to an MI
                   Delimit the current "region" of code being scheduled.
  >
  v
ScheduleDAGMI: Concrete implementation that supports both top-down and bottom-up scheduling
               with live interval update. It divides the region into three zones:
               Top-scheduled, bottom-scheduled, and unscheduled.

So does this happen after SDNodes are converted to MachineInstrs? It's
not clear to me given your description of ScheduleDAGInstrs. I assume
it uses the current SDNode->SUnit mechanism but I just want to make
sure.

3. New scheduling algorithms can be introduced by implementing new
   MachineSchedStrategies. We can add a target hook to select the
   strategy if needed.

This sounds like what we want and yes, it should be target-selectable.

There's still plenty of opportunity to restructure the code and add
customization points. Keep in mind that the goal for the target
independent code is clarity and maintainability. Code reuse across
custom schedulers is secondary and I'm not currently thinking of
designing a template library for custom schedulers.

The two are not contradictory but for me I care less about the specific
structure than I do about the flexibility to mix and match heuristics.
It just struck me that templates and policy classes is the natural way
to do it.

I'm glad to hear the top-down scheduler will get some attention. We'll
be wanting to use that.

As much as possible, the standard scheduling algorithm will be built
from standalone utilities and data structures. The customizations that
you describe would all be handled by providing a new
MachineSchedStrategy.

Makes sense to me.

Start by composing your scheduler from the pieces that are available,
e.g. HazardChecker, RegisterPressure... (There's not much value
providing a scheduling queue abstraction on top of vector or
priority_queue).

What do you mean by this last point? We absolutely want to be able to
swap out different queue implementations. There is a significant
compile time tradeoff to be made here.

1. Precise register pressure tracking with back off. I'm in the
process of checking in the pieces for this. It should be usable
sometime in the next couple of weeks.

Great!

2. Division of the target-defined resources into "interlocked"
vs. "buffered". The HazardChecker will continue to handle the
interlocks for the resources that the hardware handles in
order.

So by "interlocks" you mean hardware-implemented interlocks? So that
the scheduler will attempt to avoid them. Not that we have a machine
like this, but what about machines with no interlocks where software is
responsible for correctness (VLIW, early MIPS, etc.)? I suppose they
can be handled with a similar mechanism but the latter is much more
strict than the former.

Buffered resources, which the hardware can handle out of order. These
will be considered separately for scheduling priority. We will also
make a distinction between minimum and expected operation latency.

Does this mean you want to model the various queues, hardware scheduling
windows, etc.? I'm just trying to get a clearer idea of what this
means.

3. A register pressure reducing scheduler that considers interlocks
and register pressure. This will prioritize register reuse above the
pressure limits and include a very simple heuristic for avoiding high
register pressure situations.

4. A balanced scheduler that still prioritizes interlocks and register
pressure limits, but balances buffered resource usage with register
pressure avoidance.

5. Improved heuristics. For example, we can precompute register
lineages and use that information to avoid high register pressure
situations.

This all sounds great and is exactly what we're looking for.

As I said, this is a time-critical thing for us. Is there any way I can
help to move this along?

Thanks!

                             -Dave

Andrew Trick <atrick@apple.com> writes:

We plan to move to the MachineScheduler by 3.2. The structure is:

How hard will this be to backport to 3.1? Has woprk on this started
yet?

In my previous message I outlined the steps that I would take to bring up the new scheduler. I’m about to checkin the register pressure reducing scheduler. The next step will be plugging in the target itinerary.

ScheduleDAG: Abstract DAG of SUnits and SDeps

v
ScheduleDAGInstrs: Build the DAG from MachineInstrs, each SUnit tied to an MI
Delimit the current “region” of code being scheduled.

v
ScheduleDAGMI: Concrete implementation that supports both top-down and bottom-up scheduling
with live interval update. It divides the region into three zones:
Top-scheduled, bottom-scheduled, and unscheduled.

So does this happen after SDNodes are converted to MachineInstrs? It’s
not clear to me given your description of ScheduleDAGInstrs. I assume
it uses the current SDNode->SUnit mechanism but I just want to make
sure.

Machine scheduling occurs in the vicinity of register allocation. It uses the existing MachineInstr->SUnit mechanism.


I’m glad to hear the top-down scheduler will get some attention. We’ll
be wanting to use that.

Out of curiosity what about top-down works better for your target?

Start by composing your scheduler from the pieces that are available,
e.g. HazardChecker, RegisterPressure… (There’s not much value
providing a scheduling queue abstraction on top of vector or
priority_queue).

What do you mean by this last point? We absolutely want to be able to
swap out different queue implementations. There is a significant
compile time tradeoff to be made here.

Use whatever data structure you like for your queue. I don’t have plans to make a reusable one yet. They’re not complicated.

  1. Division of the target-defined resources into “interlocked”
    vs. “buffered”. The HazardChecker will continue to handle the
    interlocks for the resources that the hardware handles in
    order.

So by “interlocks” you mean hardware-implemented interlocks? So that
the scheduler will attempt to avoid them. Not that we have a machine
like this, but what about machines with no interlocks where software is
responsible for correctness (VLIW, early MIPS, etc.)? I suppose they
can be handled with a similar mechanism but the latter is much more
strict than the former.

I’m not designing a mechanism for inserting nops to pad latency. If someone needs that, it’s easy to add.

Buffered resources, which the hardware can handle out of order. These
will be considered separately for scheduling priority. We will also
make a distinction between minimum and expected operation latency.

Does this mean you want to model the various queues, hardware scheduling
windows, etc.? I’m just trying to get a clearer idea of what this
means.

I don’t intend to directly model microarchitectural features of OOO processors at the level of buffer sizes. Although that could be done by a highly motivated developer.

I do intend to allow a target to identify arbitrary categories of resources, how many are available in each cycle on average, and indicate which resources are used by an operation. I’ll initially piggyback on the existing functional units to avoid rewriting target itineraries.

As I said, this is a time-critical thing for us. Is there any way I can
help to move this along?

In general, fixing LLVM bugs and keeping build bots green is always helpful.

As far as the scheduler, pieces are starting to fall into place. People are starting to use those pieces and contribute. This is a pluggable pass, so there’s no reason you can’t develop your own machine scheduler in parallel and gradually take advantage of features as I add them. Please expect to do a little code copy-pasting into your target until the infrastructure is mature enough to expose more target interfaces. I’m not going to waste time redesigning APIs until we have working components.

It would be very useful to have people testing new features, finding bugs early, and hopefully fixing those bugs. I would also like people to give me interesting bits of code with performance issues that can work as test cases. That’s hard if I can’t run your target.

-Andy

Andrew Trick <atrick@apple.com> writes:

    How hard will this be to backport to 3.1? Has woprk on this started
    yet?

In my previous message I outlined the steps that I would take to bring up the new scheduler. I'm about to checkin the register pressure reducing
scheduler. The next step will be plugging in the target itinerary.

Ok, but that doesn't really answer the question. Do you have a feel for
whether this can be backported relatively easily?

Machine scheduling occurs in the vicinity of register allocation. It uses the existing MachineInstr->SUnit mechanism.

I'm not familiar with this mechanism. I thought that after
ISel/ScheduleDAG we no longer have SUnits.

    ...
    I'm glad to hear the top-down scheduler will get some attention. We'll
    be wanting to use that.

Out of curiosity what about top-down works better for your target?

It's easier to analyze certain constructs for heuristic purposes. For
example, I posted a while back about the problem that the current
scheduler moves calls around. It's easier to schedule those "in the
right place" if you don't have to look back through chains to find them.

The problem with bottom-up schedulers in general is that if you need
some node to be scheduled in a certain spot, you have to make sure all
its uses are scheduled "early" (that is, late in the static schedule) so
you can even _see_ the node that might cause a problem (like a call).
You don't know you have a potential problem until you've already
scheduled a bunch of stuff in the wrong place. With a top-down
scheduler you can see problem nodes earlier.

    What do you mean by this last point? We absolutely want to be able to
    swap out different queue implementations. There is a significant
    compile time tradeoff to be made here.

Use whatever data structure you like for your queue. I don't have
plans to make a reusable one yet. They're not complicated.

That's fine but there have to be some hooks to plug in a different data
structure.

I do intend to allow a target to identify arbitrary categories of
resources, how many are available in each cycle on average, and
indicate which resources are used by an operation. I'll initially
piggyback on the existing functional units to avoid rewriting target
itineraries.

Ok.

As far as the scheduler, pieces are starting to fall into
place. People are starting to use those pieces and contribute. This is
a pluggable pass, so there's no reason you can't develop your own
machine scheduler in parallel and gradually take advantage of features
as I add them.

That's a bit disappointing. It sounds like a lot of throwaway work. I
understand it's a work in progress. I'll take a look at what's there
and see where I might be of some help.

Please expect to do a little code copy-pasting into your target until
the infrastructure is mature enough to expose more target
interfaces. I'm not going to waste time redesigning APIs until we have
working components.

Which APIs are you talking about?

It would be very useful to have people testing new features, finding
bugs early, and hopefully fixing those bugs. I would also like people
to give me interesting bits of code with performance issues that can
work as test cases. That's hard if I can't run your target.

We just use x86, so you can almost certainly run it. But to do testing
I'll need something I can backport to 3.1. We can't afford to develop
against buggy trunk. We have to work off a release.

                            -Dave