Bottom-Up Scheduling?

Is there documentation somewhere for the bottom-up scheduling? I'm
trying to figure out what changes are necessary in order to support it
in the PPC backend.

Thanks in advance,
Hal

> Evan,
>
> Thanks for the heads up! Is there a current target that implements the
> scheduling as it will be? And does the bottom-up scheduling also account

ARM is a good model.

What part of ARM's implementation is associated with the bottom-up
scheduling? I am confused because it looks like it is essentially using
the same kind of ScoreboardHazardRecognizer that was commented out of
the PPC 440 code.

Thanks in advance,
Hal

Is there documentation somewhere for the bottom-up scheduling? I'm
trying to figure out what changes are necessary in order to support it
in the PPC backend.

Thanks in advance,
Hal

Evan,

Thanks for the heads up! Is there a current target that implements the
scheduling as it will be? And does the bottom-up scheduling also account

ARM is a good model.

What part of ARM's implementation is associated with the bottom-up
scheduling? I am confused because it looks like it is essentially using
the same kind of ScoreboardHazardRecognizer that was commented out of
the PPC 440 code.

Thanks in advance,
Hal

Hi Hal,

The best way to ensure the PPC scheduling isn't hosed now or in the
future is probably to make it work as much like ARM as possible.

This means (1) defaulting to the "hybrid" scheduler, (2) implementing the
register pressure limit, and (3) reenabling the hazard recognizer.

(1) TargetLowering::setSchedulingPreference(Sched::Hybrid)

(2) TargetRegisterInfo::getRegisterPressureLimit(...) should probably
return something a bit less than 32, depending on register class.

(3) The standard hazard recognizer works either bottom-up or top-down
on the itinerary data. So it *should* work out-of-box. The problem is
that PPC has overriden the API to layer some custom "bundling" logic
on top of basic hazard detection. This logic needs to be reversed for
bottom-up, or you could start by simply disabling it instead of the
entire hazard recognizer.

Now, to generate the best PPC schedules, there is one thing you may
want to override. The scheduler's priority function has a
HasReadyFilter attribute (enum). It can be overriden by specializing
hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
scheduling, and maximizes the instructions that can issue in one
group, regardless of register pressure. We still care about register
pressure enough in ARM to avoid enabling this. I'm really not sure how
much it will help on modern PPC implementations though.

I realize this is confusing because we have a scheduler mode named
"ILP". That mode is intended for target's that do not have an
itinerary. It's currently setup for x86 and would need some tweaking
to work well for other targets. Again, if your target has an
itinerary, you probably want the "hybrid" mode.

-Andy

> Is there documentation somewhere for the bottom-up scheduling? I'm
> trying to figure out what changes are necessary in order to support it
> in the PPC backend.
>
> Thanks in advance,
> Hal
>
>>
>>
>>> Evan,
>>>
>>> Thanks for the heads up! Is there a current target that implements the
>>> scheduling as it will be? And does the bottom-up scheduling also account
>>
>> ARM is a good model.
>
> What part of ARM's implementation is associated with the bottom-up
> scheduling? I am confused because it looks like it is essentially using
> the same kind of ScoreboardHazardRecognizer that was commented out of
> the PPC 440 code.
>
> Thanks in advance,
> Hal

Hi Hal,

The best way to ensure the PPC scheduling isn't hosed now or in the
future is probably to make it work as much like ARM as possible.

This means (1) defaulting to the "hybrid" scheduler, (2) implementing the
register pressure limit, and (3) reenabling the hazard recognizer.

(1) TargetLowering::setSchedulingPreference(Sched::Hybrid)

(2) TargetRegisterInfo::getRegisterPressureLimit(...) should probably
return something a bit less than 32, depending on register class.

(3) The standard hazard recognizer works either bottom-up or top-down
on the itinerary data. So it *should* work out-of-box. The problem is
that PPC has overriden the API to layer some custom "bundling" logic
on top of basic hazard detection. This logic needs to be reversed for
bottom-up, or you could start by simply disabling it instead of the
entire hazard recognizer.

Is EmitInstruction used in bottom-up scheduling at all? The version in
the ARM recognizer seems essential, but in all of the regression tests
(and some other .ll files I have lying around), it is never called. It
seems that only Reset() and getHazardType() are called. Could you please
explain the calling sequence?

Thanks again,
Hal

> > Is there documentation somewhere for the bottom-up scheduling? I'm
> > trying to figure out what changes are necessary in order to support it
> > in the PPC backend.
> >
> > Thanks in advance,
> > Hal
> >
> >>
> >>
> >>> Evan,
> >>>
> >>> Thanks for the heads up! Is there a current target that implements the
> >>> scheduling as it will be? And does the bottom-up scheduling also account
> >>
> >> ARM is a good model.
> >
> > What part of ARM's implementation is associated with the bottom-up
> > scheduling? I am confused because it looks like it is essentially using
> > the same kind of ScoreboardHazardRecognizer that was commented out of
> > the PPC 440 code.
> >
> > Thanks in advance,
> > Hal
>
> Hi Hal,
>
> The best way to ensure the PPC scheduling isn't hosed now or in the
> future is probably to make it work as much like ARM as possible.
>
> This means (1) defaulting to the "hybrid" scheduler, (2) implementing the
> register pressure limit, and (3) reenabling the hazard recognizer.
>
> (1) TargetLowering::setSchedulingPreference(Sched::Hybrid)
>
> (2) TargetRegisterInfo::getRegisterPressureLimit(...) should probably
> return something a bit less than 32, depending on register class.
>
> (3) The standard hazard recognizer works either bottom-up or top-down
> on the itinerary data. So it *should* work out-of-box. The problem is
> that PPC has overriden the API to layer some custom "bundling" logic
> on top of basic hazard detection. This logic needs to be reversed for
> bottom-up, or you could start by simply disabling it instead of the
> entire hazard recognizer.

Is EmitInstruction used in bottom-up scheduling at all? The version in
the ARM recognizer seems essential, but in all of the regression tests
(and some other .ll files I have lying around), it is never called. It
seems that only Reset() and getHazardType() are called. Could you please
explain the calling sequence?

I feel that I should clarify my comment: For PPC, now that Hybrid
scheduling is enabled, EmitInstruction seems never to be called (at
least it is not called when running any PPC codegen test in the
regression-test collection).

Thanks again,
Hal

Is EmitInstruction used in bottom-up scheduling at all? The version in

the ARM recognizer seems essential, but in all of the regression tests

(and some other .ll files I have lying around), it is never called. It

seems that only Reset() and getHazardType() are called. Could you please

explain the calling sequence?

I feel that I should clarify my comment: For PPC, now that Hybrid
scheduling is enabled, EmitInstruction seems never to be called (at
least it is not called when running any PPC codegen test in the
regression-test collection).

Hal,

Since PPCHazardRecognizer is not derived from ScoreboardHazardRecognizer, you’ll need to initialize MaxLookAhead to the max depth of your target’s itinerary.

See how this is done in the ScoreboardHazardRecognizer ctor:

MaxLookAhead = ScoreboardDepth;

-Andy

> >
> > Is EmitInstruction used in bottom-up scheduling at all? The
> > version in
> > the ARM recognizer seems essential, but in all of the regression
> > tests
> > (and some other .ll files I have lying around), it is never
> > called. It
> > seems that only Reset() and getHazardType() are called. Could you
> > please
> > explain the calling sequence?
>
> I feel that I should clarify my comment: For PPC, now that Hybrid
> scheduling is enabled, EmitInstruction seems never to be called (at
> least it is not called when running any PPC codegen test in the
> regression-test collection).

Hal,

Since PPCHazardRecognizer is not derived from
ScoreboardHazardRecognizer, you'll need to initialize MaxLookAhead to
the max depth of your target's itinerary.

Andy,

Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
anyway, is there any reason not to have it derive from
ScoreboardHazardRecognizer at this point? It looks like the custom
bundling logic could be implemented on top of the scoreboard recognizer
(that seems similar to what ARM's recognizer is doing).

-Hal

>
>
> > >
> > > Is EmitInstruction used in bottom-up scheduling at all? The
> > > version in
> > > the ARM recognizer seems essential, but in all of the regression
> > > tests
> > > (and some other .ll files I have lying around), it is never
> > > called. It
> > > seems that only Reset() and getHazardType() are called. Could you
> > > please
> > > explain the calling sequence?
> >
> > I feel that I should clarify my comment: For PPC, now that Hybrid
> > scheduling is enabled, EmitInstruction seems never to be called (at
> > least it is not called when running any PPC codegen test in the
> > regression-test collection).
>
>
> Hal,
>
>
> Since PPCHazardRecognizer is not derived from
> ScoreboardHazardRecognizer, you'll need to initialize MaxLookAhead to
> the max depth of your target's itinerary.

Andy,

Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
anyway, is there any reason not to have it derive from
ScoreboardHazardRecognizer at this point? It looks like the custom
bundling logic could be implemented on top of the scoreboard recognizer
(that seems similar to what ARM's recognizer is doing).

Also, how does the ARM hazard recognizer get away with not implementing
RecedeCycle?

Thanks again,
Hal

ARM can reuse all the default scoreboard hazard recognizer logic such as recede cycle (naturally since its the primary client). If you can do the same with PPC that's great.

Andy

Andy,

I should have been more clear, the ARM implementation has:
void ARMHazardRecognizer::RecedeCycle() {
  llvm_unreachable("reverse ARM hazard checking unsupported");
}

How does that work?

Thanks again,
Hal

Andy,

  Is there any good info/docs on scheduling strategy in LLVM? As I was
complaining to you at the LLVM meeting, I end up reverse engineering/double
guessing more than I would like to... This thread shows that I am not
exactly alone in this... Thanks.

Sergei Larin

Andy,

I should have been more clear, the ARM implementation has:
void ARMHazardRecognizer::RecedeCycle() {
llvm_unreachable("reverse ARM hazard checking unsupported");
}

How does that work?

Thanks again,
Hal

Hal,

My first answer was off the top of my head, so missed the subtle issue. Just so you know, to answer questions like this I usually need to instrument the code with tracing or step through in the debugger. Even though I've hacked on the code quite a bit, the interaction between the scheduler and target hooks is still not obvious to me from glancing at the code. FWIW, I'm hoping it can be cleaned up gradually, maybe for the next release.

The preRA scheduler is bottom-up, for register pressure tracking. The postRA scheduler is top-down, for simpler hazard detection logic.

On ARM, the preRA scheduler uses an unspecialized instance of ScoreboardHazardRecognizer. The machine-independent RecedeCycle() logic that operates on the scheduler itinerary is sufficient.

The ARM postRA scheduler specializes the HazardRecognizer to handle additional constraints that cannot be expressed in the itinerary. Since this is a top-down scheduler, RecedeCycle() is no applicable.

-Andy

Sergei,

I would say that each target has its own scheduling strategy that has changed considerably over time. We try to maximize code reuse across targets, but it's not easy and done ad hoc. The result is confusing code that makes it difficult to understand the strategy for any particular target.

The right thing to do is:
1) Make it as easy as possible to understand how scheduling works for each of the primary targets (x86 and ARM) independent of each other.
2) Make it easy for very similar targets to piggyback on one of those implementations, without having to worry about other targets
3) Allow dissimilar targets (e.g. VLIW) to completely bypass the scheduler used by other targets and reuse only nicely self-contained parts of the framework, such as the DAG builder and individual machine description features.

We've recently moved further from this ideal scenario in that we're now forcing targets to implement the bottom-up selection dag scheduler. This is not really so bad, because you can revert to "source order" scheduling, -pre-RA-sched=source, and you don't need to implement many target hooks. It burns compile time for no good reason, but you can probably live with it. Then you're free to implement your own MI-level scheduler.

The next step in making it easier to maintain an llvm scheduler for "interesting" targets is to build an MI-level scheduling framework and move at least one of the primary targets to this framework so it's well supported. This would separate the nasty issues of serializing the selection DAG from the challenge of microarchitecture-level scheduling, and provide a suitable place to inject your own scheduling algorithm. It's easier to implement a scheduler when starting from a valid instruction sequence where all dependencies are resolved and no register interferences exit.

To answer your question, there's no clear way to describe the current overall scheduling strategy. For now, you'll need to ask porting questions on llvm-dev. Maybe someone who's faced a similar problem will have a good suggestion. We do want to improve that situation and we intend to do that by first providing a new scheduler framework. When we get to that point, I'll be sure that the new direction can work for you and is easy to understand. All I can say now is that the new design will allow a target to compose a preRA scheduler from an MI-level framework combined with target-specific logic for selecting the optimal instruction order. I don't see any point in imposing a generic scheduling algorithm across all targets.

-Andy

> Andy,
>
> I should have been more clear, the ARM implementation has:
> void ARMHazardRecognizer::RecedeCycle() {
> llvm_unreachable("reverse ARM hazard checking unsupported");
> }
>
> How does that work?
>
> Thanks again,
> Hal

Hal,

My first answer was off the top of my head, so missed the subtle issue. Just so you know, to answer questions like this I usually need to instrument the code with tracing or step through in the debugger.

This was actually the source of my question, it was clear that the ARM
RecedeCycle function was not being called, but, running the PPC code in
the debugger, it was clear that the PPC's RecedeCycle function was being
called. I did not appreciate the preRA vs postRA distinction, so thank
you for explaining that.

Instead of trying to port the PPC bundling logic for use with the
bottom-up scheduler, maybe it would be sufficient for now to use it
postRA (which seems like what ARM is doing).

-Hal

ScoreboardHazardRecognizer::EmitInstruction checks the reservation table where cycle=0 corresponds to the first pipeline stage and cycle=n is the stage the occurs after n cycles in the itinerary. It doesn’t care if we’re scheduling top-down or bottom-up. For top-down, the reservation table models resources used by previous instructions, and for bottom-up it models the resources of subsequent instructions, but the EmitInstruction logic is the same either way.

AdvanceCycle is called by the current top-down scheduler (now postRA only) to shift the reservation table forward in time. The earliest pipeline stage is dropped.

RecedeCycle is called by the bottom-up scheduler (preRA) to shift the reservation table backward in time. The latest pipeline stage is dropped.

It’s easier to think about hazards in a top-down scheduler.

-Andy

Andy,

   Thank you for the extended and prompt answer. Let me try to summaries my
current position so you (and everyone interested) would have a better view
of the world through my eyes :wink:

  1) LLVM first robust VLIW target is currently in review. It needs for
scheduling strategy/quality are rather different than what current
scheduler(schedulers) can provide.
  2) My first attempt in porting (while I was on 2.9) resulted in a new
top-down Pre-RA VLIW enabled scheduler that I was hoping to upstream as soon
as our back end is accepted. I guess I have missed the window since our
commit took a bit longer than planned. Now Evan told me (and you have
confirmed) that it would need to change to bottom-up version for 3.0.
Moreover, current "level" (exact placement in DAG->DAG pass) of Pre-RA
scheduling is less than optimal (and I agree to that since I have to bend
backwards to extract info readily available in MIs).
  3) Your group is working on a "new" scheduler, and the best I understand
it would be same general algorithm moved "closer" to RA. I also understand
that at first it would not have added support for "packets"/bundles/multiops
in VLIW definition (or will it?). If they will be presented, interesting
discussion on how subsequent passes will be modified to recognize them would
follow... but we had another thread on this topic not that long ago.

  So, IMHO the following would make sense:

  1) It would be very nice if we can have some sort of write-up detailing
proposed changes, and maybe defining overall strategy for instruction
scheduling in LLVM __before__ major decisions are made. It should later be
converted in to "how to" or simple doc chapter on porting scheduler(s) to
new targets. Public discussion should follow, and we need to try to
accommodate all needs (as much as possible).
  2) Any attempts on my part to further VLIW scheduler design for my target
would be unwise until such discussion would take place. I also do not
separate this process from bundle/packet representation. If you perceive an
overhead associated with this activity, I could volunteer to help.

  Also, please see my comments embedded below.

Thanks.

Sergei Larin

This is not really so bad, because you can revert to “source order”

scheduling, -pre-RA-sched=source, and you don’t need to implement many

target hooks. It burns compile time for no good reason, but you can

probably live with it. Then you’re free to implement your own MI-level

scheduler.

[Larin, Sergei]
I am not 100% sure about this statement, but as I get closer to
re-implementing my scheduler I might grasp a better picture.

One thing that would be nice to have ASAP is a SelectionDAG serialization pass that satisfies dependencies and physical register interferences, while preserving IR instruction whenever possible. This should be totally separate from from the SelectionDAG scheduler. It should not work on SUnits.

I realize this is quite disjoint from the work needed to port a new target. I’m just pointing out that it would be a welcome feature.

If we had that pass, I could tell you that it would be fairly straightforward to reenable the top-down SD scheduler. At this point, since you’d rather scheduler MI’s anyway, you may choose to focus on that strategy instead.

The next step in making it easier to maintain an llvm scheduler for

“interesting” targets is to build an MI-level scheduling framework and

move at least one of the primary targets to this framework so it’s well

supported. This would separate the nasty issues of serializing the

selection DAG from the challenge of microarchitecture-level scheduling,

and provide a suitable place to inject your own scheduling algorithm.

It’s easier to implement a scheduler when starting from a valid

instruction sequence where all dependencies are resolved and no

register interferences exit.

[Larin, Sergei]
Agree, and my whole point is that it needs to be done with preceding
public discussion, and not de-facto with code drops.

It will be an incremental process. I’m not going to design a complete scheduling framework for all microarchitectures “on paper” before making any changes. Design decisions will be deferred as late as they can be without holding up progress. You’ll know when they’re being made and have the opportunity to influence them. In fact, any new design will be strongly influenced by the scheduler work that you and others have done recently.

I think you’re reacting to the recent dropping of preRA top-down scheduling without public discussion. As you know it was not part of a planned strategy, and not a desirable outcome for anyone. The fact is that we couldn’t wait to fix an existing design flaw in DAG serialization. The bottom-up scheduler has the ability to overcome this problem, but implementing a fix that doesn’t require running the bottom-up scheduler requires significant work. The right thing to do is to implement SD serialization pass I mentioned above. That solution would be preferable to everyone, but someone needs make the investment.

Of course, anyone is welcome to fix the existing top-down scheduler as well. It requires implementing the inverse of the bottom-up scheduler’s physical register tracking, see LiveRegDefs, plus some really hairy logic for resolving interferences that the SelectionDAG builder has created.

FWIW, we’re not going to run into this issue with the MI scheduling framework that I’m referring to because no part of it will be imposed on any targets.

To answer your question, there’s no clear way to describe the current

overall scheduling strategy. For now, you’ll need to ask porting

questions on llvm-dev. Maybe someone who’s faced a similar problem will

have a good suggestion. We do want to improve that situation and we

intend to do that by first providing a new scheduler framework. When we

get to that point, I’ll be sure that the new direction can work for you

[Larin, Sergei]
Any clue on time frame?

2012 :slight_smile:

-Andy

Now, to generate the best PPC schedules, there is one thing you may

want to override. The scheduler's priority function has a
HasReadyFilter attribute (enum). It can be overriden by specializing
hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
scheduling, and maximizes the instructions that can issue in one
group, regardless of register pressure. We still care about register
pressure enough in ARM to avoid enabling this. I'm really not sure how
much it will help on modern PPC implementations though.
hybrid_ls_rr_sort

Can this be done without modifying common code? It looks like
hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.

Thanks again,
Hal

Right. You would need to specialize the priority queue logic. A small amount of common code.
Andy

Andy,

I played around with this some today for my PPC 440 chips. These are
embedded chips (multiple pipelines but in-order), and may be more
similar to your ARMs than to the PPC-970 style designs...

I was able to get reasonable PPC 440 code generation by using the ILP
scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and my
load/store reordering patch). This worked significantly better than
using either hybrid or ilp alone (with or without setting
HasReadyFilter). I was looking at my primary use case which is
partially-unrolled loops with loads, stores and floating-point
calculations.

This seems to work b/c ILP first groups the instructions to extract
parallelism and then the post-RA scheduler breaks up the groups to avoid
stalls. This allows the scheduler to find its way out of what seems to
be a "local minimum" of sorts, whereby it wants to schedule each
unrolled iteration of the loop sequentially. The reason why this seems
to occur is that the hybrid scheduler would prefer to suffer a large
data-dependency delay over a shorter full-pipeline delay. Do you know
why it would do this? (you can see PR11589 for an example if you'd
like).

Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
used? Is there a reason that this is a compile-time constant? Both
Hybrid and ILP have isReady() functions. I can certainly propose a patch
to make them command-line options.

Thanks again,
Hal

Now, to generate the best PPC schedules, there is one thing you may

want to override. The scheduler's priority function has a
HasReadyFilter attribute (enum). It can be overriden by specializing
hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
scheduling, and maximizes the instructions that can issue in one
group, regardless of register pressure. We still care about register
pressure enough in ARM to avoid enabling this. I'm really not sure how
much it will help on modern PPC implementations though.
hybrid_ls_rr_sort

Can this be done without modifying common code? It looks like
hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.

Thanks again,
Hal

Right. You would need to specialize the priority queue logic. A small amount of common code.
Andy

Andy,

I played around with this some today for my PPC 440 chips. These are
embedded chips (multiple pipelines but in-order), and may be more
similar to your ARMs than to the PPC-970 style designs...

I was able to get reasonable PPC 440 code generation by using the ILP
scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and my
load/store reordering patch). This worked significantly better than
using either hybrid or ilp alone (with or without setting
HasReadyFilter). I was looking at my primary use case which is
partially-unrolled loops with loads, stores and floating-point
calculations.

This seems to work b/c ILP first groups the instructions to extract
parallelism and then the post-RA scheduler breaks up the groups to avoid
stalls. This allows the scheduler to find its way out of what seems to
be a "local minimum" of sorts, whereby it wants to schedule each
unrolled iteration of the loop sequentially. The reason why this seems
to occur is that the hybrid scheduler would prefer to suffer a large
data-dependency delay over a shorter full-pipeline delay. Do you know
why it would do this? (you can see PR11589 for an example if you'd
like).

The "ilp" scheduler has several heuristics designed to compensate for lack of itinerary. Each of those heuristics has a flag, so you can see what works for your target. I've never used that scheduler with an itinerary, but it should work. It's just that some of the heuristics effectively override the hazard checker.

The "hybrid" scheduler depends more on the itinerary/hazard checker. It's less likely to schedule instructions close together if they may induce a pipeline stall, regardless of operand latency.

Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
used? Is there a reason that this is a compile-time constant? Both
Hybrid and ILP have isReady() functions. I can certainly propose a patch
to make them command-line options.

It's a compile time constant because it's clearly on the scheduler's critical path and not used by any active targets. Enabling HasReadyFilter turns the preRA scheduler into a strict scheduler such that the hazard checker overrides all other heuristics. That's not what you want if you're also enabling postRA scheduling!

-Andy