MachineScheduler not scheduling for latency

Hi,

I'm trying to understand why MachineScheduler does a poor job in
straight line code in cases like the one in the attached debug dump.
This is on AMDGPU, an in-order target, and the problem is that the
IMAGE_SAMPLE instructions have very high (80 cycle) latency, but in
the resulting schedule they are often placed right next to their uses
like this:

1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64,
%30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec
:: (dereferenceable load 4 from custom TargetCustom8)
1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec
...
1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64,
%30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec
:: (dereferenceable load 4 from custom TargetCustom8)
1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec

This can be improved slightly in post-ra scheduling, but not much. The
post-ra scheduler simply doesn't have enough freedom to move
instructions around once physical registers have been assigned, so I
contend that MachineScheduler needs to consider latency.

I've looked at what's going on in the debugger and the problem seems
to be that GenericSchedulerBase::setPolicy does not set
Policy.ReduceLatency because it thinks that the other zone (Top in
this case) is issue limited. There are lots of things I don't
understand here:

1. "Issue limited" seems to mean that the number of instructions is
greater than the length of the critical path. I don't really
understand why this is an interesting criterion. It seems to me like a
fairly normal state of affairs.
2. Why does the fact that it's issue limited mean that it's a good
idea to turn off the latency heuristic? Moving instructions around
can't really change whether the function is issue limited or not, but
it can definitely improve latency problems.
3. Why do we completely turn off the latency heuristic, rather than
just letting it take its usual (very low) priority in
GenericScheduler::tryCandidate?
4. Stepping back a bit, is MachineScheduler really trying to consider
latency at all in pre-ra mode? I see a comment that it "Schedules
aggressively for latency in PostRA mode", but what about pre-ra?

Of course we can and do override some of the generic logic in our
target, in lib/Target/AMDGPU/GCNSchedStrategy.cpp, but before going
further down that route I'd like to try to understand the intent of
the generic logic.

Thanks for any insights,
Jay.

machine-scheduler.txt.gz (61.2 KB)

Hi Jay,

General disclaimer: The GenericScheduler is intended to exercise the key features that cover most of what backends will need, and have heuristics that are good enough for a typical out-of-order CPU. The intent was always that backend developers who care enough about scheduling will plugin a target-specific MachineSchedulingStrategy. The GenericScheduler also tends to take a "do no harm" approach and leave things in source order until it can determine that scheduling will help.

That said...

Yes, pre-RA is definitely the place to hide latency, but GenericScheduler currently does it very conservatively when MicroOpBufferSize > 1. Backends that need aggressive latency hiding will set MicroOpBufferSize = 1. I would expect you to be doing that for your target.

One reason latency is suppressed/ignored at times with MicroOpBufferSize > 1 is that some loops make good use of registers when left in source order, but exhibit spill code as soon as the scheduler starts shuffling things.

"Issue limited" means that forming instructions groups is on average more important than reducing the critical path. IsResourceLimited is similar but across all modeled processor resources. As you said, that doesn't imply that latency is irrelevant. That's probably being controlled by 'shouldReduceLatency', which looks to me like it suppresses latency hiding until the critical path becomes relevant. My guess is that it's extremely conservative about latency hiding because defaulting to latency hiding can get you backed into a corner with register pressure.

Again, shouldReduceLatency should not be relevant at MicroOpBufferSize = 1, because the scheduler should prioritize issue stalls over other heuristics.

Looking at the key decision point in your trace, I see the scheduler is being driven by the `REG-CRIT` heuristic. So, it thinks that failing to scheduled the use-def together would increase pressure in a critical register class.

There's no expectation that the GenericScheduler heuristics are anywhere near as good as they can be over the universe of workloads, but changing those heuristiscs means checking in with current backend maintainers to be sure you're not regressing their benchmarks in the process. I haven't touched these heuristcs in 6 years, but several others have had a hand in it recently.

-Andy

Hi Andy,

Thanks for the explanations. Yes AMDGPU is in-order and has
MicroOpBufferSize = 1.

Re "issue limited" and instruction groups: could it make sense to
disable the generic scheduler's detection of issue limitation on
in-order CPUs, or on CPUs that don't define instruction groups, or
some similar condition? Something like:

--- a/lib/CodeGen/MachineScheduler.cpp
+++ b/lib/CodeGen/MachineScheduler.cpp
@@ -2062,10 +2062,13 @@ getOtherResourceCount(unsigned &OtherCritIdx) {
   if (!SchedModel->hasInstrSchedModel())
     return 0;

- unsigned OtherCritCount = Rem->RemIssueCount
- + (RetiredMOps * SchedModel->getMicroOpFactor());
- LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: "
- << OtherCritCount /
SchedModel->getMicroOpFactor() << '\n');
+ unsigned OtherCritCount = 0;
+ if (some condition) {
+ OtherCritCount = Rem->RemIssueCount
+ + (RetiredMOps * SchedModel->getMicroOpFactor());
+ LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: "
+ << OtherCritCount /
SchedModel->getMicroOpFactor() << '\n');
+ }
   for (unsigned PIdx = 1, PEnd = SchedModel->getNumProcResourceKinds();
        PIdx != PEnd; ++PIdx) {
     unsigned OtherCount = getResourceCount(PIdx) + Rem->RemainingCounts[PIdx];

As for "shouldReduceLatency should not be relevant at
MicroOpBufferSize = 1": are you suggesting that shouldReduceLatency
should effectively be changed to always return true on in-order CPUs?
Even with that change, latency comes pretty far down the list of
criteria tested in GenericScheduler::tryCandidate.

Thanks,
Jay.

Thanks for the explanations. Yes AMDGPU is in-order and has
MicroOpBufferSize = 1.

Ok good. When MicroOpBufferSize==1, I would expect the Generic strategy to do something close to what you want.

More background:

MicroOpBufferSize==0 enables classic in-order scheduling, where reducing stalls takes priority at each step. This works via the traditional list-scheduling approach of leaving instructions in the pending queue until their latency is met.

MicroOpBufferSize==1 allows for more nuanced heuristics. You can now pay to reduce register pressure by spending stall cycles. But, that doesn't mean the scheduler should totally ignore stall cycles.

[Incidentally, I think controlling these scheduling modes with MicroOpBufferSize is a bit silly. It would be nice to have three explicit modes: vliw-ish, inorder-ish, out-of-order, and have MicroOpBufferSize just default to 1.]

In this in-order mode, the scheduler does account for the stall cycles due to latency. Sometimes, that means the scheduler actually deprioritizes latency because it can now see that some latency can be hidden by other long latency ops.

Let's step through the relevant prioritization in GenericScheduler::tryCandidate:

1. RegExcess: this means something is going to spill right here right now

2. RegCritical: this means that the overall register pressure exceeds
   your target's limit. I think this is the reason for your "bad"
   scheduling decision. You should check if this register pressure
   heuristic is doing the right thing for you, and maybe fix that
   instead. Or, if stalls are always more important than register
   pressure, move to MicroOpBufferSize==0 or plugin your own strategy.

3. PathReduce | DepthReduce via Rem.IsAcyclicLatencyLimited &&
   tryLatency: I expect checkAcyclicLatency will probably work ok for
   your machine model, but it only applies to single-block loops. It
   looks like this particular case is not a loop?

4. Stall via getLatencyStallCycles: I expect this to generally avoid
   the kind of latency-induced stalls that you are referring to. If
   you aren't being caught be a register pressure problem above, then
   this should handle your case.

5. PathReduce | DepthReduce via TryCand.Policy.ReduceLatency: ensures
   that latency hiding is preferred over source order. First,
   shouldReduceLatency needs to return true; is that already the case
   for you? Next, the result of shouldReduceLatency can't be overridden
   when it sees that the code is resource limited. This override
   appears to be what you're proposing to fix...

Re "issue limited" and instruction groups: could it make sense to
disable the generic scheduler's detection of issue limitation on
in-order CPUs, or on CPUs that don't define instruction groups, or
some similar condition? Something like:

--- a/lib/CodeGen/MachineScheduler.cpp
+++ b/lib/CodeGen/MachineScheduler.cpp
@@ -2062,10 +2062,13 @@ getOtherResourceCount(unsigned &OtherCritIdx) {
  if (!SchedModel->hasInstrSchedModel())
    return 0;

- unsigned OtherCritCount = Rem->RemIssueCount
- + (RetiredMOps * SchedModel->getMicroOpFactor());
- LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: "
- << OtherCritCount /
SchedModel->getMicroOpFactor() << '\n');
+ unsigned OtherCritCount = 0;
+ if (some condition) {
+ OtherCritCount = Rem->RemIssueCount
+ + (RetiredMOps * SchedModel->getMicroOpFactor());
+ LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: "
+ << OtherCritCount /
SchedModel->getMicroOpFactor() << '\n');
+ }
  for (unsigned PIdx = 1, PEnd = SchedModel->getNumProcResourceKinds();
       PIdx != PEnd; ++PIdx) {
    unsigned OtherCount = getResourceCount(PIdx) + Rem->RemainingCounts[PIdx];

As for "shouldReduceLatency should not be relevant at
MicroOpBufferSize = 1": are you suggesting that shouldReduceLatency
should effectively be changed to always return true on in-order CPUs?
Even with that change, latency comes pretty far down the list of
criteria tested in GenericScheduler::tryCandidate.

Thanks,
Jay.

I don't think that there should be any difference between issue limitations and other resources. Whether one is more "normal" than the other totally depends on the machine model.

I also don't think it makes sense to always prioritize the critical path in in-order mode, because heuristic #4 above should already handle any stalls that occur as a result of the machine being in-order.

-Andy

>
> Thanks for the explanations. Yes AMDGPU is in-order and has
> MicroOpBufferSize = 1.

Ok good. When MicroOpBufferSize==1, I would expect the Generic strategy to do something close to what you want.

More background:

MicroOpBufferSize==0 enables classic in-order scheduling, where reducing stalls takes priority at each step. This works via the traditional list-scheduling approach of leaving instructions in the pending queue until their latency is met.

MicroOpBufferSize==1 allows for more nuanced heuristics. You can now pay to reduce register pressure by spending stall cycles. But, that doesn't mean the scheduler should totally ignore stall cycles.

[Incidentally, I think controlling these scheduling modes with MicroOpBufferSize is a bit silly. It would be nice to have three explicit modes: vliw-ish, inorder-ish, out-of-order, and have MicroOpBufferSize just default to 1.]

In this in-order mode, the scheduler does account for the stall cycles due to latency. Sometimes, that means the scheduler actually deprioritizes latency because it can now see that some latency can be hidden by other long latency ops.

Let's step through the relevant prioritization in GenericScheduler::tryCandidate:

1. RegExcess: this means something is going to spill right here right now

2. RegCritical: this means that the overall register pressure exceeds
   your target's limit. I think this is the reason for your "bad"
   scheduling decision. You should check if this register pressure
   heuristic is doing the right thing for you, and maybe fix that
   instead. Or, if stalls are always more important than register
   pressure, move to MicroOpBufferSize==0 or plugin your own strategy.

Yes I think you're basically right about the bad scheduling being
caused by #2. We have a fundamental tension between register pressure
and stalls. Both of them are always more important than the other :slight_smile:

3. PathReduce | DepthReduce via Rem.IsAcyclicLatencyLimited &&
   tryLatency: I expect checkAcyclicLatency will probably work ok for
   your machine model, but it only applies to single-block loops. It
   looks like this particular case is not a loop?

4. Stall via getLatencyStallCycles: I expect this to generally avoid
   the kind of latency-induced stalls that you are referring to. If
   you aren't being caught be a register pressure problem above, then
   this should handle your case.

Thanks. I will take a look at why #4 is not handling stalls in the way
I'd expect.

5. PathReduce | DepthReduce via TryCand.Policy.ReduceLatency: ensures
   that latency hiding is preferred over source order. First,
   shouldReduceLatency needs to return true; is that already the case
   for you? Next, the result of shouldReduceLatency can't be overridden
   when it sees that the code is resource limited. This override
   appears to be what you're proposing to fix...

Yes!

One of the effects I'm seeing is that small changes in the input cause
drastic changes in the resulting schedule. This makes life difficult
because innocuous changes elsewhere in the compiler can apparently
cause large performance regressions which we have to track down.

And one reason for this is that getOtherResourceCount /
checkResourceLimit decide whether the code is issue limited or not
based on the number of ops vs the length of the critical path, which
can easily flip-flop; and this overrides the result of
shouldReduceLatency, which affects whether tryCandidate calls
tryLatency, or falls back to source order.

Given that #5 is a last resort criterion before falling back to source
order, I don't understand why it ever makes sense to disable it.

> Re "issue limited" and instruction groups: could it make sense to
> disable the generic scheduler's detection of issue limitation on
> in-order CPUs, or on CPUs that don't define instruction groups, or
> some similar condition? Something like:
>
> --- a/lib/CodeGen/MachineScheduler.cpp
> +++ b/lib/CodeGen/MachineScheduler.cpp
> @@ -2062,10 +2062,13 @@ getOtherResourceCount(unsigned &OtherCritIdx) {
> if (!SchedModel->hasInstrSchedModel())
> return 0;
>
> - unsigned OtherCritCount = Rem->RemIssueCount
> - + (RetiredMOps * SchedModel->getMicroOpFactor());
> - LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: "
> - << OtherCritCount /
> SchedModel->getMicroOpFactor() << '\n');
> + unsigned OtherCritCount = 0;
> + if (some condition) {
> + OtherCritCount = Rem->RemIssueCount
> + + (RetiredMOps * SchedModel->getMicroOpFactor());
> + LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: "
> + << OtherCritCount /
> SchedModel->getMicroOpFactor() << '\n');
> + }
> for (unsigned PIdx = 1, PEnd = SchedModel->getNumProcResourceKinds();
> PIdx != PEnd; ++PIdx) {
> unsigned OtherCount = getResourceCount(PIdx) + Rem->RemainingCounts[PIdx];
>
> As for "shouldReduceLatency should not be relevant at
> MicroOpBufferSize = 1": are you suggesting that shouldReduceLatency
> should effectively be changed to always return true on in-order CPUs?
> Even with that change, latency comes pretty far down the list of
> criteria tested in GenericScheduler::tryCandidate.
>
> Thanks,
> Jay.

I don't think that there should be any difference between issue limitations and other resources. Whether one is more "normal" than the other totally depends on the machine model.

I also don't think it makes sense to always prioritize the critical path in in-order mode, because heuristic #4 above should already handle any stalls that occur as a result of the machine being in-order.

-Andy

Thanks again,
Jay.

One of the effects I’m seeing is that small changes in the input cause
drastic changes in the resulting schedule. This makes life difficult
because innocuous changes elsewhere in the compiler can apparently
cause large performance regressions which we have to track down.

Yes. This sort of thing also happens a lot because of unpredictable register coalescing decisions and even spilling. It’s definitely a major problem for performance tracking.

And one reason for this is that getOtherResourceCount /
checkResourceLimit decide whether the code is issue limited or not
based on the number of ops vs the length of the critical path, which
can easily flip-flop; and this overrides the result of
shouldReduceLatency, which affects whether tryCandidate calls
tryLatency, or falls back to source order.

Given that #5 is a last resort criterion before falling back to source
order, I don’t understand why it ever makes sense to disable it.

The other, higher priority heuristics should handle situations where latency may cause a stall. This is an aggressive heuristic that overall tries to bias scheduling in one direction to the likely detriment of register pressure. Enabling an aggressive heuristic unconditionally introduces regressions where it would have been better to do no scheduling at all. That would actually make the scheduler a bigger source of performance noise.

Of course, every target and set of important workloads is different, which is why it’s easy to plugin your own heuristics. Or, if this change is useful for other targets as well, we could consider adding a property to the machine model that tells the generic scheduler to be unconditionally latency aggressive. Before considering that, it’s important to be sure which heuristics are really in play and verify that it would improve your performance noise problem across a wide range of programs.

As an aside, if I had continued working on instruction scheduling, I would have worked on classifying code patterns before scheduling and using that classification to determine a strategy.

-Andy