I’ve been trying to determine how to properly set the MicroOpBufferSize parameter. Based on the documentation in the header file:
// “> 1” means the processor is out-of-order. This is a machine independent
// estimate of highly machine specific characteristics such as the register
// renaming pool and reorder buffer.
Power 9 is out-of-order and so it makes sense to use a value greater than 1. However we don’t quite have a reorder buffer in the theoretical sense where we can pick any instruction in the buffer to dispatch next. As a result, I’m looking for a better understanding of how this parameter is used in the scheduler.
I’ve noticed that the only place where it matters if the value of MicroOpBufferSize is 2 or 200 is in GenericScheduler::checkAcyclicLatency(). In other places in the code we only care if we have values that fall in one of the three categories: 0, 1, or > 1.
Here is my understanding of what that function does (and I could be very wrong on this one) and please correct me if I am wrong.
MicroOpBufferSize is a confusing machine parameter. Most backends only
need to tell generic machine scheduler whether to do strict VLIW-style
in-order (=0), heuristic in-order (=1), or out-of-order (>1). We
should have some kind of alias for those modes so that backends don’t
need to explicitly mention the buffer size (patches please).
As far as picking the buffer size value for OOO hardware, it is meant
to estimate very roughly the number of instructions in-flight before
the cpu either runs out of physical registers or the number of entries
in some queue of instructions waiting for retirement (reoder buffer).
The generic machine scheduler doesn’t try to simulate the reorder
buffer because it is pointless to do so within the scope of a single
block on any modern hardware. The only time I’ve seen a case where the
compiler could determine that OOO resources would be overutilized is
in floating point loop kernel characterized by long latency operations and
interesting cyclic dependencies.
Aside from the one heuristic where it’s used, some backends may just
want to and express this level of detail about their microarchitecture
for documentation purpose and future use. I imagine a separate static
performance tool that would simulate the reservation stations and
reorder buffer. It would be nice if someone would write such a thing
based on LLVM.
Back to the scheduler… the generic machine scheduler has two purposes:
(1) A quick way to bootstrap a backend with at least some reasonable
(hopefully do no harm) scheduling.
(2) An example of how to piece together pieces of the scheduling
infrastructure into a custom scheduler, with in-tree testing of those
You’re far beyond the point of using the generic scheduler as a black
box. I suggest running benchmarks with and without the
checkAcyclicLatency heuristics to see which loops it helps/hurts and
possibly writing your own heuristic here.
We can have two critical paths through a loop. One is cyclic (use-def cross loop iterations) and one is acyclic (everything can be computed in one iteration). If the acyclic path is greater than the cyclic path by more than the size of the instruction buffer then we have to make sure that we don’t run out of instructions to dispatch in the instruction buffer. So, we set a flag Rem.IsAcyclicLatencyLimited and then tryCandidate checks this flag when making decisions.
Is this correct?
If that is what we are doing here then I think the proper value for this parameter is the full size of the instruction buffer on the P9 hardware. All we really care about is whether or not we will run out of instructions to dispatch while waiting for an instruction. We don’t really care about HOW the out-of-order dispatch is done as long as we don’t run out of instructions.
Does my logic make sense?
I want to set the parameter but I want to make sure I understand what is going on.
Yes, that’s exactly the intention, but I need to clarify one
point. It’s very unlikely that the code within a single iteration
would exceed the buffer size. Instead, we’re looking for cases where
the OOO engine will be able to speculatively execute many loop
iterations before retiring some instructions from the first
iteration. These are the cases where I’ve seen the OOO engine running
out of resources. The heuristic estimates the number of iterations that
can be speculatively executed in the “shadow” of the loop’s acyclic
path. Then it basically multiplies the number of in-flight iterations
by the size of the loop.
/// CyclesPerIteration = max( CyclicPath, Loop-Resource-Height )
/// InFlightIterations = AcyclicPath / CyclesPerIteration
/// InFlightResources = InFlightIterations * LoopResources
Sorry, the implementation of the math is really hard to understand
because of all the scaling that happens so that heuristics can compute
everything using integers (no floating-point allowed in the
compiler). I realize the comments aren’t quite adequate.
The comment above refers to “resources” but the only resource that’s
currently modeled here is the number of micro-ops being issued. You
could conceivably model the per-functional-unit buffer sizes here, but
I don’t think those values are currently used at all.
If the number of in-flight micro-ops exceeds the buffer size, then
it’s important to hide latency within the loop. When the buffer runs
out, it stalls the processor front-end, so you want to have already
issued as many independent long-latency instructions as possible when