Why did Intel change his static branch prediction mechanism during these years?

( I don't know if it's allowed to ask such question, if not, please remind me. )

I know Intel implemented several static branch prediction mechanisms
these years:
  * 80486 age: Always-not-take
  * Pentium4 age: Backwards Taken/Forwards Not-Taken
  * PM, Core2: Didn't use static prediction, randomly depending on
what happens to be in corresponding BTB entry , according to agner's
optimization guide ¹.
  * Newer CPUs like Ivy Bridge, Haswell have become increasingly
intangible, according to Matt G's experiment ².

And Intel seems don't want to talk about it any more, because the
latest material I found within Intel Document was written about ten
years ago.

I know static branch prediction is (far?) less important than dynamic,
but in quite a few situations, CPU will be completely lost and
programmers(with compiler) are usually the best guide. Of course these
situations are usually not performance bottleneck, because once a
branch is frequently executed, the dynamic predictor will capture it.

Since Intel no longer clearly statements the dynamic prediction
mechanism in its document, the builtin_expect() of GCC can do nothing
more than removing the unlikely branch from hot path or reversely for
likely branch.

I am not familiar with CPU design and I don't know what exactly
mechanism Intel use nowadays for its static predictor, I just feel the
best static mechanism for Intel should be to clearly document his CPU
"where I plan to go when dynamic predictor failed, forward or
backward", because usually the programmer is the best guide at that
time.

APPENDIX:
¹ Agner's optimization guide:
https://www.agner.org/optimize/microarchitecture.pdf , section 3.5
.

² Matt G's experiment: Branch prediction - part two — Matt Godbolt’s blog

I think almost everyone now is using some variation on “gshare”, or possibly a hybrid of gshare with another scheme (and a prediction of which scheme is better).

Maybe the best thing a compiler can do now with static information such as builtin_expect() is to keep the high probability path as straight-line code and move the low probability path to out of line code.

The 486 "assume not taken" strategy (also common in other in-order cores) is not a branch prediction mechanism as such. It can be interpreted as one (in that it's essentially equivalent to predicting all branches not taken), but first and foremost it's simply what happens when the instruction fetch stage just keeps going through memory sequentially unless redirected.

In other words, it's what naturally happens in a pipelined processor in the _absence_ of any intentionally designed branch prediction mechanism.

As for static vs. dynamic prediction, to be making explicit static predictions, you actually need to know whether you have seen a given branch before, and many branch predictors don't! That is, they simply have no way to tell reliably whether a given branch history (or branch target) entry is actually for the branch it's trying to predict. The various branch prediction structures are mostly just bags of bits indexed by a hash code.

Now some do have a few tag bits that allow them to detect collisions, probabilistically with high likelihood anyway, but others don't. It's just a matter of whether the architects think it's worthwhile to spend the area on tag bits, or whether they have another thing they'd rather do that is a better trade-off for their target criteria.

Netburst (Pentium 4) is an anomaly, AFAIK because of the way the branch prediction interacts with the trace cache. Namely, I believe that branch history etc. in Netburst was kept at the trace level. For trace cache entries, unlike branch history/target entries, there is a well-defined moment when they're new, namely when they're initialized after a trace cache miss (trace cache miss detection _has_ to do full tagging and be precise for the CPU to work); that's an opportunity to set up the new trace with the static prediction heuristic, and also branch hints. [But I'm no expert on this, and I bet there are Intel folks on this list that could be more precise.]

Zooming out a bit, real designs don't generally have a single monolithic branch prediction mechanism; they have lots of them, in different places in the pipeline. Going roughly in pipeline order, some common mechanisms are:

1. Next I$ fetch line prediction. The address for the next instruction fetch needs to be determined before the current fetch even completes, for there to be no bubbles.

2. Early branch target prediction (indexed by instruction fetch address); instruction fetch is usually multiple cycles, and again you need to know the likely address to fetch next before you've even seen any instruction bytes or know where branches are!

3. A later, more precise target/direction predictor. The early predictors are "quick and dirty", and not very accurate. In parallel, you start working on a more accurate prediction; if the late predictor disagrees with the early predictor, you need to squash a few cycles worth of instructions, but this is still fairly early in the pipeline, so a lot cheaper than full mispredictions.

4. A return stack predictor for function returns, the most common type of indirect branch by far. Tracks function calls and returns and supplies what it thinks are likely targets for return statements.

5. A separate predictor for other indirect branch targets (e.g. through virtual or jump tables). BTBs can cover some of this but dedicated indirect predictors work better. Again, an indirect predictor can decide than an earlier-guessed branch target was bogus and initiate a partial flush.

5. There are sometimes also special loop predictors to catch counted loops that (almost) always execute the same number of iterations. Innermost loop nests are usually covered well by the regular global branch history, but loop predictors help with outer loop nests that are periodic on longer time scales than the global branch history captures.

6. Once instructions are decoded, branch target addresses (for direct branches) are verified. Branch target addresses in the BTBs and other structures are commonly stored with a reduced number of bits, and in certain (hopefully rare) cases the "decompressed" address will be incorrect. Again, if this is detected, a partial flush is initiated; we're now several cycles downstream from initial instruction fetch, but this is still far from a full branch misprediction penalty.

CPU vendors are usually close-mouthed about what exactly they're doing (it's some of their secret sauce), but some are more open than others. One not-too-old CPU design that has a relatively thorough description of the branch prediction architecture (though not the nitty-gritty details) is the AMD Jaguar (low-power x86):

   http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip

Section 2.7.1 in the PDF has a (very) brief description of the various branch prediction mechanisms.

-Fabian

From Matt’s page, “The target program consists of 1000 back to back branches”.

Perhaps the processors got better at detecting when they may be running off into some kind of data? Wouldn’t I want it to predict that kind of thing as not taken?

Just a thought…

-G

Hi, Fabian Giesen. Sorry for reply so late.

As for static vs. dynamic prediction, to be making explicit static predictions, you actually need to know whether you have seen a given branch before, and many branch predictors don't! That is, they simply have no way to tell reliably whether a given branch history (or branch target) entry is actually for the branch it's trying to predict. The various branch prediction structures are mostly just bags of bits indexed by a hash code.

As you said, the branch predictor has no way to tell if met the branch
before, because BHB is indexed by hash value. But for branch target
predictor, BTB on early Intel CPUs is organized like common cache,
with set/way/tag identified by the full address of a branch.

So, I guess, is there a chance that the dynamic prediction phase quits
on BTB missing? Namely, if the branch target predictor(rather than
branch predictor) failed to find out a target, then it made the
conclusion that never see that branch before, no enough evidence for
dynamic prediction and pass control to static predictor. Otherwise,
how to explain early Intel CPUs(Pentium I, II, III) before Pentium4
support static prediction without trace cache ²?

I found some words from Intel document ¹ which may support my guess :
"Branches that do not have a history in the BTB (see Section 3.4.1,
“Branch Prediction Optimization”) are predicted using a static
prediction algorithm... "
Though I believe it's no longer applied for newer Intel CPUs, it
should be at least an evidence from Pentium age.

And I found another thing interesting.
Since Core architecture Intel stopped using static prediction
heuristic: (still quote from Intel document 3.4.1.3)
"The Intel Core microarchitecture does not use the static prediction
heuristic. However, to maintain consistency across Intel 64 and IA-32
processors, software should maintain the static prediction heuristic
as the default."

The interesting point is: Just on Core micro architecture, the
organization of BTB changed. Intel no longer use the full address of
a branch to identify BTB entry, but 0~21 bits of the address. And thus
the branch target predictor can't tell precisely whether seen a branch
before, just like branch predictor.

Thanks again.

APPENDIX:
¹ Intel Document, section 3.4.1.3
  https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
² Agner's optimization guide, section 3.2 3.3
https://www.agner.org/optimize/microarchitecture.pdf