[RFC] BlockFrequency is the wrong metric; we need a new one

chandlerc · February 2, 2014, 10:13am

Right now, all profile information is funneled through two analysis passes prior to any part of the optimizer using it.

First, we have BranchProbabilityInfo, which provides a simple interface to the simplest form of profile information: local and relative branch probabilities. These merely express the likelihood of taking one of a mutually exclusive set of exit paths from a basic block. They are very simple, and the foundation of the profile information. Even the other analysis is merely built on top of this one.

Second we have BlockFrequencyInfo which attempts to provide a more “global” (function-wide, not actually program wide) view of the statistical frequency with which any particular basic block is executed. This is nicely principled analysis that just computes the probabilistic flow of control through the various branches according to their probabilities established in the first analysis.

However, I think that BlockFrequencyInfo provides the wrong set of information. There is one critical reason why. Let’s take a totally uninteresting CFG:

A → B1, B2
B1 → C1, C2
B2 → C3, C4
C1 → D1, D2
C2 → ret
C3 → D3, D4
C4 → ret
D1 → E1, E2
D2 → ret
D3 → E3, E4
D4 → ret

You can imagine this repeating on for as many levels as you like. This isn’t an uncommon situation with real code. BlockFrequencyInfo computes for this a very logical answer:

---- Block Freqs ----
a = 1.0
a → b1 = 0.5
a → b2 = 0.5
b1 = 0.5
b1 → c1 = 0.25
b1 → c2 = 0.25
b2 = 0.5
b2 → c3 = 0.25
b2 → c4 = 0.25
c1 = 0.25
c1 → d1 = 0.125
c1 → d2 = 0.125
c2 = 0.25
c3 = 0.25
c3 → d3 = 0.125
c3 → d4 = 0.125
c4 = 0.25
d1 = 0.125
d1 → e1 = 0.0625
d1 → e2 = 0.0625
d2 = 0.125
d3 = 0.125
d3 → e3 = 0.0625
d3 → e4 = 0.0625
d4 = 0.125

One way of thinking about this is that for any basic block X which is predicated by N branches of unbiased probability (50/50), the frequency computed for that block is 2^(-N).

The problem is that this doesn’t represent anything close to reality. Processors’ branch prediction works precisely because very, very few branches in programs are 50/50. Most programs do not systematically explore breadth first the full diversity of paths through the program. And yet, in the absense of better information our heuristics would lead us to believe (and act!) as though this were true.

Now, I’m not saying that the computation of block frequencies is wrong. Merely that it cannot possibly be used for at least one of it purposes – it’s relative frequency (to the entry block “basis” frequency) is completely useless for detecting hot or cold regions of a function – it will simply claim that all regions of the function are cold. What it is useful for is establishing a total ordering over the basic blocks of a function. So it works well for some things like code layout, but is grossly misleading for others.

There are several possible solutions here. I’ll outline my proposal as well as some other ideas.

BlockWeights instead of BlockFrequencies. My idea is that we don’t really care about the depth of the control dependence for a particular basic block. We care about the accumulated bias toward or away from a basic block. This is predicated on the idea that branches are overwhelmingly predictable. As a consequence, evenly distributed probabilities are really just uncertainties. My proposed way of implementing this would take the exact algorithm already used in BlockFrequency, but instead of computing a frequency for an edge based on the probability of the branch times the frequency of the predecessor, instead compute it as the frequency of the predecessor times how “biased” the branch is relative to other branches in the predecessor. Essentially, this would make a branch probability set of {0.5, 0.5} produce edge frequencies equal to the predecessor’s edge frequency.

The result of such a system would produce weights for every block in the above CFG as ‘1.0’, or equivalent to the entry block weight. This to me is a really useful metric – it indicates that no block in the CFG is really more or less likely than any other. Only biases in a specific direction would cause the block weights to go up or down significantly. This would immediately make the analysis useful to consumers such as the vectorizer, unroller, or when we have the capability, the inliner and an outliner which respect cold regions of functions.

One alternative proposed by Nick was using the average of block frequencies across a domtree as the denominator to which a particular block’s frequency is compared. I haven’t really thought as much about this because it seems quite expensive to compute and is less intuitive to me, but I wanted to mention it. Nick might be able to provide an explanation of it that would help folks.

Related to the idea of using the domtree, I can think of many layers on top of the existing block frequencies that could be used to accurately do “cold region” detection, but they would all be significantly more analysis on top of the existing system and so are somewhat less appealing to me.

Perhaps others have still more ideas? Also comments or suggestions on mine are very welcome.

I have a really simple patch that implements my suggestion using the existing frequency infrastructure (as it is almost exactly the same). I would also want to rename everything to avoid confusion as these would no longer be “frequencies” in any strict sense. I can provide test data as needed on the result of this change if people like the direction.

-Chandler

atrick · February 3, 2014, 2:18am

Right now, all profile information is funneled through two analysis passes prior to any part of the optimizer using it.

First, we have BranchProbabilityInfo, which provides a simple interface to the simplest form of profile information: local and relative branch probabilities. These merely express the likelihood of taking one of a mutually exclusive set of exit paths from a basic block. They are very simple, and the foundation of the profile information. Even the other analysis is merely built on top of this one.

Second we have BlockFrequencyInfo which attempts to provide a more "global" (function-wide, not actually program wide) view of the statistical frequency with which any particular basic block is executed. This is nicely principled analysis that just computes the probabilistic flow of control through the various branches according to their probabilities established in the first analysis.

However, I think that BlockFrequencyInfo provides the wrong set of information. There is one critical reason why. Let's take a totally uninteresting CFG:

A -> B1, B2
B1 -> C1, C2
B2 -> C3, C4
C1 -> D1, D2
C2 -> ret
C3 -> D3, D4
C4 -> ret
D1 -> E1, E2
D2 -> ret
D3 -> E3, E4
D4 -> ret

You can imagine this repeating on for as many levels as you like. This isn't an uncommon situation with real code. BlockFrequencyInfo computes for this a very logical answer:

---- Block Freqs ----
a = 1.0
  a -> b1 = 0.5
  a -> b2 = 0.5
b1 = 0.5
  b1 -> c1 = 0.25
  b1 -> c2 = 0.25
b2 = 0.5
  b2 -> c3 = 0.25
  b2 -> c4 = 0.25
c1 = 0.25
  c1 -> d1 = 0.125
  c1 -> d2 = 0.125
c2 = 0.25
c3 = 0.25
  c3 -> d3 = 0.125
  c3 -> d4 = 0.125
c4 = 0.25
d1 = 0.125
  d1 -> e1 = 0.0625
  d1 -> e2 = 0.0625
d2 = 0.125
d3 = 0.125
  d3 -> e3 = 0.0625
  d3 -> e4 = 0.0625
d4 = 0.125

One way of thinking about this is that for any basic block X which is predicated by N branches of unbiased probability (50/50), the frequency computed for that block is 2^(-N).

The problem is that this doesn't represent anything close to reality. Processors' branch prediction works precisely because very, *very* few branches in programs are 50/50. Most programs do not systematically explore breadth first the full diversity of paths through the program. And yet, in the absense of better information our heuristics would lead us to believe (and act!) as though this were true.

Now, I'm not saying that the computation of block frequencies is wrong. Merely that it cannot possibly be used for at least one of it purposes -- it's relative frequency (to the entry block "basis" frequency) is completely useless for detecting hot or cold regions of a function -- it will simply claim that all regions of the function are cold. What it *is* useful for is establishing a total ordering over the basic blocks of a function. So it works well for some things like code layout, but is grossly misleading for others.

There are several possible solutions here. I'll outline my proposal as well as some other ideas.

BlockWeights instead of BlockFrequencies. My idea is that we don't really care about the depth of the control dependence for a particular basic block. We care about the accumulated *bias* toward or away from a basic block. This is predicated on the idea that branches are overwhelmingly predictable. As a consequence, evenly distributed probabilities are really just *uncertainties*.

This sounds like an interesting way to merge unkown branch probabilities, static heuristics, statistical profiles, and partial profiles.

My proposed way of implementing this would take the exact algorithm already used in BlockFrequency, but instead of computing a frequency for an edge based on the probability of the branch times the frequency of the predecessor, instead compute it as the frequency of the predecessor times how "biased" the branch is relative to other branches in the predecessor. Essentially, this would make a branch probability set of {0.5, 0.5} produce edge frequencies equal to the predecessor's edge frequency.

The result of such a system would produce weights for every block in the above CFG as '1.0', or equivalent to the entry block weight. This to me is a really useful metric -- it indicates that no block in the CFG is really more or less likely than any other. Only *biases* in a specific direction would cause the block weights to go up or down significantly.

I don't like this statement, or don't understand it. It is useful to know a branch is unbiased. Currently we assume branches are unbiased then optimize conservatively in those cases (do no harm). But if we had greater confidence in unbiased branches (because the branch was actually profiled), we could if-convert much more aggressively.

This would immediately make the analysis useful to consumers such as the vectorizer, unroller, or when we have the capability, the inliner and an outliner which respect cold regions of functions.

I'm a little concerned that you're adding complexity that may not be necessary. Algorithms like if-conversion make use of the fact that the block weights are additive. I'm sure they could be adapted to something more sophisticated, but these heuristics are notoriously hard to tune. There is value in a simple model. You say that your approach is simple, but I would have to think hard about handling block frequencies that are inconsistent with their immediate branch probabilities.

So the question is: why do you really need more sophistication?

Part of you problem is that you're looking at frequency relative to the entry block when you should be looking at frequency relative to the region being optimized. When deciding to unroll, compare the loop header with the preheader.

I think you're trying to avoid unrolling even high trip count loops that are seldom reached, and you're using a non-negligible frequency for "cold" code (e.g. > 2^-10). In that case it sounds like we need a static heuristic to handle the pattern of chains of branches with no CFG merge. The fall-thru path should be > 0.5.

So, I think this is a cool idea. If I better understood how it worked I might be less concerned about the complexity. I'm not very convinced that it is necessary yet.

-Andy

chandlerc · February 3, 2014, 2:55am

> Right now, all profile information is funneled through two analysis
passes prior to any part of the optimizer using it.
>
> First, we have BranchProbabilityInfo, which provides a simple interface
to the simplest form of profile information: local and relative branch
probabilities. These merely express the likelihood of taking one of a
mutually exclusive set of exit paths from a basic block. They are very
simple, and the foundation of the profile information. Even the other
analysis is merely built on top of this one.
>
> Second we have BlockFrequencyInfo which attempts to provide a more
"global" (function-wide, not actually program wide) view of the statistical
frequency with which any particular basic block is executed. This is nicely
principled analysis that just computes the probabilistic flow of control
through the various branches according to their probabilities established
in the first analysis.
>
>
> However, I think that BlockFrequencyInfo provides the wrong set of
information. There is one critical reason why. Let's take a totally
uninteresting CFG:
>
> A -> B1, B2
> B1 -> C1, C2
> B2 -> C3, C4
> C1 -> D1, D2
> C2 -> ret
> C3 -> D3, D4
> C4 -> ret
> D1 -> E1, E2
> D2 -> ret
> D3 -> E3, E4
> D4 -> ret
>
> You can imagine this repeating on for as many levels as you like. This
isn't an uncommon situation with real code. BlockFrequencyInfo computes for
this a very logical answer:
>
> ---- Block Freqs ----
> a = 1.0
> a -> b1 = 0.5
> a -> b2 = 0.5
> b1 = 0.5
> b1 -> c1 = 0.25
> b1 -> c2 = 0.25
> b2 = 0.5
> b2 -> c3 = 0.25
> b2 -> c4 = 0.25
> c1 = 0.25
> c1 -> d1 = 0.125
> c1 -> d2 = 0.125
> c2 = 0.25
> c3 = 0.25
> c3 -> d3 = 0.125
> c3 -> d4 = 0.125
> c4 = 0.25
> d1 = 0.125
> d1 -> e1 = 0.0625
> d1 -> e2 = 0.0625
> d2 = 0.125
> d3 = 0.125
> d3 -> e3 = 0.0625
> d3 -> e4 = 0.0625
> d4 = 0.125
>
> One way of thinking about this is that for any basic block X which is
predicated by N branches of unbiased probability (50/50), the frequency
computed for that block is 2^(-N).
>
> The problem is that this doesn't represent anything close to reality.
Processors' branch prediction works precisely because very, *very* few
branches in programs are 50/50. Most programs do not systematically explore
breadth first the full diversity of paths through the program. And yet, in
the absense of better information our heuristics would lead us to believe
(and act!) as though this were true.
>
> Now, I'm not saying that the computation of block frequencies is wrong.
Merely that it cannot possibly be used for at least one of it purposes --
it's relative frequency (to the entry block "basis" frequency) is
completely useless for detecting hot or cold regions of a function -- it
will simply claim that all regions of the function are cold. What it *is*
useful for is establishing a total ordering over the basic blocks of a
function. So it works well for some things like code layout, but is grossly
misleading for others.
>
>
> There are several possible solutions here. I'll outline my proposal as
well as some other ideas.
>
>
> BlockWeights instead of BlockFrequencies. My idea is that we don't
really care about the depth of the control dependence for a particular
basic block. We care about the accumulated *bias* toward or away from a
basic block. This is predicated on the idea that branches are
overwhelmingly predictable. As a consequence, evenly distributed
probabilities are really just *uncertainties*.

This sounds like an interesting way to merge unkown branch probabilities,
static heuristics, statistical profiles, and partial profiles.

> My proposed way of implementing this would take the exact algorithm
already used in BlockFrequency, but instead of computing a frequency for an
edge based on the probability of the branch times the frequency of the
predecessor, instead compute it as the frequency of the predecessor times
how "biased" the branch is relative to other branches in the predecessor.
Essentially, this would make a branch probability set of {0.5, 0.5} produce
edge frequencies equal to the predecessor's edge frequency.
>
> The result of such a system would produce weights for every block in the
above CFG as '1.0', or equivalent to the entry block weight. This to me is
a really useful metric -- it indicates that no block in the CFG is really
more or less likely than any other. Only *biases* in a specific direction
would cause the block weights to go up or down significantly.

I don't like this statement, or don't understand it. It is useful to know
a branch is unbiased. Currently we assume branches are unbiased then
optimize conservatively in those cases (do no harm). But if we had greater
confidence in unbiased branches (because the branch was actually profiled),
we could if-convert much more aggressively.

I think the key is that I'm not talking about changing how we model
branches at all. I agree that it would be useful to model both our
confidence in the relative probabilities and the probabilities themselves
for things like if-conversion. We can still do that.

But the current users of 'block frequency' are not trying to understand a
specific branch as biased or un-biased, or deal with our confidence.
They're just trying to understand how "important" a specific basic block is
to the execution of the function.

> This would immediately make the analysis useful to consumers such as the
vectorizer, unroller, or when we have the capability, the inliner and an
outliner which respect cold regions of functions.

I'm a little concerned that you're adding complexity that may not be
necessary. Algorithms like if-conversion make use of the fact that the
block weights are additive.

Are you talking about branch probabilities here? I don't believe we use
block frequencies in anything other than:
1) MachineBlockPlacement
2) Spill cost, placement, and generally the register allocator
3) StackSlotColoring (but really only for spill weight computation)
4) LoopVectorizer (disabled currently)

I just want to make sure we're on the same page. I don't think it really
invalidates your concern.

I'm sure they could be adapted to something more sophisticated, but these

heuristics are notoriously hard to tune. There is value in a simple model.
You say that your approach is simple, but I would have to think hard about
handling block frequencies that are inconsistent with their immediate
branch probabilities.

If they were inconsistent, I would agree. But they are in fact consistent
and additive. They even preserve information during CFG merge points. The
only difference is that they scale (both up and down) relative to the bias
of a branch (probability's delta from the mean) rather than relative to the
flat probability.

So I don't think this makes the computation, propagation or mathematical
model more complex. What it does is shift it from a flow frequency modeling
to biased branch modeling. That is certainly a change and worth justifying,
I just don't want you to imagine it as having more impact than it does.

So the question is: why do you really need more sophistication?

Part of you problem is that you're looking at frequency relative to the
entry block when you should be looking at frequency relative to the region
being optimized. When deciding to unroll, compare the loop header with the
preheader.

Actually, I did that already. ;] I really thought that would work back when
we were first discussing how to use block frequencies and have not
forgotten it. However, that doesn't actually capture the the frequency
relative to the region. I think an example is best. Let's imagine that in
my CFG above "E1" is a loop preheader. There is some loop header below it
in the CFG that I hadn't reached yet.

The first thing to realize is that the heuristic we want is not how hot (or
cold) the loop header is relative to the preheader. That tells us, once we
execute this loop, how many times are we likely to take the backedge (or
for a nested loop, one of the backedges). But this number may be correctly
be arbitrarily large if we have a loop in a cold region which when executed
loops for a very long time. What we actually want to check for is how
frequently we enter the loop from outside the loop at all.

So the test is how "hot" is the loop preheader relative to its "region".
Now, if the loop preheader is nested inside of another loop, we might
usefully compare the inner loop's preheader to the outer loop's header and
get a useful metric of "hot" or "cold". But what happens when we are trying
to test for this in the outer most loop? We need some way to define a
"region" of the function body itself then. There are a bunch of viable
definitions. The one which maps to the loop header for a loop nest is the
function entry block. However, that is the precise circumstance that
doesn't work -- in a branch program, most blocks are "cold" relative to the
function entry if we simply trust the block frequency.

Now, there are other ways to define a "region". I mentioned one as an
alternative to my proposal -- we could look at block frequencies only
relative to the root of their domtree. Another would be to use SESE
regions, or any number of others. However, they all have the problem that
we must do substantial work and add substantial complexity and
sophistication to compute the region, the frequency for that region, and
then compare ours against it.

As a practical matter, they also have the problem of accumulating
significant error rapidly due to continual division of the entry frequency.
My proposal actually mitigates such problems significantly.

I think you're trying to avoid unrolling even high trip count loops that
are seldom reached, and you're using a non-negligible frequency for "cold"
code (e.g. > 2^-10). In that case it sounds like we need a static heuristic
to handle the pattern of chains of branches with no CFG merge. The
fall-thru path should be > 0.5.

I'm really not looking at a single problem or solution. There are a wide
variety of consumers of profile information that want to specially handle
both cold and hot regions of a function. Looking at the frequencies as they
exist today, these transformations cannot conclude either cold or hot for a
region of code from them. We need something better.

If we had any static heuristic that was good at predicting *which* long
chain should be >0.5 probability, we would be using it already. The problem
is that there is no single idea of "fallthrough". We canonicalize the
branches, and even programmers cannot make up their mind. Is the early exit
the fast-path or is the early exit the error handling? Both patterns exist
widely. If we have raw profile information, then yes, this is *much* less
of a problem because we will actually know what direction all of the
predictable branches in the program actually take. But our analyses
currently are deceptive in the absence of strong profile information by
giving the impression that the resulting branches are *unpredictable*
instead of being predictable and unknown.

So, I think this is a cool idea. If I better understood how it worked I

might be less concerned about the complexity. I'm not very convinced that
it is necessary yet.

Ok, for the necessity, simply take any function from a simple benchmark,
and look at its static block frequencies. Arnold posted a great example
from libquantum. Here we have the most boring code (1 loop in a tiny basic
block) and in a boring benchmark that is completely predictable. Yet still,
we think that the loop is reached less than 20% of the times that the
function is called. That just doesn't make sense.

If you want to take a more complex approach, we could definitely add a
confidence dimension to the branch probability. This would allow us to
express a high-confidence un-predictable branch and optimize based on that.
However, we would *still* want block weights to be computed more like I
have proposed. Instead of only branch probabilities that were far outside
of 50/50 influencing it, we would want branch probabilities that were
highly confident to influence it. The practical result would be exactly the
same as what I have proposed, it is just that I'm making the simplifying
assumption that at the moment, there are no evenly distributed branch
probabilities which we have high confidence in.

My suspicion is that we will find even with direct profile information,
this will still be true. Mostly, I suspect there are relatively few cases
where it is important we model an even distribution of branches with high
probability. I'm happy to be wrong about this, but *that* is where I want
to wait for more data to make a decision that adds complexity to the system.

resistor · February 3, 2014, 5:02am

As an aside, I’d like to point out that knowing that a branch is unbiased is not sufficient to know that it is poorly predicted and therefore profitable to if-convert. Consider a case where a branch is evaluated 100 times, going to the true destination 50 times and the false destination 50 times. If the cases where it goes each particular direction is strongly correlated with prior branches in the execution trace, the branch predictor’s history table will pick up on that and successfully predict this branch despite it being unbiased when considered in isolation.

—Owen

atrick · February 3, 2014, 5:07am

That’s a good point. We could profile misprediction, but we have no way to express that either in the current BranchProbability/BlockFrequency API or in Chandler’s variant.
-Andy

Chandler_Carruth · February 3, 2014, 5:28am

But I don't think we *need* to profile misprediction at all. It simply
isn't a problem that PGO needs to solve. The problems we're looking at for
PGO to help with are:

- i-cache locality / general code layout
- spill placement / live range splitting
- loop vectorization and unrolling[1]
- inlining[1]

For all of these, we're really looking at hot or cold region
identification, and clustering. I don't think that in these cases it is
perfectly reasonable to model "hot" and "cold" (in the branch probability
sense) as both heavily biased and high confidence in that bias, and
"not-hot" or "not-cold" as either not heavily biased *or* low confidence in
any indicated bias.

This gives us a simple linear model.

[1]: Note that even these are merely *speculative* use cases based on
experience with other systems. I'm not really trying to make sweeping
claims about *how* effective PGO will be here, only that it is an area that
various folks are looking at improving with PGO based on experience in
other system.s

Bob_Wilson · February 3, 2014, 5:42am

Adding a confidence dimension seems like overkill to me. It seems like we should be able to distinguish branches with known 50/50 probability, e.g., from real profile data, from those where we have no idea, simply by omitting the branch probability information in the latter case. I suppose that is essentially adding a confidence dimension, but only in a very simple way.

That’s a separate question from what to do when you don’t know anything about the probability of a branch. Andy has thought a lot more about this than I have, so for now, I will defer to his opinion. I will just point out that Duncan Exon Smith is currently looking at rewriting the BlockFrequency pass in a way that Jakob had suggested to help fix some serious problems with the loss of precision. Your proposed change to BlockFrequency would almost certainly have an impact on his work, so hopefully we can figure out soon what we want to do.

Incidentally, while looking at this, I noticed that our current static heuristics are assigning naive 50/50 probabilities due to a bug. I filed PR18705 to describe it, but basically we’re often doing the wrong thing for anything inside a loop.

chandlerc · February 3, 2014, 5:53am

Adding a confidence dimension seems like overkill to me. It seems like we
should be able to distinguish branches with known 50/50 probability, e.g.,
from real profile data, from those where we have no idea, simply by
omitting the branch probability information in the latter case. I suppose
that is essentially adding a confidence dimension, but only in a very
simple way.

BranchProbabilityInfo exposes *some* probability for every branch as a
significant simplification to the consumers. I don't think we want to
trivially give this property up.

That’s a separate question from what to do when you don’t know anything
about the probability of a branch. Andy has thought a lot more about this
than I have, so for now, I will defer to his opinion. I will just point out
that Duncan Exon Smith is currently looking at rewriting the BlockFrequency
pass in a way that Jakob had suggested to help fix some serious problems
with the loss of precision. Your proposed change to BlockFrequency would
almost certainly have an impact on his work, so hopefully we can figure out
soon what we want to do.

Yes, indeed. My change would make the precision issues significantly less
severe, but they're still fundamentally present.

Incidentally, while looking at this, I noticed that our current static
heuristics are assigning naive 50/50 probabilities due to a bug. I filed
PR18705 to describe it, but basically we’re often doing the wrong thing for
anything inside a loop.

Yea, just a bug. Most of the profile input that I care about comes before
the loop heuristic though, so this hasn't been hurting us heavily. I do
worry that it may have been hiding significant static heuristic problems
further down the stack.

-Chandler

atrick · February 3, 2014, 8:13am

Yes, I understand that. My concern was that passes may assume frequency/weight is consistent with probability (path weights add up to their merge/join points). At least, I’m used to making that assumption. Take a simple CFG:

A

B C

D E F
/
/
G

If each basic block has cost=1. Path A,B,D,G has weighted cost=4. Path A,C,E|F,G has weighted cost=5. But they should be equal.

You’re right. I thought early if-conversion was multiplying the cost of each path by its weight. We aren’t doing that. Instead we’re taking a do-no-harm approach of estimating critical path increase without considering probability. It’s actually more consistent with your approach.

That makes a fair amount of sense to me, but I see your point that it’s not particularly useful.

I think there are three sets of facts we’re trying to ascertain from block frequencies:

(1) Are blocks expected to be executed more or less frequently than their neighbors? This is effectively a way to summarize the CFG with a single block frequency number. Consider a global code motion pass. If it is guided by block frequencies, it doesn’t need to know about loops or CFG regions. It always places instruction outside loops and sinks them below branches and above merges. Spill placement is similar.

(2) Does a loop iterate enough times to justify loop versioning or unrolling?

(3) Which blocks are “cold”? In other words identity blocks that are almost never reached and not worth optimizing if it means increasing code size and possibly compile time.

The proposed change makes (3) easier in case of deep acyclic control flow. It allows us to more aggressively prune the optimization space.

(2) is not currently problem. But with the new approach we could not simply estimate trip count as freq(header) / freq(preheader).

Regarding (1). You’ve probably already considered any impact on block layout, so I’ll ignore that. For code motion/placement, I’m also not sure about your method of “scaling upward” within acyclic control flow. I’m used to thinking that if a block has higher weight it must be more frequently executed. So if I place code on the lightest block, I am either (a) hoisting it outside a loop or (b) sinking it below branches or above merges.

With a new approach where we scale for a branch bias, if you have a block with frequency 1.0 terminated by a branch with 0.25/0.75 probability, regardless of whether it’s profiled or subject to static heuristic, I could imagine giving its successors frequency 0.33 and 1.0. I think what you’re proposing is to give the successors frequency 0.5 and 1.5. That’s weird to me because I don’t know how we would identify loops from the frequency info.

Also, I honestly don’t see how the block frequency analysis will work. It’s not clear to me how you can compute cyclic probability for the loops if you’ve been scaling frequencies along the way.

-Andy

dnovillo · February 3, 2014, 3:47pm

---- Block Freqs ----
a = 1.0
  a -> b1 = 0.5
  a -> b2 = 0.5
b1 = 0.5
  b1 -> c1 = 0.25
  b1 -> c2 = 0.25

But this looks as if none of the edge probabilities were taking static
indicators into account? An edge going into a return, abort(), etc.

BlockWeights instead of BlockFrequencies. My idea is that we don't really
care about the depth of the control dependence for a particular basic block.

I agree. I naively thought that's what it's already done today. If you
follow this, then the relative hotness of blocks/regions can be
computed based on the aggregate weight by the paths into them.

Now, when we have sample information, we have raw block counts.
Wouldn't it make sense to just use them in that case?

Diego.

dexonsmith · February 5, 2014, 12:07am

> Right now, all profile information is funneled through two analysis passes prior to any part of the optimizer using it.
>
> First, we have BranchProbabilityInfo, which provides a simple interface to the simplest form of profile information: local and relative branch probabilities. These merely express the likelihood of taking one of a mutually exclusive set of exit paths from a basic block. They are very simple, and the foundation of the profile information. Even the other analysis is merely built on top of this one.
>
> Second we have BlockFrequencyInfo which attempts to provide a more "global" (function-wide, not actually program wide) view of the statistical frequency with which any particular basic block is executed. This is nicely principled analysis that just computes the probabilistic flow of control through the various branches according to their probabilities established in the first analysis.
>
>
> However, I think that BlockFrequencyInfo provides the wrong set of information. There is one critical reason why. Let's take a totally uninteresting CFG:
>
> A -> B1, B2
> B1 -> C1, C2
> B2 -> C3, C4
> C1 -> D1, D2
> C2 -> ret
> C3 -> D3, D4
> C4 -> ret
> D1 -> E1, E2
> D2 -> ret
> D3 -> E3, E4
> D4 -> ret
>
> You can imagine this repeating on for as many levels as you like. This isn't an uncommon situation with real code. BlockFrequencyInfo computes for this a very logical answer:
>
> ---- Block Freqs ----
> a = 1.0
> a -> b1 = 0.5
> a -> b2 = 0.5
> b1 = 0.5
> b1 -> c1 = 0.25
> b1 -> c2 = 0.25
> b2 = 0.5
> b2 -> c3 = 0.25
> b2 -> c4 = 0.25
> c1 = 0.25
> c1 -> d1 = 0.125
> c1 -> d2 = 0.125
> c2 = 0.25
> c3 = 0.25
> c3 -> d3 = 0.125
> c3 -> d4 = 0.125
> c4 = 0.25
> d1 = 0.125
> d1 -> e1 = 0.0625
> d1 -> e2 = 0.0625
> d2 = 0.125
> d3 = 0.125
> d3 -> e3 = 0.0625
> d3 -> e4 = 0.0625
> d4 = 0.125
>
> One way of thinking about this is that for any basic block X which is predicated by N branches of unbiased probability (50/50), the frequency computed for that block is 2^(-N).
>
> The problem is that this doesn't represent anything close to reality. Processors' branch prediction works precisely because very, *very* few branches in programs are 50/50. Most programs do not systematically explore breadth first the full diversity of paths through the program. And yet, in the absense of better information our heuristics would lead us to believe (and act!) as though this were true.
>
> Now, I'm not saying that the computation of block frequencies is wrong. Merely that it cannot possibly be used for at least one of it purposes -- it's relative frequency (to the entry block "basis" frequency) is completely useless for detecting hot or cold regions of a function -- it will simply claim that all regions of the function are cold. What it *is* useful for is establishing a total ordering over the basic blocks of a function. So it works well for some things like code layout, but is grossly misleading for others.
>
>
> There are several possible solutions here. I'll outline my proposal as well as some other ideas.
>
>
> BlockWeights instead of BlockFrequencies. My idea is that we don't really care about the depth of the control dependence for a particular basic block. We care about the accumulated *bias* toward or away from a basic block. This is predicated on the idea that branches are overwhelmingly predictable. As a consequence, evenly distributed probabilities are really just *uncertainties*.

This sounds like an interesting way to merge unkown branch probabilities, static heuristics, statistical profiles, and partial profiles.

> My proposed way of implementing this would take the exact algorithm already used in BlockFrequency, but instead of computing a frequency for an edge based on the probability of the branch times the frequency of the predecessor, instead compute it as the frequency of the predecessor times how "biased" the branch is relative to other branches in the predecessor. Essentially, this would make a branch probability set of {0.5, 0.5} produce edge frequencies equal to the predecessor's edge frequency.
>
> The result of such a system would produce weights for every block in the above CFG as '1.0', or equivalent to the entry block weight. This to me is a really useful metric -- it indicates that no block in the CFG is really more or less likely than any other. Only *biases* in a specific direction would cause the block weights to go up or down significantly.

I don't like this statement, or don't understand it. It is useful to know a branch is unbiased. Currently we assume branches are unbiased then optimize conservatively in those cases (do no harm). But if we had greater confidence in unbiased branches (because the branch was actually profiled), we could if-convert much more aggressively.

I think the key is that I'm not talking about changing how we model branches at all. I agree that it would be useful to model both our confidence in the relative probabilities and the probabilities themselves for things like if-conversion. We can still do that.

But the current users of 'block frequency' are not trying to understand a specific branch as biased or un-biased, or deal with our confidence. They're just trying to understand how "important" a specific basic block is to the execution of the function.

Yes, I understand that. My concern was that passes may assume frequency/weight is consistent with probability (path weights add up to their merge/join points). At least, I'm used to making that assumption. Take a simple CFG:

A
>\
B C
> >\
D E F
> >/
>/
G

If each basic block has cost=1. Path A,B,D,G has weighted cost=4. Path A,C,E|F,G has weighted cost=5. But they should be equal.

> This would immediately make the analysis useful to consumers such as the vectorizer, unroller, or when we have the capability, the inliner and an outliner which respect cold regions of functions.

I'm a little concerned that you're adding complexity that may not be necessary. Algorithms like if-conversion make use of the fact that the block weights are additive.

Are you talking about branch probabilities here? I don't believe we use block frequencies in anything other than:
1) MachineBlockPlacement
2) Spill cost, placement, and generally the register allocator
3) StackSlotColoring (but really only for spill weight computation)
4) LoopVectorizer (disabled currently)

I just want to make sure we're on the same page. I don't think it really invalidates your concern.

You're right. I thought early if-conversion was multiplying the cost of each path by its weight. We aren't doing that. Instead we're taking a do-no-harm approach of estimating critical path increase without considering probability. It's actually more consistent with your approach.

I'm sure they could be adapted to something more sophisticated, but these heuristics are notoriously hard to tune. There is value in a simple model. You say that your approach is simple, but I would have to think hard about handling block frequencies that are inconsistent with their immediate branch probabilities.

If they were inconsistent, I would agree. But they are in fact consistent and additive. They even preserve information during CFG merge points. The only difference is that they scale (both up and down) relative to the bias of a branch (probability's delta from the mean) rather than relative to the flat probability.

So I don't think this makes the computation, propagation or mathematical model more complex. What it does is shift it from a flow frequency modeling to biased branch modeling. That is certainly a change and worth justifying, I just don't want you to imagine it as having more impact than it does.

So the question is: why do you really need more sophistication?

Part of you problem is that you're looking at frequency relative to the entry block when you should be looking at frequency relative to the region being optimized. When deciding to unroll, compare the loop header with the preheader.

Actually, I did that already. ;] I really thought that would work back when we were first discussing how to use block frequencies and have not forgotten it. However, that doesn't actually capture the the frequency relative to the region. I think an example is best. Let's imagine that in my CFG above "E1" is a loop preheader. There is some loop header below it in the CFG that I hadn't reached yet.

The first thing to realize is that the heuristic we want is not how hot (or cold) the loop header is relative to the preheader. That tells us, once we execute this loop, how many times are we likely to take the backedge (or for a nested loop, one of the backedges). But this number may be correctly be arbitrarily large if we have a loop in a cold region which when executed loops for a very long time. What we actually want to check for is how frequently we enter the loop from outside the loop at all.

So the test is how "hot" is the loop preheader relative to its "region". Now, if the loop preheader is nested inside of another loop, we might usefully compare the inner loop's preheader to the outer loop's header and get a useful metric of "hot" or "cold". But what happens when we are trying to test for this in the outer most loop? We need some way to define a "region" of the function body itself then. There are a bunch of viable definitions. The one which maps to the loop header for a loop nest is the function entry block. However, that is the precise circumstance that doesn't work -- in a branch program, most blocks are "cold" relative to the function entry if we simply trust the block frequency.

Now, there are other ways to define a "region". I mentioned one as an alternative to my proposal -- we could look at block frequencies only relative to the root of their domtree. Another would be to use SESE regions, or any number of others. However, they all have the problem that we must do substantial work and add substantial complexity and sophistication to compute the region, the frequency for that region, and then compare ours against it.

As a practical matter, they also have the problem of accumulating significant error rapidly due to continual division of the entry frequency. My proposal actually mitigates such problems significantly.

I think you're trying to avoid unrolling even high trip count loops that are seldom reached, and you're using a non-negligible frequency for "cold" code (e.g. > 2^-10). In that case it sounds like we need a static heuristic to handle the pattern of chains of branches with no CFG merge. The fall-thru path should be > 0.5.

I'm really not looking at a single problem or solution. There are a wide variety of consumers of profile information that want to specially handle both cold and hot regions of a function. Looking at the frequencies as they exist today, these transformations cannot conclude either cold or hot for a region of code from them. We need something better.

If we had any static heuristic that was good at predicting *which* long chain should be >0.5 probability, we would be using it already. The problem is that there is no single idea of "fallthrough". We canonicalize the branches, and even programmers cannot make up their mind. Is the early exit the fast-path or is the early exit the error handling? Both patterns exist widely. If we have raw profile information, then yes, this is *much* less of a problem because we will actually know what direction all of the predictable branches in the program actually take. But our analyses currently are deceptive in the absence of strong profile information by giving the impression that the resulting branches are *unpredictable* instead of being predictable and unknown.

So, I think this is a cool idea. If I better understood how it worked I might be less concerned about the complexity. I'm not very convinced that it is necessary yet.

Ok, for the necessity, simply take any function from a simple benchmark, and look at its static block frequencies. Arnold posted a great example from libquantum. Here we have the most boring code (1 loop in a tiny basic block) and in a boring benchmark that is completely predictable. Yet still, we think that the loop is reached less than 20% of the times that the function is called. That just doesn't make sense.

That makes a fair amount of sense to me, but I see your point that it’s not particularly useful.

I think there are three sets of facts we’re trying to ascertain from block frequencies:

(1) Are blocks expected to be executed more or less frequently than their neighbors? This is effectively a way to summarize the CFG with a single block frequency number. Consider a global code motion pass. If it is guided by block frequencies, it doesn't need to know about loops or CFG regions. It always places instruction outside loops and sinks them below branches and above merges. Spill placement is similar.

(2) Does a loop iterate enough times to justify loop versioning or unrolling?

(3) Which blocks are "cold"? In other words identity blocks that are almost never reached and not worth optimizing if it means increasing code size and possibly compile time.

The proposed change makes (3) easier in case of deep acyclic control flow. It allows us to more aggressively prune the optimization space.

The current code in LoopVectorize.cpp (which Chandler disabled in r200579) falls
into this category (3): is the loop too cold to bother vectorizing?

I agree with Chandler that block frequency is the wrong metric to answer *this*
sort of question. I haven’t thought enough about block weights to be sure that
it’s the right metric, but it seems much closer.

[I’m using block “frequency” to refer to the current metric, and block “weight”
to refer to Chandler’s proposal.]

(2) is not currently problem. But with the new approach we could not simply estimate trip count as freq(header) / freq(preheater).

I agree. Block frequency seems to be the right metric for (2).

Regarding (1). You've probably already considered any impact on block layout, so I'll ignore that. For code motion/placement, I'm also not sure about your method of "scaling upward" within acyclic control flow. I'm used to thinking that if a block has higher weight it must be more frequently executed. So if I place code on the lightest block, I am either (a) hoisting it outside a loop or (b) sinking it below branches or above merges.

I also think block frequency is more useful for (1).

With a new approach where we scale for a branch bias, if you have a block with frequency 1.0 terminated by a branch with 0.25/0.75 probability, regardless of whether it's profiled or subject to static heuristic, I could imagine giving its successors frequency 0.33 and 1.0. I think what you’re proposing is to give the successors frequency 0.5 and 1.5. That’s weird to me because I don’t know how we would identify loops from the frequency info.

Also, I honestly don’t see how the block frequency analysis will work. It’s not clear to me how you can compute cyclic probability for the loops if you’ve been scaling frequencies along the way.

Why not compute two metrics, and use them appropriately?

Generating both in parallel also solves the problem of generating cyclic
probabilities. The loop scale can be calculated from the block frequency
analysis, and then used to scale up block weights. However, I’m not entirely
sure that block weights *should* be scaled in loops. If loop scale is already
available through block frequency, block weight may be more useful unscaled.

Block frequency itself is a combination of two separate metrics: probability
mass and loop scale. Perhaps even these should be available separately in the
BlockFrequencyInfo API.

dexonsmith · February 5, 2014, 12:23am

Raw block counts will give the same type of information as the current metric of
block frequency (although more accurate for the profiled use case). If you have
profile data, running BranchProbabilityInfo followed by BlockFrequencyInfo
should recreate block frequencies roughly equivalent to the raw data scaled to a
“standard” entry frequency for each function, modulo adjustments for sanity and
precision.

Where this metric is unhelpful, raw block counts will also be unhelpful.

Chandler_Carruth · February 5, 2014, 12:31am

Will respond in more detail to the other points, but just to isolate this
one briefly:

> Regarding (1). You've probably already considered any impact on block
layout, so I'll ignore that. For code motion/placement, I'm also not sure
about your method of "scaling upward" within acyclic control flow. I'm used
to thinking that if a block has higher weight it must be more frequently
executed. So if I place code on the lightest block, I am either (a)
hoisting it outside a loop or (b) sinking it below branches or above merges.

I also think block frequency is more useful for (1).

I actually think they are both equivalent. We only really use them to scale
an edge probability prior to comparing two edge probabilities from
different source basic blocks, and to form a total ordering over the basic
blocks when we've completely given up on laying out code according to any
part of the CFG.

The latter case doesn't matter much -- source order is probably just as
good of a fallback. At this point, we've lost all real hope of doing "good"
code layout.

The former case is interesting, but if anything I think would do better
with the new approach: we won't consider one edge "cold" and thus OK to
break out of the CFG natural ordering (or out of the loop nest) merely
because it happens to have 5 50/50 branches predicating it and the other
block has only 1. That case doesn't seem likely to come up often, but if it
did come up, I would prefer the logic of using weights at the relative
scale for the edges.

atrick · February 5, 2014, 12:36am

Right now, all profile information is funneled through two analysis passes prior to any part of the optimizer using it.

First, we have BranchProbabilityInfo, which provides a simple interface to the simplest form of profile information: local and relative branch probabilities. These merely express the likelihood of taking one of a mutually exclusive set of exit paths from a basic block. They are very simple, and the foundation of the profile information. Even the other analysis is merely built on top of this one.

Second we have BlockFrequencyInfo which attempts to provide a more "global" (function-wide, not actually program wide) view of the statistical frequency with which any particular basic block is executed. This is nicely principled analysis that just computes the probabilistic flow of control through the various branches according to their probabilities established in the first analysis.

However, I think that BlockFrequencyInfo provides the wrong set of information. There is one critical reason why. Let's take a totally uninteresting CFG:

A -> B1, B2
B1 -> C1, C2
B2 -> C3, C4
C1 -> D1, D2
C2 -> ret
C3 -> D3, D4
C4 -> ret
D1 -> E1, E2
D2 -> ret
D3 -> E3, E4
D4 -> ret

You can imagine this repeating on for as many levels as you like. This isn't an uncommon situation with real code. BlockFrequencyInfo computes for this a very logical answer:

---- Block Freqs ----
a = 1.0
a -> b1 = 0.5
a -> b2 = 0.5
b1 = 0.5
b1 -> c1 = 0.25
b1 -> c2 = 0.25
b2 = 0.5
b2 -> c3 = 0.25
b2 -> c4 = 0.25
c1 = 0.25
c1 -> d1 = 0.125
c1 -> d2 = 0.125
c2 = 0.25
c3 = 0.25
c3 -> d3 = 0.125
c3 -> d4 = 0.125
c4 = 0.25
d1 = 0.125
d1 -> e1 = 0.0625
d1 -> e2 = 0.0625
d2 = 0.125
d3 = 0.125
d3 -> e3 = 0.0625
d3 -> e4 = 0.0625
d4 = 0.125

One way of thinking about this is that for any basic block X which is predicated by N branches of unbiased probability (50/50), the frequency computed for that block is 2^(-N).

The problem is that this doesn't represent anything close to reality. Processors' branch prediction works precisely because very, *very* few branches in programs are 50/50. Most programs do not systematically explore breadth first the full diversity of paths through the program. And yet, in the absense of better information our heuristics would lead us to believe (and act!) as though this were true.

Now, I'm not saying that the computation of block frequencies is wrong. Merely that it cannot possibly be used for at least one of it purposes -- it's relative frequency (to the entry block "basis" frequency) is completely useless for detecting hot or cold regions of a function -- it will simply claim that all regions of the function are cold. What it *is* useful for is establishing a total ordering over the basic blocks of a function. So it works well for some things like code layout, but is grossly misleading for others.

There are several possible solutions here. I'll outline my proposal as well as some other ideas.

BlockWeights instead of BlockFrequencies. My idea is that we don't really care about the depth of the control dependence for a particular basic block. We care about the accumulated *bias* toward or away from a basic block. This is predicated on the idea that branches are overwhelmingly predictable. As a consequence, evenly distributed probabilities are really just *uncertainties*.

This sounds like an interesting way to merge unkown branch probabilities, static heuristics, statistical profiles, and partial profiles.

My proposed way of implementing this would take the exact algorithm already used in BlockFrequency, but instead of computing a frequency for an edge based on the probability of the branch times the frequency of the predecessor, instead compute it as the frequency of the predecessor times how "biased" the branch is relative to other branches in the predecessor. Essentially, this would make a branch probability set of {0.5, 0.5} produce edge frequencies equal to the predecessor's edge frequency.

The result of such a system would produce weights for every block in the above CFG as '1.0', or equivalent to the entry block weight. This to me is a really useful metric -- it indicates that no block in the CFG is really more or less likely than any other. Only *biases* in a specific direction would cause the block weights to go up or down significantly.

I don't like this statement, or don't understand it. It is useful to know a branch is unbiased. Currently we assume branches are unbiased then optimize conservatively in those cases (do no harm). But if we had greater confidence in unbiased branches (because the branch was actually profiled), we could if-convert much more aggressively.

I think the key is that I'm not talking about changing how we model branches at all. I agree that it would be useful to model both our confidence in the relative probabilities and the probabilities themselves for things like if-conversion. We can still do that.

But the current users of 'block frequency' are not trying to understand a specific branch as biased or un-biased, or deal with our confidence. They're just trying to understand how "important" a specific basic block is to the execution of the function.

Yes, I understand that. My concern was that passes may assume frequency/weight is consistent with probability (path weights add up to their merge/join points). At least, I'm used to making that assumption. Take a simple CFG:

A
>\
B C
> >\
D E F
> >/
>/
G

If each basic block has cost=1. Path A,B,D,G has weighted cost=4. Path A,C,E|F,G has weighted cost=5. But they should be equal.

This would immediately make the analysis useful to consumers such as the vectorizer, unroller, or when we have the capability, the inliner and an outliner which respect cold regions of functions.

I'm a little concerned that you're adding complexity that may not be necessary. Algorithms like if-conversion make use of the fact that the block weights are additive.

Are you talking about branch probabilities here? I don't believe we use block frequencies in anything other than:
1) MachineBlockPlacement
2) Spill cost, placement, and generally the register allocator
3) StackSlotColoring (but really only for spill weight computation)
4) LoopVectorizer (disabled currently)

I just want to make sure we're on the same page. I don't think it really invalidates your concern.

You're right. I thought early if-conversion was multiplying the cost of each path by its weight. We aren't doing that. Instead we're taking a do-no-harm approach of estimating critical path increase without considering probability. It's actually more consistent with your approach.

I'm sure they could be adapted to something more sophisticated, but these heuristics are notoriously hard to tune. There is value in a simple model. You say that your approach is simple, but I would have to think hard about handling block frequencies that are inconsistent with their immediate branch probabilities.

If they were inconsistent, I would agree. But they are in fact consistent and additive. They even preserve information during CFG merge points. The only difference is that they scale (both up and down) relative to the bias of a branch (probability's delta from the mean) rather than relative to the flat probability.

So I don't think this makes the computation, propagation or mathematical model more complex. What it does is shift it from a flow frequency modeling to biased branch modeling. That is certainly a change and worth justifying, I just don't want you to imagine it as having more impact than it does.

So the question is: why do you really need more sophistication?

Part of you problem is that you're looking at frequency relative to the entry block when you should be looking at frequency relative to the region being optimized. When deciding to unroll, compare the loop header with the preheader.

Actually, I did that already. ;] I really thought that would work back when we were first discussing how to use block frequencies and have not forgotten it. However, that doesn't actually capture the the frequency relative to the region. I think an example is best. Let's imagine that in my CFG above "E1" is a loop preheader. There is some loop header below it in the CFG that I hadn't reached yet.

The first thing to realize is that the heuristic we want is not how hot (or cold) the loop header is relative to the preheader. That tells us, once we execute this loop, how many times are we likely to take the backedge (or for a nested loop, one of the backedges). But this number may be correctly be arbitrarily large if we have a loop in a cold region which when executed loops for a very long time. What we actually want to check for is how frequently we enter the loop from outside the loop at all.

So the test is how "hot" is the loop preheader relative to its "region". Now, if the loop preheader is nested inside of another loop, we might usefully compare the inner loop's preheader to the outer loop's header and get a useful metric of "hot" or "cold". But what happens when we are trying to test for this in the outer most loop? We need some way to define a "region" of the function body itself then. There are a bunch of viable definitions. The one which maps to the loop header for a loop nest is the function entry block. However, that is the precise circumstance that doesn't work -- in a branch program, most blocks are "cold" relative to the function entry if we simply trust the block frequency.

Now, there are other ways to define a "region". I mentioned one as an alternative to my proposal -- we could look at block frequencies only relative to the root of their domtree. Another would be to use SESE regions, or any number of others. However, they all have the problem that we must do substantial work and add substantial complexity and sophistication to compute the region, the frequency for that region, and then compare ours against it.

As a practical matter, they also have the problem of accumulating significant error rapidly due to continual division of the entry frequency. My proposal actually mitigates such problems significantly.

I think you're trying to avoid unrolling even high trip count loops that are seldom reached, and you're using a non-negligible frequency for "cold" code (e.g. > 2^-10). In that case it sounds like we need a static heuristic to handle the pattern of chains of branches with no CFG merge. The fall-thru path should be > 0.5.

I'm really not looking at a single problem or solution. There are a wide variety of consumers of profile information that want to specially handle both cold and hot regions of a function. Looking at the frequencies as they exist today, these transformations cannot conclude either cold or hot for a region of code from them. We need something better.

If we had any static heuristic that was good at predicting *which* long chain should be >0.5 probability, we would be using it already. The problem is that there is no single idea of "fallthrough". We canonicalize the branches, and even programmers cannot make up their mind. Is the early exit the fast-path or is the early exit the error handling? Both patterns exist widely. If we have raw profile information, then yes, this is *much* less of a problem because we will actually know what direction all of the predictable branches in the program actually take. But our analyses currently are deceptive in the absence of strong profile information by giving the impression that the resulting branches are *unpredictable* instead of being predictable and unknown.

So, I think this is a cool idea. If I better understood how it worked I might be less concerned about the complexity. I'm not very convinced that it is necessary yet.

Ok, for the necessity, simply take any function from a simple benchmark, and look at its static block frequencies. Arnold posted a great example from libquantum. Here we have the most boring code (1 loop in a tiny basic block) and in a boring benchmark that is completely predictable. Yet still, we think that the loop is reached less than 20% of the times that the function is called. That just doesn't make sense.

That makes a fair amount of sense to me, but I see your point that it’s not particularly useful.

I think there are three sets of facts we’re trying to ascertain from block frequencies:

(1) Are blocks expected to be executed more or less frequently than their neighbors? This is effectively a way to summarize the CFG with a single block frequency number. Consider a global code motion pass. If it is guided by block frequencies, it doesn't need to know about loops or CFG regions. It always places instruction outside loops and sinks them below branches and above merges. Spill placement is similar.

(2) Does a loop iterate enough times to justify loop versioning or unrolling?

(3) Which blocks are "cold"? In other words identity blocks that are almost never reached and not worth optimizing if it means increasing code size and possibly compile time.

The proposed change makes (3) easier in case of deep acyclic control flow. It allows us to more aggressively prune the optimization space.

The current code in LoopVectorize.cpp (which Chandler disabled in r200579) falls
into this category (3): is the loop too cold to bother vectorizing?

I agree with Chandler that block frequency is the wrong metric to answer *this*
sort of question. I haven’t thought enough about block weights to be sure that
it’s the right metric, but it seems much closer.

[I’m using block “frequency” to refer to the current metric, and block “weight”
to refer to Chandler’s proposal.]

(2) is not currently problem. But with the new approach we could not simply estimate trip count as freq(header) / freq(preheater).

I agree. Block frequency seems to be the right metric for (2).

Regarding (1). You've probably already considered any impact on block layout, so I'll ignore that. For code motion/placement, I'm also not sure about your method of "scaling upward" within acyclic control flow. I'm used to thinking that if a block has higher weight it must be more frequently executed. So if I place code on the lightest block, I am either (a) hoisting it outside a loop or (b) sinking it below branches or above merges.

I also think block frequency is more useful for (1).

With a new approach where we scale for a branch bias, if you have a block with frequency 1.0 terminated by a branch with 0.25/0.75 probability, regardless of whether it's profiled or subject to static heuristic, I could imagine giving its successors frequency 0.33 and 1.0. I think what you’re proposing is to give the successors frequency 0.5 and 1.5. That’s weird to me because I don’t know how we would identify loops from the frequency info.

Also, I honestly don’t see how the block frequency analysis will work. It’s not clear to me how you can compute cyclic probability for the loops if you’ve been scaling frequencies along the way.

Why not compute two metrics, and use them appropriately?

Generating both in parallel also solves the problem of generating cyclic
probabilities. The loop scale can be calculated from the block frequency
analysis, and then used to scale up block weights. However, I’m not entirely
sure that block weights *should* be scaled in loops. If loop scale is already
available through block frequency, block weight may be more useful unscaled.

Block frequency itself is a combination of two separate metrics: probability
mass and loop scale. Perhaps even these should be available separately in the
BlockFrequencyInfo API.

That’s fine with me! Especially if we need to compute frequency anyway to find cyclic probability. Using separate numbers to compute probability mass, scale, and bias sidesteps numeric range limits.

As a final clarification on my end. The block frequency value by itself (as currently computed) is of course not an ideal cold code indicator. In the past I have used it to measure coldness in two ways, which seemed to work ok, but I admit it’s not a great system:
(1) Identify almost never taken code using a very small fraction (say 2^-10). This looks like it won’t work well for us though with static heuristics and builtin_expect.
(2) Compare the frequency relative to the enclosing loop header, e.g. to move “cold” stuff out of the loop.

-Andy

echristo · February 5, 2014, 2:55am

This is definitely the important case here since our default "we don't
know" is going to be a 50/50 probability where everything is equally
likely, not less likely because of conditionals.

-eric

dexonsmith · March 7, 2014, 11:50pm

To resurrect this thread, I just posted a patch series on llvm-commits [1]
that replaces the block frequency calculation with a new algorithm.

[1]: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140303/207861.html

I’ve attached an extra patch to this email that adds (my interpretation of)
Chandler’s proposed new metric, but doesn’t use it anywhere. The
calculation is as follows:

1. Calculate block frequency normally.

2. In parallel, calculate a "reference" block frequency for every
block, which represents an unbiased block frequency.

3. Take the ratio of the two, and call it the block bias.

The reference block frequency calculations assume that all branch
weights are even. E.g., if a block has three successor edges, then each
edge has a probability of 1/3. If a block has five successor edges and
two of them go to the same block, the effective probability for that
successor is 2/5.

It's not clear how to scale the reference frequency in loops. There are
three possibilities that might be reasonable:

1. Use the same loop scale as calculated by the block frequency.

2. Use a loop scale that assumes all branch weights are even. (I.e.,
calculate the loop scale the same way as for the normal block
frequency in parallel.)

3. Don't scale loops.

The current version of this patch chooses option #3, and has option #2
commented out with FIXMEs. Option #1 would be trivial to implement by
modifying code around the same FIXMEs.

I've updated the debugging output (from -debug-only=block-freq) to give
some visibility into the calculations.

0007-RFC-blockfreq-Add-bias-metric.patch (30.5 KB)

atrick · March 19, 2014, 7:34am

I’ve already reviewed the design of your new algorithm, which I like. It was easier for me to review the design of the BlockFrequency code as a whole, which has become quite baroque. Most of my issues understanding it are not new to your patch, so feel free to defer cleanup until later.

FrequencyData and MassData need comments explaining their contents. Names like “Floating” an “Integer” don’t tell me anything beyond the type.

“WorkingData” implies that this is the only transient data in the class. This should probably be “LoopBlockData” with a much better comment.

In general, it is confusing when transient data is mixed with the analysis storage. Ideally, working data should be in a separate object. It might help to elevate the FrequencyData vector into BlockFrequencyInfo and std::move this vector from the transient BlockFrequencyImpl into the Analysis storage when it is fully computed. runOnFunction could even delete BFI so we can immediately reclaim the memory. We don’t need the checks for NULL BFI. We can assert if someone calls into an analysis before it has been run by the pass manager.

Almost nothing really needs to be in the BlockFrequencyInfoImpl.h header. I’m not sure if this was supposed to be for readability, but technically all the internal data types can be defined in BlockFrequencyInfoImpl.cpp.

dexonsmith · March 26, 2014, 12:22am

Hi Andy,

Thanks for your review. I have some responses and follow-up questions.

Originally, we did not want to make BlockFrequency depend on LoopInfo, but I no longer see a good reason to avoid it. On the other hand, LoopInfo is not a helpful representation for you. The only drawback to rolling your own is the little LoopStack utility. It's current position in the file is confusing to me because it appears to be part of the main algorithm. If you comment that it is a helper for initializeLoops, and place it adjacent in the code, then it's probably simpler than trying to reuse LoopInfo--unless you have an issue with irreducible CFG.

LoopInfo effectively ignores backedges that don't target a natural loop header. Whereas it looks like LoopStack will attempt to create a loop for every backedge. You will end up visiting blocks for different "loops" before finishing either loop. Have you reasoned through this and determined how it will play out? Does the comment "frequencies can be distorted past the end of the loop” still completely characterize the problem?

I did reason through this, and some slightly weird things happen. I
decided that the weird things are *less* weird than artifacts from
ignoring such back edges entirely, but this was somewhat arbitrary.

There are some other problems with my loop detection. I was running
through some examples to confirm my answer to your question below, I
noticed a significant regression that must have crept in while I was
reorganizing code for review.

Rolling my own solution here was dangerous because it’s really hard to
test (and, thus, to prevent regressions). I think the right solution is
to use LoopInfo; as unhelpful as it is, the complex part (loop detection)
is already well-tested.

If the overhead is too high, that’s motivation to fix LoopInfo.

I.e., I plan to migrate to LoopInfo before turning this on. Does that
sound fine to you?

---

How does loop mass distribution work when an inner loop exits to an outer loop more than one level up? Is the exit recorded in multiple loop packages? I fail to see how the mass will be propagated in the outermost loop, and how the scale will be computed in the middle loop.

It’s an exit from the inner loop, so it’ll be recorded in the inner loop
package. The inner loop package (or pseudo-node) will have an exit out
of the middle loop, too, so it’ll get recorded again in the middle loop
package.

The middle loop will have two exits that contribute to its exit
probability: the inner loop pseudo-node, and the middle loop condition.
The scale is just the inverse of its exit probability.

Consider this example:

    while (outer()) {
        while (middle()) {
            while (inner()) {
                if (breaktwice()) {
                    goto outer; // if.then
                }
            }
        }
    outer: {}
    }

Note: there’s something strange here. The if.then block (which contains
the goto) is *not* in the inner loop; it’s outside of the middle loop.
Optimizations should merge this block with its successor (outer); for
simplicity, I’m going to assume that’s already happened.

Let’s say all three loop conditions are false with 1/5 probability, and
breaktwice() is true with probability 1/4. The inner loop will have exit
probability 2/5 (1/5 + 4/5*1/4), and will get a loop scale of 5/2. When
it’s packaged, it will have two outgoing edges of equal weight.

The middle loop will have two exits: from the condition with probability
1/5, and from the inner loop with probability 1/2. The exit probability
will be 3/5 (1/5 + 4/5*1/2), so the scale will be 5/3. The middle loop
will have one outgoing edge to outer. (If we don’t assume that if.then
has been merged with outer, then the middle loop will actually have two
exits: one to outer, and one to if.then (whose sole successor is outer).)

Finally, the outer loop has just one exit, with probability 1/5. The
pseudo-node for the middle loop will distribute all of its mass to outer.
The outer loop scale is 5/1.

Assuming the outer loop’s sole predecessor has frequency of 1:

  - the outer condition will have frequency of 5 (5*1),
  - the middle condition will have frequency of 20/3 (5*5/3*4/5),
  - the inner condition will have frequency of 40/3 (20/3*5/2*4/5),
  - the if condition will have frequency 32/3 (40/3*4/5), and
  - outer will have frequency 4 (5*4/5).

---

How expensive is calculateBias? Should it be done lazily?

It’s a floating point divide, so it’s relatively expensive. Lazy might
be better.

---

I know you implemented some form of verification, but I don't see it. Anyone else touching the code needs to at least know how to do this, if it's not already part of the IR/MI verifier.

The only verification I implemented is the lit tests.

What else do you have in mind here?

atrick · March 26, 2014, 1:39am

Yes. I agree with your and Chandler’s points on this. I just didn’t want to block progress on a LoopInfo rewrite. I was afraid that API changes would perturb other passes in nontrivial ways. If you can migrate to LoopInfo and have a plan for improving it incrementally that’s great.

-Andy

Topic		Replies	Views
How to use BlockFrequency in inter-procedural context? LLVM Dev List Archives	4	81	December 27, 2014
Is that BlockFrequencyInfo can only be used with profile data IR & Optimizations	9	218	August 25, 2022
[PGO] Details of IR representation of profile data IR & Optimizations pgo , llvm	5	145	October 12, 2023
Getting consistent block frequency information LLVM Dev List Archives	0	92	March 13, 2018
counting branch frequencies LLVM Dev List Archives	10	67	September 19, 2012

[RFC] BlockFrequency is the wrong metric; we need a new one

Related Topics