Loss of precision with very large branch weights

In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops:

int g = 0;

attribute((noinline)) void bar() {
g++;
}

extern int printf(const char*, …);

int main()
{
int i, j, k;

for (i = 0; i < 1000000; i++)
bar();

printf (“g = %d\n”, g);
g = 0;

for (i = 0; i < 500000; i++)
bar();

printf (“g = %d\n”, g);
}

I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case:

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis ‘Block Frequency Analysis’ for function ‘bar’:
block-frequency-info: bar

  • entry: float = 1.0, int = 8

Printing analysis ‘Block Frequency Analysis’ for function ‘main’:
block-frequency-info: main

  • entry: float = 1.0, int = 8
  • for.cond: float = 500001.5, int = 4000011
  • for.body: float = 500000.5, int = 4000003
  • for.inc: float = 500000.5, int = 4000003
  • for.end: float = 1.0, int = 8
  • for.cond1: float = 250001.5, int = 2000011
  • for.body3: float = 250000.5, int = 2000003
  • for.inc4: float = 250000.5, int = 2000003
  • for.end6: float = 1.0, int = 8

But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis ‘Block Frequency Analysis’ for function ‘bar’:
block-frequency-info: bar

  • entry: float = 1.0, int = 8

Printing analysis ‘Block Frequency Analysis’ for function ‘main’:
block-frequency-info: main

  • entry: float = 1.0, int = 8
  • for.cond: float = 1073741824.4, int = 8589934595
  • for.body: float = 1073741823.4, int = 8589934587
  • for.inc: float = 1073741823.4, int = 8589934587
  • for.end: float = 1.0, int = 8
  • for.cond1: float = 1073741824.4, int = 8589934595
  • for.body3: float = 1073741823.4, int = 8589934587
  • for.inc4: float = 1073741823.4, int = 8589934587
  • for.end6: float = 1.0, int = 8

Now both loops are considered equally hot.

Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I’m not sure if I’m not missing something else here.

Thanks. Diego.

Since we have decided to capture the global hotness using function
entry count, it is ok to CAP the max weight as of today. I think the
remaining bug in PR22718 is that when capping happens, the scaling is
not done. That needs to be fixed.

The efficiency of the interface is a different issue.

David

Since we have decided to capture the global hotness using function
entry count, it is ok to CAP the max weight as of today. I think the
remaining bug in PR22718 is that when capping happens, the scaling is
not done. That needs to be fixed.

The problem here is that the two loops show up equally hot. So the function
entry count for bar() won't really help since it seems like both loops call
bar() equally frequently.

The efficiency of the interface is a different issue.

Yeah. I'm looking at the hotness calculation first.

Diego.

Since we have decided to capture the global hotness using function
entry count, it is ok to CAP the max weight as of today. I think the
remaining bug in PR22718 is that when capping happens, the scaling is
not done. That needs to be fixed.

The problem here is that the two loops show up equally hot. So the function
entry count for bar() won't really help since it seems like both loops call
bar() equally frequently.

Isn't that the direct result of the branch weights not being scaled
(after reaching the cap) -- thus leading to wrong branch probability
(computed from weights)? Wrong probability leads to wrong Frequency
propagation.

David

Yup, I'm trying to see if we couldn't just use 64bit values all over to make things easier. The drawback would be that we are just punting the problem to a different scale of values (but, it may be enough).

Diego.

FWIW. Intel compiler's profile instrumentation uses 64 bit integer counters.
We wrestled with similar problems for a long time before biting the bullet and switching to 64 bit counters.

For 32 bit architectures this is definitely not ideal, as it now the code must to use multi-instruction sequences
to perform the counter increments.

With 64 bits, even with 1 billion counter increments/second, the range still wouldn't cause counter overflow
for something like 800 years. So I think 64 bit scale of values is definitely enough.

Kevin Smith

yes -- for count representation, 64 bit is needed. The branch weight
here is different and does not needs to be 64bit to represent branch
probability precisely.

David

Actually, the branch weights are really counts. They get converted to
frequencies. For frequencies, we don't really need 64bits, as they're just
comparative values that can be squished into 32bits. It's the branch
weights being 32 bit quantities that are throwing off the calculations.

Diego.

In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops:

int g = 0;
__attribute__((noinline)) void bar() {
g++;
}

extern int printf(const char*, ...);

int main()
{
  int i, j, k;

  for (i = 0; i < 1000000; i++)
    bar();

  printf ("g = %d\n", g);
  g = 0;

  for (i = 0; i < 500000; i++)
    bar();

  printf ("g = %d\n", g);
}

I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case:

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
- entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 500001.5, int = 4000011
- for.body: float = 500000.5, int = 4000003
- for.inc: float = 500000.5, int = 4000003
- for.end: float = 1.0, int = 8
- for.cond1: float = 250001.5, int = 2000011
- for.body3: float = 250000.5, int = 2000003
- for.inc4: float = 250000.5, int = 2000003
- for.end6: float = 1.0, int = 8

But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647

What does -analyze -branch-prob give?

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
- entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 1073741824.4, int = 8589934595
- for.body: float = 1073741823.4, int = 8589934587
- for.inc: float = 1073741823.4, int = 8589934587
- for.end: float = 1.0, int = 8
- for.cond1: float = 1073741824.4, int = 8589934595
- for.body3: float = 1073741823.4, int = 8589934587
- for.inc4: float = 1073741823.4, int = 8589934587
- for.end6: float = 1.0, int = 8

Now both loops are considered equally hot.

Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here.

I don't understand where the fidelity is being lost; this seems
strange to me. 4B fits inside UINT32_MAX, so I don't see how we
hit a cap here.

But certainly there *is* a limit. We can't store a ratio of larger
than ~4B:1. Is this really a problem? I understand that you can't
tell which loop is hotter, but why does that matter? Is there some
optimization that will do the wrong thing? Which one?

Regarding changing branch weights to 64-bits, I think we should be
sure it's valuable before we change it. I imagine it's doable, but
it'll complicate logic in a few places that don't currently have to
worry about overflowing `uint64_t`.

yes -- for count representation, 64 bit is needed. The branch weight
here is different and does not needs to be 64bit to represent branch
probability precisely.

Actually, the branch weights are really counts.

No -- I think that is our original proposal (including changing the
meaning of MD_prof meta data) :). After many rounds of discussions, I
think what we eventually settled to is to
1) use 64 bit value to represent function entry count
2) keep branch weight representation and meaning as it is

Changing weights to 64bit for can slightly increase memory usage. In
fact, what we want longer term is get rid of 'weights' and just use a
fixed point representation for branch probability. For blocks with 2
targets, such info can be attached at Block (source) level, thus
further saving memory.

They get converted to
frequencies. For frequencies, we don't really need 64bits, as they're just
comparative values that can be squished into 32bits. It's the branch
weights being 32 bit quantities that are throwing off the calculations.

Do you still see the issue after fixing bhe bug (limit without scaling) in
BranchProbabilityInfo::calcMetadataWeights ?

David

In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops:

int g = 0;
__attribute__((noinline)) void bar() {
g++;
}

extern int printf(const char*, ...);

int main()
{
  int i, j, k;

  for (i = 0; i < 1000000; i++)
    bar();

  printf ("g = %d\n", g);
  g = 0;

  for (i = 0; i < 500000; i++)
    bar();

  printf ("g = %d\n", g);
}

I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case:

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
- entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 500001.5, int = 4000011
- for.body: float = 500000.5, int = 4000003
- for.inc: float = 500000.5, int = 4000003
- for.end: float = 1.0, int = 8
- for.cond1: float = 250001.5, int = 2000011
- for.body3: float = 250000.5, int = 2000003
- for.inc4: float = 250000.5, int = 2000003
- for.end6: float = 1.0, int = 8

But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647

What does -analyze -branch-prob give?

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
- entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 1073741824.4, int = 8589934595
- for.body: float = 1073741823.4, int = 8589934587
- for.inc: float = 1073741823.4, int = 8589934587
- for.end: float = 1.0, int = 8
- for.cond1: float = 1073741824.4, int = 8589934595
- for.body3: float = 1073741823.4, int = 8589934587
- for.inc4: float = 1073741823.4, int = 8589934587
- for.end6: float = 1.0, int = 8

Now both loops are considered equally hot.

Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here.

I don't understand where the fidelity is being lost; this seems
strange to me. 4B fits inside UINT32_MAX, so I don't see how we
hit a cap here.

The max weight is UINT32_MAX/BB->GetTerminator()->GetNumSuccessors().

But certainly there *is* a limit. We can't store a ratio of larger
than ~4B:1. Is this really a problem?

Right, this is not the issue. The issue is that when one branch's
weight reaches the cap, the other branch's weight is not scaled
accordingly:

uint32_t WeightLimit = getMaxWeightFor(BB);
Weights.reserve(TI->getNumSuccessors());
  for (unsigned i = 1, e = WeightsNode->getNumOperands(); i != e; ++i) {
    ConstantInt *Weight =
        mdconst::dyn_extract<ConstantInt>(WeightsNode->getOperand(i));
    if (!Weight)
      return false;
    Weights.push_back(
      std::max<uint32_t>(1, Weight->getLimitedValue(WeightLimit)));
<---- may hit limit

I understand that you can't
tell which loop is hotter, but why does that matter? Is there some
optimization that will do the wrong thing? Which one?

Two calls (in the loop) not equally hot may be considered equally hot
-- confusing the cost/benefit analysis. This is just an example.

Our goal is to make sure profile data is faithfully kept. Which pass
or whether it will be used now or in the future is a totally different
matter.

Regarding changing branch weights to 64-bits, I think we should be
sure it's valuable before we change it. I imagine it's doable, but
it'll complicate logic in a few places that don't currently have to
worry about overflowing `uint64_t`.

I think we can fix this without using 64bit weight.

David

What does -analyze -branch-prob give?

Similar indication of hotness. Both loops appear equally hot.

$ bin/opt -analyze -branch-prob -S unbiased-branches.ll
Printing analysis 'Branch Probability Analysis' for function 'bar':
---- Branch Probabilities ----
Printing analysis 'Branch Probability Analysis' for function 'main':
---- Branch Probabilities ----
  edge entry -> for.cond probability is 16 / 16 = 100% [HOT edge]
  edge for.cond -> for.body probability is 2147483647 / 2147483649 = 100%
[HOT edge]
  edge for.cond -> for.end probability is 2 / 2147483649 = 9.31323e-08%
  edge for.body -> for.inc probability is 16 / 16 = 100% [HOT edge]
  edge for.inc -> for.cond probability is 124 / 124 = 100% [HOT edge]
  edge for.end -> for.cond1 probability is 16 / 16 = 100% [HOT edge]
  edge for.cond1 -> for.body3 probability is 2147483647 / 2147483649 = 100%
[HOT edge]
  edge for.cond1 -> for.end6 probability is 2 / 2147483649 = 9.31323e-08%
  edge for.body3 -> for.inc4 probability is 16 / 16 = 100% [HOT edge]
  edge for.inc4 -> for.cond1 probability is 124 / 124 = 100% [HOT edge]

But certainly there *is* a limit. We can't store a ratio of larger
than ~4B:1. Is this really a problem? I understand that you can't
tell which loop is hotter, but why does that matter? Is there some
optimization that will do the wrong thing? Which one?

It will matter for cost analysis in the inliner, for instance. I don't
think this matters in the few places that profile is used now. But long
term, we want to make sure that we have a good measure of relative
temperatures within a CFG and across CFGs.

Regarding changing branch weights to 64-bits, I think we should be
sure it's valuable before we change it. I imagine it's doable, but
it'll complicate logic in a few places that don't currently have to
worry about overflowing `uint64_t`.

Sure, OK. I'll see about making sure we scale things properly during
frequency computation.

Diego.

In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops:

int g = 0;
__attribute__((noinline)) void bar() {
g++;
}

extern int printf(const char*, ...);

int main()
{
int i, j, k;

for (i = 0; i < 1000000; i++)
   bar();

printf ("g = %d\n", g);
g = 0;

for (i = 0; i < 500000; i++)
   bar();

printf ("g = %d\n", g);
}

I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case:

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
- entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 500001.5, int = 4000011
- for.body: float = 500000.5, int = 4000003
- for.inc: float = 500000.5, int = 4000003
- for.end: float = 1.0, int = 8
- for.cond1: float = 250001.5, int = 2000011
- for.body3: float = 250000.5, int = 2000003
- for.inc4: float = 250000.5, int = 2000003
- for.end6: float = 1.0, int = 8

But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647

What does -analyze -branch-prob give?

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
- entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 1073741824.4, int = 8589934595
- for.body: float = 1073741823.4, int = 8589934587
- for.inc: float = 1073741823.4, int = 8589934587
- for.end: float = 1.0, int = 8
- for.cond1: float = 1073741824.4, int = 8589934595
- for.body3: float = 1073741823.4, int = 8589934587
- for.inc4: float = 1073741823.4, int = 8589934587
- for.end6: float = 1.0, int = 8

Now both loops are considered equally hot.

Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here.

I don't understand where the fidelity is being lost; this seems
strange to me. 4B fits inside UINT32_MAX, so I don't see how we
hit a cap here.

The max weight is UINT32_MAX/BB->GetTerminator()->GetNumSuccessors().

But certainly there *is* a limit. We can't store a ratio of larger
than ~4B:1. Is this really a problem?

Right, this is not the issue. The issue is that when one branch's
weight reaches the cap, the other branch's weight is not scaled
accordingly:

uint32_t WeightLimit = getMaxWeightFor(BB);
Weights.reserve(TI->getNumSuccessors());
for (unsigned i = 1, e = WeightsNode->getNumOperands(); i != e; ++i) {
   ConstantInt *Weight =
       mdconst::dyn_extract<ConstantInt>(WeightsNode->getOperand(i));
   if (!Weight)
     return false;
   Weights.push_back(
     std::max<uint32_t>(1, Weight->getLimitedValue(WeightLimit)));
<---- may hit limit

Ah, right, this looks like a bug.

>
>
>>
>> yes -- for count representation, 64 bit is needed. The branch weight
>> here is different and does not needs to be 64bit to represent branch
>> probability precisely.
>
>
> Actually, the branch weights are really counts.

No -- I think that is our original proposal (including changing the
meaning of MD_prof meta data) :).

Sure. Though they kind of are. They get massaged and smoothed when
branch_weights are placed from the raw counts, but for sufficiently small
values they are very close to counts.

>They get converted to
> frequencies. For frequencies, we don't really need 64bits, as they're
just
> comparative values that can be squished into 32bits. It's the branch
> weights being 32 bit quantities that are throwing off the calculations.

Do you still see the issue after fixing bhe bug (limit without scaling) in
BranchProbabilityInfo::calcMetadataWeights ?

That's the fix I was contemplating initially. I was curious at whether
moving to 64bit would make this easier.

Diego.

>
>
>>
>> yes -- for count representation, 64 bit is needed. The branch weight
>> here is different and does not needs to be 64bit to represent branch
>> probability precisely.
>
>
> Actually, the branch weights are really counts.

No -- I think that is our original proposal (including changing the
meaning of MD_prof meta data) :).

Sure. Though they kind of are. They get massaged and smoothed when
branch_weights are placed from the raw counts, but for sufficiently small
values they are very close to counts.

right.

>They get converted to
> frequencies. For frequencies, we don't really need 64bits, as they're
> just
> comparative values that can be squished into 32bits. It's the branch
> weights being 32 bit quantities that are throwing off the calculations.

Do you still see the issue after fixing bhe bug (limit without scaling) in
BranchProbabilityInfo::calcMetadataWeights ?

That's the fix I was contemplating initially. I was curious at whether
moving to 64bit would make this easier.

It is certainly easier but with a cost :slight_smile:

david

Having branch weights as 64 bit seems entirely reasonable to me. Increasing the range of the value stored doesn’t change the semantics no matter how you interpret them. It does change the calculated frequencies, but only to make them more accurate. I don’t see any problem with that. Philip