RFC - Improvements to PGO profile support

Hi Diego,

I work for Qualcomm and we are also interested in improvements to PGO
profile support, especially for supporting relative hotness across
functions. We are willing to contribute in this area so please keep me in
this discussion and work distribution. Some questions and comments below.

Date: Tue, 10 Mar 2015 13:14:22 -0400
From: Diego Novillo <dnovillo@google.com>
To: Bob Wilson <bob.wilson@apple.com>
Cc: Xinliang David Li <davidxl@google.com>, LLVM Developers Mailing
  List <llvmdev@cs.uiuc.edu>
Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support
Message-ID:
  <CAD_=9DSzNSqpku0FjpeLyugaxshcxY9fLyyZ5EmD=i5TVDiqRA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

I've created a few bugzilla issues with details of some of the things
I'll

be looking into. I'm not yet done wordsmithing the overall design
document.
I'll try to finish it by early next week at the latest.

The document is available at

Improvements to LLVM's Profile Infrastructure - Google Docs
<Improvements to LLVM's Profile Infrastructure - Google Docs;

There are several topics covered. Ideally, I would prefer that we
discuss
each topic separately. The main ones I will start working on are the
ones
described in the bugzilla links we have in the doc.

This is just a starting point for us. I am not at all concerned with
implementing exactly what is proposed in the document. In fact, if we
can
get the same value using the existing support, all the better.

OTOH, any other ideas that folks may have that work better than this are
more than welcome. I don't have really strong opinions on the matter. I
am
fine with whatever works.

Thanks for the detailed write-up on this. Some of the issues definitely
need to be addressed. I am concerned, though, that some of the ideas may
be
leading toward a scenario where we have essentially two completely
different ways of representing profile information in LLVM IR. It is
great
to have two complementary approaches to collecting profile data, but two
representations in the IR would not make sense.

Yeah, I don't think I'll continue to push for a new MD_count attribute. If
we were to make MD_prof be a "real" execution count, that would be enough.
Note that by re-using MD_prof we are not changing its meaning at all. The
execution count is still a weight and the ratio is still branch
probability. All that we are changing are the absolute values of the
number
and increasing its data type width to remove the 32bit limitation.

[ivan] What new data type do you suggest? Increasing it to 64 bit for
64-bit architectures would be reasonable, and would reduce the need of
scaling counts.

The first issue raised is that profile execution counts are not
represented in the IR. This was a very intentional decision. I know it
goes
against what other compilers have done in the past. It took me a while
to
get used to the idea when Andy first suggested it, so I know it seems
awkward at first. The advantage is that branch probabilities are much
easier to keep updated in the face of compiler transformations, compared
to
execution counts.

Sorry. I don't follow. Updating counts as the CFG is transformed is not
difficult at all. What examples do you have in mind? The big advantage of
making MD_prof an actual execution count is that it is a meaningful metric
wrt scaling and transformation.

Say, for instance, that we have a branch instruction with two targets with
counts {100, 300} inside a function 'foo' that has entry count 2. The edge
probability for the first edge (count 100) is 100/(100+300) = 25%.

If we inline foo() inside another function bar() at a callsite with
profile
count == 1, the cloned branch instruction gets its counters scaled with
the
callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} =
{50, 150}. But the branch probability did not change. Currently, we are
cloning the branch without changing the edge weights.

This scaling is not difficult at all and can be incrementally very
quickly.
We cannot afford to recompute all frequencies on the fly because it would
be detrimental to compile time. If foo() itself has N callees inlined into
it, each inlined callee needs to trigger a re-computation. When foo() is
inlined into bar(), the frequencies will need to be recomputed for foo()
and all N callees inlined into foo().

[ivan] Scaling branch counts when inlining looks good, but the current PGO
support doesn't maintain function entry counts at LLVM IR. How do you plan
to maintain these at LLVM IR?

We are definitely missing the per-function execution counts that are
needed to be able to compare relative ?hotness? across functions, and I
think that would be a good place to start making improvements. In the
long
term, we should keep our options open to making major changes, but
before
we go there, we should try to make incremental improvements to fix the
existing infrastructure.

Right, and that's the core of our proposal. We don't really want to make
major infrastructure changes at this point. The only thing I'd like to
explore is making MD_prof a real count. This will be useful for the
inliner
changes and it would also make incremental updates easier, because the
scaling that needs to be done is very straightforward and quick.

Note that this change should not modify the current behaviour we get from
profile analysis. Things that were hot before should continue to be hot
now.

Many of the other issues you raise seem like they could also be
addressed
without major changes to the existing infrastructure. Let?s try to fix
those first.

That's exactly the point of the proposal. We definitely don't want to
make
major changes to the infrastructure at first. My thinking is to start
working on making MD_prof a real count. One of the things that are
happening is that the combination of real profile data plus the frequency
propagation that we are currently doing is misleading the analysis.

For example (thanks David for the code and data). In the following code:

int g;
__attribute__((noinline)) void bar() {
g++;
}

extern int printf(const char*, ...);

int main()
{
  int i, j, k;

  g = 0;

  // Loop 1.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 100; j++)
       for (k = 0; k < 100; k++)
           bar();

  printf ("g = %d\n", g);
  g = 0;

  // Loop 2.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 10000; j++)
        bar();

  printf ("g = %d\n", g);
  g = 0;

  // Loop 3.
  for (i = 0; i < 1000000; i++)
    bar();

  printf ("g = %d\n", g);
  g = 0;
}

When compiled with profile instrumentation, frequency propagation is
distorting the real profile because it gives different frequency to the
calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
times, but after frequency propagation, the first call to bar() gets a
weight of 520,202 in loop #1, 210,944 in loop #2 and 4,096 in loop #3. In
reality, every call to bar() should have a weight of 1,000,000.

[ivan] I ran the example with

opt -analyze -block-freq test1.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 101.0, int = 807
- for.cond1: float = 10100.0, int = 80799
- for.cond4: float = 1010000.0, int = 8079999
- for.body6: float = 1000000.0, int = 7999999
- for.inc7: float = 10000.0, int = 79999
- for.inc10: float = 100.0, int = 799
- for.end12: float = 1.0, int = 8
- for.cond13: float = 101.0, int = 807
- for.cond16: float = 409600.0, int = 3276799
- for.body18: float = 409559.0441, int = 3276472
- for.inc22: float = 100.0, int = 799
- for.end24: float = 1.0, int = 8
- for.cond26: float = 4096.0, int = 32768
- for.body28: float = 4095.995904, int = 32767
- for.end31: float = 1.0, int = 8

As shown above, the three calls to bar() get float BBFreq of 1000000 (in
for.body6), 409559.0441 (in for.body18), and 4095.995904 (in for.body28).

When the example is modified for trip counts of 10, 100, and 1000

opt -analyze -block-freq test1.10.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 11.0, int = 87
- for.cond1: float = 110.0, int = 879
- for.cond4: float = 1100.0, int = 8799
- for.body6: float = 1000.0, int = 7999
- for.inc7: float = 100.0, int = 799
- for.inc10: float = 10.0, int = 79
- for.end12: float = 1.0, int = 8
- for.cond13: float = 11.0, int = 87
- for.cond16: float = 1010.0, int = 8079
- for.body18: float = 1000.0, int = 7999
- for.inc22: float = 10.0, int = 79
- for.end24: float = 1.0, int = 8
- for.cond26: float = 1001.0, int = 8007
- for.body28: float = 1000.0, int = 7999
- for.end31: float = 1.0, int = 8

the three calls to bar() get the correct count of 1000.
We should try to remove the limitation of 4096 in the block frequency
propagation algorithm if possible.

Ivan

[ivan] What new data type do you suggest? Increasing it to 64 bit for
64-bit architectures would be reasonable, and would reduce the need of
scaling counts.

64 bit int in MD_prof.

The first issue raised is that profile execution counts are not
represented in the IR. This was a very intentional decision. I know it
goes
against what other compilers have done in the past. It took me a while
to
get used to the idea when Andy first suggested it, so I know it seems
awkward at first. The advantage is that branch probabilities are much
easier to keep updated in the face of compiler transformations, compared
to
execution counts.

Sorry. I don't follow. Updating counts as the CFG is transformed is not
difficult at all. What examples do you have in mind? The big advantage of
making MD_prof an actual execution count is that it is a meaningful metric
wrt scaling and transformation.

Say, for instance, that we have a branch instruction with two targets with
counts {100, 300} inside a function 'foo' that has entry count 2. The edge
probability for the first edge (count 100) is 100/(100+300) = 25%.

If we inline foo() inside another function bar() at a callsite with
profile
count == 1, the cloned branch instruction gets its counters scaled with
the
callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} =
{50, 150}. But the branch probability did not change. Currently, we are
cloning the branch without changing the edge weights.

This scaling is not difficult at all and can be incrementally very
quickly.
We cannot afford to recompute all frequencies on the fly because it would
be detrimental to compile time. If foo() itself has N callees inlined into
it, each inlined callee needs to trigger a re-computation. When foo() is
inlined into bar(), the frequencies will need to be recomputed for foo()
and all N callees inlined into foo().

[ivan] Scaling branch counts when inlining looks good, but the current PGO
support doesn't maintain function entry counts at LLVM IR. How do you plan
to maintain these at LLVM IR?

MD_prof is the only thing that needs to be maintained at IR level. For
in-memory CFG profile representation, function entry count is needed.

This means when CFG profile data is incrementally updated during
inlining, the MD_profile weight/count for the cloned branches needs to
be scaled in the same way.

As shown above, the three calls to bar() get float BBFreq of 1000000 (in
for.body6), 409559.0441 (in for.body18), and 4095.995904 (in for.body28).

When the example is modified for trip counts of 10, 100, and 1000

opt -analyze -block-freq test1.10.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 11.0, int = 87
- for.cond1: float = 110.0, int = 879
- for.cond4: float = 1100.0, int = 8799
- for.body6: float = 1000.0, int = 7999
- for.inc7: float = 100.0, int = 799
- for.inc10: float = 10.0, int = 79
- for.end12: float = 1.0, int = 8
- for.cond13: float = 11.0, int = 87
- for.cond16: float = 1010.0, int = 8079
- for.body18: float = 1000.0, int = 7999
- for.inc22: float = 10.0, int = 79
- for.end24: float = 1.0, int = 8
- for.cond26: float = 1001.0, int = 8007
- for.body28: float = 1000.0, int = 7999
- for.end31: float = 1.0, int = 8

the three calls to bar() get the correct count of 1000.
We should try to remove the limitation of 4096 in the block frequency
propagation algorithm if possible.

4096 is needed for static profile estimation -- there is no need to
remove the limit when real profile data is available -- the solution
is to simply not use the frequency propagation at all.

For your example, the values 11, 1010 etc are all bogus. When loop
trip count is small, the ratio of loop body count and preheader count
(average of trip count) can be very different from the real trip count
(see PR22719). In the same bug, there is another example (multi-entry
loop) showing the weakness.

Storing profile count in MD_prof allows the compiler to skip the freq
propagation completely. It also allows faster incremental update
during inlining via simple scaling.

David

[ivan] What new data type do you suggest? Increasing it to 64 bit for
64-bit architectures would be reasonable, and would reduce the need of
scaling counts.

Yes. That's what we have in mind. MD_prof would eventually become a 64bit integer.

[ivan] Scaling branch counts when inlining looks good, but the current PGO support doesn't maintain function entry counts at LLVM IR. How do you plan to maintain these at LLVM IR?

Well, we'd have to incrementally update the entry counts as inlining happens while scaling the MD_prof counts in the IR. This is a bit far out right now. The inliner is not even capable of reading profiles today. This is the part of the plan that will happen much further down the road.

opt -analyze -block-freq test1.10.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
  - entry: float = 1.0, int = 8
  - for.cond: float = 11.0, int = 87
  - for.cond1: float = 110.0, int = 879
  - for.cond4: float = 1100.0, int = 8799
  - for.body6: float = 1000.0, int = 7999
  - for.inc7: float = 100.0, int = 799
  - for.inc10: float = 10.0, int = 79
  - for.end12: float = 1.0, int = 8
  - for.cond13: float = 11.0, int = 87
  - for.cond16: float = 1010.0, int = 8079
  - for.body18: float = 1000.0, int = 7999
  - for.inc22: float = 10.0, int = 79
  - for.end24: float = 1.0, int = 8
  - for.cond26: float = 1001.0, int = 8007
  - for.body28: float = 1000.0, int = 7999
  - for.end31: float = 1.0, int = 8

the three calls to bar() get the correct count of 1000.
We should try to remove the limitation of 4096 in the block frequency
propagation algorithm if possible.

Long term, the presence of real profile data would make the frequency propagator unnecessary. In the shorter term, coming up with some intermediate fix to this is certainly possible. I'd rather continue this discussion in the bug itself (https://llvm.org/bugs/show_bug.cgi?id=22719)

Diego.