RFC - Improvements to PGO profile support

We (Google) have started to look more closely at the profiling infrastructure in LLVM. Internally, we have a large dependency on PGO to get peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM (e.g., the inliner) but we will still need to incorporate changes to support our work on these optimizations. Some of the changes may be addressed as individual bug fixes on the existing profiling infrastructure. Other changes may be better implemented as either new extensions or as replacements of existing code.

I think we will try to minimize infrastructure replacement at least in the short/medium term. After all, it doesn’t make too much sense to replace infrastructure that is broken for code that doesn’t exist yet.

David Li and I are preparing a document where we describe the major issues that we’d like to address. The document is a bit on the lengthy side, so it may be easier to start with an email discussion. This is a summary of the main changes we are looking at:

  1. Need to faithfully represent the execution count taken from dynamic profiles. Currently, MD_prof does not really represent an execution count. This makes things like comparing hotness across functions hard or impossible. We need a concept of global hotness.
  2. When the CFG or callgraph change, there need to exist an API for incrementally updating/scaling counts. For instance, when a function is inlined or partially inlined, when the CFG is modified, etc. These counts need to be updated incrementally (or perhaps re-computed as a first step into that direction).
  3. The inliner (and other optimizations) needs to use profile information and update it accordingly. This is predicated on Chandler’s work on the pass manager, of course.
    Need to represent global profile summary data. For example, for global hotness determination, it is useful to compute additional global summary info, such as a histogram of counts that can be used to determine hotness and working set size estimates for a large percentage of the profiled execution.
    There are other changes that we will need to incorporate. David, Teresa, Chandler, please add anything large that I missed.

My main question at the moment is what would be the best way of addressing them. Some seem to require new concepts to be implemented (e.g., execution counts). Others could be addressed as simple bugs to be fixed in the current framework.

Would it make sense to present everything in a unified document and discuss that? I’ve got some reservations about that approach because we will end up discussing everything at once and it may not lead to concrete progress. Another approach would be to present each issue individually either as patches or RFCs or bugs.

I will be taking on the implementation of several of these issues. Some of them involve the SamplePGO harness that I added last year. I would also like to know what other bugs or problems people have in mind that I could also roll into this work.

Thanks. Diego.

I don't mind whatever ways we take to upstream -- all these are
generally good problems to solve. I expect discussions more on
approaches to tackle the problem, not the problems themselves.

David

Diego Novillo <dnovillo@google.com> writes:

David Li and I are preparing a document where we describe the major issues
that we'd like to address. The document is a bit on the lengthy side, so it
may be easier to start with an email discussion. This is a summary of the main
changes we are looking at:

1. Need to faithfully represent the execution count taken from dynamic
    profiles. Currently, MD_prof does not really represent an execution count.
    This makes things like comparing hotness across functions hard or
    impossible. We need a concept of global hotness.
2. When the CFG or callgraph change, there need to exist an API for
    incrementally updating/scaling counts. For instance, when a function is
    inlined or partially inlined, when the CFG is modified, etc. These counts
    need to be updated incrementally (or perhaps re-computed as a first step
    into that direction).
3. The inliner (and other optimizations) needs to use profile information and
    update it accordingly. This is predicated on Chandler's work on the pass
    manager, of course.
    Need to represent global profile summary data. For example, for global
    hotness determination, it is useful to compute additional global summary
    info, such as a histogram of counts that can be used to determine hotness
    and working set size estimates for a large percentage of the profiled
    execution.

Great! Looking forward to hearing more.

My main question at the moment is what would be the best way of addressing
them. Some seem to require new concepts to be implemented (e.g., execution
counts). Others could be addressed as simple bugs to be fixed in the current
framework.

Would it make sense to present everything in a unified document and discuss
that? I've got some reservations about that approach because we will end up
discussing everything at once and it may not lead to concrete progress.
Another approach would be to present each issue individually either as patches
or RFCs or bugs.

While a unified document is likely to lead to a an unfocused discussion,
talking about the problems piecemeal will make it harder to think about
the big picture. I suspect that an overview of the issues and some brief
notes on how you're thinking of approaching each will be useful. We can
break out separate discussions from there on any points that are
contentious or otherwise need to be discussed in further detail.

I would personally be interested in seeing a copy of that document, but it might be more appropriate for a blog post then a discussion on llvm-dev. I worry that we’d end up with a very unfocused discussion. It might be better to frame this as your plan of attack and reserve discussion on llvm-dev for things that are being proposed semi near term. Just my 2 cents. What does MD_prof actually represent when used from Clang? I know I’ve been using it for execution counters in my frontend. Am I approaching that wrong? As a side comment: I’m a bit leery of the notion of a consistent notion of hotness based on counters across functions. These counters are almost always approximate in practice and counting problems run rampant. I’d almost rather see a consistent count inferred from data that’s assumed to be questionable than make the frontend try to generate consistent profiling metadata. I think either approach could be made to work, we just need to think about it carefully. Agreed. Do you have a sense how much of an issue this in practice? I haven’t see it kick in much, but it’s also not something I’ve been looking for. Its worth noting that the inliner work can be done independently of the pass manager work. We can always explicitly recompute relevant analysis in the inliner if needed. This will cost compile time, so we might need to make this an off by default option. (Maybe -O3 only?) Being able to work on the inliner independently of the pass management structure is valuable enough that we should probably consider doing this. PGO inlining is an area I’m very interested in. I’d really encourage you to work incrementally in tree. I’m likely to start putting non-trivial amounts of time into this topic in the next few weeks. I just need to clear a few things off my plate first. Other than the inliner, can you list the passes you think are profitable to teach about profiling data? My list so far is: PRE (particularly of loads!), the vectorizer (i.e. duplicate work down both a hot and cold path when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll (avoiding code size explosion in cold code). I’m much more interested in sources of improved performance than I am simply code size reduction. (Reducing code size can improve performance of course.) Er, not clear what you’re trying to say here? See above.

We (Google) have started to look more closely at the profiling
infrastructure in LLVM. Internally, we have a large dependency on PGO to get
peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM
(e.g., the inliner) but we will still need to incorporate changes to support
our work on these optimizations. Some of the changes may be addressed as
individual bug fixes on the existing profiling infrastructure. Other changes
may be better implemented as either new extensions or as replacements of
existing code.

I think we will try to minimize infrastructure replacement at least in the
short/medium term. After all, it doesn't make too much sense to replace
infrastructure that is broken for code that doesn't exist yet.

David Li and I are preparing a document where we describe the major issues
that we'd like to address. The document is a bit on the lengthy side, so it
may be easier to start with an email discussion.

I would personally be interested in seeing a copy of that document, but it
might be more appropriate for a blog post then a discussion on llvm-dev. I
worry that we'd end up with a very unfocused discussion. It might be better
to frame this as your plan of attack and reserve discussion on llvm-dev for
things that are being proposed semi near term. Just my 2 cents.

This is a summary of the main changes we are looking at:

Need to faithfully represent the execution count taken from dynamic
profiles. Currently, MD_prof does not really represent an execution count.
This makes things like comparing hotness across functions hard or
impossible. We need a concept of global hotness.

What does MD_prof actually represent when used from Clang? I know I've been
using it for execution counters in my frontend. Am I approaching that
wrong?

As a side comment: I'm a bit leery of the notion of a consistent notion of
hotness based on counters across functions. These counters are almost
always approximate in practice and counting problems run rampant. I'd
almost rather see a consistent count inferred from data that's assumed to be
questionable than make the frontend try to generate consistent profiling
metadata. I think either approach could be made to work, we just need to
think about it carefully.

When the CFG or callgraph change, there need to exist an API for
incrementally updating/scaling counts. For instance, when a function is
inlined or partially inlined, when the CFG is modified, etc. These counts
need to be updated incrementally (or perhaps re-computed as a first step
into that direction).

Agreed. Do you have a sense how much of an issue this in practice? I
haven't see it kick in much, but it's also not something I've been looking
for.

The inliner (and other optimizations) needs to use profile information and
update it accordingly. This is predicated on Chandler's work on the pass
manager, of course.

Its worth noting that the inliner work can be done independently of the pass
manager work. We can always explicitly recompute relevant analysis in the
inliner if needed. This will cost compile time, so we might need to make
this an off by default option. (Maybe -O3 only?) Being able to work on the
inliner independently of the pass management structure is valuable enough
that we should probably consider doing this.

PGO inlining is an area I'm very interested in. I'd really encourage you to
work incrementally in tree. I'm likely to start putting non-trivial amounts
of time into this topic in the next few weeks. I just need to clear a few
things off my plate first.

Other than the inliner, can you list the passes you think are profitable to
teach about profiling data? My list so far is: PRE (particularly of
loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
(avoiding code size explosion in cold code). I'm much more interested in
sources of improved performance than I am simply code size reduction.
(Reducing code size can improve performance of course.)

Also, code layout (bb layout, function layout, function splitting).

Need to represent global profile summary data. For example, for global
hotness determination, it is useful to compute additional global summary
info, such as a histogram of counts that can be used to determine hotness
and working set size estimates for a large percentage of the profiled
execution.

Er, not clear what you're trying to say here?

The idea is to get a sense of a good global profile count threshold to
use given an application's profile, i.e. when determining whether a
profile count is hot in the given profile. For example, what is the
minimum profile count contributing to the hottest 99% of the
application's profile.

Teresa

We (Google) have started to look more closely at the profiling
infrastructure in LLVM. Internally, we have a large dependency on PGO to get
peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM
(e.g., the inliner) but we will still need to incorporate changes to support
our work on these optimizations. Some of the changes may be addressed as
individual bug fixes on the existing profiling infrastructure. Other changes
may be better implemented as either new extensions or as replacements of
existing code.

I think we will try to minimize infrastructure replacement at least in the
short/medium term. After all, it doesn't make too much sense to replace
infrastructure that is broken for code that doesn't exist yet.

David Li and I are preparing a document where we describe the major issues
that we'd like to address. The document is a bit on the lengthy side, so it
may be easier to start with an email discussion.

I would personally be interested in seeing a copy of that document, but it
might be more appropriate for a blog post then a discussion on llvm-dev. I
worry that we'd end up with a very unfocused discussion. It might be better
to frame this as your plan of attack and reserve discussion on llvm-dev for
things that are being proposed semi near term. Just my 2 cents.

This is a summary of the main changes we are looking at:

Need to faithfully represent the execution count taken from dynamic
profiles. Currently, MD_prof does not really represent an execution count.
This makes things like comparing hotness across functions hard or
impossible. We need a concept of global hotness.

What does MD_prof actually represent when used from Clang? I know I've been
using it for execution counters in my frontend. Am I approaching that
wrong?

As a side comment: I'm a bit leery of the notion of a consistent notion of
hotness based on counters across functions. These counters are almost
always approximate in practice and counting problems run rampant.

Having representative training runs is pre-requisite for using FDO/PGO.

I'd
almost rather see a consistent count inferred from data that's assumed to be
questionable than
make the frontend try to generate consistent profiling
metadata.

Frontend does not generate profile data -- it is just a messenger that
should pass the data faithfully to the middle end. That messenger
(profile reader) can be in middle end too.

I think either approach could be made to work, we just need to
think about it carefully.

When the CFG or callgraph change, there need to exist an API for
incrementally updating/scaling counts. For instance, when a function is
inlined or partially inlined, when the CFG is modified, etc. These counts
need to be updated incrementally (or perhaps re-computed as a first step
into that direction).

Agreed. Do you have a sense how much of an issue this in practice? I
haven't see it kick in much, but it's also not something I've been looking
for.

The inliner (and other optimizations) needs to use profile information and
update it accordingly. This is predicated on Chandler's work on the pass
manager, of course.

Its worth noting that the inliner work can be done independently of the pass
manager work. We can always explicitly recompute relevant analysis in the
inliner if needed. This will cost compile time, so we might need to make
this an off by default option. (Maybe -O3 only?) Being able to work on the
inliner independently of the pass management structure is valuable enough
that we should probably consider doing this.

PGO inlining is an area I'm very interested in. I'd really encourage you to
work incrementally in tree. I'm likely to start putting non-trivial amounts
of time into this topic in the next few weeks. I just need to clear a few
things off my plate first.

Other than the inliner, can you list the passes you think are profitable to
teach about profiling data? My list so far is: PRE (particularly of
loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
(avoiding code size explosion in cold code). I'm much more interested in
sources of improved performance than I am simply code size reduction.
(Reducing code size can improve performance of course.)

PGO is very effective in code size reduction. In reality, large
percentage of functions are globally cold.

David

Other than the inliner, can you list the passes you think are profitable to
teach about profiling data? My list so far is: PRE (particularly of
loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
(avoiding code size explosion in cold code). I'm much more interested in
sources of improved performance than I am simply code size reduction.
(Reducing code size can improve performance of course.)

Also, code layout (bb layout, function layout, function splitting).

Correct me if I'm wrong, but we already have "function layout" (i.e. basic block placement). It may need improved, but I've found it to be reasonable effective.

What do you mean by "bb layout"?

I'm assuming you're referring to a form of outlining as "function splitting". This seems like a clearly good idea.

Need to represent global profile summary data. For example, for global
hotness determination, it is useful to compute additional global summary
info, such as a histogram of counts that can be used to determine hotness
and working set size estimates for a large percentage of the profiled
execution.

Er, not clear what you're trying to say here?

The idea is to get a sense of a good global profile count threshold to
use given an application's profile, i.e. when determining whether a
profile count is hot in the given profile. For example, what is the
minimum profile count contributing to the hottest 99% of the
application's profile.

Ah, got it. That makes sense for a C/C++ use case.

In my use case, we're only compiling the hottest methods. As such, I care mostly about relative hotness within a method or a small set of methods.

Need to faithfully represent the execution count taken from dynamic
profiles. Currently, MD_prof does not really represent an execution count.
This makes things like comparing hotness across functions hard or
impossible. We need a concept of global hotness.

What does MD_prof actually represent when used from Clang? I know I've been
using it for execution counters in my frontend. Am I approaching that
wrong?

As a side comment: I'm a bit leery of the notion of a consistent notion of
hotness based on counters across functions. These counters are almost
always approximate in practice and counting problems run rampant.

Having representative training runs is pre-requisite for using FDO/PGO.

Representativeness is not the issue I'm raising. Profiling systems (particularly instrumentation based ones) have systemic biases. Not accounting for that can lead to some very odd results. As an example:
void foo() {
   if (member)
      for(int i = 0; i < 100000; i++)
        if (member2)
           bar();
}

With multiple threads in play, it's entirely possible that the sum of the absolute weights on the second branch are lower than the sum of the absolute counts on the first branch. (i.e. due to racy updating) While you can avoid this by using race free updates, I know of very few systems that actually do.

If your optimization is radically unstable in such scenarios, that's a serious problem. Pessimization is bad enough (if tolerable), incorrect transforms are not. It's very easy to write a transform that implicitly assumes the counts for the first branch must be less than the counts for the second.

  I'd
almost rather see a consistent count inferred from data that's assumed to be
questionable than
make the frontend try to generate consistent profiling
metadata.

Frontend does not generate profile data -- it is just a messenger that
should pass the data faithfully to the middle end. That messenger
(profile reader) can be in middle end too.

Er, we may be arguing terminology here. I was including the profiling system as part of the "frontend" - I'm working with a JIT - whereas you're assuming a separate collection system. It doesn't actually matter which terms we use. My point was that assuming clean profiling data is just not reasonable in practice. At minimum, some type of normalization step is required.

Other than the inliner, can you list the passes you think are profitable to
teach about profiling data? My list so far is: PRE (particularly of
loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
(avoiding code size explosion in cold code). I'm much more interested in
sources of improved performance than I am simply code size reduction.
(Reducing code size can improve performance of course.)

PGO is very effective in code size reduction. In reality, large
percentage of functions are globally cold.

For a traditional C++ application, yes. For a JIT which is only compiling warm code paths in hot methods, not so much. It's still helpful, but the impact is much smaller.

Philip

Other than the inliner, can you list the passes you think are profitable
to
teach about profiling data? My list so far is: PRE (particularly of
loads!), the vectorizer (i.e. duplicate work down both a hot and cold
path
when it can be vectorized on the hot path), LoopUnswitch, IRCE, &
LoopUnroll
(avoiding code size explosion in cold code). I'm much more interested in
sources of improved performance than I am simply code size reduction.
(Reducing code size can improve performance of course.)

Also, code layout (bb layout, function layout, function splitting).

Correct me if I'm wrong, but we already have "function layout" (i.e. basic
block placement). It may need improved, but I've found it to be reasonable
effective.

What do you mean by "bb layout"?

By bb layout I was referring to basic block placement - I am not overly
familiar with LLVM's implementation, but I know that this typically
benefits from profile information.

By function layout, I meant layout of functions within the module and then
the executable. This could simply be marking/separating hot vs cold
functions, or could be fancier via a linker plugin to use profile data to
colocate functions with affinity.

I'm assuming you're referring to a form of outlining as "function
splitting". This seems like a clearly good idea.

Right.

Thanks,
Teresa

Need to faithfully represent the execution count taken from dynamic
profiles. Currently, MD_prof does not really represent an execution
count.
This makes things like comparing hotness across functions hard or
impossible. We need a concept of global hotness.

What does MD_prof actually represent when used from Clang? I know I've
been
using it for execution counters in my frontend. Am I approaching that
wrong?

As a side comment: I'm a bit leery of the notion of a consistent notion
of
hotness based on counters across functions. These counters are almost
always approximate in practice and counting problems run rampant.

Having representative training runs is pre-requisite for using FDO/PGO.

Representativeness is not the issue I'm raising. Profiling systems
(particularly instrumentation based ones) have systemic biases. Not
accounting for that can lead to some very odd results. As an example:
void foo() {
  if (member)
     for(int i = 0; i < 100000; i++)
       if (member2)
          bar();
}

With multiple threads in play, it's entirely possible that the sum of the
absolute weights on the second branch are lower than the sum of the absolute
counts on the first branch. (i.e. due to racy updating) While you can
avoid this by using race free updates, I know of very few systems that
actually do.

Are you speculating or you have data to show it? We have large
programs run with hundreds of threads, race condition only contribute
to very small count variations -- and there are ways to smooth out the
difference.

If your optimization is radically unstable in such scenarios, that's a
serious problem. Pessimization is bad enough (if tolerable), incorrect
transforms are not.

This is never our experience with using PGO in the past. We also
have tools to compare profile consistency from one training run to
another.

If you experience such problems in real apps, can you file a bug?

It's very easy to write a transform that implicitly
assumes the counts for the first branch must be less than the counts for the
second.

Compiler can detect insane profile -- it can either ignore it, correct
it, or uses it with warnings depending on options.

  I'd
almost rather see a consistent count inferred from data that's assumed to
be
questionable than
make the frontend try to generate consistent profiling
metadata.

Frontend does not generate profile data -- it is just a messenger that
should pass the data faithfully to the middle end. That messenger
(profile reader) can be in middle end too.

Er, we may be arguing terminology here. I was including the profiling
system as part of the "frontend" - I'm working with a JIT - whereas you're
assuming a separate collection system. It doesn't actually matter which
terms we use. My point was that assuming clean profiling data is just not
reasonable in practice. At minimum, some type of normalization step is
required.

If you are talking about making slightly consistent profile to be flow
consistent, yes there are mechanisms to do that.

David

We (Google) have started to look more closely at the profiling infrastructure in LLVM. Internally, we have a large dependency on PGO to get peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM (e.g., the inliner) but we will still need to incorporate changes to support our work on these optimizations. Some of the changes may be addressed as individual bug fixes on the existing profiling infrastructure. Other changes may be better implemented as either new extensions or as replacements of existing code.

I think we will try to minimize infrastructure replacement at least in the short/medium term. After all, it doesn’t make too much sense to replace infrastructure that is broken for code that doesn’t exist yet.

David Li and I are preparing a document where we describe the major issues that we’d like to address. The document is a bit on the lengthy side, so it may be easier to start with an email discussion. This is a summary of the main changes we are looking at:

  1. Need to faithfully represent the execution count taken from dynamic profiles. Currently, MD_prof does not really represent an execution count. This makes things like comparing hotness across functions hard or impossible. We need a concept of global hotness.

The plan that we have discussed in the past (I don’t remember when) was to record simple function entry execution counts. Those could be combined with the BlockFrequencyInfo to compare “hotness” across functions.

  1. When the CFG or callgraph change, there need to exist an API for incrementally updating/scaling counts. For instance, when a function is inlined or partially inlined, when the CFG is modified, etc. These counts need to be updated incrementally (or perhaps re-computed as a first step into that direction).

One of the main reasons that we chose to use branch weights to represent profile information within functions is that it makes this problem easier. Of course, we still need to update the branch weights when transforming the CFG, but I believe most of that work has already been done. Are you suggesting that we should work on incremental BlockFrequencyInfo updates? We have discussed that in the past, but so far, it has worked reasonably well to just re-run that analysis. (I wouldn’t be surprised if we’re missing some places where the analysis needs to be invalidated so that it gets re-run.)

We (Google) have started to look more closely at the profiling
infrastructure in LLVM. Internally, we have a large dependency on PGO to get
peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM
(e.g., the inliner) but we will still need to incorporate changes to support
our work on these optimizations. Some of the changes may be addressed as
individual bug fixes on the existing profiling infrastructure. Other changes
may be better implemented as either new extensions or as replacements of
existing code.

I think we will try to minimize infrastructure replacement at least in the
short/medium term. After all, it doesn't make too much sense to replace
infrastructure that is broken for code that doesn't exist yet.

David Li and I are preparing a document where we describe the major issues
that we'd like to address. The document is a bit on the lengthy side, so it
may be easier to start with an email discussion. This is a summary of the
main changes we are looking at:

Need to faithfully represent the execution count taken from dynamic
profiles. Currently, MD_prof does not really represent an execution count.
This makes things like comparing hotness across functions hard or
impossible. We need a concept of global hotness.

The plan that we have discussed in the past (I don’t remember when) was to
record simple function entry execution counts. Those could be combined with
the BlockFrequencyInfo to compare “hotness” across functions.

Yes -- there are two aspects of the problems.
1) raw profile data representation in IR and
2) the profile count data represented for CFG.

What you said is for 2) which is one of the possibilities. There is a
third issue that is also going to be covered in more detail -- that is
the Block Frequency propagation algorithm is limited (leading to
information loss). When profile count is available, block frequency
data can be directly computed via simple normalization and scaling.
This requires the raw edge count data to be represented in 1)
truthfully.

When the CFG or callgraph change, there need to exist an API for
incrementally updating/scaling counts. For instance, when a function is
inlined or partially inlined, when the CFG is modified, etc. These counts
need to be updated incrementally (or perhaps re-computed as a first step
into that direction).

One of the main reasons that we chose to use branch weights to represent
profile information within functions is that it makes this problem easier.
Of course, we still need to update the branch weights when transforming the
CFG, but I believe most of that work has already been done. Are you
suggesting that we should work on incremental BlockFrequencyInfo updates? We
have discussed that in the past, but so far, it has worked reasonably well
to just re-run that analysis. (I wouldn’t be surprised if we’re missing some
places where the analysis needs to be invalidated so that it gets re-run.)

Diego is going to share the proposal in more detail. I will give a
brief summary to answer your question:

1) making raw profile data (in MD_prof) to truthfully does not change
its original meaning -- the branch count is still branch weight -- but
it happens to be also execution count which contains more information.
2) At CFG level (BranchProbabilityInfo), using real branch probability
does not change the way branch weight is used either -- the branch
probability is still branch weight (but normalized to be 0 <= prob
<=1). Benefits and reasons behind that will be provided.

The infrastructure is ready to update the raw MD_prof data during
intra-procedural transformations when branch instructions are cloned,
but not CFG level branch prob and block-frequency/block count update.
This is especially important for inter-procedural transformations like
inliner (i.e., update the profile data associated with inline
instance in the caller context). Recomputing via freq propagation is
not only not precise (as mentioned above), but also very compile time
consuming.

thanks,

David

Folks,

I’ve created a few bugzilla issues with details of some of the things I’ll be looking into. I’m not yet done wordsmithing the overall design document. I’ll try to finish it by early next week at the latest.

In the meantime, these are the specific bugzilla issues I’ve opened:

I’m hoping the descriptions in the bugs make sense on their own. If not, please use the bugs to beat me with a clue stick and I’ll clarify.

Thanks. Diego.

I've created a few bugzilla issues with details of some of the things I'll

be looking into. I'm not yet done wordsmithing the overall design document.
I'll try to finish it by early next week at the latest.

The document is available at

https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>

There are several topics covered. Ideally, I would prefer that we discuss
each topic separately. The main ones I will start working on are the ones
described in the bugzilla links we have in the doc.

This is just a starting point for us. I am not at all concerned with
implementing exactly what is proposed in the document. In fact, if we can
get the same value using the existing support, all the better.

OTOH, any other ideas that folks may have that work better than this are
more than welcome. I don't have really strong opinions on the matter. I am
fine with whatever works.

Thanks. Diego.

Thanks for the detailed write-up on this. Some of the issues definitely need to be addressed. I am concerned, though, that some of the ideas may be leading toward a scenario where we have essentially two completely different ways of representing profile information in LLVM IR. It is great to have two complementary approaches to collecting profile data, but two representations in the IR would not make sense.

The first issue raised is that profile execution counts are not represented in the IR. This was a very intentional decision. I know it goes against what other compilers have done in the past. It took me a while to get used to the idea when Andy first suggested it, so I know it seems awkward at first. The advantage is that branch probabilities are much easier to keep updated in the face of compiler transformations, compared to execution counts. We are definitely missing the per-function execution counts that are needed to be able to compare relative “hotness” across functions, and I think that would be a good place to start making improvements. In the long term, we should keep our options open to making major changes, but before we go there, we should try to make incremental improvements to fix the existing infrastructure.

Many of the other issues you raise seem like they could also be addressed without major changes to the existing infrastructure. Let’s try to fix those first.

After reading the document, I agree with Bob’s perspective here. I would strongly recommend that you start with the optimizations than can be implemented within the current framework. The current infrastructure gives a fairly reasonable idea of relative hotness within a function. There’s a lot to be done to exploit that information (even in the inliner!) without resorting to cross function analysis. If, after most of those have been implemented, we need more fundamental changes we could consider them. Starting with a fundamental rewrite of the profiling system within LLVM seems like a mistake. At a meta level, as someone who uses LLVM for JITing I would be opposed to a system that assumed consistent profiling counts across function boundaries and gave up on relative hotness information. At least if I’m understanding your proposal, this would completely break a multi-tiered JIT. In practice, you generally stop collecting instrumentation profiling once something is compiled at a high enough tier. When compiling it’s caller, you’ll get very deceptive results if you rely on the execution counts to line up across functions. On the other hand, merging two relative hotness profiles by scaling based on the hotness of the callsite works out quite well in practice. You can use some information about global hotness to make decisions, but those decisions need to be resilient to such systematic under-counting. Philip

Bob, Philip, thanks for the feedback.

Diego is planning to give more detailed reply next Monday. There seem
to be some misunderstanding about the proposals, so I will just give
some highlights here:

1) The proposal is not intending to fundamentally change the current
framework, but to enhanced the framework so that
  a) more profile information is preserved
  b) block/edge count/frequency becomes faster to compute
  b) profile information becomes faster to access and update
(inter-procedurally)

2) Changes to profile APIs and profile client code will be minimized,
except that we will add IPA clients (once Chandler's pass manager
change is ready)

3) The proposed change does *not* give up relative hotness as
mentioned by Philiip. All clients that relies on relative hotness are
not affected -- except that the data is better and more reliable

4) With real profile data available, current infrastructure does *not*
provide reasonable hotness (e.g., you can try comparing the BBs that
execute the same number times, but in loops with different depths in
the same function and see how big the diff is), let alone fast
updating.

I am reasonably confident that the proposal
1) does not affect compilations using static profile (with branch prediction)
2) strictly better for -fprofile-instr-use optimizations.

The area I am not so sure is the JIT, but I am really interested in
knowing the details and propose solutions for you if the current
proposal does not work for you (which I doubt -- because if the
current framework works, the new one should work too :slight_smile: ).

I am looking forward to more detailed discussions next week! We shall
sit down together and discuss changes, rationale, concerns one by one
-- with concrete examples.

thanks,

David

I've created a few bugzilla issues with details of some of the things I'll

be looking into. I'm not yet done wordsmithing the overall design document.
I'll try to finish it by early next week at the latest.

The document is available at

https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>

There are several topics covered. Ideally, I would prefer that we discuss
each topic separately. The main ones I will start working on are the ones
described in the bugzilla links we have in the doc.

This is just a starting point for us. I am not at all concerned with
implementing exactly what is proposed in the document. In fact, if we can
get the same value using the existing support, all the better.

OTOH, any other ideas that folks may have that work better than this are
more than welcome. I don't have really strong opinions on the matter. I am
fine with whatever works.

Thanks for the detailed write-up on this. Some of the issues definitely
need to be addressed. I am concerned, though, that some of the ideas may be
leading toward a scenario where we have essentially two completely
different ways of representing profile information in LLVM IR. It is great
to have two complementary approaches to collecting profile data, but two
representations in the IR would not make sense.

Yeah, I don't think I'll continue to push for a new MD_count attribute. If
we were to make MD_prof be a "real" execution count, that would be enough.
Note that by re-using MD_prof we are not changing its meaning at all. The
execution count is still a weight and the ratio is still branch
probability. All that we are changing are the absolute values of the number
and increasing its data type width to remove the 32bit limitation.

The first issue raised is that profile execution counts are not
represented in the IR. This was a very intentional decision. I know it goes
against what other compilers have done in the past. It took me a while to
get used to the idea when Andy first suggested it, so I know it seems
awkward at first. The advantage is that branch probabilities are much
easier to keep updated in the face of compiler transformations, compared to
execution counts.

Sorry. I don't follow. Updating counts as the CFG is transformed is not
difficult at all. What examples do you have in mind? The big advantage of
making MD_prof an actual execution count is that it is a meaningful metric
wrt scaling and transformation.

Say, for instance, that we have a branch instruction with two targets with
counts {100, 300} inside a function 'foo' that has entry count 2. The edge
probability for the first edge (count 100) is 100/(100+300) = 25%.

If we inline foo() inside another function bar() at a callsite with profile
count == 1, the cloned branch instruction gets its counters scaled with the
callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} =
{50, 150}. But the branch probability did not change. Currently, we are
cloning the branch without changing the edge weights.

This scaling is not difficult at all and can be incrementally very quickly.
We cannot afford to recompute all frequencies on the fly because it would
be detrimental to compile time. If foo() itself has N callees inlined into
it, each inlined callee needs to trigger a re-computation. When foo() is
inlined into bar(), the frequencies will need to be recomputed for foo()
and all N callees inlined into foo().

We are definitely missing the per-function execution counts that are
needed to be able to compare relative “hotness” across functions, and I
think that would be a good place to start making improvements. In the long
term, we should keep our options open to making major changes, but before
we go there, we should try to make incremental improvements to fix the
existing infrastructure.

Right, and that's the core of our proposal. We don't really want to make
major infrastructure changes at this point. The only thing I'd like to
explore is making MD_prof a real count. This will be useful for the inliner
changes and it would also make incremental updates easier, because the
scaling that needs to be done is very straightforward and quick.

Note that this change should not modify the current behaviour we get from
profile analysis. Things that were hot before should continue to be hot now.

Many of the other issues you raise seem like they could also be addressed
without major changes to the existing infrastructure. Let’s try to fix
those first.

That's exactly the point of the proposal. We definitely don't want to make
major changes to the infrastructure at first. My thinking is to start
working on making MD_prof a real count. One of the things that are
happening is that the combination of real profile data plus the frequency
propagation that we are currently doing is misleading the analysis.

For example (thanks David for the code and data). In the following code:

int g;
__attribute__((noinline)) void bar() {
g++;
}

extern int printf(const char*, ...);

int main()
{
  int i, j, k;

  g = 0;

  // Loop 1.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 100; j++)
       for (k = 0; k < 100; k++)
           bar();

  printf ("g = %d\n", g);
  g = 0;

  // Loop 2.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 10000; j++)
        bar();

  printf ("g = %d\n", g);
  g = 0;

  // Loop 3.
  for (i = 0; i < 1000000; i++)
    bar();

  printf ("g = %d\n", g);
  g = 0;
}

When compiled with profile instrumentation, frequency propagation is
distorting the real profile because it gives different frequency to the
calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
times, but after frequency propagation, the first call to bar() gets a
weight of 520,202 in loop #1, 210,944 in loop #2 and 4,096 in loop #3. In
reality, every call to bar() should have a weight of 1,000,000.

Thanks. Diego.

(Sorry for the delay responding; I've been on holiday.)

There are two things going on here.

Firstly, the loop scales are being capped at 4096. I propagated this
approximation from the previous version of BFI. If it's causing a
problem (which it looks like it is), we should drop it and fix what
breaks. You can play around with this by commenting out the `if`
statement at the end of `computeLoopScale()` in
BlockFrequencyInfoImpl.cpp.

For example, without that logic this testcase gives:

    Printing analysis 'Block Frequency Analysis' for function 'main':
    block-frequency-info: main
     - entry: float = 1.0, int = 8
     - for.cond: float = 51.5, int = 411
     - for.body: float = 50.5, int = 403
     - for.cond1: float = 5051.0, int = 40407
     - for.body3: float = 5000.5, int = 40003
     - for.cond4: float = 505001.0, int = 4040007
     - for.body6: float = 500000.5, int = 4000003
     - for.inc: float = 500000.5, int = 4000003
     - for.end: float = 5000.5, int = 40003
     - for.inc7: float = 5000.5, int = 40003
     - for.end9: float = 50.5, int = 403
     - for.inc10: float = 50.5, int = 403
     - for.end12: float = 1.0, int = 8
     - for.cond13: float = 51.5, int = 411
     - for.body15: float = 50.5, int = 403
     - for.cond16: float = 500051.0, int = 4000407
     - for.body18: float = 500000.5, int = 4000003
     - for.inc19: float = 500000.5, int = 4000003
     - for.end21: float = 50.5, int = 403
     - for.inc22: float = 50.5, int = 403
     - for.end24: float = 1.0, int = 8
     - for.cond26: float = 500001.5, int = 4000011
     - for.body28: float = 500000.5, int = 4000003
     - for.inc29: float = 500000.5, int = 4000003
     - for.end31: float = 1.0, int = 8

(Now we get 500000.5 for all the inner loop bodies.)

Secondly, instrumentation-based profiling intentionally fuzzes the
profile data in the frontend using Laplace's Rule of Succession (look at
`scaleBranchWeight()` in CodeGenPGO.cpp).

For example, "loop 1" (which isn't affected by the 4096 cap) should give
a loop scale of 500000.5, not 1000000. (The profile data says
1000000/10000 for the inner loop, 10000/100 for the middle, and 100/1
for the outer loop. Laplace says that we should fuzz these branch
weights to 1000001/10001, 10001/101, and 101/2, which works out to
1000001/2 == 500000.5 total.)

There are two things going on here.

Firstly, the loop scales are being capped at 4096. I propagated this
approximation from the previous version of BFI. If it's causing a
problem (which it looks like it is), we should drop it and fix what
breaks. You can play around with this by commenting out the `if`
statement at the end of `computeLoopScale()` in
BlockFrequencyInfoImpl.cpp.

For example, without that logic this testcase gives:

    Printing analysis 'Block Frequency Analysis' for function 'main':
    block-frequency-info: main
     - entry: float = 1.0, int = 8
     - for.cond: float = 51.5, int = 411
     - for.body: float = 50.5, int = 403
     - for.cond1: float = 5051.0, int = 40407
     - for.body3: float = 5000.5, int = 40003
     - for.cond4: float = 505001.0, int = 4040007
     - for.body6: float = 500000.5, int = 4000003
     - for.inc: float = 500000.5, int = 4000003
     - for.end: float = 5000.5, int = 40003
     - for.inc7: float = 5000.5, int = 40003
     - for.end9: float = 50.5, int = 403
     - for.inc10: float = 50.5, int = 403
     - for.end12: float = 1.0, int = 8
     - for.cond13: float = 51.5, int = 411
     - for.body15: float = 50.5, int = 403
     - for.cond16: float = 500051.0, int = 4000407
     - for.body18: float = 500000.5, int = 4000003
     - for.inc19: float = 500000.5, int = 4000003
     - for.end21: float = 50.5, int = 403
     - for.inc22: float = 50.5, int = 403
     - for.end24: float = 1.0, int = 8
     - for.cond26: float = 500001.5, int = 4000011
     - for.body28: float = 500000.5, int = 4000003
     - for.inc29: float = 500000.5, int = 4000003
     - for.end31: float = 1.0, int = 8

(Now we get 500000.5 for all the inner loop bodies.)

Secondly, instrumentation-based profiling intentionally fuzzes the
profile data in the frontend using Laplace's Rule of Succession (look at
`scaleBranchWeight()` in CodeGenPGO.cpp).

For example, "loop 1" (which isn't affected by the 4096 cap) should give
a loop scale of 500000.5, not 1000000. (The profile data says
1000000/10000 for the inner loop, 10000/100 for the middle, and 100/1
for the outer loop. Laplace says that we should fuzz these branch
weights to 1000001/10001, 10001/101, and 101/2, which works out to
1000001/2 == 500000.5 total.)

Thanks, Duncan. I've started working on this as a first step. Long term,
the capping done during propagation is not much of a concern because we
would probably not using propagation if real profile information is
available (at least that's the initial thinking).

Thanks for all the feedback, folks. I'll start sending out patches soon
that address specific issues. We can continue the discussion there.

Diego.