RFC: Synthetic function entry counts

Functions in LLVM IR have a function_entry_count metadata that is attached in PGO compilation. By using the entry count together with the block frequency info, the compiler computes the profile count of call instructions based on which the hotness/coldness of callsites can be determined. Experiments have shown that using a higher threshold for hot callsites results in improved runtime performance of the generated code without significant code size increases. We propose to generate synthetic function counts for non-PGO compilation and use the counts for boosting hot callsites during inlining.

Synthetic function entry counts of functions are initialized based on properties of the function such as whether it is visible outside the module, whether it has an inline keyword and so on. Then, the callgraph SCC is traversed in reverse post-order. Counts of callsites are determined based on the entry count and the block frequency of the callsite. The callsite count gets added to the entry count of the callee. For targets of indirect calls, we will use the !callees metadata to find the possible targets and distribute the count equally among them. For functions in a non-trivial SCC, the algorithm has to ensure that the counts are stable and deterministic.

In ThinLTO mode, the function summary contains the list of call edges from the function. We propose to add the relative block frequency on these edges. During the thinlink phase, we propagate the function counts on the entire call graph and update the function summary with the synthetic counts. Additionally, we plan to use the computed counts to drive the importing decisions.

Alternative approach

This really sounds like it should be an analysis pass. Summarizing the results into the IR using the existing entry_county attributes sounds like a workaround for an invalidation problem more than anything else.

You acknowledge this in your alternatives discussion, but I don’t follow why BFI invalidation at the function level has to invalidate the inter procedural result. You’d certainly need to use BFI when running your analysis, but once computed the results should be stable across updates within the function. You might need update logic for inlining, outlining, and mergefunc, but that should be it in terms of current LLVM transforms?

Philip

This really sounds like it should be an analysis pass. Summarizing the
results into the IR using the existing entry_county attributes sounds like
a workaround for an invalidation problem more than anything else.

I think it is not a workaround. The strength of the proposal is that it
fits really naturally with the PGO infrastructure -- all the existing
profile update mechanism (currently mainly inliner) can be reused. Most of
the PGO related query interfaces can also be reused so that there is very
little change from the client side.

The inter-procedural frequency propagation also needs to happen in the
thinLink time for thinLTO which should not touch IR except summary data.

David

You acknowledge this in your alternatives discussion, but I don't follow

Understood. On the other hand, these could also be structured as analysis updates done by the transforms and the users could consume the analysis result. Hm, this is a really good point. I’d missed that in the original doc. Given this, the design makes a lot of sense. Personally, I’d still lean towards trying to structure this as an analysis, but that’s a weakly held opinion. Don’t let it block you from forward progress.

IIUC, this proposal is just saying that we should infer a static profile for entry counts just like we do for branch probabilities. In the case of entry counts, we do not hide that information behind an analysis like BPI, so currently just annotating synthetic PGO entry counts is a simple solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

It could be argued that we ought to refactor things so that the raw PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls back to interprocedural analysis (i.e. static profile heuristics) when the entry count metadata is missing, just like BPI does with static intraprocedural analysis and branch weight metadata. However, we probably don’t want to do that while folks are still depending on the old PM in production since CGSCC analyses don’t exist there which would force us to maintain an old and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC traversal, so it might not actually be computable incrementally as a bottom up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary module analysis for the top-down part might work around this though)

Also, the need to run this logic (or similar logic) as a “ThinLTO analysis” suggests not wedding it too much with the intricacies of the IR-level pass management (although admittedly we already do that with the inliner and then ThinLTO has to approximate those inlining decisions, so it might not be the end of the world to have some divergence).

– Sean Silva

Does this have any interaction with mayBeDerefined? E.g. different results of static profile in different compilation units causing problems. I think it should be okay since the transformations that look at the PGO data I think are the ones responsible for being safe when !mayBeDerefined, but I want to get a second opinion.

– Sean Silva

IIUC, this proposal is just saying that we should infer a static profile
for entry counts just like we do for branch probabilities. In the case of
entry counts, we do not hide that information behind an analysis like BPI,
so currently just annotating synthetic PGO entry counts is a simple
solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw PGO
metadata is only ever accessed via a wrapper CGSCC analysis that falls back
to interprocedural analysis (i.e. static profile heuristics) when the entry
count metadata is missing, just like BPI does with static intraprocedural
analysis and branch weight metadata. However, we probably don't want to do
that while folks are still depending on the old PM in production since
CGSCC analyses don't exist there which would force us to maintain an old
and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Wrapping function entry counts behind an analysis might be doable but
requires more careful thought. Right now getEntryCount is called from
various passes: inliner, loop passes (unroll, loop sink), codegenprepare
(for function layout), machine block placement, module summary analysis ...
This might be worth attempting when we fully migrate to the new pass
manager but not now.

Also, the need to run this logic (or similar logic) as a "ThinLTO analysis"

IIUC, this proposal is just saying that we should infer a static profile
for entry counts just like we do for branch probabilities. In the case of
entry counts, we do not hide that information behind an analysis like BPI,
so currently just annotating synthetic PGO entry counts is a simple
solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw PGO
metadata is only ever accessed via a wrapper CGSCC analysis that falls back
to interprocedural analysis (i.e. static profile heuristics) when the entry
count metadata is missing, just like BPI does with static intraprocedural
analysis and branch weight metadata. However, we probably don't want to do
that while folks are still depending on the old PM in production since
CGSCC analyses don't exist there which would force us to maintain an old
and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Wrapping function entry counts behind an analysis might be doable but
requires more careful thought.

Definitely.

Right now getEntryCount is called from various passes: inliner, loop
passes (unroll, loop sink), codegenprepare (for function layout), machine
block placement, module summary analysis ... This might be worth attempting
when we fully migrate to the new pass manager but not now.

It seems like many of those places that directly reference getEntryCount
are using it to check "HasProfileData". So adding synthetic values would
cause us to take a lot of code paths in the compiler that were intended to
use real profile data.

As a simple example, MachineBlockPlacement::collectLoopBlockSet uses MBFI
and MBPI guarded by a check of getEntryCount to see if PGO data is present
(the comment even says "This needs precise profile data and we only do this
when profile data is available."). Do we need to refactor things first to
use a more semantically meaningful check, like "hasRealProfileData" so that
synthetic counts don't fool this code into thinking it has real profile
data? At least, comments like the one I quoted need to be updated to
indicate that a static profile is considered precise enough.

Also, one other question: in the case of FullLTO, how do we ensure that the
synthetic profile data from static heuristics is consistent across TU's? Do
we need to recalculate the synthetic profile data? If we need to
recalculate it, presumably we need to know which profile data is synthetic
or not. E.g. in a scenario where most TU's are compiled with profile data,
but some are not (e.g. vendor or runtime lib delivered as bitcode),
recalculated synthetic counts at FullLTO time should honor real profile
counts but ignore synthetic counts from per-TU compilation.

The PGO+FullLTO case with a runtime lib delivered as bitcode is not
hypothetical; this was actually a common configuration used in production
while I was at PlayStation. CC'ing Davide, River to confirm that they still
care about this case.

Also, more generally, can you clarify how your static entry count
calculation algorithm is expected to work in the presence of partial
profile data?

-- Sean Silva

IIUC, this proposal is just saying that we should infer a static profile
for entry counts just like we do for branch probabilities. In the case of
entry counts, we do not hide that information behind an analysis like BPI,
so currently just annotating synthetic PGO entry counts is a simple
solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw PGO
metadata is only ever accessed via a wrapper CGSCC analysis that falls back
to interprocedural analysis (i.e. static profile heuristics) when the entry
count metadata is missing, just like BPI does with static intraprocedural
analysis and branch weight metadata. However, we probably don't want to do
that while folks are still depending on the old PM in production since
CGSCC analyses don't exist there which would force us to maintain an old
and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Wrapping function entry counts behind an analysis might be doable but
requires more careful thought.

Definitely.

Right now getEntryCount is called from various passes: inliner, loop
passes (unroll, loop sink), codegenprepare (for function layout), machine
block placement, module summary analysis ... This might be worth attempting
when we fully migrate to the new pass manager but not now.

It seems like many of those places that directly reference getEntryCount
are using it to check "HasProfileData". So adding synthetic values would
cause us to take a lot of code paths in the compiler that were intended to
use real profile data.

As a simple example, MachineBlockPlacement::collectLoopBlockSet uses MBFI
and MBPI guarded by a check of getEntryCount to see if PGO data is present
(the comment even says "This needs precise profile data and we only do this
when profile data is available."). Do we need to refactor things first to
use a more semantically meaningful check, like "hasRealProfileData" so that
synthetic counts don't fool this code into thinking it has real profile
data? At least, comments like the one I quoted need to be updated to
indicate that a static profile is considered precise enough.

I think the existing code that checks 'hasEntryCount()' for
'isProfileAvailable' information will need to be refactored to query
'isRealProfileAvailable'.

Also, one other question: in the case of FullLTO, how do we ensure that
the synthetic profile data from static heuristics is consistent across
TU's? Do we need to recalculate the synthetic profile data? If we need to
recalculate it, presumably we need to know which profile data is synthetic
or not. E.g. in a scenario where most TU's are compiled with profile data,
but some are not (e.g. vendor or runtime lib delivered as bitcode),
recalculated synthetic counts at FullLTO time should honor real profile
counts but ignore synthetic counts from per-TU compilation.

The PGO+FullLTO case with a runtime lib delivered as bitcode is not
hypothetical; this was actually a common configuration used in production
while I was at PlayStation. CC'ing Davide, River to confirm that they still
care about this case.

Also, more generally, can you clarify how your static entry count
calculation algorithm is expected to work in the presence of partial
profile data?

I think the BFI information from those modules with real profile data can
be used for the IPA frequency propagation purpose, but the 'real' entry
count information will need to be dropped. The entry counts for those
library functions are collected and aggregated from contexts of different
hosting binaries and their usefulness is very questionable. They needs to
be 'normalized' in the context of the target binary, which is what the
synthetic entry count computation is doing.

David

IIUC, this proposal is just saying that we should infer a static
profile for entry counts just like we do for branch probabilities. In the
case of entry counts, we do not hide that information behind an analysis
like BPI, so currently just annotating synthetic PGO entry counts is a
simple solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw PGO
metadata is only ever accessed via a wrapper CGSCC analysis that falls back
to interprocedural analysis (i.e. static profile heuristics) when the entry
count metadata is missing, just like BPI does with static intraprocedural
analysis and branch weight metadata. However, we probably don't want to do
that while folks are still depending on the old PM in production since
CGSCC analyses don't exist there which would force us to maintain an old
and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Wrapping function entry counts behind an analysis might be doable but
requires more careful thought.

Definitely.

Right now getEntryCount is called from various passes: inliner, loop
passes (unroll, loop sink), codegenprepare (for function layout), machine
block placement, module summary analysis ... This might be worth attempting
when we fully migrate to the new pass manager but not now.

It seems like many of those places that directly reference getEntryCount
are using it to check "HasProfileData". So adding synthetic values would
cause us to take a lot of code paths in the compiler that were intended to
use real profile data.

As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
present (the comment even says "This needs precise profile data and we only
do this when profile data is available."). Do we need to refactor things
first to use a more semantically meaningful check, like
"hasRealProfileData" so that synthetic counts don't fool this code into
thinking it has real profile data? At least, comments like the one I quoted
need to be updated to indicate that a static profile is considered precise
enough.

I think the existing code that checks 'hasEntryCount()' for
'isProfileAvailable' information will need to be refactored to query
'isRealProfileAvailable'.

Also, one other question: in the case of FullLTO, how do we ensure that
the synthetic profile data from static heuristics is consistent across
TU's? Do we need to recalculate the synthetic profile data? If we need to
recalculate it, presumably we need to know which profile data is synthetic
or not. E.g. in a scenario where most TU's are compiled with profile data,
but some are not (e.g. vendor or runtime lib delivered as bitcode),
recalculated synthetic counts at FullLTO time should honor real profile
counts but ignore synthetic counts from per-TU compilation.

The PGO+FullLTO case with a runtime lib delivered as bitcode is not
hypothetical; this was actually a common configuration used in production
while I was at PlayStation. CC'ing Davide, River to confirm that they still
care about this case.

Also, more generally, can you clarify how your static entry count
calculation algorithm is expected to work in the presence of partial
profile data?

I think the BFI information from those modules with real profile data can
be used for the IPA frequency propagation purpose, but the 'real' entry
count information will need to be dropped. The entry counts for those
library functions are collected and aggregated from contexts of different
hosting binaries and their usefulness is very questionable.

The case at PlayStation was one where the vendor bitcode libraries simply
did not have PGO data at all.

They needs to be 'normalized' in the context of the target binary, which
is what the synthetic entry count computation is doing.

Real entry counts might still be useful. For example, in a GUI or web
server app there may be a large number of functions that are called only
indirectly via an event loop that might be in a system library (e.g. inside
a libevent or Qt .so file). The collected branch probabilities are not
enough to decide that one handler is much hotter than the other.
Is the plan to currently not support this case? (i.e., we would require you
to build the event loop library with PGO so that indirect call profiling
can infer the hotness data)

-- Sean Silva

IIUC, this proposal is just saying that we should infer a static
profile for entry counts just like we do for branch probabilities. In the
case of entry counts, we do not hide that information behind an analysis
like BPI, so currently just annotating synthetic PGO entry counts is a
simple solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw
PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls
back to interprocedural analysis (i.e. static profile heuristics) when the
entry count metadata is missing, just like BPI does with static
intraprocedural analysis and branch weight metadata. However, we probably
don't want to do that while folks are still depending on the old PM in
production since CGSCC analyses don't exist there which would force us to
maintain an old and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Wrapping function entry counts behind an analysis might be doable but
requires more careful thought.

Definitely.

Right now getEntryCount is called from various passes: inliner, loop
passes (unroll, loop sink), codegenprepare (for function layout), machine
block placement, module summary analysis ... This might be worth attempting
when we fully migrate to the new pass manager but not now.

It seems like many of those places that directly reference getEntryCount
are using it to check "HasProfileData". So adding synthetic values would
cause us to take a lot of code paths in the compiler that were intended to
use real profile data.

As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
present (the comment even says "This needs precise profile data and we only
do this when profile data is available."). Do we need to refactor things
first to use a more semantically meaningful check, like
"hasRealProfileData" so that synthetic counts don't fool this code into
thinking it has real profile data? At least, comments like the one I quoted
need to be updated to indicate that a static profile is considered precise
enough.

I think the existing code that checks 'hasEntryCount()' for
'isProfileAvailable' information will need to be refactored to query
'isRealProfileAvailable'.

Also, one other question: in the case of FullLTO, how do we ensure that
the synthetic profile data from static heuristics is consistent across
TU's? Do we need to recalculate the synthetic profile data? If we need to
recalculate it, presumably we need to know which profile data is synthetic
or not. E.g. in a scenario where most TU's are compiled with profile data,
but some are not (e.g. vendor or runtime lib delivered as bitcode),
recalculated synthetic counts at FullLTO time should honor real profile
counts but ignore synthetic counts from per-TU compilation.

The PGO+FullLTO case with a runtime lib delivered as bitcode is not
hypothetical; this was actually a common configuration used in production
while I was at PlayStation. CC'ing Davide, River to confirm that they still
care about this case.

Also, more generally, can you clarify how your static entry count
calculation algorithm is expected to work in the presence of partial
profile data?

I think the BFI information from those modules with real profile data can
be used for the IPA frequency propagation purpose, but the 'real' entry
count information will need to be dropped. The entry counts for those
library functions are collected and aggregated from contexts of different
hosting binaries and their usefulness is very questionable.

The case at PlayStation was one where the vendor bitcode libraries simply
did not have PGO data at all.

They needs to be 'normalized' in the context of the target binary, which
is what the synthetic entry count computation is doing.

Real entry counts might still be useful. For example, in a GUI or web
server app there may be a large number of functions that are called only
indirectly via an event loop that might be in a system library (e.g. inside
a libevent or Qt .so file). The collected branch probabilities are not
enough to decide that one handler is much hotter than the other.
Is the plan to currently not support this case? (i.e., we would require
you to build the event loop library with PGO so that indirect call
profiling can infer the hotness data)

Unless we can indirect call promote those handlers and inline them, the
global entry info for this handlers won't matter much though.

David

IIUC, this proposal is just saying that we should infer a static
profile for entry counts just like we do for branch probabilities. In the
case of entry counts, we do not hide that information behind an analysis
like BPI, so currently just annotating synthetic PGO entry counts is a
simple solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw
PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls
back to interprocedural analysis (i.e. static profile heuristics) when the
entry count metadata is missing, just like BPI does with static
intraprocedural analysis and branch weight metadata. However, we probably
don't want to do that while folks are still depending on the old PM in
production since CGSCC analyses don't exist there which would force us to
maintain an old and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Wrapping function entry counts behind an analysis might be doable but
requires more careful thought.

Definitely.

Right now getEntryCount is called from various passes: inliner, loop
passes (unroll, loop sink), codegenprepare (for function layout), machine
block placement, module summary analysis ... This might be worth attempting
when we fully migrate to the new pass manager but not now.

It seems like many of those places that directly reference
getEntryCount are using it to check "HasProfileData". So adding synthetic
values would cause us to take a lot of code paths in the compiler that were
intended to use real profile data.

As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
present (the comment even says "This needs precise profile data and we only
do this when profile data is available."). Do we need to refactor things
first to use a more semantically meaningful check, like
"hasRealProfileData" so that synthetic counts don't fool this code into
thinking it has real profile data? At least, comments like the one I quoted
need to be updated to indicate that a static profile is considered precise
enough.

I think the existing code that checks 'hasEntryCount()' for
'isProfileAvailable' information will need to be refactored to query
'isRealProfileAvailable'.

Also, one other question: in the case of FullLTO, how do we ensure that
the synthetic profile data from static heuristics is consistent across
TU's? Do we need to recalculate the synthetic profile data? If we need to
recalculate it, presumably we need to know which profile data is synthetic
or not. E.g. in a scenario where most TU's are compiled with profile data,
but some are not (e.g. vendor or runtime lib delivered as bitcode),
recalculated synthetic counts at FullLTO time should honor real profile
counts but ignore synthetic counts from per-TU compilation.

The PGO+FullLTO case with a runtime lib delivered as bitcode is not
hypothetical; this was actually a common configuration used in production
while I was at PlayStation. CC'ing Davide, River to confirm that they still
care about this case.

Also, more generally, can you clarify how your static entry count
calculation algorithm is expected to work in the presence of partial
profile data?

I think the BFI information from those modules with real profile data
can be used for the IPA frequency propagation purpose, but the 'real' entry
count information will need to be dropped. The entry counts for those
library functions are collected and aggregated from contexts of different
hosting binaries and their usefulness is very questionable.

The case at PlayStation was one where the vendor bitcode libraries simply
did not have PGO data at all.

They needs to be 'normalized' in the context of the target binary,
which is what the synthetic entry count computation is doing.

Real entry counts might still be useful. For example, in a GUI or web
server app there may be a large number of functions that are called only
indirectly via an event loop that might be in a system library (e.g. inside
a libevent or Qt .so file). The collected branch probabilities are not
enough to decide that one handler is much hotter than the other.
Is the plan to currently not support this case? (i.e., we would require
you to build the event loop library with PGO so that indirect call
profiling can infer the hotness data)

Unless we can indirect call promote those handlers and inline them, the
global entry info for this handlers won't matter much though.

What if two handlers call the same helper, but one handler is very hot
compared with the other? We should prefer inlining into the hot handler.

-- Sean Silva

IIUC, this proposal is just saying that we should infer a static
profile for entry counts just like we do for branch probabilities. In the
case of entry counts, we do not hide that information behind an analysis
like BPI, so currently just annotating synthetic PGO entry counts is a
simple solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw
PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls
back to interprocedural analysis (i.e. static profile heuristics) when the
entry count metadata is missing, just like BPI does with static
intraprocedural analysis and branch weight metadata. However, we probably
don't want to do that while folks are still depending on the old PM in
production since CGSCC analyses don't exist there which would force us to
maintain an old and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Wrapping function entry counts behind an analysis might be doable but
requires more careful thought.

Definitely.

Right now getEntryCount is called from various passes: inliner, loop
passes (unroll, loop sink), codegenprepare (for function layout), machine
block placement, module summary analysis ... This might be worth attempting
when we fully migrate to the new pass manager but not now.

It seems like many of those places that directly reference
getEntryCount are using it to check "HasProfileData". So adding synthetic
values would cause us to take a lot of code paths in the compiler that were
intended to use real profile data.

As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
present (the comment even says "This needs precise profile data and we only
do this when profile data is available."). Do we need to refactor things
first to use a more semantically meaningful check, like
"hasRealProfileData" so that synthetic counts don't fool this code into
thinking it has real profile data? At least, comments like the one I quoted
need to be updated to indicate that a static profile is considered precise
enough.

I think the existing code that checks 'hasEntryCount()' for
'isProfileAvailable' information will need to be refactored to query
'isRealProfileAvailable'.

Also, one other question: in the case of FullLTO, how do we ensure
that the synthetic profile data from static heuristics is consistent across
TU's? Do we need to recalculate the synthetic profile data? If we need to
recalculate it, presumably we need to know which profile data is synthetic
or not. E.g. in a scenario where most TU's are compiled with profile data,
but some are not (e.g. vendor or runtime lib delivered as bitcode),
recalculated synthetic counts at FullLTO time should honor real profile
counts but ignore synthetic counts from per-TU compilation.

The PGO+FullLTO case with a runtime lib delivered as bitcode is not
hypothetical; this was actually a common configuration used in production
while I was at PlayStation. CC'ing Davide, River to confirm that they still
care about this case.

Also, more generally, can you clarify how your static entry count
calculation algorithm is expected to work in the presence of partial
profile data?

I think the BFI information from those modules with real profile data
can be used for the IPA frequency propagation purpose, but the 'real' entry
count information will need to be dropped. The entry counts for those
library functions are collected and aggregated from contexts of different
hosting binaries and their usefulness is very questionable.

The case at PlayStation was one where the vendor bitcode libraries
simply did not have PGO data at all.

They needs to be 'normalized' in the context of the target binary,
which is what the synthetic entry count computation is doing.

Real entry counts might still be useful. For example, in a GUI or web
server app there may be a large number of functions that are called only
indirectly via an event loop that might be in a system library (e.g. inside
a libevent or Qt .so file). The collected branch probabilities are not
enough to decide that one handler is much hotter than the other.
Is the plan to currently not support this case? (i.e., we would require
you to build the event loop library with PGO so that indirect call
profiling can infer the hotness data)

Unless we can indirect call promote those handlers and inline them, the
global entry info for this handlers won't matter much though.

What if two handlers call the same helper, but one handler is very hot
compared with the other? We should prefer inlining into the hot handler.

yes, the entry count can differentiate in this case. On the other hand,
with partial profile, we really do not have a good/complete global profile
summary information. This makes it harder to guess the global hotness even
with entry count.

By the way, the partial profile data case is probably also interesting to
us -- but in the context of sample PGO. We will need to find a good way to
deal with it once we have the basic infrastructure ready for synthetic
profile data.

thanks,

David

IIUC, this proposal is just saying that we should infer a static
profile for entry counts just like we do for branch probabilities. In the
case of entry counts, we do not hide that information behind an analysis
like BPI, so currently just annotating synthetic PGO entry counts is a
simple solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

Yes, that is the proposal.

It could be argued that we ought to refactor things so that the raw
PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls
back to interprocedural analysis (i.e. static profile heuristics) when the
entry count metadata is missing, just like BPI does with static
intraprocedural analysis and branch weight metadata. However, we probably
don't want to do that while folks are still depending on the old PM in
production since CGSCC analyses don't exist there which would force us to
maintain an old and new way of doing it.

(Also, it sounds like you want to compute this with a top-down
CGSCC traversal, so it might not actually be computable incrementally as a
bottom up CGSCC analysis which is what CGSCC analyses currently do; an
auxiliary module analysis for the top-down part might work around this
though)

Wrapping function entry counts behind an analysis might be doable
but requires more careful thought.

Definitely.

Right now getEntryCount is called from various passes: inliner, loop
passes (unroll, loop sink), codegenprepare (for function layout), machine
block placement, module summary analysis ... This might be worth attempting
when we fully migrate to the new pass manager but not now.

It seems like many of those places that directly reference
getEntryCount are using it to check "HasProfileData". So adding synthetic
values would cause us to take a lot of code paths in the compiler that were
intended to use real profile data.

As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
present (the comment even says "This needs precise profile data and we only
do this when profile data is available."). Do we need to refactor things
first to use a more semantically meaningful check, like
"hasRealProfileData" so that synthetic counts don't fool this code into
thinking it has real profile data? At least, comments like the one I quoted
need to be updated to indicate that a static profile is considered precise
enough.

I think the existing code that checks 'hasEntryCount()' for
'isProfileAvailable' information will need to be refactored to query
'isRealProfileAvailable'.

Also, one other question: in the case of FullLTO, how do we ensure
that the synthetic profile data from static heuristics is consistent across
TU's? Do we need to recalculate the synthetic profile data? If we need to
recalculate it, presumably we need to know which profile data is synthetic
or not. E.g. in a scenario where most TU's are compiled with profile data,
but some are not (e.g. vendor or runtime lib delivered as bitcode),
recalculated synthetic counts at FullLTO time should honor real profile
counts but ignore synthetic counts from per-TU compilation.

The PGO+FullLTO case with a runtime lib delivered as bitcode is not
hypothetical; this was actually a common configuration used in production
while I was at PlayStation. CC'ing Davide, River to confirm that they still
care about this case.

Also, more generally, can you clarify how your static entry count
calculation algorithm is expected to work in the presence of partial
profile data?

I think the BFI information from those modules with real profile data
can be used for the IPA frequency propagation purpose, but the 'real' entry
count information will need to be dropped. The entry counts for those
library functions are collected and aggregated from contexts of different
hosting binaries and their usefulness is very questionable.

The case at PlayStation was one where the vendor bitcode libraries
simply did not have PGO data at all.

They needs to be 'normalized' in the context of the target binary,
which is what the synthetic entry count computation is doing.

Real entry counts might still be useful. For example, in a GUI or web
server app there may be a large number of functions that are called only
indirectly via an event loop that might be in a system library (e.g. inside
a libevent or Qt .so file). The collected branch probabilities are not
enough to decide that one handler is much hotter than the other.
Is the plan to currently not support this case? (i.e., we would require
you to build the event loop library with PGO so that indirect call
profiling can infer the hotness data)

Unless we can indirect call promote those handlers and inline them, the
global entry info for this handlers won't matter much though.

What if two handlers call the same helper, but one handler is very hot
compared with the other? We should prefer inlining into the hot handler.

yes, the entry count can differentiate in this case. On the other hand,
with partial profile, we really do not have a good/complete global profile
summary information. This makes it harder to guess the global hotness even
with entry count.

Yes, this is the main challenge for the partial profile case and I do not
plan to handle the mixed-profile case initially.