PGO information at LTO/thinLTO link step

Hello,

My team and I noticed that callsite hotness information is not preserved from compile to link step with LTO/thinLTO enabled. As a result, the link step inlining pass remains conservative when inlining callsites known to be hot (ie. without the ‘HotCallSiteThreshold’ which is set at 3000 by default). There are likely many cross-module inlining opportunities lost this way, and diminishes the benefit of using LTO/thinLTO+PGO together.

In general, does LLVM pass profiling information through the IR to the link step other than branch probabilities and function entry counts? If not, are there plans to do so in the future? For inlining specifically, perhaps we can mark callsites with hot/cold attributes during compile step to ensure LTO inlining will give appropriate threshold bonuses/penalties.

Any thoughts/insights/comments would be appreciated.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

With the new PM, the profile information should be properly updated across inline transformations. If you see otherwise, please file a bug …

David

Hello,

My team and I noticed that callsite hotness information is not preserved
from compile to link step with LTO/thinLTO enabled. As a result, the link
step inlining pass remains conservative when inlining callsites known to be
hot (ie. without the 'HotCallSiteThreshold' which is set at 3000 by
default). There are likely many cross-module inlining opportunities lost
this way, and diminishes the benefit of using LTO/thinLTO+PGO together.

The callsite hotness is passed via the IR, so it should be there in the
LTO/ThinLTO backends (during the link step). Can you provide a reproducer
where that isn't happening?
Teresa

Hi Teresa,

Actually, enabling the new pass manager manually seems to have solved this issue, so this problem is only valid for the old pass manager.

Thanks,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gif

Hi Teresa,

Actually, enabling the new pass manager manually seems to have solved this
issue, so this problem is only valid for the old pass manager.

It should not be an issue in the old PM either - the callsite hotness is
passed via IR. As David mentioned, the new PM inliner does a better job of
updating call hotness after inlining, but it should be there (some things
might look hotter than then should, which seems to be the opposite of the
problem you are hitting). Can you send me a reproducer with the old PM?

Teresa

graycol.gif

Hi Teresa,

Actually, enabling the new pass manager manually seems to have solved
this issue, so this problem is only valid for the old pass manager.

It should not be an issue in the old PM either - the callsite hotness is
passed via IR.

More precisely, the function entry counts are passed via IR. With the old
PM, we don't have callsite hotness information, but callee's entry count
is used to boost the threshold.

graycol.gif

Hi Teresa,

Actually, enabling the new pass manager manually seems to have solved
this issue, so this problem is only valid for the old pass manager.

It should not be an issue in the old PM either - the callsite hotness is
passed via IR.

More precisely, the function entry counts are passed via IR. With the old
PM, we don't have callsite hotness information, but callee's entry count
is used to boost the threshold.

Thanks for the clarification. (But essentially there should be no
difference between the profile info in the IR in the compile step vs the
link aka ThinLTO backend steps - the inliner in both cases is working off
the same profile info in the IR)

graycol.gif

Thanks Easwaran. This is what we’ve observed as well, where the old PM inliner was only looking hot/cold callee information, which have signficantly smaller boosts/penalties compared to callsite information.

Teresa, do you know if there is some documentation/video/presentation on how PGO information is represented in LLVM and what information is passed via the IR? I’m finding some difficulty in getting the big picture via the code.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifTeresa Johnson —10/03/2017 05:00:11 PM—On Tue, Oct 3, 2017 at 1:55 PM, Easwaran Raman eraman@google.com wrote: >

Thanks Easwaran. This is what we've observed as well, where the old PM
inliner was only looking hot/cold callee information, which have
signficantly smaller boosts/penalties compared to callsite information.

Teresa, do you know if there is some documentation/video/presentation on
how PGO information is represented in LLVM and what information is passed
via the IR? I'm finding some difficulty in getting the big picture via the
code.

The documentation I am aware of is in the Language Ref and a subpage linked
from here:
https://llvm.org/docs/LangRef.html#prof-metadata

If that doesn't help let me know and I can point you to someone who would
know (if I can't answer it myself).

Teresa

graycol.gif

Thanks Easwaran. This is what we've observed as well, where the old PM
inliner was only looking hot/cold callee information, which have
signficantly smaller boosts/penalties compared to callsite information.

Teresa, do you know if there is some documentation/video/presentation on
how PGO information is represented in LLVM and what information is passed
via the IR? I'm finding some difficulty in getting the big picture via the
code.

In a nutshell, there are two main types of profile data (BB/edge related).

1) branch probability
2) function entry count.

PGO instrumentation actually collects BB count. During profile annotation
pass, the branch probability (weights) profile data is computed and
annotated to the IR (the branch instructions). After profile annotation,
the block count information is dropped except for the function entry count
which is annotated to the function entry.

The BPI/BFI pass can recompute the block frequency data from the branch
probability info. The BB frequency is intra-function and does not have
global meaning. To recompute the profile count for a BB, function entry
profile count is used, which is multiplied by the ratio of the BB freq and
entry freq.

This scheme works really well except for CFG with irreducible loops, for
which BFI's frequency propagation pass can not reconstruct the
frequency/count properly. Hiroshi is working on a patch to fix the problem.

For AutoFDO, there is also a sample count profile for callsites which is
needed because function entry may not have any samples.

The value profiler also has its meta data for value histograms of a given
value site. At module level, while program profile summary data is also
represented so that optimization passes can query for global hotness info.

David

graycol.gif

Awesome. Thanks David, Teresa.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gif