I’ve recently implemented profile guided section layout in llvm + lld using the Call-Chain Clustering (C³) heuristic from https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf . In the programs I’ve tested it on I’ve gotten from 0% to 5% performance improvement over standard PGO with zero cases of slowdowns and up to 15% reduction in ITLB misses.
There are three parts to this implementation.
The first is a new llvm pass which uses branch frequency info to get counts for each call instruction and then adds a module flags metatdata table of function → function edges along with their counts.
The second takes the module flags metadata and writes it into a .note.llvm.callgraph section in the object file. This currently just dumps it as text, but could save space by reusing the string table.
The last part is in lld. It reads the .note.llvm.callgraph data from each object file and merges them into a single table. It then builds a call graph based on the profile data then iteratively merges the hottest call edges using the C³ heuristic as long as it would not create a cluster larger than the page size. All clusters are then sorted by a density metric to further improve locality.
There are a couple issues with the current implementation that shouldn’t be too hard to fix.
The .note.llvm.callgraph sections add about 10% to the size of each object file. This can be fixed by reusing the symbol string table instead of duplicating all the strings.
The ordering algorithm is n^2 in the number of call graph edges. This is because it’s using a trivial algorithm and can be fixed by doing something more intelligent.
It doesn’t currently work for LTO as the llvm pass needs to be run after all inlining decisions have been made and LTO codegen has to be done with -ffunction-sections.
It’s only implemented for ELF.
I’ve attached the llvm and lld patches and would appreciate feedback on the overall design, the method of threading the profile data to the linker, and an improved ordering algorithm. No need for a detailed code review as I’ll be making changes, adding tests, and posting proper patches for review.
pgcl-lld.diff (9.15 KB)
pgcl-llvm.diff (8.66 KB)
Hi Michael,
This is cool stuff, thanks for sharing!
The first is a new llvm pass which uses branch frequency info to get counts for each call instruction and then adds a module flags metatdata table of function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a .note.llvm.callgraph section in the object file. This currently just dumps it as text, but could save space by reusing the string table.
Have you considered reading the profile in the linker and extracting that information directly from the profile? The profile should contain call sites and their sample counts and you could match these up with relocations (calls) in the section?
It doesn't currently work for LTO as the llvm pass needs to be run after all inlining decisions have been made and LTO codegen has to be done with -ffunction-sections.
So this is just an implementation issue, right? You can make LTO run with -ffunction-sections (by setting TargetOptions.FunctionSections=true) and insert your pass in the appropriate place in the pipeline.
Thanks,
Tobias
Hi Michael,
This is cool stuff, thanks for sharing!
The first is a new llvm pass which uses branch frequency info to get
counts for each call instruction and then adds a module flags metatdata
table of function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.
Have you considered reading the profile in the linker and extracting that
information directly from the profile? The profile should contain call
sites and their sample counts and you could match these up with relocations
(calls) in the section?
I did this using IR PGO instead of sample PGO so the profile data can only
be applied in the same place in the pipeline it is generated. Even for
sample based this would be complicated as the linker would actually need to
generate machine basic blocks from sections to be able to accurately match
sample counts to relocations, as there may be cold calls in hot functions.
It may be useful however for the linker to directly accept an externally
generated call graph profile. The current approach can actually do this by
embedding it into an extra object file.
It doesn't currently work for LTO as the llvm pass needs to be run after
all inlining decisions have been made and LTO codegen has to be done with
-ffunction-sections.
So this is just an implementation issue, right? You can make LTO run with
-ffunction-sections (by setting TargetOptions.FunctionSections=true) and
insert your pass in the appropriate place in the pipeline.
Yeah, just an implementation issue. Just need to build the pass pipeline
differently for LTO and add a way to do -ffunction-sections in lld.
- Michael Spencer
Great! Cc’ing Sriraman who implemented something like this several years back on our gcc google branch. We may want to utilize your work in the gold-plugin as well.
Hi Michael,
This is cool stuff, thanks for sharing!
The first is a new llvm pass which uses branch frequency info to get
counts for each call instruction and then adds a module flags metatdata
table of function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.
Have you considered reading the profile in the linker and extracting that
information directly from the profile? The profile should contain call
sites and their sample counts and you could match these up with relocations
(calls) in the section?
The main reason is that IPO transformations such as inlining and clonining
will change the hotness of functions, so the original profile can not be
directly for the purpose of function layout. There is a similar support
in Gold plugin for Google GCC.
David
Hi Michael,
This is cool stuff, thanks for sharing!
The first is a new llvm pass which uses branch frequency info to get
counts for each call instruction and then adds a module flags metatdata
table of function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.
Have you considered reading the profile in the linker and extracting
that information directly from the profile? The profile should contain call
sites and their sample counts and you could match these up with relocations
(calls) in the section?
The main reason is that IPO transformations such as inlining and clonining
will change the hotness of functions, so the original profile can not be
directly for the purpose of function layout. There is a similar support
in Gold plugin for Google GCC.
Will this cause issues with ThinLTO? E.g. the thinlto backends are doing
inlining of imported functions. Do we have a mechanism for those decisions
to be reflected in a global profile for the linker to look at?
In theory the thinlto backends can keep a history of their IPO decisions in
metadata or something and emit a section for the linker to aggregate and
reconstruct an accurate global profile, but that seems relatively invasive.
-- Sean Silva
Hi Michael,
This is cool stuff, thanks for sharing!
The first is a new llvm pass which uses branch frequency info to get
counts for each call instruction and then adds a module flags metatdata
table of function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.
Have you considered reading the profile in the linker and extracting
that information directly from the profile? The profile should contain call
sites and their sample counts and you could match these up with relocations
(calls) in the section?
The main reason is that IPO transformations such as inlining and
clonining will change the hotness of functions, so the original profile can
not be directly for the purpose of function layout. There is a similar
support in Gold plugin for Google GCC.
Will this cause issues with ThinLTO? E.g. the thinlto backends are doing
inlining of imported functions. Do we have a mechanism for those decisions
to be reflected in a global profile for the linker to look at?
In theory the thinlto backends can keep a history of their IPO decisions
in metadata or something and emit a section for the linker to aggregate and
reconstruct an accurate global profile, but that seems relatively invasive.
Yes, it will cause problems which is also known to GCC's LIPO. We have an
intern working on that problem 
David
Hi Michael,
This is cool stuff, thanks for sharing!
The first is a new llvm pass which uses branch frequency info to get
counts for each call instruction and then adds a module flags metatdata
table of function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.
Have you considered reading the profile in the linker and extracting
that information directly from the profile? The profile should contain call
sites and their sample counts and you could match these up with relocations
(calls) in the section?
The main reason is that IPO transformations such as inlining and
clonining will change the hotness of functions, so the original profile can
not be directly for the purpose of function layout. There is a similar
support in Gold plugin for Google GCC.
Will this cause issues with ThinLTO? E.g. the thinlto backends are doing
inlining of imported functions. Do we have a mechanism for those decisions
to be reflected in a global profile for the linker to look at?
In theory the thinlto backends can keep a history of their IPO decisions
in metadata or something and emit a section for the linker to aggregate and
reconstruct an accurate global profile, but that seems relatively invasive.
Yes, it will cause problems which is also known to GCC's LIPO. We have an
intern working on that problem 
Nice! What approach is being used to solve it?
-- Sean Silva
Hi Michael,
This is cool stuff, thanks for sharing!
The first is a new llvm pass which uses branch frequency info to get
counts for each call instruction and then adds a module flags metatdata
table of function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.
Have you considered reading the profile in the linker and extracting
that information directly from the profile? The profile should contain call
sites and their sample counts and you could match these up with relocations
(calls) in the section?
The main reason is that IPO transformations such as inlining and
clonining will change the hotness of functions, so the original profile can
not be directly for the purpose of function layout. There is a similar
support in Gold plugin for Google GCC.
Will this cause issues with ThinLTO? E.g. the thinlto backends are doing
inlining of imported functions. Do we have a mechanism for those decisions
to be reflected in a global profile for the linker to look at?
In theory the thinlto backends can keep a history of their IPO decisions
in metadata or something and emit a section for the linker to aggregate and
reconstruct an accurate global profile, but that seems relatively invasive.
Yes, it will cause problems which is also known to GCC's LIPO. We have an
intern working on that problem 
Nice! What approach is being used to solve it?
The method and results will be shared at some point. One approach is what
you mentioned which uses meta data and pass them to linker plugin. Another
way to be explored is to let thin link phase to make a very quick cross
module inlinining decisions globally and teach the backend inliner to honor
the results . The global inline decisions are not intended to replace the
backend inline analysis. The global analysis can for instance, 1) expose
linker GC opportunities; 2) can focus on hot functions with few callsites
or very few really hot callsites; 3) identify a single module that has the
most # of calls to a function and assign that module to be be final info
updater; 4) decide the post-link hotness for those functions whose
decisions are made at thin-link time.
David
Michael,
I have tried this patch with adaptation to our own linker and it works great. I would like to see this implemented on master.
Please let me know if I can do anything to facilitate this.
Sergei
Michael Spencer via llvm-dev <llvm-dev@lists.llvm.org> writes:
I've recently implemented profile guided section layout in llvm + lld using
the Call-Chain Clustering (C³) heuristic from
https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
. In the programs I've tested it on I've gotten from 0% to 5% performance
improvement over standard PGO with zero cases of slowdowns and up to 15%
reduction in ITLB misses.
There are three parts to this implementation.
The first is a new llvm pass which uses branch frequency info to get counts
for each call instruction and then adds a module flags metatdata table of
function -> function edges along with their counts.
The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.
The last part is in lld. It reads the .note.llvm.callgraph data from each
object file and merges them into a single table. It then builds a call
graph based on the profile data then iteratively merges the hottest call
edges using the C³ heuristic as long as it would not create a cluster
larger than the page size. All clusters are then sorted by a density metric
to further improve locality.
Since the branch frequency info is in a llvm specific format, it makes
sense for llvm to read it instead of expecting lld to do it again. Since
.o files is how the compiler talks to the linker, it also makes sense
for llvm to record the required information there.
In the same way, since the linker is the first place with global
knowledge, it makes sense for it to be the one that implements a section
ordering heuristic instead of just being told by some other tool, which
would complicate the build.
However, do we need to start with instrumentation? The original paper
uses sampling with good results and current intel cpus can record every
branch in a program.
I would propose starting with just an lld patch that reads the call
graph from a file. The format would be very similar to what you propose,
just weight,caller,callee.
In a another patch we can then look at instrumentation: Why it is more
convenient for some uses and what performance advantage it might have.
I have written a small tool that usesr intel_bts and 'perf script' to
construct the callgraph. I am giving it a try with your lld patch and
will hopefully post results today.
Cheers,
Rafael
Hi Rafael,
However, do we need to start with instrumentation? The original paper
uses sampling with good results and current intel cpus can record every
branch in a program.
I would propose starting with just an lld patch that reads the call
graph from a file. The format would be very similar to what you propose,
just weight,caller,callee.
The advantage of the proposed approach (weighted callgraph section) is that it's completely transparent: it works regardless of the particular profiling methodology (as long as there's !perf metadata when the pass runs). For this reason, it fits neatly into an *existing* PGO-based build flow. I only need to add 1 compiler flag to enable it. That's a big plus.
On the other hand, I could see how your idea (callgraph input file for linker) would be useful in situations where I just want to do section layout but no PGO in the compiler... and of course for testing of the linker's sorting algorithm.
So there's merits in both, but for my use cases Michael's original approach is the most practical.
Tobias
Tobias Edler von Koch <tobias@codeaurora.org> writes:
Hi Rafael,
However, do we need to start with instrumentation? The original paper
uses sampling with good results and current intel cpus can record every
branch in a program.
I would propose starting with just an lld patch that reads the call
graph from a file. The format would be very similar to what you propose,
just weight,caller,callee.
The advantage of the proposed approach (weighted callgraph section) is
that it's completely transparent: it works regardless of the particular
profiling methodology (as long as there's !perf metadata when the pass
runs). For this reason, it fits neatly into an *existing* PGO-based
build flow. I only need to add 1 compiler flag to enable it. That's a
big plus.
On the other hand, I could see how your idea (callgraph input file for
linker) would be useful in situations where I just want to do section
layout but no PGO in the compiler... and of course for testing of the
linker's sorting algorithm.
So there's merits in both, but for my use cases Michael's original
approach is the most practical.
Yes, I must stress that I am not proposing having the option of reading
the callgraph from another file *instead* of reading it from .o
files. Just that doing it first decouples most of the lld patch form the
llvm changes and can be useful in cases where only samples are available.
Cheers,
Rafael
A rebased version of the lld patch is attached.
Cheers,
Rafael
lld.diff (9.13 KB)
Absolutely!
Michael/Rafael, would you mind uploading the patches to Phabricator so people can review them there? I think there is enough interest in this feature to move towards merging it in tree.
Thanks,
Tobias
I'm definitely interested in seeing this moving forward, (phabricator
will make the review easy), quick question, why the option is phrased
as negative? (i.e. --disable instead of --enable?)
Thanks,
Both sample and instrumentation are turned into the same profile metadata in the IR which is what feeds BranchFrequencyInfo. So it should be naturally agnostic to which method is used.
– Sean Silva
I updated the patch to read a call graph from a text file.
I tested it with the attached call.txt from lld linking chromium.
Unfortunately the resulting lld doesn't seem any faster. One thing I
noticed is that the most used symbols seem to be at the end of the
file.
In any case, can you add tests and send the lld patch for review?
Thanks,
Rafael
lld.diff (8.47 KB)
calls.txt (434 KB)
I'm fine with approaching this from lld first, as the input reading is the
simplest part of this. I started from instrumentation based because I
started with a target program that already had the infrastructure setup to
do it.
- Michael Spencer
I updated the patch to read a call graph from a text file.
I tested it with the attached call.txt from lld linking chromium.
Unfortunately the resulting lld doesn't seem any faster. One thing I
noticed is that the most used symbols seem to be at the end of the
file.
Not all programs will get faster as it really depends on the working set
size of the code. While lld is kinda large, if you're not running LTO then
the instruction working set is pretty small.
In any case, can you add tests and send the lld patch for review?
Yep, should have that tomorrow.
- Michael Spencer