[RFC] Propeller: A frame work for Post Link Optimizations

tmsriram · October 9, 2019, 9:10pm

Next, we would like to address questions on performance data regarding
clang improvements with BOLT and Propeller:

TLDR; We have provided a Makefile in the git repo, under
llvm-propeller/splo, to compare BOLT and Propeller performance. “make
check-performance” with pointers to the BOLT binaries should do the
comparisons.

In terms of performance optimization for code layout, fundamentally,
BOLT and Propeller are similar. In fact, in one experiment, we
extracted the basic block order from BOLT and just used basic block
sections and linker symbol ordering to perform the same order during
linking. The performance was identical and the ordering was exactly
as expected.

Some recent changes in Propeller has caused it to regress by a small
amount and is sometimes getting hidden by the small noise. We are
debugging the performance issue and it looks like an alignment issue
or a second order effect. Currently, because of this bug, BOLT is
faster than Propeller by about 0.5% to 1% on Clang benchmark.

tmsriram · October 10, 2019, 8:34pm

We have addressed concerns about Debug Info and Overhead below and
talked about Debug Fission which is very important when talking about
scalability of builds.

Also, we have been addressing each topic piecemeal to keep discussions
focussed. I will provide a consolidated document putting together our
clarifications on all the concerns raised.

Large binaries are built using debug fission to as it significantly
improves the scalability of debug builds,
DebugFission - GCC Wiki. Quoting from the page which
again reiterates the importance of scalability and incremental builds,
Fission was invented to mitigate the following issues for debug builds
of large applications:

* Out of memory errors
* Slow link times
* Slow gdb startup times

Debug Fission essentially splits the dwarf info into a separate object
file and the link action memory overheads are smaller as the input
object files and the resulting binary are smaller.

Propeller supports both fission and non-fission builds. Even with
fission, the dwarf objects are not relocatable, so the extra
relocations needed by Propeller would still go into the main object
file that contains code. Also, Propeller uses extra relocations for
location lists too.

We looked at BOLT and it does not seem to have any support for
binaries built with fission, something we missed mentioning in the
original RFC. Fission supports a mode which allows the individual
dwarf objects, dwo files, to be linked into a single dwp file and
BOLT would have to also rewrite the fission object file and likely in
the same process.

Another concern raised was regarding debugging overhead. The specific
concern here is that having more debug ranges with basic block
sections might increase overheads of debugging tools. We have not
evaluated this, any benchmarks that could measure these overheads?
Also, this can be mitigated by just exposing debug_ranges to lld. The
various ranges created for each basic block are contiguous as most
basic blocks end up together and these ranges can be collapsed by the
linker into a larger range.

jyknight · October 11, 2019, 5:45pm

Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code?

Or, could the propellor-optimized compile step instead split each function into only 2 pieces – one containing an “optimally-ordered” set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn’t actually need that flexibility?

Apologies if this is obvious for those who actually know what they’re talking about here.

Xinliang_David_Li · October 11, 2019, 6:25pm

Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code?

Or, could the propellor-optimized compile step instead split each function into only 2 pieces – one containing an “optimally-ordered” set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn’t actually need that flexibility?

Apologies if this is obvious for those who actually know what they’re talking about here.

It is a fair question.

We believe the flexibility to do fine grained layout in whole program context is important. PostLinkOptimization is aimed at getting as much performance improvement as possible (usually applied on top of ThinLTO+PGO), so the framework is designed to enable it.

In particular, it allows the linker to stitch hot bb traces from different functions to be stitched together. It also allows hot trace duplication across procedure boundaries (kind of interprocedural tailDup). Besides, code alignment decisions to minimize branch mispredictions may require global context (e.g, too conflicting branches residing in two different functions). Other micro-arch specific optimizations to improve processor front-end throughput may also require global context.

It is conceivable to have an option to control the level of granularity at the possible cost of performance.

thanks,

David

tmsriram · October 14, 2019, 6:43pm

Hello,

I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it.

BOLT’s performance gains inspired this work and we believe BOLT

is a great piece of engineering. However, there are build environments where
scalability is critical and memory limits per process are tight :

Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily
invented to achieve scalability and better incremental build times while
building large binaries with debug information.
ThinLTO,
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html was
primarily invented to make LLVM’s full LTO scalable and keep the memory and
time overheads low. ThinLTO has enabled much broader adoption of whole
program optimization, by making it non-monolithic.
For Chromium builds,
https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha
in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO.
It was 26GB with Full LTO before that and individual processes will run of out
of memory beyond that.
Here,
https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib
utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale
is shown where 5 million binary and test builds are performed every day on
several thousands of machines, each with a limitation of 12G of memory per
process and 15 minute time-out on tests. Memory overheads of 35G (clang) are
well above these thresholds.

We have developed Propeller like ThinLTO that can be used to obtain similar
performance gains like BOLT in such environments.

Thanks
Sri

concerns.txt (11.1 KB)

maksfb · October 16, 2019, 10:52pm

Hi Sri,

I want to clarify one thing before sending a detailed reply: did you evaluate

BOLT on Clang built with basic block sections? In the makefile you reference,

there are two versions: a “vanilla” and a default built with function sections.

High overheads you see with BOLT on Clang do not match our experience.

Thanks,

Maksim

tmsriram · October 17, 2019, 9:58pm

Hello Maksim,

Hi Sri,

I want to clarify one thing before sending a detailed reply: did you evaluate

BOLT on Clang built with basic block sections?

In the makefile you reference,

there are two versions: a “vanilla” and a default built with function sections.

High overheads you see with BOLT on Clang do not match our experience.

Thanks for pointing that out in the Makefile. We double-checked and noticed a bug in our Makefile. For clang, we noticed that we are BOLTING with basic block symbols which seems to affect the memory consumption of BOLT. So, we have re-measured with recent bolt and for full transparency we have uploaded the binaries, BOLT’s yaml files and perf.data files and the commands so that you can independently verify our results and check the binaries. We have gzipped all the files to keep it under 2G limit for git lfs, everything is here : https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment We have run our experiments on a 192G machine with Intel 18 core.

We built llvm-bolt with most recent sources and is pristine with none of our patches and uploading the binary we used here, https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt That’s a very recent BOLT binary, git hash: 988a7e8819b882fd14e18d149f8d3f702b134680

The https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2} contains two sets of binaries. The first binary is pristine recent clang-10 built from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary also only additionally built with -q. There are no labels or sections or any other Propeller flags used to build these clang binaries. Here is the command we are using to optimize with BOLT, all the commands have been checked in too.

You should be able to run llvm-bolt now on these binaries as all the files are provided. We have also provided the raw perf data files in case you want to independently convert.

$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1 --update-debug-sections

For version 2, this is the number:

Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40

Maximum resident set size (kbytes): 18742688

That is 125 seconds and ~18G of RAM.

For version 1, this hangs and we stopped it after several minutes and the maximum RSS size crossing 50G. This is most likely just a bug and you should be able to reproduce this. The binary seems to be running and printing update messages.

We also measured without update-debug-sections too with the command :

$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1

For version1 :
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74

Maximum resident set size (kbytes): 14824444

93 seconds and ~14G of RAM

version 2 :
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90

Maximum resident set size (kbytes): 14511912

similar 91 secs and ~14G

Now, coming back to the bug in the Makefile, we originally reported ~35G. That is wrong since the clang binary used to measure bolt overheads was built with basic block labels. Our sincere apologies for this, this showed BOLT as consuming more memory than is actual for clang. We double-checked BOLT numbers with the internal benchmark search2 for sanity and that is built without any labels and only with “-Wl,-q”. We are checking the other large internal benchmarks too. We cannot disclose internal benchmarks. So, we will get more large open-source benchmarks like Chrome or gcc built with bolt and share the binaries and results so you can independently verify.

Thanks
Sri

maksfb · October 18, 2019, 5:56pm

Cool. The new numbers look good. If you run BOLT with jemalloc library

preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed that

compared to the default malloc, it improves the multithreaded

performance and brings down memory usage significantly.

Thanks,

Maksim

tmsriram · October 18, 2019, 6:20pm

Hello Maksim,

Cool. The new numbers look good. If you run BOLT with jemalloc library

preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed that

compared to the default malloc, it improves the multithreaded

performance and brings down memory usage significantly.

Great, thanks for confirming! Would you be willing to share specific numbers, how significant is the reduction in memory with jemalloc for clang? We double-checked our numbers with the larger benchmarks and we can confirm they were not built with labels. One of our large benchmarks, search1, is about 5 times the size of clang in terms of text size as reported by size command, and we are seeing a 70G memory overhead on this. Do you have data on the memory consumption of BOLT with larger benchmarks with jemalloc. We are trying to build Chrome with latest BOLT so that we can share the memory overheads and the binaries with you for transparency but we are struggling with the disassembly errors. If you have data on large benchmarks we would appreciate it if you could share it.

Further, if you have a recipe to use jemalloc with BOLT, please point it at us. We could try it out too and share our findings.

Thanks much,
Sri

maksfb · October 22, 2019, 5:07am

Hi Sri,

Thank you for replying to our feedback. 7 out 12 high-level concerns have been

answered; 2 of them are fully addressed. The rest are being tracked at the

following Google doc:

To keep the discussion at a high level, I have to reference the LTO vs ThinLTO

comparison since that appears to be the central theme in your response to the

feedback.

Unlike LTO, BOLT does not have to keep the entire program in memory.

Furthermore, as we have previously mentioned, most of the passes are run in

parallel, and the performance scales well with the number of CPUs.

To demonstrate that running BOLT on a hot subset of functions is not just a

speculation, we have prototyped a “Thin” version that optimizes Clang-7 in under

15 seconds using less than 4GB of memory. No modifications to the linker or

compiler were required. And by the way, that appears to be faster than just the

re-linking phase of the Propeller. Larger loads show similar improvements

providing 2x-5x savings over the original processing specs.

Let me reiterate that current BOLT requires large amounts of memory not because

it’s a fundamental limitation, unlike LTO. For us, system memory was never a

constraint. The runtime of the application, not BOLT, was the primary goal

during the development.

ThinLTO design solves a real problem and dramatically improves compilation time

even when building on a single node. ThinLTO results provide "end-to-end build

time" comparison to LTO. I’ve asked you to show a similar comparison for

Propeller vs. BOLT. I haven’t seen the results, and I suspect the total overhead

will exceed that of even the oldest non-parallel version of BOLT.

One argument I’ve heard is that BOLT is not taking advantage of the distributed

build system. That’s correct. It does not have to since it does not require to

rebuild the application. In “Thin” mode, the overhead is similar to a regular

linker running with a linker script.

You are right that we do not support debug fission packages. It is unimplemented

for a simple reason: no one asked for it previously. And as we like to say in

the open-source community: “patches are welcome.”

Maksim

P.S. We have updated https://github.com/facebookincubator/BOLT with instructions on running BOLT with jemalloc or tcmalloc.

tmsriram · October 22, 2019, 5:38am

We are going to be at the llvm-dev meeting the next two days. We will get back to you after that.

Sri

Xinliang_David_Li · October 22, 2019, 7:03am

Hi Maksim, see my reply inline below.

Hi Sri,

Thank you for replying to our feedback. 7 out 12 high-level concerns have been

answered; 2 of them are fully addressed. The rest are being tracked at the

following Google doc:

https://docs.google.com/document/d/1jqjUZc8sEoNl6_lrVD6ZkITyFBFqhUOZ6ZaDm_XVbb8/

To keep the discussion at a high level, I have to reference the LTO vs ThinLTO

comparison since that appears to be the central theme in your response to the

feedback.

Unlike LTO, BOLT does not have to keep the entire program in memory.

IMO, there is no fundamental difference there. Partial (non Whole Program) LTO does not need to put everything in memory either, but it is still MonoLTO, and does not make it ThinLTO.

Like ThinLTO, Propeller is designed with the following principles in mind:

The serial step of Propeller (either an integrated component of linker or a plugin to it) is made as thin as possible – it works on lightweight summary like data
The serial step makes global decisions and do not make transformations
Code transformations are jobs of compiler backend and linker (reordering etc) and there will be clear interface between Propeller and the transformers (longer term with persistent McInst support)
Propeller is designed to be turned on for tens of thousands of build targets, in other words, it should be turned on for everyone by default. This means the memory footprint for the optimization much be reasonably small. It should integrate easily with distributed build systems.
Related to goal 4) there, disassembling should be avoid as much as possible. We have experienced all kinds of problems with it, and it can be a constant source of pain (not to mention the memory overhead).

Furthermore, as we have previously mentioned, most of the passes are run in

parallel, and the performance scales well with the number of CPUs.

If a pass can be run in parallel (e.g, function level), then PLO may be the wrong place to do such optimizations.

To demonstrate that running BOLT on a hot subset of functions is not just a

speculation, we have prototyped a “Thin” version that optimizes Clang-7 in under

15 seconds using less than 4GB of memory. No modifications to the linker or

compiler were required. And by the way, that appears to be faster than just the

re-linking phase of the Propeller. Larger loads show similar improvements

providing 2x-5x savings over the original processing specs.

If you can provide a patch so that we can try it with large benchmarks, that will be great.

On the other hand, Clang’s text is relatively small (<100M?), so using 4GB memory is still quite large IMO. Sri and team tried to bring up BOLT for Chrome (which is about 2 or 3x larger than Clang) for the last two days, but have not been successful. They encountered all sorts of problems — after fixing many sources that led to disassembly errors, they got repeatedly hit by ‘jump into the middle of insn’ errors. See also 5) above. (Similarly, a few months time were spent on bringing up BOLT for an internal benchmark which is 8x larger than Clang)

Let me reiterate that current BOLT requires large amounts of memory not because

it’s a fundamental limitation, unlike LTO. For us, system memory was never a

constraint. The runtime of the application, not BOLT, was the primary goal

during the development.

BOLT is an amazing engineering pierce that inspired Propeller. See the objectives above, system memory is one of the primary design goals for it , and it is architected to enforce the policy.

ThinLTO design solves a real problem and dramatically improves compilation time

even when building on a single node. ThinLTO results provide "end-to-end build

time" comparison to LTO. I’ve asked you to show a similar comparison for

Propeller vs. BOLT. I haven’t seen the results, and I suspect the total overhead

will exceed that of even the oldest non-parallel version of BOLT.

I have explained this above and we are confident that Propeller can be turned on by default (out of box) for any targets just like ThinLTO today (and this is indeed the goal). We will continue to iterate on later versions to make it happen.

One argument I’ve heard is that BOLT is not taking advantage of the distributed

build system. That’s correct. It does not have to since it does not require to

rebuild the application. In “Thin” mode, the overhead is similar to a regular

linker running with a linker script.

4GB in Thin Mode for a Clang sized program is not the scalability goal to shoot for wider adoption of PLO.

thanks,

David

tmsriram · January 18, 2020, 12:17am

We have been working on addressing most of the concerns raised and we
have updated the patches with the following changes.

1) Exception Handling

TLDR; We now handle exceptions and can optimize exception heavy
benchmarks and clang with exceptions enabled.

We have updated Propeller patches to handle exceptions. In contrast to
the previous approach which grouped all exception-handling related
basic blocks into one section, we now group just the landing pads
within one section and create a separate section for every other basic
block (even those which potentially throw exceptions by calling other
functions). This allows us to reorder code for exception-heavy
programs with greater flexibility. Grouping all landing pad basic
blocks in the same section lets us satisfy the ABI requirement of “one
landing pad fragment per procedure fragment.” This could be further
relaxed to separately grouping together all the landing pads which are
used by every basic block section. We will leave this for future
updates.

With this change, we are able to speedup the SPEC benchmark xalancbmk
by 0.5% which we were unable to do so previously. (Without exception
handling support this benchmark regressed by 1%).

We also built clang with exceptions enabled (DLLVM_ENABLE_EH=ON) and
were able to get performance improvements similar to what we got
without exceptions.

2) Minimizing the number of basic block sections

TLDR; We have significantly cut down the number of basic blocks for
which sections are created.

Previously, basic block sections were created for all the basic
blocks in a function that had samples. We have modified the patches
to allow creation of sections for only those basic blocks that have
samples. This greatly reduces the number of sections created. This
does not affect benchmark performance in any manner. For clang,
making this change reduces the object size bloat from 100% (2x) to 50%
(1.5x). Clang is an extreme benchmark for object size bloats since the
perf samples touch 10% of all functions. For larger datacenter
workload benchmarks we evaluated, the object size bloats are less than
5% with this change.

We are working on further reducing this so that the object file size
bloats can be kept as minimal as possible. Currently, we create a
section for a basic block even if it received just one perf sample.
This is overkill and can be restricted to only basic blocks that were
really hot.

3) Interprocedural Basic Block Reordering

TLDR; Interprocedural basic block reordering further improves
performance by 1% (clang benchmark) but is slower and we are working
on a better implementation.

Previously, the function reordering done was intra-procedural and one
of the comments was that if this could be done inter-procedurally. We
have now added an option to do this inter-procedurally, it is not the
default. While the inter-procedural reordering algorithm is slower by
10x (increasing from 3 seconds to 30 seconds for clang) because the
code layout search space is much larger, the performance of the final
benchmark improves by an additional 1%. We are working on improving
the implementation to make this much faster.

4) Propeller functionality in lld is separated out

After discussing with Rui, we have moved Propeller code into a
separate directory. Propeller interfaces with lld through
well-defined API. This provides the advantage that Propeller code
itself and future changes can be reviewed independently.

5) Reusing optimized LLVM IR files to re-link

TLDR; We are working on re-using native object files from previous
link and patches will be available soon

The final re-link with Propeller can directly start from the high
level optimized IR files. We are working on a patch to support
options “--save-precodegen” which will automatically save optimized IR
files which will later be reused by Propeller to reduce link time,
Branch : https://github.com/google/llvm-propeller/tree/plo-codegen-backend.

We are also working on directly using native object files from the
previous link for those objects which did not accrue any profiling
samples. Even for the clang benchmark, only 7% of the objects needed
to be re-generated with basic block sections and the cached native
object files can be used for the rest. This approach works
particularly well with ThinLTO as the lto.cache can directly be used
to obtain the native object files for modules that are not affected.
We will send out patches for this soon.

6) Time Overheads of CFI and Debug Info

TLDR; Experiments show varying overheads with Debug Info depending
on the symbolizers. We did not see any overheads with CFI.
Collapsing .debug_ranges in the linker solves the overheads associated
with this.

We tried to measure two things. Overheads from accessing .eh_frame
and .debug_ranges. With Basic block sections, a new CFI FDE is
created for every basic block that gets a new section and a new range
pair is added to debug_ranges. We wrote a benchmark , using
libunwind, to dump the stack trace (no symbolization) and then do then
dump the stack trace running the symbolizer (symbolization) using a
Google Internal API.

The benchmark had ~12000 functions and basic block sections for all
basic blocks resulted in ~130000 more basic block sections and
equivalent debug range entries, an order of magnitude more.

For a vanilla binary, on average over several iterations, it took
~70ns to dump the stack trace without symbolization and ~235000 ns
with symbolization. We then enabled basic block sections across the
board and re-measured and found no measurable impact. Symbolization
seems to dominate the time.

We then measured overheads with addr2line and llvm-symbolizer. We
just timed symbolizing a couple of addresses. With addr2line, the
binary with sections slows down by 5% to 10%. With llvm-symbolizer,
the slowdown is much more, by upto 30% sometimes. Looking at the
symbolizer code, we found places where the symbols are scanned
linearly and also places where the .debug_ranges are not collapsed.
Looking at the different symbolizer implementations, we found that
some use binary search, some collapse debug ranges as and when they
are added and some just do a linear search.

The primary reason for this overhead is that the .debug_ranges are
much larger. We are looking at improving this by doing one of these:

a) Have the linker collapse the debug_ranges which are pretty much
contiguous as most of the basic blocks end up in the same function.
This is the easier solution and we are working on a patch. This should
completely solve the problem.
b) Alternately, modify the symbolizers to collapse the ranges as and
when they are added.

All our experiments were with full basic block sections. When basic
block sections are turned on selectively, these problems should not
manifest even now.

7) New Relocation types to simplify linker relaxation of Basic Block
Section Jumps

We are working on adding two new relocation types: R_X86_64_PC32_JUMPX
and R_X86_64_PC8_JUMPX to the x86-64 ps-ABI to make linker handing of
relaxation simplified. Specifically, these two relocation types will
allow the linker to easily identify the jump instructions that were
created for bb sections. The PC8 relocations will only have to be
checked for growing to 32 bits during relaxation.

Proposal here: Redirecting to Google Groups

Thanks,
Sri

MaskRay · February 7, 2020, 7:05pm

Thank you for replying to our feedback. 7 out 12 high-level concerns have been

answered; 2 of them are fully addressed. The rest are being tracked at the

following Google doc:

Propeller Concerns - Google Docs

To keep the discussion at a high level, I have to reference the LTO vs ThinLTO

comparison since that appears to be the central theme in your response to the

feedback.

Unlike LTO, BOLT does not have to keep the entire program in memory.

Furthermore, as we have previously mentioned, most of the passes are run in

parallel, and the performance scales well with the number of CPUs.

To demonstrate that running BOLT on a hot subset of functions is not just a

speculation, we have prototyped a "Thin" version that optimizes Clang-7 in under

15 seconds using less than 4GB of memory. No modifications to the linker or

compiler were required. And by the way, that appears to be faster than just the

re-linking phase of the Propeller. Larger loads show similar improvements

providing 2x-5x savings over the original processing specs.

Let me reiterate that current BOLT requires large amounts of memory not because

it's a fundamental limitation, unlike LTO. For us, system memory was never a

constraint. The runtime of the application, not BOLT, was the primary goal

during the development.

ThinLTO design solves a real problem and dramatically improves compilation time

even when building on a single node. ThinLTO results provide "end-to-end build

time" comparison to LTO. I've asked you to show a similar comparison for

Propeller vs. BOLT. I haven't seen the results, and I suspect the total overhead

will exceed that of even the oldest non-parallel version of BOLT.

One argument I've heard is that BOLT is not taking advantage of the distributed

build system. That's correct. It does not have to since it does not require to

rebuild the application. In "Thin" mode, the overhead is similar to a regular

linker running with a linker script.

You are right that we do not support debug fission packages. It is unimplemented

for a simple reason: no one asked for it previously. And as we like to say in

the open-source community: "patches are welcome."

Maksim

P.S. We have updated GitHub - facebookarchive/BOLT: Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries with instructions on running BOLT with jemalloc or tcmalloc.

Hello Maksim,

Cool. The new numbers look good. If you run BOLT with jemalloc library

preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed that

compared to the default malloc, it improves the multithreaded

performance and brings down memory usage significantly.

Great, thanks for confirming! Would you be willing to share specific numbers, how significant is the reduction in memory with jemalloc for clang? We double-checked our numbers with the larger benchmarks and we can confirm they were *not built with labels*. One of our large benchmarks, search1, is about 5 times the size of clang in terms of text size as reported by size command, and we are seeing a 70G memory overhead on this. Do you have data on the memory consumption of BOLT with larger benchmarks with jemalloc. We are trying to build Chrome with latest BOLT so that we can share the memory overheads and the binaries with you for transparency but we are struggling with the disassembly errors. If you have data on large benchmarks we would appreciate it if you could share it.

Further, if you have a recipe to use jemalloc with BOLT, please point it at us. We could try it out too and share our findings.

Thanks much,

Sri

Thanks,

Maksim

Hi Sri,

I want to clarify one thing before sending a detailed reply: did you evaluate

BOLT on Clang built with basic block sections?

In the makefile you reference,

there are two versions: a “vanilla” and a default built with function sections.

High overheads you see with BOLT on Clang do not match our experience.

Thanks for pointing that out in the Makefile. We double-checked and noticed a bug in our Makefile. For clang, we noticed that we are BOLTING with basic block symbols which seems to affect the memory consumption of BOLT. So, we have re-measured with recent bolt and for *full transparency* we have uploaded the binaries, BOLT's yaml files and perf.data files and the commands so that you can independently verify our results and check the binaries. We have gzipped all the files to keep it under 2G limit for git lfs, everything is here : https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment We have run our experiments on a 192G machine with Intel 18 core.

We built llvm-bolt with most recent sources and is *pristine* with none of our patches and uploading the binary we used here, https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt That's a very recent BOLT binary, git hash: 988a7e8819b882fd14e18d149f8d3f702b134680

The https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/\{v1,v2\} contains two sets of binaries. The first binary is pristine recent clang-10 built from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary also only additionally built with -q. There are no labels or sections or any other Propeller flags used to build these clang binaries. Here is the command we are using to optimize with BOLT, all the commands have been checked in too.

You should be able to run llvm-bolt now on these binaries as all the files are provided. We have also provided the raw perf data files in case you want to independently convert.

$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1 --update-debug-sections

For version 2, this is the number:

Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40

Maximum resident set size (kbytes): 18742688

That is 125 seconds and ~18G of RAM.

For version 1, this hangs and we stopped it after several minutes and the maximum RSS size crossing 50G. This is most likely just a bug and you should be able to reproduce this. The binary seems to be running and printing update messages.

We also measured without update-debug-sections too with the command :

$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1

For version1 :

Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74

Maximum resident set size (kbytes): 14824444

93 seconds and ~14G of RAM

version 2 :

Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90

Maximum resident set size (kbytes): 14511912

similar 91 secs and ~14G

Now, coming back to the bug in the Makefile, we originally reported ~35G. That is *wrong* since the clang binary used to measure bolt overheads was built with basic block labels. Our *sincere apologies* for this, this showed BOLT as consuming more memory than is actual for clang. We double-checked BOLT numbers with the internal benchmark search2 for sanity and that is built *without any labels* and only with "-Wl,-q". We are checking the other large internal benchmarks too. We cannot disclose internal benchmarks. So, we will get more large open-source benchmarks like Chrome or gcc built with bolt and share the binaries and results so you can independently verify.

Thanks

Sri

Thanks,

Maksim

Hello,

I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it.

BOLT’s performance gains inspired this work and we believe BOLT

is a great piece of engineering. However, there are build environments where
scalability is critical and memory limits per process are tight :

* Debug Fission, DebugFission - GCC Wiki was primarily
invented to achieve scalability and better incremental build times while
building large binaries with debug information.

* ThinLTO,
ThinLTO: Scalable and Incremental LTO - The LLVM Project Blog was
primarily invented to make LLVM’s full LTO scalable and keep the memory and
time overheads low. ThinLTO has enabled much broader adoption of whole
program optimization, by making it non-monolithic.

* For Chromium builds,
https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha
in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO.
It was 26GB with Full LTO before that and individual processes will run of out
of memory beyond that.

* Here,
https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib
utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale
is shown where 5 million binary and test builds are performed every day on
several thousands of machines, each with a limitation of 12G of memory per
process and 15 minute time-out on tests. Memory overheads of 35G (clang) are
well above these thresholds.

We have developed Propeller like ThinLTO that can be used to obtain similar
performance gains like BOLT in such environments.

Thanks

Sri

Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code?

Or, could the propellor-optimized compile step instead split each function into only 2 pieces -- one containing an "optimally-ordered" set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn't actually need that flexibility?

Apologies if this is obvious for those who actually know what they're talking about here.

It is a fair question.

We believe the flexibility to do fine grained layout in whole program context is important. PostLinkOptimization is aimed at getting as much performance improvement as possible (usually applied on top of ThinLTO+PGO), so the framework is designed to enable it.

In particular, it allows the linker to stitch hot bb traces from different functions to be stitched together. It also allows hot trace duplication across procedure boundaries (kind of interprocedural tailDup). Besides, code alignment decisions to minimize branch mispredictions may require global context (e.g, too conflicting branches residing in two different functions). Other micro-arch specific optimizations to improve processor front-end throughput may also require global context.

It is conceivable to have an option to control the level of granularity at the possible cost of performance.

thanks,

David

You’re correct, except that, in Propeller, CFI duplication happens for every basic block as it operates with the conservative assumption that a block can be put anywhere by the linker. That’s a significant bloat that is not cleaned up later. So, during link time, if N blocks from the same function are contiguous in the final layout, as it should happen most of the time for any sane BB order, we would have several FDEs for a region that only needs one. The bloat goes to the final binary (a lot more FDEs, specifically, one FDE per basic block).

BOLT will only split a function in two parts, and only if it has profile. Most of the time, a function is not split. It also has an option not to split at all. For internally reordered basic blocks of a given function, it has CFI deduplication logic (it will interpret and build the CFI states for each block and rewrite the CFIs in a way that uses the minimum number of instructions to encode the states for each block).

From: llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of James Y Knight via llvm-dev <llvm-dev@lists.llvm.org>
Reply-To: James Y Knight <jyknight@google.com>
Date: Wednesday, October 2, 2019 at 1:59 PM
To: Maksim Panchenko <maks@fb.com>
Cc: "llvm-dev@lists.llvm.org" <llvm-dev@lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

I'm a bit confused by this subthread -- doesn't BOLT have the exact same CFI bloat issue? From my cursory reading of the propellor doc, the CFI duplication is _necessary_ to represent discontiguous functions, not anything particular to the way Propellor happens to generate those discontiguous functions.

And emitting discontiguous functions is a fundamental goal of this, right?

Thanks for clarifying. This means once you move to the next basic block (or any other basic

block in the function) you have to execute an entirely new set of CFI instructions

except for the common CIE part. While indeed this is not as bad, on average, the overall

active memory footprint will increase.

Creating one FDE per basic block means that .eh_frame_hdr, an allocatable section,

will be bloated too. This will increase the FDE lookup time. I don’t see .eh_frame_hdr

being mentioned in the proposal.

Maksim

*Pessimization/overhead for stack unwinding used by system-wide profilers and
for exception handling*

Larger CFI programs put an extra burden on unwinding at runtime as more CFI
(and thus native) instructions have to be executed. This will cause more
overhead for any profiler that records stack traces, and, as you correctly note
in the proposal, for any program that heavily uses exceptions.

The number of CFI instructions that have to be executed when unwinding any given stack stays the same. The CFI instructions for a function have to be duplicated in every basic block section, but when performing unwinding only one such a set is executed -- the copy for the current basic block. However, this copy contains precisely the same CFI instructions as the ones that would have to be executed if there were no basic block sections.

We are going to be at the llvm-dev meeting the next two days. We will get back to you after that.

Sri

We have been working on addressing most of the concerns raised and we
have updated the patches with the following changes.

1) Exception Handling

TLDR; We now handle exceptions and can optimize exception heavy
benchmarks and clang with exceptions enabled.

We have updated Propeller patches to handle exceptions. In contrast to
the previous approach which grouped all exception-handling related
basic blocks into one section, we now group just the landing pads
within one section and create a separate section for every other basic
block (even those which potentially throw exceptions by calling other
functions). This allows us to reorder code for exception-heavy
programs with greater flexibility. Grouping all landing pad basic
blocks in the same section lets us satisfy the ABI requirement of “one
landing pad fragment per procedure fragment.” This could be further
relaxed to separately grouping together all the landing pads which are
used by every basic block section. We will leave this for future
updates.

With this change, we are able to speedup the SPEC benchmark xalancbmk
by 0.5% which we were unable to do so previously. (Without exception
handling support this benchmark regressed by 1%).

We also built clang with exceptions enabled (DLLVM_ENABLE_EH=ON) and
were able to get performance improvements similar to what we got
without exceptions.

2) Minimizing the number of basic block sections

TLDR; We have significantly cut down the number of basic blocks for
which sections are created.

Previously, basic block sections were created for all the basic
blocks in a function that had samples. We have modified the patches
to allow creation of sections for only those basic blocks that have
samples. This greatly reduces the number of sections created. This
does not affect benchmark performance in any manner. For clang,
making this change reduces the object size bloat from 100% (2x) to 50%
(1.5x). Clang is an extreme benchmark for object size bloats since the
perf samples touch 10% of all functions. For larger datacenter
workload benchmarks we evaluated, the object size bloats are less than
5% with this change.

We are working on further reducing this so that the object file size
bloats can be kept as minimal as possible. Currently, we create a
section for a basic block even if it received just one perf sample.
This is overkill and can be restricted to only basic blocks that were
really hot.

3) Interprocedural Basic Block Reordering

TLDR; Interprocedural basic block reordering further improves
performance by 1% (clang benchmark) but is slower and we are working
on a better implementation.

Previously, the function reordering done was intra-procedural and one
of the comments was that if this could be done inter-procedurally. We
have now added an option to do this inter-procedurally, it is not the
default. While the inter-procedural reordering algorithm is slower by
10x (increasing from 3 seconds to 30 seconds for clang) because the
code layout search space is much larger, the performance of the final
benchmark improves by an additional 1%. We are working on improving
the implementation to make this much faster.

4) Propeller functionality in lld is separated out

After discussing with Rui, we have moved Propeller code into a
separate directory. Propeller interfaces with lld through
well-defined API. This provides the advantage that Propeller code
itself and future changes can be reviewed independently.

5) Reusing optimized LLVM IR files to re-link

TLDR; We are working on re-using native object files from previous
link and patches will be available soon

The final re-link with Propeller can directly start from the high
level optimized IR files. We are working on a patch to support
options “--save-precodegen” which will automatically save optimized IR
files which will later be reused by Propeller to reduce link time,
Branch : https://github.com/google/llvm-propeller/tree/plo-codegen-backend.

We are also working on directly using native object files from the
previous link for those objects which did not accrue any profiling
samples. Even for the clang benchmark, only 7% of the objects needed
to be re-generated with basic block sections and the cached native
object files can be used for the rest. This approach works
particularly well with ThinLTO as the lto.cache can directly be used
to obtain the native object files for modules that are not affected.
We will send out patches for this soon.

6) Time Overheads of CFI and Debug Info

TLDR; Experiments show varying overheads with Debug Info depending
on the symbolizers. We did not see any overheads with CFI.
Collapsing .debug_ranges in the linker solves the overheads associated
with this.

We tried to measure two things. Overheads from accessing .eh_frame
and .debug_ranges. With Basic block sections, a new CFI FDE is
created for every basic block that gets a new section and a new range
pair is added to debug_ranges. We wrote a benchmark , using
libunwind, to dump the stack trace (no symbolization) and then do then
dump the stack trace running the symbolizer (symbolization) using a
Google Internal API.

The benchmark had ~12000 functions and basic block sections for all
basic blocks resulted in ~130000 more basic block sections and
equivalent debug range entries, an order of magnitude more.

For a vanilla binary, on average over several iterations, it took
~70ns to dump the stack trace without symbolization and ~235000 ns
with symbolization. We then enabled basic block sections across the
board and re-measured and found no measurable impact. Symbolization
seems to dominate the time.

We then measured overheads with addr2line and llvm-symbolizer. We
just timed symbolizing a couple of addresses. With addr2line, the
binary with sections slows down by 5% to 10%. With llvm-symbolizer,
the slowdown is much more, by upto 30% sometimes. Looking at the
symbolizer code, we found places where the symbols are scanned
linearly and also places where the .debug_ranges are not collapsed.
Looking at the different symbolizer implementations, we found that
some use binary search, some collapse debug ranges as and when they
are added and some just do a linear search.

The primary reason for this overhead is that the .debug_ranges are
much larger. We are looking at improving this by doing one of these:

a) Have the linker collapse the debug_ranges which are pretty much
contiguous as most of the basic blocks end up in the same function.
This is the easier solution and we are working on a patch. This should
completely solve the problem.
b) Alternately, modify the symbolizers to collapse the ranges as and
when they are added.

All our experiments were with full basic block sections. When basic
block sections are turned on selectively, these problems should not
manifest even now.

7) New Relocation types to simplify linker relaxation of Basic Block
Section Jumps

We are working on adding two new relocation types: R_X86_64_PC32_JUMPX
and R_X86_64_PC8_JUMPX to the x86-64 ps-ABI to make linker handing of
relaxation simplified. Specifically, these two relocation types will
allow the linker to easily identify the jump instructions that were
created for bb sections. The PC8 relocations will only have to be
checked for growing to 32 bits during relaxation.

Proposal here: Redirecting to Google Groups

I took some time to understand the lld side changes
(D68065+D73497+D68073; the patch do not have clear relations and do not
apply at HEAD cleanly), attended a LLVM Social yesterday and asked some
people about their feelings. We are still dubious whether doing all the
heavy lifting on the linker side is the right approach. Disassembly at
linker side may give some short term benefits, but the interaction with
debug info/asynchronous unwind tables/symbol information is muddy. (From
the existing posted patch series it is not clear how they are solved.)

Doing non-LTO things on the linkder side is rather inflexible.
Linkers don't really understand the semantics. So I see things like:

* Get the input section size, check if there is a relocation (for JMP)
   r_offset = input_section_size-4, if there is a JCC at r_offset =
   input_section_size-9, ...
* Update st_value to take shrunk input sections into account.
* An algorithm to the extended travelling salesman problem looks fancy.
   It also introduced a bunch of parameters. Have other ordering approaches been considered?
   I am concerned that the raw ELF fields (like sh_size) are used for
   heuristics. It seems to try reconstructing some information about
   Machine{Function,BasicBlock,Instr,...} about doing that in the
   Machine* stage may just be simpler and more accurate.

We feel that the available optimizations may be limited. The lld focused
implementation may have some improvement in the near term but the
benefits will dilute quickly when the Machine{Function,BasicBlock}
interfaces get improved.

Also, doing heavy-lifting work on the linker side:

Needs non-trivial work to port to another binary format. Another target
may be a very different story. (Even for x86-64, I see the patches
attend to address R_X86_64_PC8/R_X86_64_PC32/R_X86_64_PLT32 calls, but
the two relocation types are not handled.)
The newly lld/ELF code overlaps with jobs of MachineBlockPlacement/BranchFolding/AsmPrinter/etc.

If the current MachineFunction interface is not flexible (e.g. it does
not allow sharable MachineBasicBlocks between MachineFunctions.) enough,
let's fix it.

Rahman_Lavaee1 · February 10, 2020, 7:13pm

I respond with regards to “unwind tables” and “code layout”.

WIth regards to “unwind tables”, the idea of splitting call-sites tables has been implemented in D73739.

I think “asynchronous unwind tables” are already supported by our patch (Meaning that the unwind tables are precise at instruction boundary).
Please let us know if you believe otherwise.

An algorithm to the extended travelling salesman problem looks fancy.
It also introduced a bunch of parameters. Have other ordering approaches been considered?

The Extended-TSP algorithm is published in https://arxiv.org/abs/1809.04676 and shows a speedup improvement of about 0.7% compared to TSP (on Clang), which I confirmed independently.
The general idea of considering jump distance in the layout decision is ubiquitous (https://dl.acm.org/doi/10.1145/3302516.3307358 (codestitcher), https://dl.acm.org/doi/10.5555/3049832.3049858 (hfsort) ).
We have considered other ordering approaches such as pettis-hansen and codestitcher (both of which are based on TSP) and we have found Ext-TSP to be the best.
You can also refer to the codestitcher and ext-TSP paper for more extensive study.

Let me point out that although the formulation of ext-TSP is more involved than TSP, the heuristic algorithm that is used to solve it is essentially the same as the one for TSP. It’s an incremental changing algorithm aimed at maximizing the gain in the score at each step.

On the other hand, most of the complication in the algorithm is caused by split-and-remerge (guided by -propeller-chain-split-threshold) which is highly beneficial for escaping local optima, resulting in 2% more performance improvement. The original ext-TSP work sets the split threshold at 128 bytes, while we can achieve better speedup by setting it to 1024bytes, without sacrificing the complication time (The whole reordering algorithm takes about 4 seconds on clang).

I am concerned that the raw ELF fields (like sh_size) are used for
heuristics. It seems to try reconstructing some information about
Machine{Function,BasicBlock,Instr,…} about doing that in the
Machine* stage may just be simpler and more accurate.

Yes. Basic block (binary) sizes are used to compute the binary distance for jumps in the final layout (as used by the ext-TSP algorithm). We can currently get this information both from the object code (sections), or the BB symbol sizes (propagated through the propeller profile). Getting this information in the machine stage is not as accurate and convenient. For instance, StackMapShadowTracker has been implemented in llvm to count the instruction bytes after the last StackMap and it shares some code between different classes (X86AsmPrinter and X86MCInstLower).

tmsriram · February 11, 2020, 6:31pm

>>> Thank you for replying to our feedback. 7 out 12 high-level concerns have been
>>>
>>> answered; 2 of them are fully addressed. The rest are being tracked at the
>>>
>>> following Google doc:
>>>
>>>
>>>
>>> Propeller Concerns - Google Docs
>>>
>>>
>>>
>>> To keep the discussion at a high level, I have to reference the LTO vs ThinLTO
>>>
>>> comparison since that appears to be the central theme in your response to the
>>>
>>> feedback.
>>>
>>>
>>>
>>> Unlike LTO, BOLT does not have to keep the entire program in memory.
>>>
>>> Furthermore, as we have previously mentioned, most of the passes are run in
>>>
>>> parallel, and the performance scales well with the number of CPUs.
>>>
>>>
>>>
>>> To demonstrate that running BOLT on a hot subset of functions is not just a
>>>
>>> speculation, we have prototyped a "Thin" version that optimizes Clang-7 in under
>>>
>>> 15 seconds using less than 4GB of memory. No modifications to the linker or
>>>
>>> compiler were required. And by the way, that appears to be faster than just the
>>>
>>> re-linking phase of the Propeller. Larger loads show similar improvements
>>>
>>> providing 2x-5x savings over the original processing specs.
>>>
>>>
>>>
>>> Let me reiterate that current BOLT requires large amounts of memory not because
>>>
>>> it's a fundamental limitation, unlike LTO. For us, system memory was never a
>>>
>>> constraint. The runtime of the application, not BOLT, was the primary goal
>>>
>>> during the development.
>>>
>>>
>>>
>>> ThinLTO design solves a real problem and dramatically improves compilation time
>>>
>>> even when building on a single node. ThinLTO results provide "end-to-end build
>>>
>>> time" comparison to LTO. I've asked you to show a similar comparison for
>>>
>>> Propeller vs. BOLT. I haven't seen the results, and I suspect the total overhead
>>>
>>> will exceed that of even the oldest non-parallel version of BOLT.
>>>
>>>
>>>
>>> One argument I've heard is that BOLT is not taking advantage of the distributed
>>>
>>> build system. That's correct. It does not have to since it does not require to
>>>
>>> rebuild the application. In "Thin" mode, the overhead is similar to a regular
>>>
>>> linker running with a linker script.
>>>
>>>
>>>
>>> You are right that we do not support debug fission packages. It is unimplemented
>>>
>>> for a simple reason: no one asked for it previously. And as we like to say in
>>>
>>> the open-source community: "patches are welcome."
>>>
>>>
>>>
>>> Maksim
>>>
>>>
>>>
>>> P.S. We have updated GitHub - facebookarchive/BOLT: Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries with instructions on running BOLT with jemalloc or tcmalloc.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hello Maksim,
>>>
>>>
>>>
>>>
>>> Cool. The new numbers look good. If you run BOLT with jemalloc library
>>>
>>> preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed that
>>>
>>> compared to the default malloc, it improves the multithreaded
>>>
>>> performance and brings down memory usage significantly.
>>>
>>>
>>>
>>> Great, thanks for confirming! Would you be willing to share specific numbers, how significant is the reduction in memory with jemalloc for clang? We double-checked our numbers with the larger benchmarks and we can confirm they were *not built with labels*. One of our large benchmarks, search1, is about 5 times the size of clang in terms of text size as reported by size command, and we are seeing a 70G memory overhead on this. Do you have data on the memory consumption of BOLT with larger benchmarks with jemalloc. We are trying to build Chrome with latest BOLT so that we can share the memory overheads and the binaries with you for transparency but we are struggling with the disassembly errors. If you have data on large benchmarks we would appreciate it if you could share it.
>>>
>>>
>>>
>>> Further, if you have a recipe to use jemalloc with BOLT, please point it at us. We could try it out too and share our findings.
>>>
>>>
>>>
>>> Thanks much,
>>>
>>> Sri
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Maksim
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hi Sri,
>>>
>>>
>>>
>>> I want to clarify one thing before sending a detailed reply: did you evaluate
>>>
>>> BOLT on Clang built with basic block sections?
>>>
>>> In the makefile you reference,
>>>
>>> there are two versions: a “vanilla” and a default built with function sections.
>>>
>>> High overheads you see with BOLT on Clang do not match our experience.
>>>
>>>
>>>
>>> Thanks for pointing that out in the Makefile. We double-checked and noticed a bug in our Makefile. For clang, we noticed that we are BOLTING with basic block symbols which seems to affect the memory consumption of BOLT. So, we have re-measured with recent bolt and for *full transparency* we have uploaded the binaries, BOLT's yaml files and perf.data files and the commands so that you can independently verify our results and check the binaries. We have gzipped all the files to keep it under 2G limit for git lfs, everything is here : https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment We have run our experiments on a 192G machine with Intel 18 core.
>>>
>>>
>>>
>>> We built llvm-bolt with most recent sources and is *pristine* with none of our patches and uploading the binary we used here, https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt That's a very recent BOLT binary, git hash: 988a7e8819b882fd14e18d149f8d3f702b134680
>>>
>>>
>>>
>>> The https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/\{v1,v2\} contains two sets of binaries. The first binary is pristine recent clang-10 built from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary also only additionally built with -q. There are no labels or sections or any other Propeller flags used to build these clang binaries. Here is the command we are using to optimize with BOLT, all the commands have been checked in too.
>>>
>>>
>>>
>>> You should be able to run llvm-bolt now on these binaries as all the files are provided. We have also provided the raw perf data files in case you want to independently convert.
>>>
>>>
>>>
>>> $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1 --update-debug-sections
>>>
>>>
>>>
>>> For version 2, this is the number:
>>>
>>>
>>>
>>> Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40
>>>
>>> Maximum resident set size (kbytes): 18742688
>>>
>>>
>>>
>>> That is 125 seconds and ~18G of RAM.
>>>
>>>
>>>
>>> For version 1, this hangs and we stopped it after several minutes and the maximum RSS size crossing 50G. This is most likely just a bug and you should be able to reproduce this. The binary seems to be running and printing update messages.
>>>
>>>
>>>
>>> We also measured without update-debug-sections too with the command :
>>>
>>>
>>>
>>> $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1
>>>
>>>
>>>
>>> For version1 :
>>>
>>> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74
>>>
>>> Maximum resident set size (kbytes): 14824444
>>>
>>>
>>>
>>> 93 seconds and ~14G of RAM
>>>
>>>
>>>
>>> version 2 :
>>>
>>> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90
>>>
>>> Maximum resident set size (kbytes): 14511912
>>>
>>>
>>>
>>> similar 91 secs and ~14G
>>>
>>>
>>>
>>> Now, coming back to the bug in the Makefile, we originally reported ~35G. That is *wrong* since the clang binary used to measure bolt overheads was built with basic block labels. Our *sincere apologies* for this, this showed BOLT as consuming more memory than is actual for clang. We double-checked BOLT numbers with the internal benchmark search2 for sanity and that is built *without any labels* and only with "-Wl,-q". We are checking the other large internal benchmarks too. We cannot disclose internal benchmarks. So, we will get more large open-source benchmarks like Chrome or gcc built with bolt and share the binaries and results so you can independently verify.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Sri
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Maksim
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it.
>>>
>>>
>>>
>>> BOLT’s performance gains inspired this work and we believe BOLT
>>>
>>> is a great piece of engineering. However, there are build environments where
>>> scalability is critical and memory limits per process are tight :
>>>
>>> * Debug Fission, DebugFission - GCC Wiki was primarily
>>> invented to achieve scalability and better incremental build times while
>>> building large binaries with debug information.
>>>
>>> * ThinLTO,
>>> ThinLTO: Scalable and Incremental LTO - The LLVM Project Blog was
>>> primarily invented to make LLVM’s full LTO scalable and keep the memory and
>>> time overheads low. ThinLTO has enabled much broader adoption of whole
>>> program optimization, by making it non-monolithic.
>>>
>>> * For Chromium builds,
>>> https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha
>>> in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO.
>>> It was 26GB with Full LTO before that and individual processes will run of out
>>> of memory beyond that.
>>>
>>> * Here,
>>> https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib
>>> utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale
>>> is shown where 5 million binary and test builds are performed every day on
>>> several thousands of machines, each with a limitation of 12G of memory per
>>> process and 15 minute time-out on tests. Memory overheads of 35G (clang) are
>>> well above these thresholds.
>>>
>>> We have developed Propeller like ThinLTO that can be used to obtain similar
>>> performance gains like BOLT in such environments.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Sri
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code?
>>>
>>>
>>>
>>> Or, could the propellor-optimized compile step instead split each function into only 2 pieces -- one containing an "optimally-ordered" set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn't actually need that flexibility?
>>>
>>>
>>>
>>> Apologies if this is obvious for those who actually know what they're talking about here.
>>>
>>>
>>>
>>> It is a fair question.
>>>
>>>
>>>
>>> We believe the flexibility to do fine grained layout in whole program context is important. PostLinkOptimization is aimed at getting as much performance improvement as possible (usually applied on top of ThinLTO+PGO), so the framework is designed to enable it.
>>>
>>>
>>>
>>> In particular, it allows the linker to stitch hot bb traces from different functions to be stitched together. It also allows hot trace duplication across procedure boundaries (kind of interprocedural tailDup). Besides, code alignment decisions to minimize branch mispredictions may require global context (e.g, too conflicting branches residing in two different functions). Other micro-arch specific optimizations to improve processor front-end throughput may also require global context.
>>>
>>>
>>>
>>> It is conceivable to have an option to control the level of granularity at the possible cost of performance.
>>>
>>>
>>>
>>> thanks,
>>>
>>>
>>>
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> You’re correct, except that, in Propeller, CFI duplication happens for every basic block as it operates with the conservative assumption that a block can be put anywhere by the linker. That’s a significant bloat that is not cleaned up later. So, during link time, if N blocks from the same function are contiguous in the final layout, as it should happen most of the time for any sane BB order, we would have several FDEs for a region that only needs one. The bloat goes to the final binary (a lot more FDEs, specifically, one FDE per basic block).
>>>
>>> BOLT will only split a function in two parts, and only if it has profile. Most of the time, a function is not split. It also has an option not to split at all. For internally reordered basic blocks of a given function, it has CFI deduplication logic (it will interpret and build the CFI states for each block and rewrite the CFIs in a way that uses the minimum number of instructions to encode the states for each block).
>>>
>>>
>>>
>>> From: llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of James Y Knight via llvm-dev <llvm-dev@lists.llvm.org>
>>> Reply-To: James Y Knight <jyknight@google.com>
>>> Date: Wednesday, October 2, 2019 at 1:59 PM
>>> To: Maksim Panchenko <maks@fb.com>
>>> Cc: "llvm-dev@lists.llvm.org" <llvm-dev@lists.llvm.org>
>>> Subject: Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
>>>
>>>
>>>
>>> I'm a bit confused by this subthread -- doesn't BOLT have the exact same CFI bloat issue? From my cursory reading of the propellor doc, the CFI duplication is _necessary_ to represent discontiguous functions, not anything particular to the way Propellor happens to generate those discontiguous functions.
>>>
>>>
>>>
>>> And emitting discontiguous functions is a fundamental goal of this, right?
>>>
>>>
>>>
>>>
>>> Thanks for clarifying. This means once you move to the next basic block (or any other basic
>>>
>>> block in the function) you have to execute an entirely new set of CFI instructions
>>>
>>> except for the common CIE part. While indeed this is not as bad, on average, the overall
>>>
>>> active memory footprint will increase.
>>>
>>>
>>>
>>> Creating one FDE per basic block means that .eh_frame_hdr, an allocatable section,
>>>
>>> will be bloated too. This will increase the FDE lookup time. I don’t see .eh_frame_hdr
>>>
>>> being mentioned in the proposal.
>>>
>>>
>>>
>>> Maksim
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Pessimization/overhead for stack unwinding used by system-wide profilers and
>>> for exception handling*
>>>
>>> Larger CFI programs put an extra burden on unwinding at runtime as more CFI
>>> (and thus native) instructions have to be executed. This will cause more
>>> overhead for any profiler that records stack traces, and, as you correctly note
>>> in the proposal, for any program that heavily uses exceptions.
>>>
>>>
>>>
>>> The number of CFI instructions that have to be executed when unwinding any given stack stays the same. The CFI instructions for a function have to be duplicated in every basic block section, but when performing unwinding only one such a set is executed -- the copy for the current basic block. However, this copy contains precisely the same CFI instructions as the ones that would have to be executed if there were no basic block sections.
>>
>> We are going to be at the llvm-dev meeting the next two days. We will get back to you after that.
>>
>> Sri
>>
>We have been working on addressing most of the concerns raised and we
>have updated the patches with the following changes.
>
>1) Exception Handling
>
>TLDR; We now handle exceptions and can optimize exception heavy
>benchmarks and clang with exceptions enabled.
>
>We have updated Propeller patches to handle exceptions. In contrast to
>the previous approach which grouped all exception-handling related
>basic blocks into one section, we now group just the landing pads
>within one section and create a separate section for every other basic
>block (even those which potentially throw exceptions by calling other
>functions). This allows us to reorder code for exception-heavy
>programs with greater flexibility. Grouping all landing pad basic
>blocks in the same section lets us satisfy the ABI requirement of “one
>landing pad fragment per procedure fragment.” This could be further
>relaxed to separately grouping together all the landing pads which are
>used by every basic block section. We will leave this for future
>updates.
>
>With this change, we are able to speedup the SPEC benchmark xalancbmk
>by 0.5% which we were unable to do so previously. (Without exception
>handling support this benchmark regressed by 1%).
>
>We also built clang with exceptions enabled (DLLVM_ENABLE_EH=ON) and
>were able to get performance improvements similar to what we got
>without exceptions.
>
>2) Minimizing the number of basic block sections
>
>TLDR; We have significantly cut down the number of basic blocks for
>which sections are created.
>
>Previously, basic block sections were created for all the basic
>blocks in a function that had samples. We have modified the patches
>to allow creation of sections for only those basic blocks that have
>samples. This greatly reduces the number of sections created. This
>does not affect benchmark performance in any manner. For clang,
>making this change reduces the object size bloat from 100% (2x) to 50%
>(1.5x). Clang is an extreme benchmark for object size bloats since the
>perf samples touch 10% of all functions. For larger datacenter
>workload benchmarks we evaluated, the object size bloats are less than
>5% with this change.
>
>We are working on further reducing this so that the object file size
>bloats can be kept as minimal as possible. Currently, we create a
>section for a basic block even if it received just one perf sample.
>This is overkill and can be restricted to only basic blocks that were
>really hot.
>
>3) Interprocedural Basic Block Reordering
>
>TLDR; Interprocedural basic block reordering further improves
>performance by 1% (clang benchmark) but is slower and we are working
>on a better implementation.
>
>Previously, the function reordering done was intra-procedural and one
>of the comments was that if this could be done inter-procedurally. We
>have now added an option to do this inter-procedurally, it is not the
>default. While the inter-procedural reordering algorithm is slower by
>10x (increasing from 3 seconds to 30 seconds for clang) because the
>code layout search space is much larger, the performance of the final
>benchmark improves by an additional 1%. We are working on improving
>the implementation to make this much faster.
>
>4) Propeller functionality in lld is separated out
>
>After discussing with Rui, we have moved Propeller code into a
>separate directory. Propeller interfaces with lld through
>well-defined API. This provides the advantage that Propeller code
>itself and future changes can be reviewed independently.
>
>5) Reusing optimized LLVM IR files to re-link
>
>TLDR; We are working on re-using native object files from previous
>link and patches will be available soon
>
>The final re-link with Propeller can directly start from the high
>level optimized IR files. We are working on a patch to support
>options “--save-precodegen” which will automatically save optimized IR
>files which will later be reused by Propeller to reduce link time,
>Branch : https://github.com/google/llvm-propeller/tree/plo-codegen-backend.
>
>We are also working on directly using native object files from the
>previous link for those objects which did not accrue any profiling
>samples. Even for the clang benchmark, only 7% of the objects needed
>to be re-generated with basic block sections and the cached native
>object files can be used for the rest. This approach works
>particularly well with ThinLTO as the lto.cache can directly be used
>to obtain the native object files for modules that are not affected.
>We will send out patches for this soon.
>
>6) Time Overheads of CFI and Debug Info
>
>TLDR; Experiments show varying overheads with Debug Info depending
>on the symbolizers. We did not see any overheads with CFI.
>Collapsing .debug_ranges in the linker solves the overheads associated
>with this.
>
>We tried to measure two things. Overheads from accessing .eh_frame
>and .debug_ranges. With Basic block sections, a new CFI FDE is
>created for every basic block that gets a new section and a new range
>pair is added to debug_ranges. We wrote a benchmark , using
>libunwind, to dump the stack trace (no symbolization) and then do then
>dump the stack trace running the symbolizer (symbolization) using a
>Google Internal API.
>
>The benchmark had ~12000 functions and basic block sections for all
>basic blocks resulted in ~130000 more basic block sections and
>equivalent debug range entries, an order of magnitude more.
>
>For a vanilla binary, on average over several iterations, it took
>~70ns to dump the stack trace without symbolization and ~235000 ns
>with symbolization. We then enabled basic block sections across the
>board and re-measured and found no measurable impact. Symbolization
>seems to dominate the time.
>
>We then measured overheads with addr2line and llvm-symbolizer. We
>just timed symbolizing a couple of addresses. With addr2line, the
>binary with sections slows down by 5% to 10%. With llvm-symbolizer,
>the slowdown is much more, by upto 30% sometimes. Looking at the
>symbolizer code, we found places where the symbols are scanned
>linearly and also places where the .debug_ranges are not collapsed.
>Looking at the different symbolizer implementations, we found that
>some use binary search, some collapse debug ranges as and when they
>are added and some just do a linear search.
>
>The primary reason for this overhead is that the .debug_ranges are
>much larger. We are looking at improving this by doing one of these:
>
>a) Have the linker collapse the debug_ranges which are pretty much
>contiguous as most of the basic blocks end up in the same function.
>This is the easier solution and we are working on a patch. This should
>completely solve the problem.
>b) Alternately, modify the symbolizers to collapse the ranges as and
>when they are added.
>
>All our experiments were with full basic block sections. When basic
>block sections are turned on selectively, these problems should not
>manifest even now.
>
>7) New Relocation types to simplify linker relaxation of Basic Block
>Section Jumps
>
>We are working on adding two new relocation types: R_X86_64_PC32_JUMPX
>and R_X86_64_PC8_JUMPX to the x86-64 ps-ABI to make linker handing of
>relaxation simplified. Specifically, these two relocation types will
>allow the linker to easily identify the jump instructions that were
>created for bb sections. The PC8 relocations will only have to be
>checked for growing to 32 bits during relaxation.
>
>Proposal here: Redirecting to Google Groups

I took some time to understand the lld side changes
(D68065+D73497+D68073; the patch do not have clear relations and do not
apply at HEAD cleanly), attended a LLVM Social yesterday and asked some
people about their feelings. We are still dubious whether doing all the
heavy lifting on the linker side is the right approach. Disassembly at
linker side may give some short term benefits, but the interaction with
debug info/asynchronous unwind tables/symbol information is muddy. (From
the existing posted patch series it is not clear how they are solved.)

Doing non-LTO things on the linkder side is rather inflexible.
Linkers don't really understand the semantics. So I see things like:

* Get the input section size, check if there is a relocation (for JMP)
   r_offset = input_section_size-4, if there is a JCC at r_offset =
   input_section_size-9, ...
* Update st_value to take shrunk input sections into account.
* An algorithm to the extended travelling salesman problem looks fancy.
   It also introduced a bunch of parameters. Have other ordering approaches been considered?
   I am concerned that the raw ELF fields (like sh_size) are used for
   heuristics. It seems to try reconstructing some information about
   Machine{Function,BasicBlock,Instr,...} about doing that in the
   Machine* stage may just be simpler and more accurate.

We feel that the available optimizations may be limited. The lld focused
implementation may have some improvement in the near term but the
benefits will dilute quickly when the Machine{Function,BasicBlock}
interfaces get improved.

Also, doing heavy-lifting work on the linker side:

Needs non-trivial work to port to another binary format. Another target
may be a very different story. (Even for x86-64, I see the patches
attend to address R_X86_64_PC8/R_X86_64_PC32/R_X86_64_PLT32 calls, but
the two relocation types are not handled.)

I will specifically answer the relocations part as Rahman has
addressed the other things. This was discussed previously and Eli and
Rui suggested we use specific relocations for jmp instructions which
can be relaxed. We are working on that one here:
https://groups.google.com/g/x86-64-abi/c/-2AkYw1QJuI With the
relocations added, the relaxation code would become a lot simpler.

Topic		Replies	Views
Making Clang/LLVM faster using code layout optimizations LLVM Dev List Archives	3	241	October 19, 2018
[RFC] Propeller LLVM Dev List Archives	0	103	September 30, 2019
[RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization LLVM Dev List Archives	0	111	October 26, 2020
Propeller can work with pgo use one profile? IR & Optimizations pgo , clang	9	230	April 11, 2024
Preparing BOLT for LLVM monorepo LLVM Dev List Archives	33	323	January 12, 2022

[RFC] Propeller: A frame work for Post Link Optimizations

Related topics