llvm and clang are getting slower

I have just benchmarked building trunk llvm and clang in Debug,
Release and LTO modes (see the attached scrip for the cmake lines).

The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
cases I used the system libgcc and libstdc++.

For release builds there is a monotonic increase in each version. From
163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
5.3.2 takes 205 minutes.

Debug and LTO show an improvement in 3.7, but have regressed again in 3.8.

Cheers,
Rafael

run.sh (936 Bytes)

LTO.time (262 Bytes)

Debug.time (259 Bytes)

Release.time (326 Bytes)

Hi Rafael,

Thanks for sharing. We also noticed this internally, and I know that Bruno and Chris are working on some infrastructure and tooling to help tracking closely compile time regressions.

We had this conversation internally about the tradeoff between compile-time and runtime performance, and I planned to bring-up the topic on the list in the coming months, this looks like a good occasion to plant the seed. Apparently in the past (years/decade ago?) the project was very conservative on adding any optimizations that would impact compile time, however there is no explicit policy (that I know of) to address this tradeoff.
The closest I could find would be what Chandler wrote in: http://reviews.llvm.org/D12826 ; for instance for O2 he stated that "if an optimization increases compile time by 5% or increases code size by 5% for a particular benchmark, that benchmark should also be one which sees a 5% runtime improvement".

My hope is that with better tooling for tracking compile time in the future, we'll reach a state where we'll be able to consider "breaking" the compile-time regression test as important as breaking any test: i.e. the offending commit should be reverted unless it has been shown to significantly (hand wavy...) improve the runtime performance.

<troll>
With the current trend, the Polly developers don't have to worry about improving their compile time, we'll catch up with them :wink:
</troll>

Hi,

There is a possibility that r259673 could play a role here.

For the buildSchedGraph() method, there is the -dag-maps-huge-region that has the default value of 1000. When I commited the patch, I was expecting people to lower this value as needed and also suggested this, but this has not happened. 1000 is very high, basically “unlimited”.

It would be interesting to see what results you get with e.g. -mllvm -dag-maps-huge-region=50. Of course, since this is a trade-off between compile time and scheduler freedom, some care should be taken before lowering this in trunk.

Just a thought,

Jonas

From: "Mehdi Amini via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Rafael Espíndola" <rafael.espindola@gmail.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "cfe-dev" <cfe-dev@lists.llvm.org>
Sent: Tuesday, March 8, 2016 11:40:47 AM
Subject: Re: [cfe-dev] [llvm-dev] llvm and clang are getting slower

Hi Rafael,

CC: cfe-dev

Thanks for sharing. We also noticed this internally, and I know that
Bruno and Chris are working on some infrastructure and tooling to
help tracking closely compile time regressions.

We had this conversation internally about the tradeoff between
compile-time and runtime performance, and I planned to bring-up the
topic on the list in the coming months, this looks like a good
occasion to plant the seed. Apparently in the past (years/decade
ago?) the project was very conservative on adding any optimizations
that would impact compile time, however there is no explicit policy
(that I know of) to address this tradeoff.
The closest I could find would be what Chandler wrote in:
⚙ D12826 [PM] Wire up optimization levels and default pipeline construction APIs in the PassBuilder. ; for instance for O2 he stated that
"if an optimization increases compile time by 5% or increases code
size by 5% for a particular benchmark, that benchmark should also be
one which sees a 5% runtime improvement".

My hope is that with better tooling for tracking compile time in the
future, we'll reach a state where we'll be able to consider
"breaking" the compile-time regression test as important as breaking
any test: i.e. the offending commit should be reverted unless it has
been shown to significantly (hand wavy...) improve the runtime
performance.

<troll>
With the current trend, the Polly developers don't have to worry
about improving their compile time, we'll catch up with them :wink:
</troll>

My two largest pet peeves in this area are:

1. We often use functions from ValueTracking (to get known bits, the number of sign bits, etc.) as through they're low cost. They're not really low cost. The problem is that they *should* be. These functions do bottom-up walks, and could cache their results. Instead, they do a limited walk and recompute everything each time. This is expensive, and a significant amount of our InstCombine time goes to ValueTracking, and that shouldn't be the case. The more we add to InstCombine (and related passes), and the more we run InstCombine, the worse this gets. On the other hand, fixing this will help both compile time and code quality.

  Furthermore, BasicAA has the same problem.

2. We have "cleanup" passes in the pipeline, such as those that run after loop unrolling and/or vectorization, that run regardless of whether the preceding pass actually did anything. We've been adding more of these, and they catch important use cases, but we need a better infrastructure for this (either with the new pass manager or otherwise).

Also, I'm very hopeful that as our new MemorySSA and GVN improvements materialize, we'll see large compile-time improvements from that work. We spend a huge amount of time in GVN computing memory-dependency information (the dwarfs the time spent by GVN doing actual value numbering work by an order of magnitude or more).

-Hal

From: “Mehdi Amini via cfe-dev” <cfe-dev@lists.llvm.org>
To: “Rafael Espíndola” <rafael.espindola@gmail.com>
Cc: “llvm-dev” <llvm-dev@lists.llvm.org>, “cfe-dev” <cfe-dev@lists.llvm.org>
Sent: Tuesday, March 8, 2016 11:40:47 AM
Subject: Re: [cfe-dev] [llvm-dev] llvm and clang are getting slower

Hi Rafael,

CC: cfe-dev

Thanks for sharing. We also noticed this internally, and I know that
Bruno and Chris are working on some infrastructure and tooling to
help tracking closely compile time regressions.

We had this conversation internally about the tradeoff between
compile-time and runtime performance, and I planned to bring-up the
topic on the list in the coming months, this looks like a good
occasion to plant the seed. Apparently in the past (years/decade
ago?) the project was very conservative on adding any optimizations
that would impact compile time, however there is no explicit policy
(that I know of) to address this tradeoff.
The closest I could find would be what Chandler wrote in:
http://reviews.llvm.org/D12826 ; for instance for O2 he stated that
“if an optimization increases compile time by 5% or increases code
size by 5% for a particular benchmark, that benchmark should also be
one which sees a 5% runtime improvement”.

My hope is that with better tooling for tracking compile time in the
future, we’ll reach a state where we’ll be able to consider
“breaking” the compile-time regression test as important as breaking
any test: i.e. the offending commit should be reverted unless it has
been shown to significantly (hand wavy…) improve the runtime
performance.

With the current trend, the Polly developers don't have to worry about improving their compile time, we'll catch up with them ;)

My two largest pet peeves in this area are:

  1. We often use functions from ValueTracking (to get known bits, the number of sign bits, etc.) as through they’re low cost. They’re not really low cost. The problem is that they should be. These functions do bottom-up walks, and could cache their results. Instead, they do a limited walk and recompute everything each time. This is expensive, and a significant amount of our InstCombine time goes to ValueTracking, and that shouldn’t be the case. The more we add to InstCombine (and related passes), and the more we run InstCombine, the worse this gets. On the other hand, fixing this will help both compile time and code quality.

Furthermore, BasicAA has the same problem.

  1. We have “cleanup” passes in the pipeline, such as those that run after loop unrolling and/or vectorization, that run regardless of whether the preceding pass actually did anything. We’ve been adding more of these, and they catch important use cases, but we need a better infrastructure for this (either with the new pass manager or otherwise).

A related issue is that if an analysis is not preserved by a pass, it gets invalidated even if the pass doesn’t end up modifying the code. Because of this for example we invalidate SCEV’s cache unnecessarily. The new pass manager should fix this.

Adam

> From: "Mehdi Amini via cfe-dev" <cfe-dev@lists.llvm.org>
> To: "Rafael Espíndola" <rafael.espindola@gmail.com>
> Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "cfe-dev" <
cfe-dev@lists.llvm.org>
> Sent: Tuesday, March 8, 2016 11:40:47 AM
> Subject: Re: [cfe-dev] [llvm-dev] llvm and clang are getting slower
>
> Hi Rafael,
>
> CC: cfe-dev
>
> Thanks for sharing. We also noticed this internally, and I know that
> Bruno and Chris are working on some infrastructure and tooling to
> help tracking closely compile time regressions.
>
> We had this conversation internally about the tradeoff between
> compile-time and runtime performance, and I planned to bring-up the
> topic on the list in the coming months, this looks like a good
> occasion to plant the seed. Apparently in the past (years/decade
> ago?) the project was very conservative on adding any optimizations
> that would impact compile time, however there is no explicit policy
> (that I know of) to address this tradeoff.
> The closest I could find would be what Chandler wrote in:
> ⚙ D12826 [PM] Wire up optimization levels and default pipeline construction APIs in the PassBuilder. ; for instance for O2 he stated that
> "if an optimization increases compile time by 5% or increases code
> size by 5% for a particular benchmark, that benchmark should also be
> one which sees a 5% runtime improvement".
>
> My hope is that with better tooling for tracking compile time in the
> future, we'll reach a state where we'll be able to consider
> "breaking" the compile-time regression test as important as breaking
> any test: i.e. the offending commit should be reverted unless it has
> been shown to significantly (hand wavy...) improve the runtime
> performance.
>
> <troll>
> With the current trend, the Polly developers don't have to worry
> about improving their compile time, we'll catch up with them :wink:
> </troll>

My two largest pet peeves in this area are:

I think you hit on something that i would expand on:

We don't hold the line very well on adding little things to passes and
analysis over time.
We add 1000 little walkers and pattern matchers to try to get better code,
and then often add knobs to try to control their overall compile time.
At some point, these all add up. You end up with the same flat profile if
you do this everywhere, but your compiler gets slower.
At some point, someone has to stop and say "well, wait a minute, are there
better algorithms or architecture we should be using to do this", and
either do it, or not let it get worse :slight_smile: I'd suggest, in most cases, we
know better ways to do almost all of these things.

Don't get me wrong, i don't believe there is any theoretically pure way to
do everything that we can just implement and never have to tweak. But it's
a continuum, and at some point you have to stop and re-evaluate whether the
current approach is really the right one if you have to have a billion
little things to it get what you want.
We often don't do that.
We go *very* far down the path of a billion tweaks and adding knobs, and
what we have now, compile time wise, is what you get when you do that :slight_smile:
I suspect this is because we don't really want to try to force work on
people who are just trying to get crap done. We're all good contributors
trying to do the right thing, and saying no often seems obstructionist, etc.
The problem is at some point you end up with the tragedy of the commons.

(also, not everything in the compiler has to catch every case to get good
code)

1. We often use functions from ValueTracking (to get known bits, the
number of sign bits, etc.) as through they're low cost. They're not really
low cost. The problem is that they *should* be. These functions do
bottom-up walks, and could cache their results. Instead, they do a limited
walk and recompute everything each time. This is expensive, and a
significant amount of our InstCombine time goes to ValueTracking, and that
shouldn't be the case. The more we add to InstCombine (and related passes),
and the more we run InstCombine, the worse this gets. On the other hand,
fixing this will help both compile time and code quality.

(LVI is another great example. Fun fact: If you ask for value info for
everything, it's no longer lazy ....)

  Furthermore, BasicAA has the same problem.

2. We have "cleanup" passes in the pipeline, such as those that run after
loop unrolling and/or vectorization, that run regardless of whether the
preceding pass actually did anything. We've been adding more of these, and
they catch important use cases, but we need a better infrastructure for
this (either with the new pass manager or otherwise).

Also, I'm very hopeful that as our new MemorySSA and GVN improvements
materialize, we'll see large compile-time improvements from that work. We
spend a huge amount of time in GVN computing memory-dependency information
(the dwarfs the time spent by GVN doing actual value numbering work by an
order of magnitude or more).

I'm a working on it :wink:

I'm curious how these times divide across Clang and various parts of
LLVM; rerunning with -ftime-report and summing the numbers across all
compiles could be interesting.

> From: "Mehdi Amini via cfe-dev" <cfe-dev@lists.llvm.org>
> To: "Rafael Espíndola" <rafael.espindola@gmail.com>
> Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "cfe-dev" <
cfe-dev@lists.llvm.org>
> Sent: Tuesday, March 8, 2016 11:40:47 AM
> Subject: Re: [cfe-dev] [llvm-dev] llvm and clang are getting slower
>
> Hi Rafael,
>
> CC: cfe-dev
>
> Thanks for sharing. We also noticed this internally, and I know that
> Bruno and Chris are working on some infrastructure and tooling to
> help tracking closely compile time regressions.
>
> We had this conversation internally about the tradeoff between
> compile-time and runtime performance, and I planned to bring-up the
> topic on the list in the coming months, this looks like a good
> occasion to plant the seed. Apparently in the past (years/decade
> ago?) the project was very conservative on adding any optimizations
> that would impact compile time, however there is no explicit policy
> (that I know of) to address this tradeoff.
> The closest I could find would be what Chandler wrote in:
> ⚙ D12826 [PM] Wire up optimization levels and default pipeline construction APIs in the PassBuilder. ; for instance for O2 he stated that
> "if an optimization increases compile time by 5% or increases code
> size by 5% for a particular benchmark, that benchmark should also be
> one which sees a 5% runtime improvement".
>
> My hope is that with better tooling for tracking compile time in the
> future, we'll reach a state where we'll be able to consider
> "breaking" the compile-time regression test as important as breaking
> any test: i.e. the offending commit should be reverted unless it has
> been shown to significantly (hand wavy...) improve the runtime
> performance.
>
> <troll>
> With the current trend, the Polly developers don't have to worry
> about improving their compile time, we'll catch up with them :wink:
> </troll>

My two largest pet peeves in this area are:

I think you hit on something that i would expand on:

We don't hold the line very well on adding little things to passes and
analysis over time.
We add 1000 little walkers and pattern matchers to try to get better code,
and then often add knobs to try to control their overall compile time.
At some point, these all add up. You end up with the same flat profile if
you do this everywhere, but your compiler gets slower.
At some point, someone has to stop and say "well, wait a minute, are there
better algorithms or architecture we should be using to do this", and
either do it, or not let it get worse :slight_smile: I'd suggest, in most cases, we
know better ways to do almost all of these things.

Don't get me wrong, i don't believe there is any theoretically pure way to
do everything that we can just implement and never have to tweak. But it's
a continuum, and at some point you have to stop and re-evaluate whether the
current approach is really the right one if you have to have a billion
little things to it get what you want.
We often don't do that.
We go *very* far down the path of a billion tweaks and adding knobs, and
what we have now, compile time wise, is what you get when you do that :slight_smile:
I suspect this is because we don't really want to try to force work on
people who are just trying to get crap done. We're all good contributors
trying to do the right thing, and saying no often seems obstructionist, etc.
The problem is at some point you end up with the tragedy of the commons.

(also, not everything in the compiler has to catch every case to get good
code)

1. We often use functions from ValueTracking (to get known bits, the
number of sign bits, etc.) as through they're low cost. They're not really
low cost. The problem is that they *should* be. These functions do
bottom-up walks, and could cache their results. Instead, they do a limited
walk and recompute everything each time. This is expensive, and a
significant amount of our InstCombine time goes to ValueTracking, and that
shouldn't be the case. The more we add to InstCombine (and related passes),
and the more we run InstCombine, the worse this gets. On the other hand,
fixing this will help both compile time and code quality.

(LVI is another great example. Fun fact: If you ask for value info for
everything, it's no longer lazy ....)

Yep -- see the bug Wei is working on:
https://llvm.org/bugs/show_bug.cgi?id=10584

David

I have noticed that LLVM doesn’t seem to “like” large functions, as a general rule. Admittedly, my experience is similar with gcc, so I’m not sure it’s something that can be easily fixed. And I’m probably sounding like a broken record, because I have said this before.

My experience is that the time it takes to compile something is growing above linear with size of function.

Of course, the LLVM code is growing over time, both to support more features and to support more architectures, new processor types and instruction sets, at least of which will lead to larger functions in general [and this is the function “after inlining”, so splitting small ‘called once’ functions out doesn’t really help much].

I will have a little play to see if I can identify more of a cuplrit [at the very least if it’s “large basic blocks” or “large functions” that is the problem] - of course, this could be unrelated and irellevant to the problem Daniel is pointing at, and it may or may not be easily resolved…

On a somewhat smaller (but hopefully more actionable) scale, we noticed that build time regressed ~10% recently in 262315:262447. I’m still trying to repro locally (no luck so far; maybe it’s a bot config thing, not a clang-side problem), but if this rings a bell to anyone, please let me know :slight_smile:

https://bugs.chromium.org/p/chromium/issues/detail?id=593030

I have noticed that LLVM doesn't seem to "like" large functions, as a
general rule. Admittedly, my experience is similar with gcc, so I'm not
sure it's something that can be easily fixed. And I'm probably sounding
like a broken record, because I have said this before.

My experience is that the time it takes to compile something is growing
above linear with size of function.

The number of BBs -- Kosyia can point you to the compile time bug that is
exposed by asan .

David

In case someone finds it useful, this is some indication of the breakdown of where time is spent during a build of Clang.

tl;dr: in Debug+Asserts about 10% of time is spent in the backend and in Release without asserts (and without debug info IIRC) about 33% of time is spent in the backend.

These are the charts I collected a while back breaking down the time it takes clang to compile itself.
See the thread “[cfe-dev] Some DTrace probes for measuring per-file time” for how I collected this information. The raw data is basically aggregated CPU time spent textually parsing each header (and IRGen’ing them, since clang does that as it parses. There are also a couple “phony” headers to cover stuff like the backend/optimizer.

Since there a large number of files, the pie charts below are grouped into rough categories. E.g. the “llvm headers” includes the time spent on include/llvm/Support/raw_ostream.h and all other headers in include/llvm. The “libc++” pie slice contains the time spent in the libc++ system headers (this data was collected on a mac, so libc++ was the C++ standard library). “system” are C system headers.

All time spent inside the LLVM optimizer is in the “after parsing” pie slice.

Debug with asserts:

Release without asserts (and without debug info IIRC):

– Sean Silva

> I have just benchmarked building trunk llvm and clang in Debug,
> Release and LTO modes (see the attached scrip for the cmake lines).
>
> The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
> cases I used the system libgcc and libstdc++.
>
> For release builds there is a monotonic increase in each version. From
> 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
> 5.3.2 takes 205 minutes.
>
> Debug and LTO show an improvement in 3.7, but have regressed again in
3.8.

I'm curious how these times divide across Clang and various parts of
LLVM; rerunning with -ftime-report and summing the numbers across all
compiles could be interesting.

Based on the results I posted upthread about the relative time spend in the
backend for debug vs release, we can estimate this.
To summarize:
10% of time spent in LLVM for Debug
33% of time spent in LLVM for Release
(I'll abbreviate "in LLVM" as just "backend"; this is "backend" from
clang's perspective)

Let's look at the difference between 3.5 and trunk.

For debug, the user time jumps from 174m50.251s to 197m9.932s.
That's {10490.3, 11829.9} seconds, respectively.
For release, the corresponding numbers are:
{9826.71, 12714.3} seconds.

debug35 = 10490.251
debugTrunk = 11829.932

debugTrunk/debug35 == 1.12771
debugRatio = 1.12771

release35 = 9826.705
releaseTrunk = 12714.288

releaseTrunk/release35 == 1.29385
releaseRatio = 1.29385

For simplicity, let's use a simple linear model for the distribution of
slowdown between the frontend and backend: a constant factor slowdown for
the backend, and an independent constant factor slowdown for the frontend.
This gives the following linear system:
debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio
releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio

Solving this linear system we find that under this simple model, the
expected slowdown factors are:
backendRatio = 1.77783
frontendRatio = 1.05547

Intuitively, backendRatio comes out larger in this comparison because we
see the biggest slowdown during release (1.29 vs 1.12), and during release
we are spending a larger fraction of time in the backend (33% vs 10%).

Applying this same model to across Rafael's data, we find the following
(numbers have been rounded for clarity):

transition backendRatio frontendRatio
3.5->3.6 1.08 1.03
3.6->3.7 1.30 0.95
3.7->3.8 1.34 1.07
3.8->trunk 0.98 1.02

Note that in Rafael's measurements LTO is pretty similar to Release from a
CPU time (user time) standpoint. While the final LTO link takes a large
amount of real time, it is single threaded. Based on the real time numbers
the LTO link was only spending about 20 minutes single-threaded (i.e. about
20 minutes CPU time), which is pretty small compared to the 300-400 minutes
of total CPU time. It would be interesting to see the numbers for -O0 or
-O1 per-TU together with LTO.

-- Sean Silva

> I have just benchmarked building trunk llvm and clang in Debug,
> Release and LTO modes (see the attached scrip for the cmake lines).
>
> The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
> cases I used the system libgcc and libstdc++.
>
> For release builds there is a monotonic increase in each version. From
> 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
> 5.3.2 takes 205 minutes.
>
> Debug and LTO show an improvement in 3.7, but have regressed again in
3.8.

I'm curious how these times divide across Clang and various parts of
LLVM; rerunning with -ftime-report and summing the numbers across all
compiles could be interesting.

Based on the results I posted upthread about the relative time spend in
the backend for debug vs release, we can estimate this.
To summarize:

That is, to summarize the post upthread that I'm referring to. The summary
of this post is that most of the slowdown seems to be in the backend.

-- Sean Silva

I have noticed that LLVM doesn't seem to "like" large functions, as a
general rule. Admittedly, my experience is similar with gcc, so I'm not
sure it's something that can be easily fixed. And I'm probably sounding
like a broken record, because I have said this before.

My experience is that the time it takes to compile something is growing
above linear with size of function.

The number of BBs -- Kosyia can point you to the compile time bug that is
exposed by asan .

I believe we also have some superlinear behavior with BB size as well.

-- Sean Silva

See https://llvm.org/bugs/show_bug.cgi?id=17409

I believe much of the compile issues related to BlockFrequency computation has been fixed by Cong’s recent work to convert weight based interfaces to BranchProbability based interfaces. The other issue related to Spiller probably still remains.

David

Just a note about LTO being sequential: Rafael mentioned he was “building trunk llvm and clang”. By default I believe it is ~56 link targets that can be run in parallel (provided you have enough RAM to avoid swapping).

Just a note about LTO being sequential: Rafael mentioned he was "building
trunk llvm and clang". By default I believe it is ~56 link targets that can
be run in parallel (provided you have enough RAM to avoid swapping).

Correct. The machine has no swap :slight_smile:

But some targets (clang) are much larger and I have the impression
that the last minute or so of the build is just finishing that one
link.

Cheers,
Rafael

+1

> I have just benchmarked building trunk llvm and clang in Debug,
> Release and LTO modes (see the attached scrip for the cmake lines).
>
> The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
> cases I used the system libgcc and libstdc++.
>
> For release builds there is a monotonic increase in each version. From
> 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
> 5.3.2 takes 205 minutes.
>
> Debug and LTO show an improvement in 3.7, but have regressed again in
3.8.

I'm curious how these times divide across Clang and various parts of
LLVM; rerunning with -ftime-report and summing the numbers across all
compiles could be interesting.

Based on the results I posted upthread about the relative time spend in
the backend for debug vs release, we can estimate this.
To summarize:
10% of time spent in LLVM for Debug
33% of time spent in LLVM for Release
(I'll abbreviate "in LLVM" as just "backend"; this is "backend" from
clang's perspective)

Let's look at the difference between 3.5 and trunk.

For debug, the user time jumps from 174m50.251s to 197m9.932s.
That's {10490.3, 11829.9} seconds, respectively.
For release, the corresponding numbers are:
{9826.71, 12714.3} seconds.

debug35 = 10490.251
debugTrunk = 11829.932

debugTrunk/debug35 == 1.12771
debugRatio = 1.12771

release35 = 9826.705
releaseTrunk = 12714.288

releaseTrunk/release35 == 1.29385
releaseRatio = 1.29385

For simplicity, let's use a simple linear model for the distribution of
slowdown between the frontend and backend: a constant factor slowdown for
the backend, and an independent constant factor slowdown for the frontend.
This gives the following linear system:
debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio
releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio

Solving this linear system we find that under this simple model, the
expected slowdown factors are:
backendRatio = 1.77783
frontendRatio = 1.05547

Intuitively, backendRatio comes out larger in this comparison because we
see the biggest slowdown during release (1.29 vs 1.12), and during release
we are spending a larger fraction of time in the backend (33% vs 10%).

Applying this same model to across Rafael's data, we find the following
(numbers have been rounded for clarity):

transition backendRatio frontendRatio
3.5->3.6 1.08 1.03
3.6->3.7 1.30 0.95
3.7->3.8 1.34 1.07
3.8->trunk 0.98 1.02

Note that in Rafael's measurements LTO is pretty similar to Release from a
CPU time (user time) standpoint. While the final LTO link takes a large
amount of real time, it is single threaded. Based on the real time numbers
the LTO link was only spending about 20 minutes single-threaded (i.e. about
20 minutes CPU time), which is pretty small compared to the 300-400 minutes
of total CPU time. It would be interesting to see the numbers for -O0 or
-O1 per-TU together with LTO.

Just a note about LTO being sequential: Rafael mentioned he was "building
trunk llvm and clang". By default I believe it is ~56 link targets that can
be run in parallel (provided you have enough RAM to avoid swapping).

D'oh! I was looking at the data wrong since I broke my Fundamental Rule of
Looking At Data, namely: don't look at raw numbers in a table since you are
likely to look at things wrong or form biases based on the order in which
you look at the data points; *always* visualize. There is a significant
difference between release and LTO. About 2x consistently.

!Screen Shot 2016-03-08 at 5.45.54 PM.png|837x314

This is actually curious because during the release build, we were spending
33% of CPU time in the backend (as clang sees it; i.e. mid-level optimizer
and codegen). This data is inconsistent with LTO simply being another run
through the backend (which would be just +33% CPU time at worst). There
seems to be something nonlinear happening.
To make it worse, the LTO build has approximately a full Release
optimization running per-TU, so the actual LTO step should be seeing
inlined/"cleaned up" IR which should be much smaller than what the per-TU
optimizer is seeing, so naively it should take *even less* than "another
33% CPU time" chunk.
Yet we see 1.5x-2x difference:

!Screen Shot 2016-03-08 at 5.29.21 PM.png|831x283

-- Sean Silva