llvm (the middle-end) is getting slower, December edition

First of all, sorry for the long mail.
Inspired by the excellent analysis Rui did for lld, I decided to do
the same for llvm.
I'm personally very interested in build-time for LTO configuration,
with particular attention to the time spent in the optimizer.
Rafael did something similar back in March, so this can be considered
as an update. This tries to include a more accurate high-level
analysis of where llvm is spending CPU cycles.
Here I present 2 cases: clang building itself with `-flto` (Full), and
clang building an internal codebase which I'm going to refer as
`game7`.
It's a mid-sized program (it's actually a game), more or less of the
size of clang, which we use internally as benchmark to track
compile-time/runtime improvements/regression.
I picked two random revisions of llvm: trunk (December 16th 2016) and
trunk (June 2nd 2016), so, roughly, 6 months period.
My setup is a Mac Pro running Linux (NixOS).
These are the numbers I collected (including the output of -mllvm -time-passes).
For clang:

June 2nd:
real 22m9.278s
user 21m30.410s
sys 0m38.834s
  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
  289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 (
24.3%) X86 DAG->DAG Instruction Selection
  97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 (
7.7%) Global Value Numbering
  62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 (
5.0%) Function Integration/Inlining
  58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 (
4.7%) Combine redundant instructions
  53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 (
4.3%) Combine redundant instructions
  51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 (
4.1%) Loop Strength Reduction
  47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 (
3.8%) Greedy Register Allocator
  36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 (
3.0%) Induction Variable Simplification
  37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 (
2.9%) Combine redundant instructions
  34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 (
2.7%) Combine redundant instructions
  25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 (
2.0%) Combine redundant instructions

Dec 16th:
real 27m34.922s
user 26m53.489s
sys 0m41.533s

  287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 (
19.3%) X86 DAG->DAG Instruction Selection
  197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 (
12.5%) Function Integration/Inlining
  106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 (
6.8%) Global Value Numbering
  89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 (
5.7%) Combine redundant instructions
  79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 (
5.0%) Combine redundant instructions
  55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 (
3.5%) Combine redundant instructions
  51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 (
3.3%) Greedy Register Allocator
  52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 (
3.3%) Combine redundant instructions
  49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 (
3.1%) Loop Strength Reduction
  41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 (
2.6%) Induction Variable Simplification
  38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 (
2.4%) Combine redundant instructions

so, llvm is around 20% slower than it used to be.

For our internal codebase the situation seems slightly worse:

`game7`

June 2nd:

Total Execution Time: 464.3920 seconds (463.8986 wall clock)

  88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 (
20.3%) X86 DAG->DAG Instruction Selection
  27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 (
9.4%) X86 Assembly / Object Emitter
  34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 (
7.6%) Function Integration/Inlining
  27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 (
6.1%) Global Value Numbering
  22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 (
4.8%) Combine redundant instructions
  19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 (
4.2%) Post RA top-down list latency scheduler
  15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 (
3.5%) Combine redundant instructions

Dec 16th:

Total Execution Time: 861.0898 seconds (860.5808 wall clock)

  135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 (
15.2%) Combine redundant instructions
  103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 (
11.7%) Combine redundant instructions
  97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 (
11.6%) X86 DAG->DAG Instruction Selection
  72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 (
8.1%) Combine redundant instructions
  69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 (
7.8%) Function Integration/Inlining
  60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 (
6.8%) Global Value Numbering
  56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 (
6.4%) Combine redundant instructions

so, using LTO, lld takes 2x to build what it used to take (and all the
extra time seems spent in the optimizer).

As an (extra) experiment, I decided to take the unoptimized output of
game7 (via lld -save-temps) and pass to -opt -O2. That shows another
significant regression (with different characteristics).

June 2nd:
time opt -O2
real 6m23.016s
user 6m20.900s
sys 0m2.113s

35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%)
Function Integration/Inlining
33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%)
Global Value Numbering
27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%)
Bitcode Writer
25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%)
Combine redundant instructions
19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%)
Combine redundant instructions
18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%)
Combine redundant instructions
16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%)
Combine redundant instructions
13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%)
Combine redundant instructions
13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%)
Combine redundant instructions

Dec 16th:

real 20m10.734s
user 20m8.523s
sys 0m2.197s

  208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 (
17.5%) Value Propagation
  179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 (
15.1%) Value Propagation
  92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 (
7.7%) Combine redundant instructions
  72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 (
6.1%) Combine redundant instructions
  72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 (
6.1%) Combine redundant instructions
  66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 (
5.6%) Combine redundant instructions
  65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 (
5.5%) Combine redundant instructions
  61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 (
5.2%) Function Integration/Inlining
  54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 (
4.6%) Combine redundant instructions
  50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 (
4.2%) Combine redundant instructions
  47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 (
4.0%) Global Value Numbering

I don't have an infrastructure to measure the runtime performance
benefits/regression of clang, but I have for `game7`.
I wasn't able to notice any fundamental speedup (at least, not
something that justifies a 2x build-time).

tl;dr:
There are quite a few things to notice:
1) GVN used to be the top pass in the middle-end, in some cases, and
pretty much always in the top-3. This is not the case anymore, but
it's still a pass where we spend a lot of time. This is being worked
on by Daniel Berlin and me) ⚙ D26224 NewGVN so there's
some hope that will be sorted out (or at least there's a plan for it).
2) For clang, we spend 35% more time inside instcombine, and for game7
instcombine seems to largely dominate the amount of time we spend
optimizing IR. I tried to bisect (which is not easy considering the
test takes a long time to run), but I wasn't able to identify a single
point in time responsible for the regression. It seems to be an
additive effect. My wild (or not so wild) guess is that every day
somebody adds a matcher of two because that improves their testcase,
and at some point all this things add up. I'll try to do some
additional profiling but I guess large part of our time is spent
solving bitwise-domain dataflow problems (ComputeKnownBits et
similia). Once GVN will be in a more stable state, I plan to
experiment with caching results.
3) Something peculiar is that we spend 2x time in the inliner. I'm not
familiar with the inliner, IIRC there were some changes to threshold
recently, so any help here will be appreciated (both in reproducing
the results and with analysis).
4) For the last testcase (opt -O2 on large unoptimized chunk of
bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
think it's not as lazy as it claims to be (or at least, the way we use
it). This doesn't show up in a full LTO run because we don't run CVP
as part of the default LTO pipeline, but the case shows how LVI can be
a bottleneck for large TUs (or at least how -O2 compile time degraded
on large TUs). I haven't thought about the problem very carefully, but
there seems to be some progress on this front (
https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
original bitcode file but I can probably do some profiling on it as
well.

As next steps I'll try to get a more detailed analysis of the
problems. In particular, try to do what Rui did for lld but with more
coarse granularity (every week) to have a chart of the compile time
trend for these cases over the last 6 months, and post here.

I think (I know) some people are aware of the problems I outline in
this e-mail. But apparently not everybody. We're in a situation where
compile time is increasing without real control. I'm happy that Apple
is doing a serious effort to track build-time, so hopefully things
will improve. There are, although, some cases (like adding matchers in
instcombine or knobs) where the compile time regression is hard to
track until it's too late. LLVM as a project tries not to stop people
trying to get things done and that's great, but from time to time it's
good to take a step back and re-evaluate approaches.
The purpose of this e-mail was to outline where we regressed, for
those interested.

Thanks for your time, and of course, feedback welcome!

Davide,

Thank you for this analysis.

As a front-end developer using LLVM I have one piece of feedback:

I’m not really concerned about compilation time for -O2 or -O3, but I am extremely interested in minimizing compiliation time for -O0.

Thank you for all your hard work,
Andrew (http://ziglang.org/)

First of all, sorry for the long mail.
Inspired by the excellent analysis Rui did for lld, I decided to do
the same for llvm.
I'm personally very interested in build-time for LTO configuration,
with particular attention to the time spent in the optimizer.
Rafael did something similar back in March, so this can be considered
as an update. This tries to include a more accurate high-level
analysis of where llvm is spending CPU cycles.
Here I present 2 cases: clang building itself with `-flto` (Full), and
clang building an internal codebase which I'm going to refer as
`game7`.
It's a mid-sized program (it's actually a game), more or less of the
size of clang, which we use internally as benchmark to track
compile-time/runtime improvements/regression.
I picked two random revisions of llvm: trunk (December 16th 2016) and
trunk (June 2nd 2016), so, roughly, 6 months period.
My setup is a Mac Pro running Linux (NixOS).
These are the numbers I collected (including the output of -mllvm -time-passes).
For clang:

June 2nd:
real 22m9.278s
user 21m30.410s
sys 0m38.834s
Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 (
24.3%) X86 DAG->DAG Instruction Selection
97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 (
7.7%) Global Value Numbering
62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 (
5.0%) Function Integration/Inlining
58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 (
4.7%) Combine redundant instructions
53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 (
4.3%) Combine redundant instructions
51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 (
4.1%) Loop Strength Reduction
47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 (
3.8%) Greedy Register Allocator
36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 (
3.0%) Induction Variable Simplification
37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 (
2.9%) Combine redundant instructions
34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 (
2.7%) Combine redundant instructions
25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 (
2.0%) Combine redundant instructions

Dec 16th:
real 27m34.922s
user 26m53.489s
sys 0m41.533s

287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 (
19.3%) X86 DAG->DAG Instruction Selection
197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 (
12.5%) Function Integration/Inlining
106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 (
6.8%) Global Value Numbering
89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 (
5.7%) Combine redundant instructions
79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 (
5.0%) Combine redundant instructions
55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 (
3.5%) Combine redundant instructions
51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 (
3.3%) Greedy Register Allocator
52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 (
3.3%) Combine redundant instructions
49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 (
3.1%) Loop Strength Reduction
41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 (
2.6%) Induction Variable Simplification
38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 (
2.4%) Combine redundant instructions

so, llvm is around 20% slower than it used to be.

For our internal codebase the situation seems slightly worse:

`game7`

June 2nd:

Total Execution Time: 464.3920 seconds (463.8986 wall clock)

88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 (
20.3%) X86 DAG->DAG Instruction Selection
27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 (
9.4%) X86 Assembly / Object Emitter
34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 (
7.6%) Function Integration/Inlining
27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 (
6.1%) Global Value Numbering
22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 (
4.8%) Combine redundant instructions
19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 (
4.2%) Post RA top-down list latency scheduler
15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 (
3.5%) Combine redundant instructions

Dec 16th:

Total Execution Time: 861.0898 seconds (860.5808 wall clock)

135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 (
15.2%) Combine redundant instructions
103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 (
11.7%) Combine redundant instructions
97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 (
11.6%) X86 DAG->DAG Instruction Selection
72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 (
8.1%) Combine redundant instructions
69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 (
7.8%) Function Integration/Inlining
60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 (
6.8%) Global Value Numbering
56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 (
6.4%) Combine redundant instructions

so, using LTO, lld takes 2x to build what it used to take (and all the
extra time seems spent in the optimizer).

As an (extra) experiment, I decided to take the unoptimized output of
game7 (via lld -save-temps) and pass to -opt -O2. That shows another
significant regression (with different characteristics).

June 2nd:
time opt -O2
real 6m23.016s
user 6m20.900s
sys 0m2.113s

35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%)
Function Integration/Inlining
33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%)
Global Value Numbering
27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%)
Bitcode Writer
25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%)
Combine redundant instructions
19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%)
Combine redundant instructions
18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%)
Combine redundant instructions
16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%)
Combine redundant instructions
13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%)
Combine redundant instructions
13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%)
Combine redundant instructions

Dec 16th:

real 20m10.734s
user 20m8.523s
sys 0m2.197s

208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 (
17.5%) Value Propagation
179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 (
15.1%) Value Propagation
92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 (
7.7%) Combine redundant instructions
72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 (
6.1%) Combine redundant instructions
72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 (
6.1%) Combine redundant instructions
66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 (
5.6%) Combine redundant instructions
65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 (
5.5%) Combine redundant instructions
61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 (
5.2%) Function Integration/Inlining
54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 (
4.6%) Combine redundant instructions
50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 (
4.2%) Combine redundant instructions
47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 (
4.0%) Global Value Numbering

I don't have an infrastructure to measure the runtime performance
benefits/regression of clang, but I have for `game7`.
I wasn't able to notice any fundamental speedup (at least, not
something that justifies a 2x build-time).

tl;dr:
There are quite a few things to notice:
1) GVN used to be the top pass in the middle-end, in some cases, and
pretty much always in the top-3. This is not the case anymore, but
it's still a pass where we spend a lot of time. This is being worked
on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
some hope that will be sorted out (or at least there's a plan for it).
2) For clang, we spend 35% more time inside instcombine, and for game7
instcombine seems to largely dominate the amount of time we spend
optimizing IR. I tried to bisect (which is not easy considering the
test takes a long time to run), but I wasn't able to identify a single
point in time responsible for the regression. It seems to be an
additive effect. My wild (or not so wild) guess is that every day
somebody adds a matcher of two because that improves their testcase,
and at some point all this things add up. I'll try to do some
additional profiling but I guess large part of our time is spent
solving bitwise-domain dataflow problems (ComputeKnownBits et
similia). Once GVN will be in a more stable state, I plan to
experiment with caching results.

We have seen a similar thing when compiling the swift standard library. We have talked about making a small simple instcombine pass that doesn't iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my memory might be wrong).

[...]

I don't have an infrastructure to measure the runtime performance
benefits/regression of clang, but I have for `game7`.
I wasn't able to notice any fundamental speedup (at least, not
something that justifies a 2x build-time).

Just to provide numbers (using Sean's Mathematica template, thanks),
here's a plot of the CDF of the frames in a particular level/scene.
The curves pretty much match, and the picture in the middle shows a
relative difference of 0.4% between Jun and Dec (which could be very
possibly be in the noise).
On the same scene, the difference between -O3 and -flto is 12%, FWIW.
So it seems that at least for this particular case all the compile
time didn't buy any runtime improvement.

An efficient way to bisect this is to:

1) dump the IR right before instcombine, and then run only opt -instcombine and confirm the regression shows up.
2) Then reduce the input: you should be able to single out a single function ultimately. (Maybe with bugpoint or with -opt-bisect-limit)
3) With a single function that shows the regression, it should be fairly easy to plot the time to run opt -inst-combine for almost every revision between June and now.

I tried 1) and I'm able to reproduce the increase in compile time. 2)
is on my todolist. I plan to use (and I can see how you can use)
bugpoint or delta (with `ulimit`), but it's not entirely clear to me
how to reduce using -opt-bisect-limit. As far as I know that just runs
passes up to a given point of the pipeline, while here the regression
shows up also with a single pass, i.e. opt -instcombine. Can you
please elaborate?

>
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
-time-passes).
> For clang:
>
> June 2nd:
> real 22m9.278s
> user 21m30.410s
> sys 0m38.834s
> Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
> 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 (
> 24.3%) X86 DAG->DAG Instruction Selection
> 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 (
> 7.7%) Global Value Numbering
> 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 (
> 5.0%) Function Integration/Inlining
> 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 (
> 4.7%) Combine redundant instructions
> 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 (
> 4.3%) Combine redundant instructions
> 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 (
> 4.1%) Loop Strength Reduction
> 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 (
> 3.8%) Greedy Register Allocator
> 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 (
> 3.0%) Induction Variable Simplification
> 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 (
> 2.9%) Combine redundant instructions
> 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 (
> 2.7%) Combine redundant instructions
> 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 (
> 2.0%) Combine redundant instructions
>
> Dec 16th:
> real 27m34.922s
> user 26m53.489s
> sys 0m41.533s
>
> 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 (
> 19.3%) X86 DAG->DAG Instruction Selection
> 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 (
> 12.5%) Function Integration/Inlining
> 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 (
> 6.8%) Global Value Numbering
> 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 (
> 5.7%) Combine redundant instructions
> 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 (
> 5.0%) Combine redundant instructions
> 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 (
> 3.5%) Combine redundant instructions
> 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 (
> 3.3%) Greedy Register Allocator
> 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 (
> 3.3%) Combine redundant instructions
> 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 (
> 3.1%) Loop Strength Reduction
> 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 (
> 2.6%) Induction Variable Simplification
> 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 (
> 2.4%) Combine redundant instructions
>
> so, llvm is around 20% slower than it used to be.
>
> For our internal codebase the situation seems slightly worse:
>
> `game7`
>
> June 2nd:
>
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
>
> 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 (
> 20.3%) X86 DAG->DAG Instruction Selection
> 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 (
> 9.4%) X86 Assembly / Object Emitter
> 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 (
> 7.6%) Function Integration/Inlining
> 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 (
> 6.1%) Global Value Numbering
> 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 (
> 4.8%) Combine redundant instructions
> 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 (
> 4.2%) Post RA top-down list latency scheduler
> 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 (
> 3.5%) Combine redundant instructions
>
> Dec 16th:
>
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
>
> 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 (
> 15.2%) Combine redundant instructions
> 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 (
> 11.7%) Combine redundant instructions
> 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 (
> 11.6%) X86 DAG->DAG Instruction Selection
> 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 (
> 8.1%) Combine redundant instructions
> 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 (
> 7.8%) Function Integration/Inlining
> 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 (
> 6.8%) Global Value Numbering
> 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 (
> 6.4%) Combine redundant instructions
>
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
>
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
>
> June 2nd:
> time opt -O2
> real 6m23.016s
> user 6m20.900s
> sys 0m2.113s
>
> 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%)
> Function Integration/Inlining
> 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%)
> Global Value Numbering
> 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%)
> Bitcode Writer
> 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%)
> Combine redundant instructions
> 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%)
> Combine redundant instructions
> 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%)
> Combine redundant instructions
> 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%)
> Combine redundant instructions
> 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%)
> Combine redundant instructions
> 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%)
> Combine redundant instructions
>
> Dec 16th:
>
> real 20m10.734s
> user 20m8.523s
> sys 0m2.197s
>
> 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 (
> 17.5%) Value Propagation
> 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 (
> 15.1%) Value Propagation
> 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 (
> 7.7%) Combine redundant instructions
> 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 (
> 6.1%) Combine redundant instructions
> 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 (
> 6.1%) Combine redundant instructions
> 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 (
> 5.6%) Combine redundant instructions
> 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 (
> 5.5%) Combine redundant instructions
> 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 (
> 5.2%) Function Integration/Inlining
> 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 (
> 4.6%) Combine redundant instructions
> 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 (
> 4.2%) Combine redundant instructions
> 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 (
> 4.0%) Global Value Numbering
>
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
>
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) ⚙ D26224 NewGVN so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression. It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.

We have seen a similar thing when compiling the swift standard library. We
have talked about making a small simple instcombine pass that doesn't
iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my
memory might be wrong).

I would expect that if the slowdown was an increase in time spent iterating
to a fixed point, we would also be matching more patterns and simplifying
the code more, but the small performance delta on game7 doesn't seem to
corroborate that.

One issue with instcombine is that its matching time (even if it doesn't
find anything) is still O(# of patterns tested). That is very consistent
with the slowdown model where as new checks are added, it gradually gets
slower. Approaches like Nuno's Alive avoids this overhead by building a
matching automaton. From the above, it sounds like we're spending 10's of
percents of time in instcombine, so it may be worth it.

Without having thought about it too closely, my gut reaction is that I
don't think it's a good idea to make an "instcombine V2"; incrementally
improving the one we already have (e.g. moving a large portion of its
matches into an automaton-based matcher; or conditionalizing (deleting?)
checks that aren't worth it) seems like a better long-term approach.

-- Sean Silva

First of all, sorry for the long mail.
Inspired by the excellent analysis Rui did for lld, I decided to do
the same for llvm.
I'm personally very interested in build-time for LTO configuration,
with particular attention to the time spent in the optimizer.
Rafael did something similar back in March, so this can be considered
as an update. This tries to include a more accurate high-level
analysis of where llvm is spending CPU cycles.
Here I present 2 cases: clang building itself with `-flto` (Full), and
clang building an internal codebase which I'm going to refer as
`game7`.
It's a mid-sized program (it's actually a game), more or less of the
size of clang, which we use internally as benchmark to track
compile-time/runtime improvements/regression.
I picked two random revisions of llvm: trunk (December 16th 2016) and
trunk (June 2nd 2016), so, roughly, 6 months period.
My setup is a Mac Pro running Linux (NixOS).
These are the numbers I collected (including the output of -mllvm
-time-passes).
For clang:

June 2nd:
real 22m9.278s
user 21m30.410s
sys 0m38.834s
  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
  289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 (
24.3%) X86 DAG->DAG Instruction Selection
  97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 (
7.7%) Global Value Numbering
  62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 (
5.0%) Function Integration/Inlining
  58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 (
4.7%) Combine redundant instructions
  53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 (
4.3%) Combine redundant instructions
  51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 (
4.1%) Loop Strength Reduction
  47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 (
3.8%) Greedy Register Allocator
  36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 (
3.0%) Induction Variable Simplification
  37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 (
2.9%) Combine redundant instructions
  34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 (
2.7%) Combine redundant instructions
  25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 (
2.0%) Combine redundant instructions

Dec 16th:
real 27m34.922s
user 26m53.489s
sys 0m41.533s

  287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 (
19.3%) X86 DAG->DAG Instruction Selection
  197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 (
12.5%) Function Integration/Inlining
  106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 (
6.8%) Global Value Numbering
  89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 (
5.7%) Combine redundant instructions
  79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 (
5.0%) Combine redundant instructions
  55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 (
3.5%) Combine redundant instructions
  51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 (
3.3%) Greedy Register Allocator
  52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 (
3.3%) Combine redundant instructions
  49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 (
3.1%) Loop Strength Reduction
  41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 (
2.6%) Induction Variable Simplification
  38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 (
2.4%) Combine redundant instructions

so, llvm is around 20% slower than it used to be.

For our internal codebase the situation seems slightly worse:

`game7`

June 2nd:

Total Execution Time: 464.3920 seconds (463.8986 wall clock)

  88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 (
20.3%) X86 DAG->DAG Instruction Selection
  27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 (
9.4%) X86 Assembly / Object Emitter
  34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 (
7.6%) Function Integration/Inlining
  27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 (
6.1%) Global Value Numbering
  22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 (
4.8%) Combine redundant instructions
  19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 (
4.2%) Post RA top-down list latency scheduler
  15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 (
3.5%) Combine redundant instructions

Dec 16th:

Total Execution Time: 861.0898 seconds (860.5808 wall clock)

  135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 (
15.2%) Combine redundant instructions
  103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 (
11.7%) Combine redundant instructions
  97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 (
11.6%) X86 DAG->DAG Instruction Selection
  72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 (
8.1%) Combine redundant instructions
  69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 (
7.8%) Function Integration/Inlining
  60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 (
6.8%) Global Value Numbering
  56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 (
6.4%) Combine redundant instructions

so, using LTO, lld takes 2x to build what it used to take (and all the
extra time seems spent in the optimizer).

As an (extra) experiment, I decided to take the unoptimized output of
game7 (via lld -save-temps) and pass to -opt -O2. That shows another
significant regression (with different characteristics).

June 2nd:
time opt -O2
real 6m23.016s
user 6m20.900s
sys 0m2.113s

35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%)
Function Integration/Inlining
33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%)
Global Value Numbering
27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%)
Bitcode Writer
25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%)
Combine redundant instructions
19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%)
Combine redundant instructions
18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%)
Combine redundant instructions
16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%)
Combine redundant instructions
13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%)
Combine redundant instructions
13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%)
Combine redundant instructions

Dec 16th:

real 20m10.734s
user 20m8.523s
sys 0m2.197s

  208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 (
17.5%) Value Propagation
  179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 (
15.1%) Value Propagation
  92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 (
7.7%) Combine redundant instructions
  72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 (
6.1%) Combine redundant instructions
  72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 (
6.1%) Combine redundant instructions
  66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 (
5.6%) Combine redundant instructions
  65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 (
5.5%) Combine redundant instructions
  61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 (
5.2%) Function Integration/Inlining
  54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 (
4.6%) Combine redundant instructions
  50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 (
4.2%) Combine redundant instructions
  47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 (
4.0%) Global Value Numbering

I don't have an infrastructure to measure the runtime performance
benefits/regression of clang, but I have for `game7`.
I wasn't able to notice any fundamental speedup (at least, not
something that justifies a 2x build-time).

tl;dr:
There are quite a few things to notice:
1) GVN used to be the top pass in the middle-end, in some cases, and
pretty much always in the top-3. This is not the case anymore, but
it's still a pass where we spend a lot of time. This is being worked
on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
some hope that will be sorted out (or at least there's a plan for it).
2) For clang, we spend 35% more time inside instcombine, and for game7
instcombine seems to largely dominate the amount of time we spend
optimizing IR. I tried to bisect (which is not easy considering the
test takes a long time to run), but I wasn't able to identify a single
point in time responsible for the regression. It seems to be an
additive effect. My wild (or not so wild) guess is that every day
somebody adds a matcher of two because that improves their testcase,
and at some point all this things add up. I'll try to do some
additional profiling but I guess large part of our time is spent
solving bitwise-domain dataflow problems (ComputeKnownBits et
similia). Once GVN will be in a more stable state, I plan to
experiment with caching results.
3) Something peculiar is that we spend 2x time in the inliner. I'm not
familiar with the inliner, IIRC there were some changes to threshold
recently, so any help here will be appreciated (both in reproducing
the results and with analysis).
4) For the last testcase (opt -O2 on large unoptimized chunk of
bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
think it's not as lazy as it claims to be (or at least, the way we use
it). This doesn't show up in a full LTO run because we don't run CVP
as part of the default LTO pipeline, but the case shows how LVI can be
a bottleneck for large TUs (or at least how -O2 compile time degraded
on large TUs). I haven't thought about the problem very carefully, but
there seems to be some progress on this front (
https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
original bitcode file but I can probably do some profiling on it as
well.

LVI is one of those analyses with quadratic runtime, but has a cutoff to
its search depth so that it is technically not quadratic. So increased
inlining could easily exacerbate it more than non-"quadratic" passes.
(increased inlining would also cause a general slowdown too).

-- Sean Silva

Unless I’m mistaken -opt-bisect-limit operates per execution of a pass. So even if you schedule a single pass it will be efficient, actually bisecting through the execution over each function and may help find which function in the module is causing the largest increase.

[...]

> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
>

Just to provide numbers (using Sean's Mathematica template, thanks),
here's a plot of the CDF of the frames in a particular level/scene.
The curves pretty much match, and the picture in the middle shows a
relative difference of 0.4% between Jun and Dec (which could be very
possibly be in the noise).

With 5-10 runs per binary you should be able to reliably measure down to
0.5% difference on game7 (50 usec difference per frame). With just one run
per binary like you have here I would not trust a 0.4% difference.

-- Sean Silva

Yeah, I agree. This was mostly a sanity check to understand if there was a significant improvement at runtime. I ran the same test 7 times in the last our but the difference is always in the noise, FWIW.

LVI is one of those analyses with quadratic runtime, but has a cutoff to
its search depth so that it is technically not quadratic. So increased
inlining could easily exacerbate it more than non-"quadratic" passes.
(increased inlining would also cause a general slowdown too).

LVI is only quadratic because of the way we've built it (it's actually
worse than quadratic,but let's just normalize to "quadratic").
Non-lazy versions are not quadratic, and you could likely build an
incremental lazy version that is also not quadratic in practice.
At some point, none of this will change unless people hold the line
somewhere. Compilers usually get slower 0.1% at a time, not in huge leaps
and bounds.
Without people saying "If we really want to get this case, in a thing not
really designed to get that case sanely, we need to stop and think about
it", it doesn't get thought about.

Obviously, you can go too far into that extreme, but i think we are still
too far on one side of that one (at least, in most places in LLVM)

Good point. One thing that struck me when randomly browsing the GCC source
code is that GCC seems to use far more sophisticated algorithms (e.g. lots
of files with comments at the top citing different papers and things; not
that counting citations means that much, but it's a rough proxy for keeping
up with the literature). WIth LLVM, it seems like most of the stuff is
homegrown or at least not frequently revisited with an eye to using an new
algorithm. And once something starts picking up a bunch of special cases
and things, it becomes harder and harder to replace with a new better
thing, because nobody will want to throw the old thing away until the new
thing covers the same set of test cases.

-- Sean Silva

This is precisely what happened with GVN, for example (and happens with
other things, like InstCombine, like LVI, like BasicAA, ...). All of
these things started by serving a purpose and with some useful design
lifetime. They made sense for those things. People push them past their
useful design lifetime and original purposes, but they don't get redesigned
or rethought. Pretty soon, you have a mess that is very hard to clean up
for the reason you stated (in practice, you may have to split analysis or
passes, and accept that you can't and shouldn't try to catch absolutely
everything in every single pass).

Thankfully, a number of companies/people in our community can get the time
to spend on "technical debt" instead of just improving performance, but
it's, IMHO, still overloaded a little too much on the performance at any
cost side.
It happens with all compilers, of course, it's not specific to LLVM.

As for why GCC tends to use more literature, again IMHO:
GCC has more areas than LLVM with specific code owners, which has the
upside that someone is ostensibly thinking about these things and has some
thought where that area should go. They can provide explicit guidance, at
at least make sure something is going in the right direction. The
downside, of course, is that it comes with a bit of bureaucracy (though,
IMHO, LLVM has this in practice, it's just hidden more)., and sometimes a
little less innovation (again, varies) It also means people have to not
view it as an honorific, and when they move on or change or stop caring or
what have you, that the area gets reassigned to someone else. The other
downside is that it often involves saying "no, let's not do that, let's do
this instead" a lot, and that makes people feel bad when they know someone
else is just trying to achieve a task someone assigned them.

The only thing any of these approaches change, in the end, is the useful
lifetime of the end result. It doesn't mean you end up with a thing that
is perfect, but you may get a thing that works well for 10 years instead of
5. Whether that is the right tradeoff for what you are doing, also varies
wildly (it probably makes no sense to spend 6 months designing a thing that
is only meant to last 6 months).

Also, this isn't, of course, to say that homegrown is necessarily bad.
But doing it well at least requires having read and understood enough
theory to understand how to grow your own thing well, or be for a thing
small enough it just doesn't matter :slight_smile:
Personally, I learned compilers and optimization in an environment
surrounded by people who had pretty much done everything we are trying to
do before, and so i always learned to go hunting to see what the state of
the art was, and the theories behind it, before i tried to roll my own.
That didn't always make the state of the art good, sane, or even
implementable, mind you, but it meant it at least got thought about, and
often, even when the papers were crazy, they had ideas or approaches you
could reuse.

--Dan

First of all, sorry for the long mail.
Inspired by the excellent analysis Rui did for lld, I decided to do
the same for llvm.
I'm personally very interested in build-time for LTO configuration,
with particular attention to the time spent in the optimizer.
Rafael did something similar back in March, so this can be considered
as an update. This tries to include a more accurate high-level
analysis of where llvm is spending CPU cycles.
Here I present 2 cases: clang building itself with `-flto` (Full), and
clang building an internal codebase which I'm going to refer as
`game7`.
It's a mid-sized program (it's actually a game), more or less of the
size of clang, which we use internally as benchmark to track
compile-time/runtime improvements/regression.
I picked two random revisions of llvm: trunk (December 16th 2016) and
trunk (June 2nd 2016), so, roughly, 6 months period.
My setup is a Mac Pro running Linux (NixOS).
These are the numbers I collected (including the output of -mllvm -time-passes).
For clang:

June 2nd:
real 22m9.278s
user 21m30.410s
sys 0m38.834s
   Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
   289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 (
24.3%) X86 DAG->DAG Instruction Selection
   97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 (
7.7%) Global Value Numbering
   62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 (
5.0%) Function Integration/Inlining
   58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 (
4.7%) Combine redundant instructions
   53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 (
4.3%) Combine redundant instructions
   51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 (
4.1%) Loop Strength Reduction
   47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 (
3.8%) Greedy Register Allocator
   36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 (
3.0%) Induction Variable Simplification
   37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 (
2.9%) Combine redundant instructions
   34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 (
2.7%) Combine redundant instructions
   25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 (
2.0%) Combine redundant instructions

Dec 16th:
real 27m34.922s
user 26m53.489s
sys 0m41.533s

   287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 (
19.3%) X86 DAG->DAG Instruction Selection
   197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 (
12.5%) Function Integration/Inlining
   106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 (
6.8%) Global Value Numbering
   89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 (
5.7%) Combine redundant instructions
   79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 (
5.0%) Combine redundant instructions
   55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 (
3.5%) Combine redundant instructions
   51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 (
3.3%) Greedy Register Allocator
   52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 (
3.3%) Combine redundant instructions
   49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 (
3.1%) Loop Strength Reduction
   41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 (
2.6%) Induction Variable Simplification
   38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 (
2.4%) Combine redundant instructions

so, llvm is around 20% slower than it used to be.

For our internal codebase the situation seems slightly worse:

`game7`

June 2nd:

Total Execution Time: 464.3920 seconds (463.8986 wall clock)

   88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 (
20.3%) X86 DAG->DAG Instruction Selection
   27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 (
9.4%) X86 Assembly / Object Emitter
   34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 (
7.6%) Function Integration/Inlining
   27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 (
6.1%) Global Value Numbering
   22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 (
4.8%) Combine redundant instructions
   19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 (
4.2%) Post RA top-down list latency scheduler
   15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 (
3.5%) Combine redundant instructions

Dec 16th:

Total Execution Time: 861.0898 seconds (860.5808 wall clock)

   135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 (
15.2%) Combine redundant instructions
   103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 (
11.7%) Combine redundant instructions
   97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 (
11.6%) X86 DAG->DAG Instruction Selection
   72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 (
8.1%) Combine redundant instructions
   69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 (
7.8%) Function Integration/Inlining
   60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 (
6.8%) Global Value Numbering
   56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 (
6.4%) Combine redundant instructions

so, using LTO, lld takes 2x to build what it used to take (and all the
extra time seems spent in the optimizer).

As an (extra) experiment, I decided to take the unoptimized output of
game7 (via lld -save-temps) and pass to -opt -O2. That shows another
significant regression (with different characteristics).

June 2nd:
time opt -O2
real 6m23.016s
user 6m20.900s
sys 0m2.113s

35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%)
  Function Integration/Inlining
33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%)
  Global Value Numbering
27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%)
  Bitcode Writer
25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%)
  Combine redundant instructions
19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%)
  Combine redundant instructions
18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%)
  Combine redundant instructions
16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%)
  Combine redundant instructions
13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%)
  Combine redundant instructions
13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%)
  Combine redundant instructions

Dec 16th:

real 20m10.734s
user 20m8.523s
sys 0m2.197s

   208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 (
17.5%) Value Propagation
   179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 (
15.1%) Value Propagation
   92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 (
7.7%) Combine redundant instructions
   72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 (
6.1%) Combine redundant instructions
   72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 (
6.1%) Combine redundant instructions
   66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 (
5.6%) Combine redundant instructions
   65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 (
5.5%) Combine redundant instructions
   61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 (
5.2%) Function Integration/Inlining
   54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 (
4.6%) Combine redundant instructions
   50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 (
4.2%) Combine redundant instructions
   47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 (
4.0%) Global Value Numbering

I'd really like to see a profile which broke down the time spent in Value Propagation and LVI. As the person who has touched both most recently, I am probably the responsible party. At the same time, there are a number of known fixes we can apply depending on the way this particular compile time problem exhibits. I'm happy to take a look at this particular issue if you can given me enough information to analyze it.

[...]

I'd really like to see a profile which broke down the time spent in Value
Propagation and LVI. As the person who has touched both most recently, I am
probably the responsible party. At the same time, there are a number of
known fixes we can apply depending on the way this particular compile time
problem exhibits. I'm happy to take a look at this particular issue if you
can given me enough information to analyze it.

I'll try to reduce the testcase and do some profiling on it, then come
back to you.
Something to keep in mind, please note I'm not pointing fingers. The
main motivation behind this e-mail is showing up that multiple passes
in the compiler seem to have regressed on large testcases. Many people
don't notice because they are (un)lucky enough to not run with LTO.
There are probably/maybe other regressions uncovered by my
non-exahustive analysis. I wasn't particularly pleased after a fair
amount of time working on GVN to clean up technical-debt to realize
that the compiler got slower in other areas.

From our own offline regression testing, one of the biggest culprits in our experience is Instcombine’s known bits calculation. A number of new known bits checks have been added in the past few years (e.g. to infer nuw, nsw, etc on various instructions) and the cost adds up quite a lot, because *the cost is paid even if Instcombine does nothing*, since it’s a significant cost on visiting every relevant instruction.

This IME is one of the greatest ways performance gets lost: a tiny bit at a time, whenever a new combine/transformation is added that is *expensive to test for*. The test has to be done every time no matter what (and instcombine gets called a lot!), so the cost adds up.

—escha

In that case we wanted a lighter-weight instcombine cleanup pass to run right after LLVM IR generation. But, in general it makes sense to separate cleanup and canonicalization vs. optimization. I’ve always been strongly in favor of splitting InstCombine into a set of cheap, canonical transforms that run frequently, vs. optimizations that can run once later in the pipeline. It’s particularly annoying when somewhat target-specific codegen optimization get thrown into InstCombine—I’m not how prevalent that still is. It would be a lot of work to go through all the patterns and figure out what they’re needed for. But it might make sense to try reducing the number of times InstCombine is run, replace the earlier runs with an EarlyInstCombine and gradually move the most important canonical transforms into EarlyInstCombine as they’re needed.

-Andy

[...]

> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
>

Just to provide numbers (using Sean's Mathematica template, thanks),
here's a plot of the CDF of the frames in a particular level/scene.
The curves pretty much match, and the picture in the middle shows a
relative difference of 0.4% between Jun and Dec (which could be very
possibly be in the noise).

With 5-10 runs per binary you should be able to reliably measure down to
0.5% difference on game7 (50 usec difference per frame). With just one run
per binary like you have here I would not trust a 0.4% difference.

Yeah, I agree. This was mostly a sanity check to understand if there was a
significant improvement at runtime. I ran the same test 7 times in the last
our but the difference is always in the noise, FWIW.

The reason to add more runs is that you can measure performance differences
smaller than the run to run variation for the same binary. Most of the
time, if you just plot 10 runs of each binary all on the same plot you can
easily eyeball it to see that one is consistently faster. If you want to
formalize this you could do something like:

1. pair up each run of one binary with another run from the other binary
2. see which (say) median is higher to get a boolean value of which is
faster in that pair. (you can be fancier and compute an integral over e.g.
the 20-80%'ile frame times; or you can investigate the tail by looking at
the 99'th percentile)
3. treat that as a coin toss (i.e., as a bernoulli random variable with
parameter p; a "fair coin" would be p=0.5)
4. Use bayesian methods to propagate the "coin toss" observations backward
to infer a distribution on the possible values of the parameter p of the
bernoulli random variable.

Step 4 is actually fairly simple; an example of how to do it is here:

Notice (in the picture on that page) how the prior assumption that p might
be anywhere in [0,1] with uniform probability is refined as we incorporate
the result of each coin flip (that's really all the "bayesian" means;
Bayes' theorem tells you how to incorporate the new evidence to update your
estimated distribution (often called the "posterior" distribution)). As we
incorporate more coin tosses, we refine that initial distribution. For a
fair coin, after enough tosses, the posterior distribution becomes very
concentrated around p=0.5

For the "coin flip", it's just a matter of plugging into a closed form
formula to get the distribution, but for more sophisticated models there is
no closed form formula.
This is where MCMC libraries (like it shows how to use in that
"Probabilistic Programming and Bayesian Methods for Hackers" book) come
into play.
E.g. note that in step 2 you are losing a lot of information actually (you
reduce two frame time CDF's (thousands of data points each) to a single bit
of information). Using an MCMC library you can have fine-grained control
over the amount of detail in the model and it can propagate your
observations back to the parameters of your model in a continuous and
information-preserving way (to the extent that you design your model to
preserve information; there are tradeoffs between model accuracy and
computation time obviously).

(step 1 also loses information, btw. You also lose information when you
start looking at the frame time distribution because you lose information
about the ordering of the frames)

-- Sean Silva

Jut wanted to mention [http://lab.llvm.org:8080/green/view/Compile%20Time/](http://lab.llvm.org:8080/green/view/Compile Time/) again. We have setup continuous compiletime tracking of the CTMark testsuite in 4 popular flag combinations there (and yes it’s another example showing the slowdowns). It also allows has a feature to track regressions!

  • Matthias