llvm and clang are getting slower

The lto time could be explained by second order effect due to increased dcache/dtlb pressures due to increased memory footprint and poor locality.

David

The lto time could be explained by second order effect due to increased
dcache/dtlb pressures due to increased memory footprint and poor locality.

Actually thinking more about this, I was totally wrong. Mehdi said that we
LTO ~56 binaries. If we naively assume that each binary is like clang and
links in "everything" and that the LTO process takes CPU time equivalent to
"-O3 for every TU", then we would expect that *for each binary* we would
see +33% (total increase >1800% vs Release). Clearly that is not happening
since the actual overhead is only 50%-100%, so we need a more refined
explanation.

There are a couple factors that I can think of.
a) there are 56 binaries being LTO'd (this will tend to increase our
estimate)
b) not all 56 binaries are the size of clang (this will tend to decrease
our estimate)
c) per-TU processing only is doing mid-level optimizations and no codegen
(this will tend to decrease our estimate)
d) IR seen during LTO has already been "cleaned up" and has less overall
size/amount of optimizations that will apply during the LTO process (this
will tend to decrease our estimate)
e) comdat folding in the linker means that we only codegen (this will tend
to decrease our estimate)

Starting from a (normalized) release build with
releaseBackend = .33
releaseFrontend = .67
release = releaseBackend + releaseFrontend = 1

Let us try to obtain
LTO = (some expression involving releaseFrontend and releaseBackend) = 1.5-2

For starters, let us apply a), with a naive assumption that for each of the
numBinaries = 52 binaries we add the cost of releaseBackend (I just checked
and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra, ignoring
symlinks). This gives
LTO = release + 52 * releaseBackend = 21.46, which is way high.

Let us apply b). A quick check gives 371,515,392 total bytes of text in a
release build across all 52 binaries (Mac, x86_64). Clang is 45,182,976
bytes of text. So using final text size in Release as an indicator of the
total code seen by the LTO process, we can use a coefficient of 1/8, i.e.
the average binary links in about avgTextFraction = 1/8 of "everything".
LTO = release + 52 * (.125 * releaseBackend) = 3.14

We are still high. For c), Let us assume that half of releaseBackend is
spend after mid-level optimizations. So let codegenFraction = .5 be the
fraction of releaseBackend that is spend after mid-level optimizations. We
can discount this time from the LTO build since it does not that work
per-TU.
LTO = release + 52 * (.125 * releaseBackend) - (codegenFraction *
releaseBackend) = 2.98
So this is not a significant reduction.

I don't have a reasonable estimate a priori for d) or e), but altogether
they reduce to a constant factor otherSavingsFraction that multiplies the
second term
LTO = release + 52 * (.125 * otherSavingsFraction * releaseBackend) -
(codegenFraction * releaseBackend) =? 1.5-2x

Given the empirical data, this suggests that otherSavingsFraction must have
a value around 1/2, which seems reasonable.

For a moment I was rather surprised that we could have 52 binaries and it
would be only 2x, but this closer examination shows that between
avgTextFraction = .125 and releaseBackend = .33 the "52" is brought under
control.

-- Sean Silva

I have noticed that LLVM doesn't seem to "like" large functions, as a
general rule. Admittedly, my experience is similar with gcc, so I'm not
sure it's something that can be easily fixed. And I'm probably sounding
like a broken record, because I have said this before.

My experience is that the time it takes to compile something is growing
above linear with size of function.

The number of BBs -- Kosyia can point you to the compile time bug that is
exposed by asan .

17409 – llvm::SpillPlacement::addLinks takes all the time with asan or msan and -O2

Not just asan, this bug reproduces in a wide range of cases.
By now I am not even sure if this is a single bug or a set of independent
(but similarly looking) problems.

The lto time could be explained by second order effect due to increased
dcache/dtlb pressures due to increased memory footprint and poor locality.

Actually thinking more about this, I was totally wrong. Mehdi said that we
LTO ~56 binaries. If we naively assume that each binary is like clang and
links in "everything" and that the LTO process takes CPU time equivalent to
"-O3 for every TU", then we would expect that *for each binary* we would
see +33% (total increase >1800% vs Release). Clearly that is not happening
since the actual overhead is only 50%-100%, so we need a more refined
explanation.

There are a couple factors that I can think of.
a) there are 56 binaries being LTO'd (this will tend to increase our
estimate)
b) not all 56 binaries are the size of clang (this will tend to decrease
our estimate)
c) per-TU processing only is doing mid-level optimizations and no codegen
(this will tend to decrease our estimate)
d) IR seen during LTO has already been "cleaned up" and has less overall
size/amount of optimizations that will apply during the LTO process (this
will tend to decrease our estimate)
e) comdat folding in the linker means that we only codegen (this will tend
to decrease our estimate)

Starting from a (normalized) release build with
releaseBackend = .33
releaseFrontend = .67
release = releaseBackend + releaseFrontend = 1

Let us try to obtain
LTO = (some expression involving releaseFrontend and releaseBackend) =
1.5-2

For starters, let us apply a), with a naive assumption that for each of
the numBinaries = 52 binaries we add the cost of releaseBackend (I just
checked and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra,
ignoring symlinks). This gives
LTO = release + 52 * releaseBackend = 21.46, which is way high.

Some bitcode .o files (such as in support libs) are linked in by more than
one targets, but not all .o files are. Suppose the average duplication
factor is DupFactor, then LTO time should be approximated by

LTO = releaseFrontend + DupFactor*ReleaseBackend

Consider comdat elimination and let DedupFactor is the ratio of total
number of unique functions over total number of functions produced by FE,
the LTO time is approximated by:

LTO = releaseFrontend + DupFactor*DedupFactor*ReleaseBackend

David

A historical note:

Back in the pre-Clang LLVM 1.x dark ages you could, if you
pressed the right buttons, run LLVM as a very fast portable
codegen. MB/s was a reasonable measure as the speed was (or
could be made to be) fairly independent of the input structure.

Since ~2006, as LLVM has shifted from "awesome research
plaything" to "compiler people depend on", there has been a
focus on ensuring that typical software compiles quickly and
well. Many good things have followed as a result, but you are
certainly correct that LLVM doesn't handle large input
particularly well. Having said that, some projects (the Gambit
Scheme->C and Verilator Verilog->C compilers come to mind)
routinely see runtimes 10~100x that of GCC in typical use. So
perhaps we are thinking of different things if you're seeing
similar issues with GCC.

I suspect that despite the passage of time the problem remains
solvable - there's probably *more* work to be done now, but I
don't think there are any massively *difficult* problems to be
solved. Properly quantifying/tracking the problem would be a
good first step.

Best,
Duraid

Bug reports (with pre-processed source files preferably) are always welcome.

Collecting the test cases in a "compile time test suite" is what should follow naturally.

Best,

Hi,

There is a possibility that r259673 could play a role here.

For the buildSchedGraph() method, there is the -dag-maps-huge-region that
has the default value of 1000. When I commited the patch, I was expecting
people to lower this value as needed and also suggested this, but this has
not happened. 1000 is very high, basically "unlimited".

It would be interesting to see what results you get with e.g. -mllvm
-dag-maps-huge-region=50. Of course, since this is a trade-off between
compile time and scheduler freedom, some care should be taken before
lowering this in trunk.

Indeed we hit this internally, filed a PR:
https://llvm.org/bugs/show_bug.cgi?id=26940

As a general comment on this thread and as mentioned by Mehdi, we care
a lot about compile time and we're looking forward to contribute more
in this area in the following months; by collecting compile time
testcases into a testsuite and publicly tracking results on those we
should be able to start a RFC on a tradeoff policy.

Honza recently posted some benchmarks for building libreoffice with
GCC 6 and LTO and found a similar compile time regression for recent
llvm trunk...

Compared to llvm 3.5,0. the builds with llvm 3.9.0 svn were 24% slower.

LLVM has a wonderful policy regarding broken commits: we revert to green. We ask that a test case be available within a reasonable time frame (preferably before, but some exceptions can be made), but otherwise we revert the offending patch, even if it contains nice features that people want, and keep the tree green. This is an awesome policy.

I would like to suggest we adopt and follow the same policy for compile time regressions that are large, and especially for ones that are super-linear. As an example from the previous thread:

Chandler Carruth via llvm-dev <llvm-dev@lists.llvm.org> writes:

LLVM has a wonderful policy regarding broken commits: we revert to green.
We ask that a test case be available within a reasonable time frame
(preferably before, but some exceptions can be made), but otherwise we
revert the offending patch, even if it contains nice features that people
want, and keep the tree green. This is an awesome policy.

I would like to suggest we adopt and follow the same policy for compile
time regressions that are large, and especially for ones that are
super-linear. As an example from the previous thread:

> There is a possibility that r259673 could play a role here.
>
> For the buildSchedGraph() method, there is the -dag-maps-huge-region that
> has the default value of 1000. When I commited the patch, I was expecting
> people to lower this value as needed and also suggested this, but this
has
> not happened. 1000 is very high, basically "unlimited".
>
> It would be interesting to see what results you get with e.g. -mllvm
> -dag-maps-huge-region=50. Of course, since this is a trade-off between
> compile time and scheduler freedom, some care should be taken before
> lowering this in trunk.

Indeed we hit this internally, filed a PR:
26940 – [Scheduler] Recent improvements lead to compile time bloat

I think we should have rolled back r259673 as soon as the test case was
available.

Thoughts?

+1. Reverting is easy when a commit is fresh, but gets rapidly more
difficult as other changes (related or not) come after it, whereas
re-applying a commit later is usually straightforward.

Keeping the top of tree compiler in good shape improves everyone's
lives.

LLVM has a wonderful policy regarding broken commits: we revert to green. We
ask that a test case be available within a reasonable time frame (preferably
before, but some exceptions can be made), but otherwise we revert the
offending patch, even if it contains nice features that people want, and
keep the tree green. This is an awesome policy.

I would like to suggest we adopt and follow the same policy for compile time
regressions that are large, and especially for ones that are super-linear.
As an example from the previous thread:

+1

> There is a possibility that r259673 could play a role here.
>
> For the buildSchedGraph() method, there is the -dag-maps-huge-region
> that
> has the default value of 1000. When I commited the patch, I was
> expecting
> people to lower this value as needed and also suggested this, but this
> has
> not happened. 1000 is very high, basically "unlimited".
>
> It would be interesting to see what results you get with e.g. -mllvm
> -dag-maps-huge-region=50. Of course, since this is a trade-off between
> compile time and scheduler freedom, some care should be taken before
> lowering this in trunk.

Indeed we hit this internally, filed a PR:
26940 – [Scheduler] Recent improvements lead to compile time bloat

I think we should have rolled back r259673 as soon as the test case was
available.

I agree, but since we didn't have a policy about it, I was kind of
unsure on what to do about it. Glad you begin this discussion :slight_smile:

Thoughts?

Ideally it would be good to have more compile time sensitive
benchmarks on the test-suite to detect those. We're are working on
collecting what we have internally and upstream to help track the
results in a public way.

Hi,

TLDR: I totally support considering compile time regression as bug.

I’m glad you bring this topic. Also it is worth pointing at this recent thread: http://lists.llvm.org/pipermail/llvm-dev/2016-March/096488.html
And also this blog post comparing the evolution of clang and gcc on this aspect: http://hubicka.blogspot.nl/2016/03/building-libreoffice-with-gcc-6-and-lto.html

I will repeat myself here, since we also noticed internally that compile time was slowly degrading with time. Bruno and Chris are working on some infrastructure and tooling to help tracking closely compile time regressions.

We had this conversation internally about the tradeoff between compile-time and runtime performance, and I planned to bring-up the topic on the list in the coming months, but was waiting for more tooling to be ready.
Apparently in the past (years/decade ago?) the project was very conservative on adding any optimizations that would impact compile time, however there is no explicit policy (that I know of) to address this tradeoff.
The closest I could find would be what Chandler wrote in: http://reviews.llvm.org/D12826 ; for instance for O2 he stated that “if an optimization increases compile time by 5% or increases code size by 5% for a particular benchmark, that benchmark should also be one which sees a 5% runtime improvement”.

My hope is that with better tooling for tracking compile time in the future, we’ll reach a state where we’ll be able to consider “breaking” the compile-time regression test as important as breaking any test: i.e. the offending commit should be reverted unless it has been shown to significantly (hand wavy…) improve the runtime performance.

Since you raise the discussion now, I take the opportunity to push on the “more aggressive” side: I think the policy should be a balance between the improvement the commit brings compared to the compile time slow down. Something along the line as what you wrote in my quote above.
You are referring to “large” compile time regressions (aside: what is “large”?), while Bruno has graphs that shows that the compile time regressions are mostly a lot of 1-3% regressions in general, spread over tens of commits.
Also (and this where we need better tooling) unexpected compile-time slow down are what makes me worried: i.e. the author of the commit adds something but didn’t expect the compile time to be “significantly” impacted. This is motivated by Bruno/Chris data.
Tracking this more closely may also help to triage thing between O2 and O3 when a commit introduces a compile time slow-down but also brings significant enough runtime improvements.

TLDR: I totally support considering compile time regression as bug.

Me too.

I also agree that reverting fresh and reapplying is *much* easier than
trying to revert late.

But I'd like to avoid dubious metrics.

The closest I could find would be what Chandler wrote in:
⚙ D12826 [PM] Wire up optimization levels and default pipeline construction APIs in the PassBuilder. ; for instance for O2 he stated that "if an
optimization increases compile time by 5% or increases code size by 5% for a
particular benchmark, that benchmark should also be one which sees a 5%
runtime improvement".

I think this is a bit limited and can lead to which hunts, especially
wrt performance measurements.

Chandler's title is perfect though... Large can be vague, but
"super-linear" is not. We used to have the concept that any large
super-linear (quadratic+) compile time introductions had to be in O3
or, for really bad cases, behind additional flags. I think we should
keep that mindset.

My hope is that with better tooling for tracking compile time in the future,
we'll reach a state where we'll be able to consider "breaking" the
compile-time regression test as important as breaking any test: i.e. the
offending commit should be reverted unless it has been shown to
significantly (hand wavy...) improve the runtime performance.

In order to have any kind of threshold, we'd have to monitor with some
accuracy the performance of both compiler and compiled code for the
main platforms. We do that to certain extent with the test-suite bots,
but that's very far from ideal.

So, I'd recommend we steer away from any kind of percentage or ratio
and keep at least the quadratic changes and beyond on special flags
(n.logn is ok for most cases).

Since you raise the discussion now, I take the opportunity to push on the
"more aggressive" side: I think the policy should be a balance between the
improvement the commit brings compared to the compile time slow down.

This is a fallacy.

Compile time often regress across all targets, while execution
improvements are focused on specific targets and can have negative
effects on those that were not benchmarked on. Overall, though,
compile time regressions dilute over the improvements, but not on a
commit per commit basis. That's what I meant by which hunt.

I think we should keep an eye on those changes, ask for numbers in
code review and even maybe do some benchmarking on our own before
accepting it. Also, we should not commit code that we know hurts
performance that badly, even if we believe people will replace them in
the future. It always takes too long. I myself have done that last
year, and I learnt my lesson.

Metrics are often more dangerous than helpful, as they tend to be used
as a substitute for thinking.

My tuppence.

--renato

Hi Renato,

TLDR: I totally support considering compile time regression as bug.

Me too.

I also agree that reverting fresh and reapplying is *much* easier than
trying to revert late.

But I'd like to avoid dubious metrics.

I'm not sure about how "this commit regress the compile time by 2%" is a dubious metric.
The metric is not dubious IMO, it is what it is: a measurement.
You just have to cast a good process around it to exploit this measurement in a useful way for the project.

The closest I could find would be what Chandler wrote in:
⚙ D12826 [PM] Wire up optimization levels and default pipeline construction APIs in the PassBuilder. ; for instance for O2 he stated that "if an
optimization increases compile time by 5% or increases code size by 5% for a
particular benchmark, that benchmark should also be one which sees a 5%
runtime improvement".

I think this is a bit limited and can lead to which hunts, especially
wrt performance measurements.

Chandler's title is perfect though... Large can be vague, but
"super-linear" is not. We used to have the concept that any large
super-linear (quadratic+) compile time introductions had to be in O3
or, for really bad cases, behind additional flags. I think we should
keep that mindset.

My hope is that with better tooling for tracking compile time in the future,
we'll reach a state where we'll be able to consider "breaking" the
compile-time regression test as important as breaking any test: i.e. the
offending commit should be reverted unless it has been shown to
significantly (hand wavy...) improve the runtime performance.

In order to have any kind of threshold, we'd have to monitor with some
accuracy the performance of both compiler and compiled code for the
main platforms. We do that to certain extent with the test-suite bots,
but that's very far from ideal.

I agree. Did you read the part where I was mentioning that we're working in the tooling part and that I was waiting for it to be done to start this thread?

So, I'd recommend we steer away from any kind of percentage or ratio
and keep at least the quadratic changes and beyond on special flags
(n.logn is ok for most cases).

How to do you suggest we address the long trail of 1-3% slow down that lead to the current situation (cf the two links I posted in my previous email)?
Because there *is* a problem here, and I'd really like someone to come up with a solution for that.

Since you raise the discussion now, I take the opportunity to push on the
"more aggressive" side: I think the policy should be a balance between the
improvement the commit brings compared to the compile time slow down.

This is a fallacy.

Not sure why or what you mean? The fact that an optimization improves only some target does not invalidate the point.

Compile time often regress across all targets, while execution
improvements are focused on specific targets and can have negative
effects on those that were not benchmarked on.

Yeah, as usual in LLVM: if you care about something on your platform, setup a bot and track trunk closely, otherwise you're less of a priority.

Overall, though,
compile time regressions dilute over the improvements, but not on a
commit per commit basis. That's what I meant by which hunt.

There is no "witch hunt", at least that's not my objective.
I think everyone is pretty enthusiastic with every new perf improvement (I do), but just like without bot in general (and policy) we would break them all the time unintentionally.
I talking about chasing and tracking every single commit were a developer would regress compile time *without even being aware*.
I'd personally love to have a bot or someone emailing me with compile time regression I would introduce.

I think we should keep an eye on those changes, ask for numbers in
code review and even maybe do some benchmarking on our own before
accepting it. Also, we should not commit code that we know hurts
performance that badly, even if we believe people will replace them in
the future. It always takes too long. I myself have done that last
year, and I learnt my lesson.

Agree.

Metrics are often more dangerous than helpful, as they tend to be used
as a substitute for thinking.

I don't relate this sentence to anything concrete at stance here.
I think this list is full of people that are very good at thinking and won't substitute it :slight_smile:

Best,

I'm not sure about how "this commit regress the compile time by 2%" is a dubious metric.
The metric is not dubious IMO, it is what it is: a measurement.

Ignoring for a moment the slippery slope we recently had on compile
time performance, 2% is an acceptable regression for a change that
improves most targets around 2% execution time, more than if only one
target was affected.

Different people see performance with different eyes, and companies
have different expectations about it, too, so those percentages can
have different impact on different people for the same change.

I guess my point is that no threshold will please everybody, and
people are more likely to "abuse" of the metric if the results are far
from what they see as acceptable, even if everyone else is ok with it.

My point about replacing metrics for thinking is not to the lazy
programmers (of which there are very few here), but to how far does
the encoded threshold fall from your own. Bias is a *very* hard thing
to remove, even for extremely smart and experienced people.

So, while "which hunt" is a very strong term for the mild bias we'll
all have personally, we have seen recently how some discussions end up
in rage when a group of people strongly disagree with the rest,
self-reinforcing their bias to levels that they would never reach
alone. In those cases, the term stops being strong, and may be
fitting... Makes sense?

I agree. Did you read the part where I was mentioning that we're working in the tooling part and that I was waiting for it to be done to start this thread?

I did, and should have mentioned on my reply. I think you guys (and
ARM) are doing an amazing job at quality measurement. I wasn't trying
to reduce your efforts, but IMHO, the relationship between effort and
bias removal is not linear, ie. you'll have to improve quality
exponentially to remove bias linearly. So, the threshold we're
prepared to stop might not remove all the problems and metrics could
still play a negative role.

I think I'm just asking for us to be aware of the fact, not to stop
any attempt to introduce metrics. If they remain relevant to the final
objective, and we're allowed to break them with enough arguments, it
should work fine.

How to do you suggest we address the long trail of 1-3% slow down that lead to the current situation (cf the two links I posted in my previous email)?
Because there *is* a problem here, and I'd really like someone to come up with a solution for that.

Indeed, we're now slower than GCC, and that's a place that looked
impossible two years ago. But I doubt reverting a few patches will
help. For this problem, we'll need a task force to hunt for all the
dragons, and surgically alter them, since at this time, all relevant
patches are too far in the past.

For the future, emailing on compile time regressions (as well as run
time) is a good thing to have and I vouch for it. But I don't want
that to become a tool that will increase stress in the community.

Not sure why or what you mean? The fact that an optimization improves only some target does not invalidate the point.

Sorry, I seem to have misinterpreted your point.

The fallacy is about the measurement of "benefit" versus the
regression "effect". The former is very hard to measure, while the
latter is very precise. Comparisons with radically different standard
deviations can easily fall into "undefined behaviour" land, and be
seed for rage threads.

I talking about chasing and tracking every single commit were a developer would regress compile time *without even being aware*.

That's a goal worth pursuing, regardless of the patch's benefit, I
agree wholeheartedly. And for that, I'm very grateful of the work you
guys are doing.

cheers,
--renato

I'm not sure about how "this commit regress the compile time by 2%" is a dubious metric.
The metric is not dubious IMO, it is what it is: a measurement.

Ignoring for a moment the slippery slope we recently had on compile
time performance, 2% is an acceptable regression for a change that
improves most targets around 2% execution time, more than if only one
target was affected.

Sure, I don't think I have suggested anything else, if I did it is because I don't express myself correctly then :slight_smile:
I'm excited about runtime performance, and I'm willing to spend compile-time budget to achieve these.
I'd even say that my view is that by tracking compile-time on other things, it'll help to preserve more compile-time budget for the kind of commit you mention above.

Different people see performance with different eyes, and companies
have different expectations about it, too, so those percentages can
have different impact on different people for the same change.

I guess my point is that no threshold

I don't suggest a threshold that says "a commit can't regress x%", and that would be set in stone.

What I have in mind is more: if a commit regress the build above a threshold (1% on average for instance), then we should be able to have a discussion about this commit to evaluate if it belongs to O2 or if it should go to O3 for instance.
Also if the commit is about refactoring, or introducing a new feature, the regression might not be intended at all by the author!

will please everybody, and
people are more likely to "abuse" of the metric if the results are far
from what they see as acceptable, even if everyone else is ok with it.

The metric is "the commit regressed 1%". The natural thing that follows is what happens usually in the community: we look at the data (what is the performance improvement), and decide on a case by case if it is fine as is or not.
I feel like you're talking about the "metric" like an automatic threshold that triggers an automatic revert and block things, this is not the goal and that is not what I mean when I use of the word metric (but hey, I'm not a native speaker!).
As I said before, I'm mostly chasing *untracked* and *unintentional* compile time regression.

My point about replacing metrics for thinking is not to the lazy
programmers (of which there are very few here), but to how far does
the encoded threshold fall from your own. Bias is a *very* hard thing
to remove, even for extremely smart and experienced people.

So, while "which hunt" is a very strong term for the mild bias we'll
all have personally, we have seen recently how some discussions end up
in rage when a group of people strongly disagree with the rest,
self-reinforcing their bias to levels that they would never reach
alone. In those cases, the term stops being strong, and may be
fitting... Makes sense?

I agree. Did you read the part where I was mentioning that we're working in the tooling part and that I was waiting for it to be done to start this thread?

I did, and should have mentioned on my reply. I think you guys (and
ARM) are doing an amazing job at quality measurement. I wasn't trying
to reduce your efforts, but IMHO, the relationship between effort and
bias removal is not linear, ie. you'll have to improve quality
exponentially to remove bias linearly. So, the threshold we're
prepared to stop might not remove all the problems and metrics could
still play a negative role.

I'm not sure I really totally understand everything you mean.

I think I'm just asking for us to be aware of the fact, not to stop
any attempt to introduce metrics. If they remain relevant to the final
objective, and we're allowed to break them with enough arguments, it
should work fine.

How to do you suggest we address the long trail of 1-3% slow down that lead to the current situation (cf the two links I posted in my previous email)?
Because there *is* a problem here, and I'd really like someone to come up with a solution for that.

Indeed, we're now slower than GCC, and that's a place that looked
impossible two years ago. But I doubt reverting a few patches will
help. For this problem, we'll need a task force to hunt for all the
dragons, and surgically alter them, since at this time, all relevant
patches are too far in the past.

Obviously, my immediate concern is "what tools and process to make sure it does not get worse", and starting with "community awareness" is not bad. Improving and recovering from the current state is valuable, but orthogonal to what I'm trying to achieve.
Another things is the complain from multiple people that are trying to JIT using LLVM, we know LLVM is not designed in a way that helps with latency and memory consumption, but getting worse is not nice.

For the future, emailing on compile time regressions (as well as run
time) is a good thing to have and I vouch for it. But I don't want
that to become a tool that will increase stress in the community.

Sure, I'm glad you step up to make sure it does not happen. So please continue to voice up in the future as we try to roll thing.
I hope we're on the same track past the initial misunderstanding we had each other?

What I'd really like is to have a consensus on the goal to pursue (knowing to not be alone to care about compile time is a great start!), so that the tooling can be set up to serve this goal the best way possible (and decreasing stress instead of increasing it).

Best,

What I have in mind is more: if a commit regress the build above a threshold (1% on average for instance), then we should be able to have a discussion about this commit to evaluate if it belongs to O2 or if it should go to O3 for instance.
Also if the commit is about refactoring, or introducing a new feature, the regression might not be intended at all by the author!

Thresholds as trigger for discussion is exactly what I was looking for.

But Chandler goes further (or so I gathered), that some commits are
really bad and could be candidates for reversion before discussion.
Those, more extreme measures, may be justified if, for example, the
commit is quadratic or more in a core part of the compiler, or double
the testing time, etc.

I agree with both proposals, but we have to make sure what goes where,
to avoid (unintentionally) heavy handing other people's work.

The metric is "the commit regressed 1%". The natural thing that follows is what happens usually in the community: we look at the data (what is the performance improvement), and decide on a case by case if it is fine as is or not.
I feel like you're talking about the "metric" like an automatic threshold that triggers an automatic revert and block things, this is not the goal and that is not what I mean when I use of the word metric (but hey, I'm not a native speaker!).

I wasn't talking about automatic reversal, but about pre-discussion
reversal, as I mention above.

As I said before, I'm mostly chasing *untracked* and *unintentional* compile time regression.

That's is obviously good. :slight_smile:

I'm not sure I really totally understand everything you mean.

It's about the threshold between what promotes discussion and what
promotes pre-discussion reverts. This is a hard line to draw with so
many people (and companies) involved.

Sure, I'm glad you step up to make sure it does not happen. So please continue to voice up in the future as we try to roll thing.
I hope we're on the same track past the initial misunderstanding we had each other?

Yes. :slight_smile:

cheers,
--renato

> TLDR: I totally support considering compile time regression as bug.

Me too.

I also agree that reverting fresh and reapplying is *much* easier than
trying to revert late.

But I'd like to avoid dubious metrics.

> The closest I could find would be what Chandler wrote in:
> ⚙ D12826 [PM] Wire up optimization levels and default pipeline construction APIs in the PassBuilder. ; for instance for O2 he stated that "if
an
> optimization increases compile time by 5% or increases code size by 5%
for a
> particular benchmark, that benchmark should also be one which sees a 5%
> runtime improvement".

I think this is a bit limited and can lead to which hunts, especially
wrt performance measurements.

Chandler's title is perfect though... Large can be vague, but
"super-linear" is not. We used to have the concept that any large
super-linear (quadratic+) compile time introductions had to be in O3
or, for really bad cases, behind additional flags. I think we should
keep that mindset.

> My hope is that with better tooling for tracking compile time in the
future,
> we'll reach a state where we'll be able to consider "breaking" the
> compile-time regression test as important as breaking any test: i.e. the
> offending commit should be reverted unless it has been shown to
> significantly (hand wavy...) improve the runtime performance.

In order to have any kind of threshold, we'd have to monitor with some
accuracy the performance of both compiler and compiled code for the
main platforms. We do that to certain extent with the test-suite bots,
but that's very far from ideal.

So, I'd recommend we steer away from any kind of percentage or ratio
and keep at least the quadratic changes and beyond on special flags
(n.logn is ok for most cases).

> Since you raise the discussion now, I take the opportunity to push on the
> "more aggressive" side: I think the policy should be a balance between
the
> improvement the commit brings compared to the compile time slow down.

This is a fallacy.

Compile time often regress across all targets, while execution
improvements are focused on specific targets and can have negative
effects on those that were not benchmarked on. Overall, though,
compile time regressions dilute over the improvements, but not on a
commit per commit basis. That's what I meant by which hunt.

I think we should keep an eye on those changes, ask for numbers in
code review and even maybe do some benchmarking on our own before
accepting it. Also, we should not commit code that we know hurts
performance that badly, even if we believe people will replace them in
the future. It always takes too long. I myself have done that last
year, and I learnt my lesson.

Metrics are often more dangerous than helpful, as they tend to be used
as a substitute for thinking.

One of my favorite quotes:
"The results are definitely numbers, but do they have very much at all to
do with the problem?" - Forman Acton, "Numerical Methods that (usually)
Work"

-- Sean Silva