LLVM is getting faster, April edition

Hi,

It’s been a while since I sent the last compile time report [1], where it was shown that LLVM was getting slower over time. But now I’m happy to bring some good news: finally, LLVM is getting faster, not slower :slight_smile:

*** Current status ***
Many areas of LLVM have been examined and improved since then: InstCombine, SCEV, APInt implementation, and that resulted in almost 10% improvement compared to January compiler. I remeasured compile time data for CTMark tests and annotated the biggest changes, the graphs for Os and O0-g are attached below. Thick black line represents geomean, colored thin lines represent individual tests. The data is normalized on the first revision in the range (which is ~Jun, 2015).

*** Future work ***
There are still plenty of opportunities to make LLVM faster. Here is a list of some ideas that can further help compile-time:

  • KnownBits Cache. InstCombine and other passes use known bits, which often happens to be pretty expensive. Hal posted a patch [2] that implements a cache for known bits, but there are still some issues to fix there.
  • SCEV. Some parts of SCEV still need to be improved. For instance, createAddRecFromPHI function seems to be very inefficient: it can perform many expensive traversals over entire function/loop nest, and most of them are probably redundant.
  • Forming LCSSA. PR31851 reports that the current implementation of LCSSA forming can be expensive. A WIP patch [3] should address the problem, but probably there are more to be improved here.
  • InstCombine vs InstSimplify. Currently we run InstCombine 6 times in our O3 pipeline. Probably, we don’t need full InstCombine all 6 times, and some of its invocations can be replaced with a cheaper clean-up pass.
  • Unnecessary pass dependencies. There are cases in which computing pass dependencies is much more expensive than running the pass itself (especially at O0). It might make sense to find such passes and try replacing their dependencies with lazy computations of required analyses (see e.g. [4]).
  • libcxx. r249742 split a bunch of headers and resulted in noticeable compile time slowdowns. While the change itself seems to be necessary, it would be nice to find a way to mitigate the induced slowdowns.

Of course, the list is far from complete, so if you happen to know other problematic areas, please let me know. Some of these ideas are already worked on, but there is always a room for volunteers here! So, if you’d like to work on LLVM compile time, please, let me know and let’s join our efforts.

Thanks for your time,
Michael

[1] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109188.html

[2] https://reviews.llvm.org/D31239
[3] https://reviews.llvm.org/D31843
[4] https://reviews.llvm.org/D31302

CTMark -Os:

CTMark - Os.pdf (33.5 KB)

CTMark - O0-g.pdf (19.1 KB)

I am interested in knowing more.

  1. What benchmarks does LLVM community use for compile-time study? I see CTMark, but is that the only one being analyzed?
  2. Is ASM parsing treated as a bottleneck in the flow? It is not the default in the compilation flow though.
  3. Do we have a target here? how fast does LLVM want to be?

Hi Madhur,

I am interested in knowing more.

  1. What benchmarks does LLVM community use for compile-time study? I see CTMark, but is that the only one being analyzed?

I used CTMark, which is a subset of the standard LLVM testsuite. We created CTMark some time ago specifically to make it a benchmark suite for compile-time studies with idea that new tests will be added in future. Usually it’s also a good idea to check compile time on established suites, e.g. SPEC, but this time I didn’t do it (building SPEC takes much longer than building CTMark).

  1. Is ASM parsing treated as a bottleneck in the flow? It is not the default in the compilation flow though.

It depends - if there is a use case where it’s a bottleneck, I don’t see a reason not to look into it. But in my investigations I don’t remember seeing it as a bottleneck.

  1. Do we have a target here? how fast does LLVM want to be?

It’s an open-ended goal. Ideally, compile-time should improve from release to release but in practice it’s almost impossible, because LLVM gets more and more features. My current focus is to identify bottlenecks (using CTMark or maybe some other tests) and try to optimize them one by one. Since we didn’t care much about compile time in the recent past, I think there should be plenty of low-hanging fruits here.

Thanks,
Michael

Thanks Michael. These reports are great for both me personally and the llvm community. It’s very appreciated.

-eric

Hi,

It's been a while since I sent the last compile time report [1], where it
was shown that LLVM was getting slower over time. But now I'm happy to bring
some good news: finally, LLVM is getting faster, not slower :slight_smile:

Thanks a lot for the update, this is very good to hear :slight_smile:

*** Current status ***
Many areas of LLVM have been examined and improved since then: InstCombine,
SCEV, APInt implementation, and that resulted in almost 10% improvement
compared to January compiler. I remeasured compile time data for CTMark
tests and annotated the biggest changes, the graphs for Os and O0-g are
attached below. Thick black line represents geomean, colored thin lines
represent individual tests. The data is normalized on the first revision in
the range (which is ~Jun, 2015).

*** Future work ***
There are still plenty of opportunities to make LLVM faster. Here is a list
of some ideas that can further help compile-time:

- KnownBits Cache. InstCombine and other passes use known bits, which often
happens to be pretty expensive. Hal posted a patch [2] that implements a
cache for known bits, but there are still some issues to fix there.
- SCEV. Some parts of SCEV still need to be improved. For instance,
createAddRecFromPHI function seems to be very inefficient: it can perform
many expensive traversals over entire function/loop nest, and most of them
are probably redundant.
- Forming LCSSA. PR31851 reports that the current implementation of LCSSA
forming can be expensive. A WIP patch [3] should address the problem, but
probably there are more to be improved here.

https://reviews.llvm.org/rL300255 is a first step towards the goal.
For some large tests, LCSSA is still slow, and that's due to a lot of
time spent in the updater. I'll try to fix that one next.

To clarify, here's the profiler output
https://reviews.llvm.org/F3221946
I think there's not much we can do (easily) for `getExitBlocks()` as
we already cache the call for each loop in `fromLCSSAfoInstructions`
but switching to a faster renamer should help. Dan, do you think it's
possible to move your O(def + use) renamer out of PredicateInfo to a
common file and use that here?

I am interested in knowing more.

  1. What benchmarks does LLVM community use for compile-time study? I see CTMark, but is that the only one being analyzed?

CTMark is not cast in stone. Its purpose is for the community to have a trackable proxy for the overall llvm test suite. This assertion is supposed to get evaluated (and the benchmarks in CTMark possibly adjusted) on a regular basis which should happen roughly twice a year at compiler release times. As far as open source is concerned only CTMark is tracked on green dragon for O0g, Os and - forward looking - O3 LTO. This means O0g and Os is watched very closely, while O3 LTO at this stage is just "getting a look” (a double digit increase will certainly raise eyebrows though). The tracking data is at http://lab.llvm.org:8080/green/view/Compile%20Time/.

  1. Is ASM parsing treated as a bottleneck in the flow? It is not the default in the compilation flow though.
  2. Do we have a target here? how fast does LLVM want to be?

Our data showed that compile-time increased steadily by double digits in the last 2 years for Os and a little less in O0g. Unfortunately, and for many reasons, it is not straight forward to get that compile-time back by simply setting a X% goal e.g. Y months. Instead it requires establishing a process that allows the open source community to pursue better compile-time. This involves
a) identify compile time increases quickly.
b) reason about an increase in “real-time”.
c) take action and implement improvements.

CTMark gives focus and the sense of achievability for a).

b) requires finishing some work in progress and hopefully will become a process ingrained in the llvm/clang dna. Michael’s analysis of compile-time bumps is the basis for classifying the reasons, and many times an increase eg. for a new feature. optimization, tuning etc is the right trade-off. But then it becomes a group decision based on data and insight to accept longer compile-times and not something that simply happened. To help with root causing/reasoning I think the work by Matthias and others [2] on timers/stats combined with per commit tracking on green dragon (which Chris enabled) will form the basis of the methodology.

c) is different. While a) and b) puts in place barriers to ongoing compile-time increases, c) gives the improvements. Michael’s work on SCEV [1] is an example of this. This (and similar work by others, eg. the refs in this mail thread) shows that clang can get ahead of the game and improve compile-time significantly. It is now time to shoot the azimuth and see where future improvements can come from and where they can take clang compile-time. Some ideas are in Michael’s mail thread, but with additional analysis and insight including from compiling clang itself more opportunities should become apparent. Expect most of the issues to take many weeks or months to analyze and implement. Still ready to join the effort? It takes commitment, not just interest!

To answer your question about the target: the process is working, compile-time changes are assessed immediately, improvement ideas are followed up on and blogs/articles reporting clang compile-times are all praise. When the best talents in the community devote some of their time and efforts this lofty goal will be hit.

Cheers
Gerolf

References:
[1] https://reviews.llvm.org/D30477
[2] http://lists.llvm.org/pipermail/llvm-dev/2016-December/108088.html (and more comments in https://reviews.llvm.org/D31566#716880)

Is it just renaming you need, or general updating?
The RewriteUse call there is one that inserts phis, so i’m not sure renaming alone will help.

Thanks Gerolf for explaining the philosophy behind the analysis. I totally agree it requires consistency and watch.

I got the purpose of CTMark. Its intent is clear to me. Apart from this have we ever thought of having synthetic tests which are focused on compile-time only? We can write a test suite generator which generates LLVM IR files which stress tests a particular phase of LLVM. E.g. we can have a performance test which stress test just LICM or LSE and so on.

This will allow us to keep an eye on that phase and track more efficiently.

Thoughts?

Thanks Gerolf for explaining the philosophy behind the analysis. I totally agree it requires consistency and watch.

I got the purpose of CTMark. Its intent is clear to me. Apart from this have we ever thought of having synthetic tests which are focused on compile-time only? We can write a test suite generator which generates LLVM IR files which stress tests a particular phase of LLVM. E.g. we can have a performance test which stress test just LICM or LSE and so on.

This will allow us to keep an eye on that phase and track more efficiently.

Maybe. Sometimes stress-testing (generated or eg. by turning the compiler loose inline all, unroll all and other thresholds set to infinite) points to low hanging fruits. But mostly it tends to result in threshold tuning cutting off/protecting from outliers. It is a bigger challenge to improve compile-time for a wide range of apps/benchmarks. But it is also more fun when you can commit the time.

I suspect this will track a particular instance of the pass execution without necessarily catch/cover the edge cases that we may end up regressing. So I’m not sure about the expected efficiency of such approach.
It also wonder about how hard it would be to maintain such tests as the passes and the pipeline evolve with time.