CTMark - regular LLVM and CLANG compile-time tracking

Hi,

this is about kicking-off regular compile-time tracking for LLVM and CLANG on the green dragon: [http://lab.llvm.org:8080/green/view/Compile%20Time/](http://lab.llvm.org:8080/green/view/Compile Time/). The goal is to stay on top of compile-time issues immediately when they occur so they can be assessed rather than creeping in unnoticed. The methodology is simple: form a CTMark suite out of 10 “long” compiling tests of the LLVM test suite and track it closely for ARM64 O0g and Os. When there is a jump in compile-time of more than 2.5% in one of the tests an email notification will be sent to the committer and a bug will be filed in bugzilla. The 2.5% threshold is large enough to be above the noise level and should be motivating enough to root cause the issue. Also watch for spikes of >10% in O3 LTO.

There will be process and effort. Chris Matthews put together CTMark together with Michael and Matthias, and keeps the servers running and constantly improves tooling. Michael stepped up helping with the ongoing process of watching for issues and filing PRs until that part is automated. It will require the effort of everyone to control and possibly improve compile-time over time. And everybody is invited to root cause past regressions.

The need for this work has been recognized by the communities and also externally. It has been observed more than once that llvm and clang are getting slower over time, for example in this thread: http://lists.llvm.org/pipermail/llvm-dev/2016-March/096491.html. Phoronix reported recently compile-time slow downs in clang 3.9 vs 3.8: http://www.phoronix.com/scan.php?page=article&item=llvm-clang-39&num=4. At the LLVM Developer Meeting Michael showed double digit compile-time increases in clang 3.7 and 3.8 in Os/O3 and O0 in his Loop Passes presentation (https://llvmdevelopersmeetingbay2016.sched.org/event/8Z0B/loop-passes-adding-new-features-while-reducing-technical-debt):
PastedGraphic-3.png

The selection of 10 tests out of the LLVM tests invites criticism. The motivation here is to pick something “reasonable” and adjust as needed moving forward. Specifically this means running a wider set of tests from time to time and adjust the tests in CTMark either by adding more tests or by removing in-effective tests. Internally we also track a set of benchmarks and check for correlations of compile-time regressions to CTMark. And most importantly I think the shared interest and commitment of the community to compile-time will carry this forward.

Special thanks to Chris and Michael to get this started!

-Gerolf

Hi Gerolf,

This is really cool!
I’m very excited about this initiative and I hope we’ll be able to get to a stage where compile time regression are handled like other regression: if they are not expected / justified by the commit author promptly, the commit should be reverted in the meantime!

I’d like to suggest adding to CTMark the “empty” compile test (and maybe “empty + one empty function”), unless it is too noisy to measure.
It is an interesting test to complete the existing ones because it measures the general overhead of setting up all the “infrastructure” (static initializers, creating a pass pipeline, etc.)

Hi Gerolf,

This is really cool!
I’m very excited about this initiative and I hope we’ll be able to get to a stage where compile time regression are handled like other regression: if they are not expected / justified by the commit author promptly, the commit should be reverted in the meantime!

I’d like to suggest adding to CTMark the “empty” compile test (and maybe “empty + one empty function”), unless it is too noisy to measure.
It is an interesting test to complete the existing ones because it measures the general overhead of setting up all the “infrastructure” (static initializers, creating a pass pipeline, etc.)

That would indeed be a very interesting test, however this will be way too short to measure predictably on its own.
I could see it working if we had a flag that artifically runs the compilation pipeline hundreds of times or alternatively puts the whole compiler into something like googlebenchmark.

  • Matthias

Another oustanding idea is producing some kind of generator for certain patterns of sourcecode.
For a case like this we could have a generator produce N empty functions. All of this would be more of a performance unit test stressing the compiler for particular patterns. In that sense CTMark is more of an end-to-end test testing the experience of an end user compiling a large project.

  • Matthias

Hi Gerolf,

This is really cool!
I’m very excited about this initiative and I hope we’ll be able to get to a stage where compile time regression are handled like other regression: if they are not expected / justified by the commit author promptly, the commit should be reverted in the meantime!

I’d like to suggest adding to CTMark the “empty” compile test (and maybe “empty + one empty function”), unless it is too noisy to measure.
It is an interesting test to complete the existing ones because it measures the general overhead of setting up all the “infrastructure” (static initializers, creating a pass pipeline, etc.)

That would indeed be a very interesting test, however this will be way too short to measure predictably on its own.

Are you afraid of the measurement noise for this?

I could see it working if we had a flag that artifically runs the compilation pipeline hundreds of times or alternatively puts the whole compiler into something like googlebenchmark.

Since I’m interested in the startup time in general, it’d have to be a loop in a shell script that invokes clang ~1000 times (or better: with a statistical measurement of “confidence” that stops the loop when it reaches a threshold or don’t make progress anymore), and returns something like the geometric mean.

I think I remember Michael G. doing something like that for Swift performance testing?

Hi Gerolf,

This is really cool!
I’m very excited about this initiative and I hope we’ll be able to get to a stage where compile time regression are handled like other regression: if they are not expected / justified by the commit author promptly, the commit should be reverted in the meantime!

I’d like to suggest adding to CTMark the “empty” compile test (and maybe “empty + one empty function”), unless it is too noisy to measure.
It is an interesting test to complete the existing ones because it measures the general overhead of setting up all the “infrastructure” (static initializers, creating a pass pipeline, etc.)

That would indeed be a very interesting test, however this will be way too short to measure predictably on its own.

Are you afraid of the measurement noise for this?

I could see it working if we had a flag that artifically runs the compilation pipeline hundreds of times or alternatively puts the whole compiler into something like googlebenchmark.

Since I’m interested in the startup time in general, it’d have to be a loop in a shell script that invokes clang ~1000 times (or better: with a statistical measurement of “confidence” that stops the loop when it reaches a threshold or don’t make progress anymore), and returns something like the geometric mean.

Can someone report actual experiences with this? I would expect extra noise if we get the operating system involved as well creating thousands of processes.

I think I remember Michael G. doing something like that for Swift performance testing?

Well to come back to the end-to-end vs. unittest comparison. If startup time gets out of hand it will show up in CTMark as well. If it does not get out of hand and requires special measurement techniques because it is a lot less than 0.5s then probably end user won’t care as well.

  • Matthias

We have measured null/empty tests before. On a well stabilized machine there was no problem detecting regressions.

(Repeating here what we discussed offline): a x2 increase for the empty test may not show up on a test where for instance SelectionDAG takes most of the time.
It may still matters on some codebase and with O0, or for LTO for instance where clang does not go through CodeGen.