I don't know what to expect but the results seems to be quite noisy and unstable. E.g I've done two runs on two different commits that only differ by a space in CODE_OWNERS.txt on my 12 core ubuntu 14.04 machine with:
The numbers bounce around a lot if I do more runs.
Given the amount of noise I see here I don't know to sort out significant regressions if I actually do a real change in the compiler.
Are the above results expected?
How to use this?
As a bonus question, if I instead run the benchmarks with an added -m32:
lnt runtest nt --sandbox SANDBOX --cflag=-m32 --cc <path-to-my-clang> --test-suite /data/repo/test-suite -j 8
I get three failures:
--- Tested: 2465 tests --
FAIL: MultiSource/Applications/ClamAV/clamscan.compile_time (1 of 2465)
FAIL: MultiSource/Applications/ClamAV/clamscan.execution_time (494 of 2465)
FAIL: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.execution_time (495 of 2465)
Is this known/expected or do I do something stupid?
Some noisiness in benchmark results is expected, but the numbers you see seem to be higher than I'd expect.
A number of tricks people use to get lower noise results are (with the lnt runtest nt command line options to enable it between brackets):
* Only build the benchmarks in parallel, but do the actual running of the benchmark code at most one at a time. (--threads 1 --build-threads 6).
* Make lnt use linux perf to get more accurate timing for short-running benchmarks (--use-perf=1)
* Pin the running benchmark to a specific core, so the OS doesn't move the benchmark process from core to core. (--make-param=RUNUNDER=taskset -c 1)
* Only run the programs that are marked as a benchmark; some of the tests in the test-suite are not intended to be used as a benchmark (--benchmarking-only)
* Make sure each program gets run multiple times, so that LNT has a higher chance of recognizing which programs are inherently noisy (--multisample=3)
I hope this is the kind of answer you were looking for?
Do the above measures reduce the noisiness to acceptable levels for your setup?
I get massively more stable execution times on 16.04 than on 14.04 on both x86 and ARM because 16.04 does far fewer gratuitous moves from one core to another, even without explicit pinning.
turn off ASLR: “echo 0 > /proc/sys/kernel/randomize_va_space”. As well as getting stable addresses for debugging repeatability, it also stabilizes execution time variability due to “random” conflicts in caches, hash collisions in branch prediction or BTB, maybe even uop cache.
I get massively more stable execution times on 16.04 than on 14.04 on both x86 and ARM because 16.04 does far fewer gratuitous moves from one core to another, even without explicit pinning.
turn off ASLR: “echo 0 > /proc/sys/kernel/randomize_va_space”. As well as getting stable addresses for debugging repeatability, it also stabilizes execution time variability due to “random” conflicts in caches, hash collisions in branch prediction or BTB, maybe even uop cache.
FWIW, I personally think it’s better to keep ASLR turned on. It’s better to get the performance fluctuations in your experiments from the slight changes in code layout from ASLR, as that gives some kind of indication of how sensitive the specific program, core, environment is to layout changes. If you disable ASLR and get a big speed difference when evaluating a compiler patch, you still won’t know if it’s down to some code layout change in a hot piece of code that your patch otherwise didn’t change at all. Keeping ASLR turned on isn’t perfect by far: if you really want to evaluate this properly, you might need to introduce more code layout randomization in your experiments. I’ve talked about this in a bit more detail at EuroLLVM last year, see https://www.youtube.com/watch?v=COmfRpnujF8.
Being able to more quickly determine whether a performance change is due to the intent of the compiler patch you’ve written or due to a micro-architectural non-linearity (such as a big speed difference due to a small code layout change), was one of the main motivations to add profile-annotated disassembly views to LNT, as demonstrated at http://blog.llvm.org/2016/06/using-lnt-to-track-performance.html, or https://fosdem.org/2017/schedule/event/lnt/. Beware that to use this feature, you’ll need to use the cmake+lit infrastructure in the test-suite rather than the older make infrastructure. From lnt runtest, this can be done by using “lnt runtest test-suite” rather than using “lnt runtest nt”.
You should try it both ways, certainly, But it’s good to isolate different effects from unrelated library code, especially if you’re specifically working on things such as whether aligning branch targets is worthwhile, or choosing different instruction encodings to maximize dispatch width from 16 or 32 byte blocks.
To put it another way: N improvements, each individually below the ASLR noise level, can together add up to something significant, but you need to be able to tell whether they are each individually improvements.
I think LNT should use taskset for the benchmarks if there are more
than 1 cores. We usually taskset the scripts to the core zero and
benchmark to a specific core (A53, A57) if they are different, or core
1 if they're all the same.
In addition to all the good points given in this thread:
- Nowadays I'd recommend using 'lnt runtest test-suite' instead of 'nt' to use the cmake/lit based variant.
- Alternatively if you just need an A/B comparison run the benchmarks directly as described in http://www.llvm.org/docs/TestSuiteMakefileGuide.html#running-the-test-suite-via-cmake and use test-suite/utils/compare.py
- Use --benchmarking-only (lnt) / -DTEST_SUITE_BENCHMARKING_ONLY (cmake) to remove a number of tests that are useless for performance testing (like all the unittests in there)
- I created a blacklist of benchmarks that are noisy for my target by rerunning the test-suite a few times with the same compiler. I can feed this blacklist to `utils/compare.py --filter-blacklist`
- As we are on the topic. I recommend this talk from last years dev meeting to dampen the expectation that every good compiler transformations must lead to better (or at least neutral) performance: 2016 LLVM Developers’ Meeting: Z. Ansari "Causes of Performance Instability due to Code ..." - YouTube I think one lesson we should draw from this is that we can use benchmarking as an indicator for problems but there is no way around checking the assembly differences manually for the things where we measured different performance.
Some noisiness in benchmark results is expected, but the numbers you see seem to be higher than I'd expect.
A number of tricks people use to get lower noise results are (with the lnt runtest nt command line options to enable it between brackets):
* Only build the benchmarks in parallel, but do the actual running of the benchmark code at most one at a time. (--threads 1 --build-threads 6).
* Make lnt use linux perf to get more accurate timing for short-running benchmarks (--use-perf=1)
* Pin the running benchmark to a specific core, so the OS doesn't move the benchmark process from core to core. (--make-param=RUNUNDER=taskset -c 1)
* Only run the programs that are marked as a benchmark; some of the tests in the test-suite are not intended to be used as a benchmark (--benchmarking-only)
* Make sure each program gets run multiple times, so that LNT has a higher chance of recognizing which programs are inherently noisy (--multisample=3)
I hope this is the kind of answer you were looking for?
Spot on! Thanks!
Do the above measures reduce the noisiness to acceptable levels for your setup?
I ran with all your suggestions above and now I have:
Some noisiness in benchmark results is expected, but the numbers you see seem to be higher than I'd expect.
A number of tricks people use to get lower noise results are (with the lnt runtest nt command line options to enable it between brackets):
* Only build the benchmarks in parallel, but do the actual running of the benchmark code at most one at a time. (--threads 1 --build-threads 6).
This seems critical, I always do that.
* Make lnt use linux perf to get more accurate timing for short-running benchmarks (--use-perf=1)
* Pin the running benchmark to a specific core, so the OS doesn't move the benchmark process from core to core. (--make-param=RUNUNDER=taskset -c 1)
* Only run the programs that are marked as a benchmark; some of the tests in the test-suite are not intended to be used as a benchmark (--benchmarking-only)
* Make sure each program gets run multiple times, so that LNT has a higher chance of recognizing which programs are inherently noisy (--multisample=3)
This as well, with usually 5 multisamples.
I’d add to this good list: disable frequency scaling / turbo boost. In case of thermal throttling it can skew the results.
Some noisiness in benchmark results is expected, but the numbers you see seem to be higher than I’d expect.
A number of tricks people use to get lower noise results are (with the lnt runtest nt command line options to enable it between brackets):
Only build the benchmarks in parallel, but do the actual running of the benchmark code at most one at a time. (–threads 1 --build-threads 6).
This seems critical, I always do that.
+1.
Make lnt use linux perf to get more accurate timing for short-running benchmarks (–use-perf=1)
Pin the running benchmark to a specific core, so the OS doesn’t move the benchmark process from core to core. (–make-param=RUNUNDER=taskset -c 1)
Only run the programs that are marked as a benchmark; some of the tests in the test-suite are not intended to be used as a benchmark (–benchmarking-only)
Make sure each program gets run multiple times, so that LNT has a higher chance of recognizing which programs are inherently noisy (–multisample=3)
This as well, with usually 5 multisamples.
As far as I remember, LNT uses some advanced statistics if number of samples >= 4, so I’d recommend to use at least 4.
I’d add to this good list: disable frequency scaling / turbo boost. In case of thermal throttling it can skew the results.
+1.
I also usually rerun suspiciously improved or regressed tests to verify the performance change. Most of the time, if it was just a noise, the test doesn’t appear on another run. I wish LNT (or any other script) could do that for me
Now, I'm sure that I haven't read every piece of documentation about the test suite, but don't you think the tips and tricks you've responded with here should make into the quick start web page to help the next test-suite newbie that wants to run this and get stable results?
There could be a bullet 3 under "Running Tests" or just some extra proposed flags under "2" to describe a few things one could do if the results bounce around a lot.
Doesn’t the lnt runtest nt --rerun command line option allow you to do this?
If not, what is missing? I don’t use this option at the moment, but it would be nice to know if it does scratch your itch or not. Also, we still need to implement that functionality for lnt runtest test-suite.
Now, I’m sure that I haven’t read every piece of documentation about the test suite, but don’t you think the tips and tricks you’ve responded with here should make into the quick start web page to help the next test-suite newbie that wants to run this and get stable results?
It definitely should.
It’s sometimes hard for the non-newbies to figure out what documentation is missing the most, so thank you very much for pointing this out!
I’ve added some documentation in the patch under review at https://reviews.llvm.org/D30488.
Please have a look and leave your comments. I’ll leave the patch in review until the end of the week before committing it.
I also usually rerun suspiciously improved or regressed tests to verify the performance change. Most of the time, if it was just a noise, the test doesn’t appear on another run. I wish LNT (or any other script) could do that for me
Michael
Doesn’t the lnt runtest nt --rerun command line option allow you to do this?
Hmm, I think I tried to use it in the past, and it didn’t work for me for some reason - but I don’t remember for sure. Maybe that’s exactly what I asked for.
If not, what is missing?I don’t use this option at the moment, but it would be nice to know if it does scratch your itch or not. Also, we still need to implement that functionality for lnt runtest test-suite.
Yeah, I’m using ‘lnt runtest test-suite’ now most of the time.
It would also be good to implement re-running directly at the test-suite/litsupport level. In principle that is trivial to do, however last time I looked at it lit would only support outputting 1 value per benchmark/metric …