RFC:LNT Improvements

Dear all,

Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose some improvements to the LNT performance tracking software.

The most significant issue with current implementation is that the report is filled with extremely noisy values. Hence it is hard to notice performance improvements or regressions.

After investigation of LNT and the LLVM test suite, I propose following methods. I've also attached prototype patches for each method.
- Increase the execution time of the benchmark so it runs long enough to avoid noisy results
        Currently there are two options to run benchmarks, namely small and large problem size. I propose adding a third option: adaptive. In adaptive mode, benchmarks scale the problem size according to pre-measured system performance value so that the running time is kept at around 10 seconds, the sweet spot between time and accuracy. The downside is that correctness for some benchmarks cannot be measured. Solution is to measure correctness in a separate board with small problem size.
        LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode
        Test suite: [PATCH 1/2] Add support for adaptive problem size
                        [PATCH 2/2] A subset of test suite programs modified for adaptive
- Show and graph total compile time
        There is no obvious way to scale up the compile time of individual benchmarks, so total time is the best thing we can do to minimize error.
        LNT: [PATCH 1/3] Add Total to run view and graph plot
- Only show performance changes with high confidence in summary report
        To investigate the correlation between program run time and its variance, I ran Dhrystone of different problem size multiple times. The result shows that some fluctuations are expected and shorter tests have much greater variance. By modelling the run time to be normally distributed, we can calculate the minimal difference for statistical significance. Using this knowledge, we can hide those results with low confidence level from summary report. They are still available and marked in colour in detailed report in case interested.
        LNT: [PATCH 3/3] Ignore tests with very short run time
- Make sure board has low background noise
        Perform a system performance benchmark before each run and compare the value with the reference(obtained during machine set-up). If the percentage difference is too large, abort or defer the run. In prototype this feature is implemented using Bash script and not integrated into LNT. Will rewrite in Python.
        LNT: benchmark.sh

In my prototype implementation, the summary report becomes much more useful. There are almost no noisy readings while small regressions are still detectable for long running benchmark programs. The implementation is backwards compatible with older databases.

Screenshots from a sample run is attached.

Thanks for reading!

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782

patchset.tar.gz (14.4 KB)

Hi Yi Kong,

thanks for working on this. I think there is a lot we can improve here. I copied Mingxing Tan who has worked on a couple of patches in this area before and Chris, who is maintaining LNT.

Dear all,

Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose some improvements to the LNT performance tracking software.

The most significant issue with current implementation is that the report is filled with extremely noisy values. Hence it is hard to notice performance improvements or regressions.

Right.

After investigation of LNT and the LLVM test suite, I propose following methods. I've also attached prototype patches for each method.
- Increase the execution time of the benchmark so it runs long enough to avoid noisy results
         Currently there are two options to run benchmarks, namely small and large problem size. I propose adding a third option: adaptive. In adaptive mode, benchmarks scale the problem size according to pre-measured system performance value so that the running time is kept at around 10 seconds, the sweet spot between time and accuracy. The downside is that correctness for some benchmarks cannot be measured. Solution is to measure correctness in a separate board with small problem size.
         LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode
         Test suite: [PATCH 1/2] Add support for adaptive problem size
                         [PATCH 2/2] A subset of test suite programs modified for adaptive

I think it will be easier to review such patches one by one on the commit mailing lists. Especially as this one is a little larger.

In general, I see such changes as a second step. First, we want to have a system in place that allows us to reliably detect if a benchmark is noisy or not, second we want to increase the number of benchmarks that are not noisy and where we can use the results.

- Show and graph total compile time
         There is no obvious way to scale up the compile time of individual benchmarks, so total time is the best thing we can do to minimize error.
         LNT: [PATCH 1/3] Add Total to run view and graph plot

I did not see the effect of these changes in your images and also honestly do not fully understand what you are doing. What is the total compile time? Don't we already show the compile time in run view? How is the total time different to this compile time?

Maybe you can answer this in a separate patch email.

- Only show performance changes with high confidence in summary report
         To investigate the correlation between program run time and its variance, I ran Dhrystone of different problem size multiple times. The result shows that some fluctuations are expected and shorter tests have much greater variance. By modelling the run time to be normally distributed, we can calculate the minimal difference for statistical significance. Using this knowledge, we can hide those results with low confidence level from summary report. They are still available and marked in colour in detailed report in case interested.
         LNT: [PATCH 3/3] Ignore tests with very short run time

I think this is the most important point which we should address first.
In fact, I would prefer to go even further and actually compute the confidence and make the confidence we require an option. This allows
us to understand both how stable/noisy a machine is and how well the other changes you propose work in practice.

We had a longer discussion here on llvmdev names 'Questions about results reliability in LNT infrustructure'. Anton suggested to do the
following:

1. Get 5-10 samples per run
2. Do the Wilcoxon/Mann-Whitney test

I already set up -O3 buildbots that provide 10 runs, per commit and the
noise for them is very low:

http://llvm.org/perf/db_default/v4/nts/25151?num_comparison_runs=10&test_filter=&test_min_value_filter=&aggregation_fn=median&compare_to=25149&submit=Update

If you are interested in performance data to test your changes, you can extract the results from the LLVM buildmaster at:

http://lab.llvm.org:8011/builders/polly-perf-O3/builds/2942/steps/lnt.nightly-test/logs/report.json

with 2942 being one of the latest successful builds. By going backwards
or forwards you should get other builds if they have been successful.

There should be a standard function for the wilcoxon/mann-whitney in
scipy?, so in case you are interested adding this reliability numbers as
a first step seems to be a simple and purely beneficial commit.

- Make sure board has low background noise
         Perform a system performance benchmark before each run and compare the value with the reference(obtained during machine set-up). If the percentage difference is too large, abort or defer the run. In prototype this feature is implemented using Bash script and not integrated into LNT. Will rewrite in Python.
         LNT: benchmark.sh

I am a little sceptical on this. Machines should generally not be noisy. However, if for some reason there is noise on the machine, the noise is as likely to appear during this pre-noise-test than during the actual benchmark runs, maybe during both, but maybe also only during the benchmark. So I am afraid we might often run in the situation where this test says OK but the later test is still suffering noise.

I would probably prefer to make the previous point of reporting reliability work well and then we can see for each test/benchmark if there was noise involved or not.

All the best,
Tobias

In general, I see such changes as a second step. First, we want to have a
system in place that allows us to reliably detect if a benchmark is noisy or
not, second we want to increase the number of benchmarks that are not noisy
and where we can use the results.

I personally use the test-suite for correctness, not performance and
would not like to have its run time increased by any means.

As discussed in the BoF last year, if we could separate test run from
benchmark run before we do any change, I'd appreciate.

I want to have a separate benchmark bot on the subset that makes sense
to work as benchmark, but I don't want the noise of the rest.

1. Get 5-10 samples per run
2. Do the Wilcoxon/Mann-Whitney test

5-10 samples on an ARM board is not feasible. Currently it takes 1
hour to run the whole set. Making it run for 5-10 hours will reduce
its value to zero.

I am a little sceptical on this. Machines should generally not be noisy.

ARM machines work at a much lower power level than Intel ones. The
scheduler is a lot more aggressive and the quality of the peripherals
is *a lot* worse.

Even if you set up the board for benchmarks (fix the scheduler, put
everything up to 11), the quality of the external hardware (USB, SD,
eMMC, etc) and their drivers do a lot of damage to any meaningful
number you may extract if the moon is full and Jupiter is in
Sagittarius.

So...

However, if for some reason there is noise on the machine, the noise is as
likely to appear during this pre-noise-test than during the actual benchmark
runs, maybe during both, but maybe also only during the benchmark. So I am
afraid we might often run in the situation where this test says OK but the
later test is still suffering noise.

...this is not entirely true, on ARM.

We may be getting server quality hardware for AArch64 any time now,
but it's very unlikely that we'll *ever* get quality 32-bit test
boards.

cheers,
--renato

Hi Renato,

In general, I see such changes as a second step. First, we want to have a
system in place that allows us to reliably detect if a benchmark is noisy or
not, second we want to increase the number of benchmarks that are not noisy
and where we can use the results.

I personally use the test-suite for correctness, not performance and
would not like to have its run time increased by any means.

I agree, we should not complicate the current use as a correctness test-suite. Though I don't think any of the changes proposed this.

I also believe we should as a first step not touch the test suite at all, but just improve how LNT reports the results it gets.

As discussed in the BoF last year, if we could separate test run from
benchmark run before we do any change, I'd appreciate.

To my understanding, the first patches should just improve LNT to report how reliable the results are it reports. So there is no way that this can effect the test suite runs, which means I do not see why we would want to delay such changes.

In fact, if we have a good idea which kernels are reliable and which ones are not, we can probably use this information to actually mark benchmarks that are known to be noisy.

I want to have a separate benchmark bot on the subset that makes sense
to work as benchmark, but I don't want the noise of the rest.

Right. There are two steps to get here:

1) Measure and show if a benchmark result is reliable

2) Avoid running the know to be noisy/unreliable benchmarks

1. Get 5-10 samples per run
2. Do the Wilcoxon/Mann-Whitney test

5-10 samples on an ARM board is not feasible. Currently it takes 1
hour to run the whole set. Making it run for 5-10 hours will reduce
its value to zero.

Reporting numbers that are not 100% reliable makes the results useless as well. As ARM boards are cheap, you could just put 5 boxes in place and we get the samples we need. Even if this is not yet feasible, I would rather run 5 samples of the benchmarks you really care, then running everything once and getting unreliable number.

I am a little sceptical on this. Machines should generally not be noisy.

Let me rephrase. "Machines on which you would like to run benchmarks should have a consistent and low enough level of noise"

ARM machines work at a much lower power level than Intel ones. The
scheduler is a lot more aggressive and the quality of the peripherals
is *a lot* worse.

Even if you set up the board for benchmarks (fix the scheduler, put
everything up to 11), the quality of the external hardware (USB, SD,
eMMC, etc) and their drivers do a lot of damage to any meaningful
number you may extract if the moon is full and Jupiter is in
Sagittarius.

So...

>

However, if for some reason there is noise on the machine, the noise is as
likely to appear during this pre-noise-test than during the actual benchmark
runs, maybe during both, but maybe also only during the benchmark. So I am
afraid we might often run in the situation where this test says OK but the
later test is still suffering noise.

...this is not entirely true, on ARM.

So do you think the benchmark.sh script proposed by Yi Kong is useful for ARM?

Cheers,
Tobias

To my understanding, the first patches should just improve LNT to report how
reliable the results are it reports. So there is no way that this can effect
the test suite runs, which means I do not see why we would want to delay
such changes.

In fact, if we have a good idea which kernels are reliable and which ones
are not, we can probably use this information to actually mark benchmarks
that are known to be noisy.

Right, yes, that'd be a good first step. I just wanted to make sure
that we don't just assume 10 runs is ok for everyone and consider it
done.

Reporting numbers that are not 100% reliable makes the results useless as
well. As ARM boards are cheap, you could just put 5 boxes in place and we
get the samples we need. Even if this is not yet feasible, I would rather
run 5 samples of the benchmarks you really care, then running everything
once and getting unreliable number.

That'd be another source of noise. You can't consider 5 boards'
results to be the same as 5 results in 1 board.

They're cheap (as in quality) and different boards (of the same brand
and batch) have different manufacturing defects that are only exposed
when we crush them to death with compiler tests and benchmarks. Nobody
in the factory has ever tested for that, since they only expect you to
run light stuff like media players, web servers, routers.

Let me rephrase. "Machines on which you would like to run benchmarks should
have a consistent and low enough level of noise"

No 32-bit ARM machine I have tested so far fits that bill.

So do you think the benchmark.sh script proposed by Yi Kong is useful for
ARM?

I'm also sceptical about that. I don't think that the noise on setup
will be any better or worse than noise during tests.

The only way to be sure is to run it every time and to understand the
curve, find a cut and warn on every noise level above the cut. Mind
you, this cut will be dynamic as the number of results grow, but once
we have a few dozen runs, it should stabilise.

But that is not a replacement for running the test multiple times or
for longer times. We need statistical significance.

cheers,
--renato

To my understanding, the first patches should just improve LNT to report how
reliable the results are it reports. So there is no way that this can effect
the test suite runs, which means I do not see why we would want to delay
such changes.

In fact, if we have a good idea which kernels are reliable and which ones
are not, we can probably use this information to actually mark benchmarks
that are known to be noisy.

Right, yes, that'd be a good first step. I just wanted to make sure
that we don't just assume 10 runs is ok for everyone and consider it
done.

Right.

Reporting numbers that are not 100% reliable makes the results useless as
well. As ARM boards are cheap, you could just put 5 boxes in place and we
get the samples we need. Even if this is not yet feasible, I would rather
run 5 samples of the benchmarks you really care, then running everything
once and getting unreliable number.

That'd be another source of noise. You can't consider 5 boards'
results to be the same as 5 results in 1 board.

They're cheap (as in quality) and different boards (of the same brand
and batch) have different manufacturing defects that are only exposed
when we crush them to death with compiler tests and benchmarks. Nobody
in the factory has ever tested for that, since they only expect you to
run light stuff like media players, web servers, routers.

I see the point. There are ways around this e.g. by running different benchmarks on different boards, but what it boils down to is that we first need to reliably measure and report the quality of the results.
Only then we can judge the effects of changes that are aimed to increase the quality.

Let me rephrase. "Machines on which you would like to run benchmarks should
have a consistent and low enough level of noise"

No 32-bit ARM machine I have tested so far fits that bill.

So do you think the benchmark.sh script proposed by Yi Kong is useful for
ARM?

I'm also sceptical about that. I don't think that the noise on setup
will be any better or worse than noise during tests.

The only way to be sure is to run it every time and to understand the
curve, find a cut and warn on every noise level above the cut. Mind
you, this cut will be dynamic as the number of results grow, but once
we have a few dozen runs, it should stabilise.

But that is not a replacement for running the test multiple times or
for longer times. We need statistical significance.

Agreed.

My proposal is to do this right ahead. As there is enough data from the public X86 -O3 runs (10 samples each run, with 3-5 commits between each run), the only missing piece seems to be the LNT changes to report on
the statistical significance. Yi Kong started to hack on this already
and might be able to adjust his changes. Let's wait for his opinion.

Cheers,
Tobias

Only then we can judge the effects of changes that are aimed to increase the
quality.

Agreed.

My proposal is to do this right ahead. As there is enough data from the
public X86 -O3 runs (10 samples each run, with 3-5 commits between each
run), the only missing piece seems to be the LNT changes to report on
the statistical significance. Yi Kong started to hack on this already
and might be able to adjust his changes. Let's wait for his opinion.

Ok.

--renato

Hi Tobias, Renato,

Thanks for your attention to my RFC.

> We had a longer discussion here on llvmdev names 'Questions about
> results reliability in LNT infrastructure'. Anton suggested to do the
> following:
>
> 1. Get 5-10 samples per run
> 2. Do the Wilcoxon/Mann-Whitney test

My current analysis uses student's t test, assuming that programs with
similar run time have similar stdev, which seems to be
over-simplification after gone through your previous discussion about
results reliability. I will go ahead and implement it and see if that
produces better results.

>> - Show and graph total compile time
>> There is no obvious way to scale up the compile time of
>> individual benchmarks, so total time is the best thing we can do to
>> minimize error.
>> LNT: [PATCH 1/3] Add Total to run view and graph plot
>
> I did not see the effect of these changes in your images and also
> honestly do not fully understand what you are doing. What is the
> total compile time? Don't we already show the compile time in run
> view? How is the total time different to this compile time?

It is hard to spot minor improvements or regressions over a large number
of tests from independent machine noise. So I added a "total time"
analysis to the run report and able to graph its trend, hoping that
noise will cancel out and will help us to easily spot. (Screenshot attached)

> I am a little sceptical on this. Machines should generally not be
> noisy. However, if for some reason there is noise on the machine, the
> noise is as likely to appear during this pre-noise-test than during
> the actual benchmark runs, maybe during both, but maybe also only
> during the benchmark. So I am afraid we might often run in the
> situation where this test says OK but the later test is still
> suffering noise.

I agree that measuring before each run may not be very useful. The main
purpose of it is for adaptive problem scaling.

Hi Tobias, Renato,

Thanks for your attention to my RFC.

>> - Show and graph total compile time
>> There is no obvious way to scale up the compile time of
>> individual benchmarks, so total time is the best thing we can do to
>> minimize error.
>> LNT: [PATCH 1/3] Add Total to run view and graph plot
>
> I did not see the effect of these changes in your images and also
> honestly do not fully understand what you are doing. What is the
> total compile time? Don't we already show the compile time in run
> view? How is the total time different to this compile time?

It is hard to spot minor improvements or regressions over a large number
of tests from independent machine noise. So I added a "total time"
analysis to the run report and able to graph its trend, hoping that
noise will cancel out and will help us to easily spot. (Screenshot
attached)

I understand the picture, but I still don't get how to compute "total time". Is this a well known term?

When looking at the plots of our existing -O3 testers, I also look for some kind of less noisy line. The first thing coming to my mind would just be the median of the set of run samples. Are you doing something similar? Or are you computing a value across different runs?

> I am a little sceptical on this. Machines should generally not be
> noisy. However, if for some reason there is noise on the machine, the
> noise is as likely to appear during this pre-noise-test than during
> the actual benchmark runs, maybe during both, but maybe also only
> during the benchmark. So I am afraid we might often run in the
> situation where this test says OK but the later test is still
> suffering noise.

I agree that measuring before each run may not be very useful. The main
purpose of it is for adaptive problem scaling.

I see. If it is OK with you, I would propose to first get your LNT improvements in, before we move to adaptive problem scaling.

That's the total time taken to compile/execute. Put it in another way,
sum of compile/execution time of all tests.

Cheers,
Yi Kong

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782

I have so many comments about this thread! I will start here.

I think having a total compile time metric is a great idea. The summary report code already does this. The one problem with this metric is that it does not work well as the test suite evolves and we add and remove tests, so it should be done on a subset of the tests, which is not going to change. I would love to see that feature reported in the nightly reports.

In the past we have toyed with a total execution time metric (sum of the execution of all benchmarks), and it has not worked well. There are some benchmarks that run for so long that they alone can swing the metric, and all the other little tests amount to nothing. How the SPEC benchmarks do their calculations in might be relevant. They have a baseline run, and the metric is the geometric mean of the ratio of current exec to base exec. That fixes the different sized benchmarks problem.

OK. I understand your intentions now.

I currently have little intuition if this works or not. It seems you also don't know if this works or not, do you?

My personal hope is that the reliability allows us to get rid of almost all noise such that most runs would just report no performance changes at all. If this is the case, the actual performance changes would stand out nicely and we could highlight them better in LNT.

If this does not work, some aggregated performance numbers as the ones you propose may be helpful. The total time is a reasonable first metric I suppose, but we may want to verify if statistics don't give us a better tool (Anton may be able to help).

Thanks again for your explanation,
Tobias

Dear all,

Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose some improvements to the LNT performance tracking software.

The most significant issue with current implementation is that the report is filled with extremely noisy values. Hence it is hard to notice performance improvements or regressions.

After investigation of LNT and the LLVM test suite, I propose following methods. I've also attached prototype patches for each method.
- Increase the execution time of the benchmark so it runs long enough to avoid noisy results
       Currently there are two options to run benchmarks, namely small and large problem size. I propose adding a third option: adaptive. In adaptive mode, benchmarks scale the problem size according to pre-measured system performance value so that the running time is kept at around 10 seconds, the sweet spot between time and accuracy. The downside is that correctness for some benchmarks cannot be measured. Solution is to measure correctness in a separate board with small problem size.
       LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode
       Test suite: [PATCH 1/2] Add support for adaptive problem size
                       [PATCH 2/2] A subset of test suite programs modified for adaptive
- Show and graph total compile time
       There is no obvious way to scale up the compile time of individual benchmarks, so total time is the best thing we can do to minimize error.
       LNT: [PATCH 1/3] Add Total to run view and graph plot
- Only show performance changes with high confidence in summary report
       To investigate the correlation between program run time and its variance, I ran Dhrystone of different problem size multiple times. The result shows that some fluctuations are expected and shorter tests have much greater variance. By modelling the run time to be normally distributed, we can calculate the minimal difference for statistical significance. Using this knowledge, we can hide those results with low confidence level from summary report. They are still available and marked in colour in detailed report in case interested.
       LNT: [PATCH 3/3] Ignore tests with very short run time

I think this is harder than it sounds. I just looked through some results from today, and I found a benchmark that found a real regression of 0.01s, in a benchmark running ~0.05s. That would have been filtered by your patch. Do you have some intuition that small runtime tests are where the noise is coming from? I feel like it is not a problem that is unique to short runs, but more to particular benchmarks.

- Make sure board has low background noise
       Perform a system performance benchmark before each run and compare the value with the reference(obtained during machine set-up). If the percentage difference is too large, abort or defer the run. In prototype this feature is implemented using Bash script and not integrated into LNT. Will rewrite in Python.
       LNT: benchmark.sh

I wrote a very similar python script for checking system baselines. I think this is a great idea. My script ran several non-compiler related tasks, which *should* be stable on any machine. By should I mean, they are long running and intentionally only test one aspect of the system. I did not gate results on these runs, but instead submitted these results to LNT, then allowed it to report on anomalies it detected. So far this process has detected some problems on our testing machines. If there is interest I can share that script.

Hi Chris,

I think it would definitely be useful to share your script; so that we don't
need to reinvent
the wheel.

Thanks!

Kristof

> - Make sure board has low background noise
> Perform a system performance benchmark before each run and
compare the value with the reference(obtained during machine set-up). If
the percentage difference is too large, abort or defer the run. In

prototype

this feature is implemented using Bash script and not integrated into LNT.
Will rewrite in Python.
> LNT: benchmark.sh

I wrote a very similar python script for checking system baselines. I

think this

is a great idea. My script ran several non-compiler related tasks, which
*should* be stable on any machine. By should I mean, they are long

running

and intentionally only test one aspect of the system. I did not gate

results on

these runs, but instead submitted these results to LNT, then allowed it to
report on anomalies it detected. So far this process has detected some
problems on our testing machines. If there is interest I can share that

script.

I agree with Chris. Following the way of SPEC benchmarks is a good idea.

We already have so many benchmarks in LLVM testsuite. Why not select some representative benchmarks (e.g. NPB, MiBench, nBench, Polybench, etc.) and take a simple run as the baseline. In that case, we can get the score for each benchmark and can easily tell the relative performance.

For the noisy problem, I agree we should not change the testsuite for performance evaluation, since testsuite is originally used for correctness check, but I believe we can hack LNT and provide different options for correctness check and performance evaluation. Previously, I (with Tobias) mainly change the LNT framework by simply: 1. running each benchmark 10 times; 2. adding some reliability tests (e.g. t-test) to check the reliability and dropping those benchmarks with low reliability; (3) dropping those benchmarks if their total runtime is too small (less than 0.002s). Definitely, these changes should only be applied for performance evaluation.

BTW, I like the idea of “adaptive mode”, but keep in mind it should be enabled only for performance evaluation, not in default.

Best,
Star Tan