[LNT] Question about results reliability in LNT infrustructure

Hi all,

When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not?

I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev * confidence_interval), then the data status is set as UNCHANGED. However, it is obviously not enough. For example, assuming both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is slightly larger than self.stddev, LNT will report to readers that the performance improvement is huge without considering the huge stddev. I think one way is to normalize the performance improvements by considering the stddev, but I am not sure whether it has been implemented in LNT.

Could anyone give some suggestions that how can I find out whether the testing results are reliable in LNT? Specifically, how can I get the normalized performance improvement/regression by! considering the stderr?

Best wishes,
Star Tan.

Hi Daniel, Michael, Paul,

do you happen to have some insights on this? Basically, the stddev shown
when a run is compared to a previous run does not seem to be useful to
measure the reliability of the results shown. We are looking for a good
way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is?

Thanks,
Tobias

We don't really have a great answer yet. For now the best we do is try
and get our testing machines as quiet as possible and then mostly look
at the daily trend not individual reports.

- Daniel

Hi Tobi,

I had a look at this a while ago, but never got around to actually work on
it. My idea was to never use point-changes as indication of
progress/regressions, unless there was a significant change (2/3 sigma).
What we should do is to compare the current moving-average with the past
moving averages (of K runs) with both last-avg and the (N-K)th
moving-average (to make sure previous values included in the current moving
average are not toning it down/up), and keep the biggest difference as the
final result.

We should also compare the current mov-avg with M non-overlapping mov-avgs
before, and calculate if we're monotonically increasing, decreasing or if
there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.

cheers,
--renato

Chris Matthews has recently been working on implementing something similar to that. Chris, can you share some details?

There are a few things we have looked at with LNT runs, so I will share the insights we have had so far. A lot of the problems we have are artificially created by our test protocols instead of the compiler changes themselves. I have been doing a lot of large sample runs of single benchmarks to characterize them better. Some key points:

  1. Some benchmarks are bi-modal or multi-modal, single means won’t describe these well
  2. Some runs are pretty noisy and sometimes have very large single sample spikes
  3. Most benchmarks don’t regress most of the time
  4. Compile time is pretty stable metric, execution time not always

and depending on what you are using LNT for:

  1. A regression is not really something to worry about unless it lasts for a while (some number of revisions or days or samples)
  2. We also need to catch long slow regressions
  3. Some of the “benchmarks” are really just correctness tests, and were not designed with repeatable measurement in mind.

As it stands now, we really can’t detect small regressions, slow regressions, and there are a lot of false positives.

There are two things I am working on right now to help make regression detection more reliable: adaptive sampling and cluster based regression flagging.

First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long. The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history. Simply, it reruns benchmarks which are reported as regressed or improved. The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case. Adaptive sampling should reduce the false positive regression flagging rate we see. We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times. This gives us more data in the places where we need it most. This has made a big difference on my local test machine.

As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that. My hope is this can characterize multi-modal results, be resilient to short spikes and detect long term motion in the dataset. I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with.

Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine. A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running. You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running. In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT.

Chris Matthews
chris.matthews@.com
(408) 783-6335

Hi Chris,

Amazing that someone is finally looking at that with a proper background. You’re much better equipped than I am to deal with that, so I’ll trust you on your judgements, as I haven’t paid much attention to benchmarks, more correctness. Some comments inline.

Just forwarding this to the list, my original reply was bounced.

First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long. The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history. Simply, it reruns benchmarks which are reported as regressed or improved. The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case. Adaptive sampling should reduce the false positive regression flagging rate we see. We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times. This gives us more data in the places where we need it most. This has made a big difference on my local test machine.

As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that. My hope is this can characterize multi-modal results,

be resilient to short spikes and detect long term motion in the dataset. I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with.

Basic question: I’m imagining the volume of data being dealt with isn’t that large (as statistical datasets go) and you’re discarding old values anyway (since we care if we’re regressing wrt now rather than LLVM 1.1), so can’t you just build a kernel density estimator of the “baseline” runtime and then estimate the probabilities that samples from a given codebase are going to happening “slower” than the baseline? I suppose the drawback to not explicitly modelling the modes (with all its complications and tunings) is that you can’t attempt to determine when a value is bigger than a lower cluster, even though it’s smaller than the bigger cluster and estimate if it’s evidence of a slowdown within the small cluster regime. Still that seems a bit complicated to do automatically.

(Inicidentally, responding to the earlier email below, I think you don’t really want to compare moving averages but use some statistical test to quantify if the separation between the set of points within the “earlier window” are statistically significantly higher than the “later window”; all moving averages do is smear out useful information which can be useful if you’ve just got far too many data points, but otherwise it doesn’t really help.

Cheers,

Dave

Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine. A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running. You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running. In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT.

Chris Matthews
chris.matthews@.com
(408) 783-6335

When your data is explicitly grouped, I'd agree with you. But all I can see
no distinct modal signal from them. Chris said he knows of some, I haven't
looked deep enough, so I trust his judgement. What I don't want is to be
treating noise groups as signal, that's all.

I think we probably need a few different approaches, depending on the
benchmark, with moving averages being the simplest, and why I suggested we
implement it first. Sometimes, smoothing the line is all you need... :wink:

cheers,
--renato

I think we probably need a few different approaches, depending on the benchmark, with moving averages being the simplest, and why I suggested we implement it first. Sometimes, smoothing the line is all you need… :wink:

That’s a viewpoint; another one is that statisticians might well have very good reasons why they spend so long coming up with statistical tests in order to create the most powerful tests so they can deal with marginal quantities of data.

Cheers,

Dave

87.35% of all statistics are made up, 55.12% of them could have been done a
lot simpler, a lot quicker and only 1.99% (AER) actually make your life
better.

I'm glad that Chris already has working solutions, and I'b be happy to see
them go live before any professional statistician had a look at it. :wink:

cheers,
--renato

I should describe the cost of false negatives and false positives, since I think it matters for how this problem is approached. False negatives mean we miss a real regression — we don’t want that. False positives mean somebody has to spend some time looking at and reproducing the regression when there is not one — bad too. Given this tradeoff I think we want to tend towards false positives (over false negatives) strictly as a matter of compiler quality, but if we can throw more data to reduce false positives that is good.

I have discussed the classification problem before with people off list. The problem that we face is that the space is pretty big for manual classification, at worse: number of benchmarks * number of architectures * sets of flags * metrics collected. Perhaps some sensible defaults could overcome that, also to classify well, you probably need a lot of samples as a baseline.

There certainly are lots of tests for small data. As far as I know though they rely more heavily on assumptions that in our case would have to be proven. That said, I’d never object to a professional’s opinion on this problem!

Chris Matthews
chris.matthews@.com
(408) 783-6335

Given this tradeoff I think we want to tend towards false positives (over
false negatives) strictly as a matter of compiler quality.

False hits are not binary, but (at least) two-dimensional. You can't say
it's better to have any amount of false positives than any amount of false
negatives (pretty much like the NSA spying on *everybody* to avoid *any*
false negative). You can't also say that N false-positives is the same as N
false-negatives, because a false-hit can be huge in itself, or not.

What we have today is a huge amount of false positives and very few (or
none) false negatives. But even the real positives that we could spot even
with this amount of noise, we don't, because people don't normally look at
regressions. If I had to skim through the regressions on every build, I'd
do nothing else.

Given the proportion, I'd rather have a few small false positives and
reduce considerably the number of false positives with a hammer approach,
and only later try to nail down the options and do some fine tuning, than
doing the fine tuning now while still nobody cares about any result because
they're not trust-worthy.

That said, I’d never object to a professional’s opinion on this problem!

Absolutely! And David can help you a lot, there. But I wouldn't try to get
it perfect before we get it acceptable.

cheers,
--renato

Wow. Thanks a lot for the insights in what LNT is currently doing and
what people are planning for the future. It seems there is a lot of
interesting stuff on the way.

I agree with Renato that one of the major problems is currently not
missing regressions because we do not detect them, but missing them because nobody looks at the results due to the large amount of noise.

To make this more concrete I want to point you to the experiments that
Star Tan has run. He hosted his lnt results here [1]. One of the top changes in the reports is a 150% compile time increase for SingleSource/UnitTests/2003-07-10-SignConversions.c.

Looking at the data of the original run, we get:

~$ cat /tmp/data-before
0.0120
0.0080
0.0200

~$ cat /tmp/data-after
0.0200
0.0240
0.0200

It seems there is a lot of noise involved. Still, LNT is reporting this result without understanding that the results for this benchmark are unreliable.

In contrast, the ministat [2] tool is perfectly capable of understanding
that those results are insufficient to prove any statistical difference
at 90% confidence.

data-before (21 Bytes)

data-after (21 Bytes)

data2-before (24 Bytes)

data2-after (24 Bytes)

Hi Tobi,

First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :slight_smile:

The statistical test ministat is performing seems simple and pretty
standard. Is there any reason we could not do something similar? Or are we
doing it already and it just does not work as expected?

The main problem with such sort of tests is that we cannot trust them, unless:
1. The data has the normal distribution
2. The sample size if large (say, > 50)

Here we have only 3 points and, no, I won't trust the ministat's
t-test and normal-approximation based confidence bounds. They are *too
short* (=the real confidence level is no 99.5%, but, actually 40-50%,
for example).

I'd ask for:

1. Increasing sample size to at least 5-10
2. Do the Wilcoxon/Mann-Whitney test

What do you think?

Hi Tobi,

First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :slight_smile:

The statistical test ministat is performing seems simple and pretty
standard. Is there any reason we could not do something similar? Or are we
doing it already and it just does not work as expected?

The main problem with such sort of tests is that we cannot trust them, unless:
1. The data has the normal distribution
2. The sample size if large (say, > 50)

Here we have only 3 points and, no, I won't trust the ministat's
t-test and normal-approximation based confidence bounds. They are *too
short* (=the real confidence level is no 99.5%, but, actually 40-50%,
for example).

Hi Anton,

I trust your knowledge about statistics, but am wondering why ministat (and it's t-test) is promoted as a statistical sane tool for benchmarking results. Is the use of the t-test for benchmark results a bad idea in general? Would ministat be a better tool if it implemented the Wilcoxon/Mann-Whitney test?

I'd ask for:

1. Increasing sample size to at least 5-10
2. Do the Wilcoxon/Mann-Whitney test

Reading about the Wilcoxon/Mann-Whitney, it seems to be a more robust test that frees us from the normal-approximation assumption. As its implementation also does not look overly complicated, it may be a good choice.

Regarding the number of samples. I think the most important point is that we get some measurement of confidence by which we can sort our results and make it visible in the UI. For different use cases we can adapt the number of samples based on the required confidence and the amount of noise/lost regressions we can accept. This may also be a great use for the adaptive sampling that Chris suggested.

Is there anything stopping us from implementing such a test and exposing its results in the UI?

Cheers,
Tobi

That's not feasible on slower systems. A single data point takes 1 hour on
the fastest ARM board I can get (Chromebook). Getting 10 samples at
different commits will give you similar accuracy if behaviour doesn't
change, and you can rely on 10-point blocks before and after each change to
have the same result.

What won't happen is one commit makes it truly faster and the very next
slow again (or slow/fast), so all we need to measure is for each commit, if
that was the one that made all next runs slower/faster, and that we can get
with several commits after the culprit, since the probability that another
(unrelated) commit will change the behaviour is small.

This is why I proposed something like moving averages. Not because it's the
best statistical model, but because it works around a concrete problem we
have. I don't care which model/tool you use, as long as it doesn't mean
I'll have to wait 10 hours for a result, or sift through hundreds of
commits every time I see a regression in performance. What that will do,
for sure, is make me ignore small regressions, since they won't be worth
the massive work to find the real culprit.

If I had a team of 10 people just to look at regressions all day long, I'd
ask them to make a proper statistical model and go do more interesting
things...

cheers,
--renato

Hi Tobias,

I trust your knowledge about statistics, but am wondering why ministat (and
it's t-test) is promoted as a statistical sane tool for benchmarking
results.

I do not know... Ask author of ministat?

Is the use of the t-test for benchmark results a bad idea in
general?

No, in general. But one should be aware about the assumptions of the
underlying theory. t-test is fine as soon as our data follows the
normal distribution (and hence the test would be exact) or the sample
size is large (then we have the asymptotic normality of the mean due
to CLT).

Would ministat be a better tool if it implemented the
Wilcoxon/Mann-Whitney test?

The precision would be much better for small sample sizes (say, in range 10-50).

But in any case, never trust someone who will claim he can reliably
estimate the variance from 3 data points.

Is there anything stopping us from implementing such a test and exposing its
results in the UI?

I do not think so...

Getting 10 samples at different commits will give you similar accuracy if
behaviour doesn't change, and you can rely on 10-point blocks before and > after each change to have the same result.

Right. But this way you will have 10-commits delay. So, you will need
3-4 additional test runs to pinpoint the offending commit in the worst
case.

This is why I proposed something like moving averages.

Moving average will "smooth" the result. So, only really big changes
will be caught by it.