There are a few things we have looked at with LNT runs, so I will share the insights we have had so far. A lot of the problems we have are artificially created by our test protocols instead of the compiler changes themselves. I have been doing a lot of large sample runs of single benchmarks to characterize them better. Some key points:
- Some benchmarks are bi-modal or multi-modal, single means won’t describe these well
- Some runs are pretty noisy and sometimes have very large single sample spikes
- Most benchmarks don’t regress most of the time
- Compile time is pretty stable metric, execution time not always
and depending on what you are using LNT for:
- A regression is not really something to worry about unless it lasts for a while (some number of revisions or days or samples)
- We also need to catch long slow regressions
- Some of the “benchmarks” are really just correctness tests, and were not designed with repeatable measurement in mind.
As it stands now, we really can’t detect small regressions, slow regressions, and there are a lot of false positives.
There are two things I am working on right now to help make regression detection more reliable: adaptive sampling and cluster based regression flagging.
First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long. The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history. Simply, it reruns benchmarks which are reported as regressed or improved. The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case. Adaptive sampling should reduce the false positive regression flagging rate we see. We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times. This gives us more data in the places where we need it most. This has made a big difference on my local test machine.
As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that. My hope is this can characterize multi-modal results, be resilient to short spikes and detect long term motion in the dataset. I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with.
Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine. A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running. You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running. In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT.
Chris Matthews
chris.matthews@.com
(408) 783-6335