Also, a couple notes about the use of standard deviation (and general stuff about benchmarking):

It is vital to keep in mind the distinction between a “sample statistic” and a “population statistic”. A sample statistic is a statistic derived from the dataset of measured values, while a population statistic is a mathematical function of the underlying probability distribution of the phenomenon (some abstract “population” of possibilities) being sampled.

As you take more measurements, the standard deviation sample statistic converges to the standard deviation population statistic regardless of the underlying population distribution.

Every probability distribution has a defined standard deviation (it is just a mathematical function of the distribution). For a Gaussian distribution, the standard deviation (in conjunction with the mean) completely describes the probability distribution. This is emphatically not the case for all probability distributions [1]. In order to derive insight from a standard deviation sample statistic, you must a priori have a good reason to believe that the underlying probability distribution’s standard deviation provides useful information about the actual distribution.

Any probability distribution can be made to have any standard deviation without changing its distribution shape (technically, at most a linear transformation of the coordinate is needed). To pick a random example, for a highly bimodal distribution the standard deviation doesn’t really give you any useful information about the overall shape (and it doesn’t exactly correspond between the spacing between the modes either).

The standard deviation is nothing more than a rootmeansquare average. Basically, it is a very popular statistic because there is an extremely powerful way to model phenomena for which the rootmeansquare average is the actual meaningful value that ties into the model. The theory I’m talking about is used extensively in almost every science and engineering discipline outside of software engineering (and the purely digital part of hardware) [2]. Since this modeling technique is rarely directly relevant in software, there is little reason to assume that the standard deviation is a good statistic (there may be a good reason to, but I haven’t seen one).
Okay, so I’ve cast doubt on the use of standard deviation (and actually an argument could be made against the mean too for similar reasons). On a more constructive note, what is a good small set of parameters to summarize measurements of program run time? One interesting model that was brought to my attention by Nick Lewycky is based on the following observation: most “perturbations” of the run time of a fixed program binary on a fixed chip and fixed OS are due to externalities that contend for (or otherwise deprive the program of) execution resources and therefore almost always make it slower.
The model is then that there is a “true” execution time for the program, which is a completely deterministic function of the program, chip, and OS; intuitively, this is time occurs when the program is e.g. able to execute by itself, on a single execution unit of the CPU, with no interrupts knocking it off CPU, and no other threads using any of the execution resources of this CPU. On top of this, we assume that there is a set of perturbations that contend for execution resources and slow the program down.
For this model, it makes sense to approximate the “true” value by the maximum of the recorded values. If we want to use two numbers to summarize the data, it probably makes sense to look at a way of characterizing the extent of the variation from this “true” value. In the absence of proper characterization of the perturbations (such as the distribution of when they occur and how much effect they have when they occur), one simple solution is the minimum. Max and min might be too sensitive to the perturbations (especially disk cache of the program itself on the first run), so a pair of percentiles might be a bit more reliable, such as 10th percentile and 90th percentile.
Next time I’m measuring program run time, I’m going to try out this model. It should be possible to look at a higherdegreeoffreedom representation of the dataset, like a histogram of the distribution, in order to evaluate how well this twoparameter characterization of the data works compared with the typical mean and stddev. Also, ultimately we only need as much precision here as is necessary for what we are doing with the summarized data, which typically is to alert us of a significant change. On a sufficiently quiet machine, the variance of the measurements might be significantly smaller than the thresholds that we consider “significant” so that the mean, median, max, min, 10%ile etc. are all so close that it does not matter which we use, and we can summarize the data with just a single number, which can be any one of them (and just trigger an error if total variation exceeds a set amount).
However, the noisier the machine (and hence our data), the more important is is to properly model and analyze things to avoid coming to a faulty conclusion (note: reliable conclusions can be made from data with large amounts of noise as long as there is a good model which allows isolating the data we are interested in from the noise [3]).
Nick, I forget from when we talked, but did you guys ever settle on a model like this for your internal benchmarking?
[1] If you are familiar with Fourier theory, using only two statistics to describe a probability distribution is sort of like using only two Fourier components to describe a function. In order to conclude anything you must a priori know that the other components are not important!
[2] The modeling technique I’m talking about is decomposition of squareintegrable function spaces into an orthonormal basis (this is by no means the most general way of describing the idea). This is a farreaching concept and pops up under many names. Things like “frequency domain”, “Fourier transform”, “Laplace transform”, and “spectrum” are among the most elementary. More advanced ones are “wavelets”, “Hilbert space”, “eigenfunctions of a linear operator”. A smattering of use cases: determining the shape of the electron orbitals of hydrogen, designing the fuel control system of a jet engine to make sure the right amount of fuel is being provided at any given moment, designing the circuits sitting at the end of a communication line (or a cell phone antenna) that have to detect incoming signals with the least chance of error, analyzing the structure of molecules based on Xray diffraction, determining the chemical composition of the atmosphere of a planet orbiting another star.
[3] For example if you know that there is Gaussian noise (even a very large amount) on top of an underlying value, then the underlying value is just the mean of the population (which will be a Gaussian distribution) and can be reliably determined from the mean sample statistic.