Use perf tool for more accurate time measuring on Linux

Hi all,

The LLVM benchmarking system produces very noisy results even on quiet
machines. One of the sources of inaccuracy is the timing tool we are
using. Because it is a user-space tool, the OS can context switch it
and we will get an outlier result. Perf stat uses SW_TASK_CLOCK
counter in kernel to measure time, therefore more accurate. It also
does not get context switched.

I've implemented a wrapper script over perf stat which mimics the
behaviour of timeit tool in test suite, so that nothing else needs to
be modified. The script is not yet feature complete as timeit, but
enough to run nightly tests.

I carried out experiments on several machines and saw different level
of improvements. I am no longer seeing outlier results, and MAD is
considerably lower. The run-by-run changes results over the same
revision shrank from around 10 to only 2-3. The MAD reduced from
around 0.01 to 0.003 on a quiet machine.

I've attached the patch and please experiment with it.

Cheers,
Yi Kong

perfit.diff (1.3 KB)

Also some screenshots. Both running on a quiet x86 machine.

Timeit: 10 samples per run(aggregated by minimum, 0.05 MWU significance level)
Perf stat: 8 samples per run(aggregated by minimum, 0.05 MWU significance level)

From: "Yi Kong" <kongy.dev@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Renato Golin" <renato.golin@linaro.org>, "Tobias Grosser" <tobias@grosser.es>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, May 16, 2014 11:37:28 AM
Subject: Use perf tool for more accurate time measuring on Linux

Hi all,

The LLVM benchmarking system produces very noisy results even on
quiet
machines. One of the sources of inaccuracy is the timing tool we are
using. Because it is a user-space tool, the OS can context switch it
and we will get an outlier result. Perf stat uses SW_TASK_CLOCK
counter in kernel to measure time, therefore more accurate. It also
does not get context switched.

I've implemented a wrapper script over perf stat which mimics the
behaviour of timeit tool in test suite, so that nothing else needs to
be modified. The script is not yet feature complete as timeit, but
enough to run nightly tests.

I carried out experiments on several machines and saw different level
of improvements. I am no longer seeing outlier results, and MAD is
considerably lower. The run-by-run changes results over the same
revision shrank from around 10 to only 2-3. The MAD reduced from
around 0.01 to 0.003 on a quiet machine.

That sounds great, thanks for working on this!

First, we'd definitely need more documentation on what perf is and how to get it. The testing guide (and lnt dependencies), at least, need to be updated. FWIW, I don't have any machines on which this is already installed (and so it certainly is not installed by default). On Ubuntu, it is claimed that both linux-base and linux-tools-common provide a 'pref' utility, and on rpm systems it looks like there is a perf package.

+# FIXME: How to measure sys time?
+echo sys 0.0000 >> $REPORT

Is this just the difference between the real time and the task time (or would that be a reasonable approximation)?

Thanks again,
Hal

From: "Yi Kong" <kongy.dev@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Renato Golin" <renato.golin@linaro.org>, "Tobias Grosser" <tobias@grosser.es>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, May 16, 2014 11:37:28 AM
Subject: Use perf tool for more accurate time measuring on Linux

Hi all,

The LLVM benchmarking system produces very noisy results even on
quiet
machines. One of the sources of inaccuracy is the timing tool we are
using. Because it is a user-space tool, the OS can context switch it
and we will get an outlier result. Perf stat uses SW_TASK_CLOCK
counter in kernel to measure time, therefore more accurate. It also
does not get context switched.

I've implemented a wrapper script over perf stat which mimics the
behaviour of timeit tool in test suite, so that nothing else needs to
be modified. The script is not yet feature complete as timeit, but
enough to run nightly tests.

I carried out experiments on several machines and saw different level
of improvements. I am no longer seeing outlier results, and MAD is
considerably lower. The run-by-run changes results over the same
revision shrank from around 10 to only 2-3. The MAD reduced from
around 0.01 to 0.003 on a quiet machine.

That sounds great, thanks for working on this!

First, we'd definitely need more documentation on what perf is and how to get it. The testing guide (and lnt dependencies), at least, need to be updated. FWIW, I don't have any machines on which this is already installed (and so it certainly is not installed by default). On Ubuntu, it is claimed that both linux-base and linux-tools-common provide a 'pref' utility, and on rpm systems it looks like there is a perf package.

+# FIXME: How to measure sys time?
+echo sys 0.0000 >> $REPORT

Is this just the difference between the real time and the task time (or would that be a reasonable approximation)?

Not on some occasions. But for most programs, that should a reasonable
approximation. Since test suite does not care about the sys time, we
can leave it for now.
$ time -p sleep 5
real 5.00
user 0.00
sys 0.00

Why not use the cycle count which perf exposes from hardware? That would seem even better to me, but data would be better. =]

I think this is a great idea. I installed perf on all my performance trackers. As you already experimented with it on x86, I would be fine with just committing the patch. We can adjust it if the results are
not as expected.

Tobias

From: "Tobias Grosser" <tobias@grosser.es>
To: "Yi Kong" <kongy.dev@gmail.com>, "Hal Finkel" <hfinkel@anl.gov>, "Renato Golin" <renato.golin@linaro.org>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, May 16, 2014 1:13:43 PM
Subject: Re: Use perf tool for more accurate time measuring on Linux

> Hi all,
>
> The LLVM benchmarking system produces very noisy results even on
> quiet
> machines. One of the sources of inaccuracy is the timing tool we
> are
> using. Because it is a user-space tool, the OS can context switch
> it
> and we will get an outlier result. Perf stat uses SW_TASK_CLOCK
> counter in kernel to measure time, therefore more accurate. It also
> does not get context switched.
>
> I've implemented a wrapper script over perf stat which mimics the
> behaviour of timeit tool in test suite, so that nothing else needs
> to
> be modified. The script is not yet feature complete as timeit, but
> enough to run nightly tests.
>
> I carried out experiments on several machines and saw different
> level
> of improvements. I am no longer seeing outlier results, and MAD is
> considerably lower. The run-by-run changes results over the same
> revision shrank from around 10 to only 2-3. The MAD reduced from
> around 0.01 to 0.003 on a quiet machine.
>
> I've attached the patch and please experiment with it.

I think this is a great idea. I installed perf on all my performance
trackers. As you already experimented with it on x86, I would be fine
with just committing the patch. We can adjust it if the results are
not as expected.

This is a general adjustment for anything running the test suite on a Linux target, I'd wait until we have some testing on non-x86 platforms before we enable it by default for all architectures; and we have other x86 buildbots besides yours, right?

-Hal

Why not use the cycle count which perf exposes from hardware? That would seem even better to me, but data would be better. =]

That’s an interesting idea. However I’m concerned if that will miss some aspects of compiler optimization. For example frequent cache misses would have much smaller impact on the result if the processor goes to lower frequency during the stall period. Nonetheless it’s definitely worth to try out.

Sure, but we should disable frequency throttling on any machine from which
we want numbers that look *remotely* stable.

The other thing you might try doing while you're wrapping these tools is to
use schedtool to pin the process to a single core. On most modern x86
machines you can see 2-3% swing in lots of small details, and when the
process migrates between cores this makes the numbers very hard to analyze.

I've set up a public LNT server to show the result of perf stat. There
is a huge improvement compared with timeit tool.
http://parkas16.inria.fr:8000/

Patch is updated to pin the process to a single core, the readings are
even more accurate. It's hard coded to run everything on core 0, so
don't run parallel testing with it for now. The tool now depends on
Linux perf and schedtool.

I'm running tests on ARM Cortex boards to verify the improvements.
Please also check if this works on your system.

Thanks,
Yi

I've set up a public LNT server to show the result of perf stat. There
is a huge improvement compared with timeit tool.
http://parkas16.inria.fr:8000/

Hi Yi Kong,

thanks for testing these changes.

Patch is updated to pin the process to a single core, the readings are
even more accurate. It's hard coded to run everything on core 0, so
don't run parallel testing with it for now. The tool now depends on
Linux perf and schedtool.

I think this sounds like a very good direction.

How did you evaluate the improvements exactly? The following run shows e.g two execution time changes:

http://parkas16.inria.fr:8000/db_default/v4/nts/9

Are they expected? If I change e.g. the aggregation function to median
they disappear. Similarly the graph for one of them does not suggest an actual performance change:

http://parkas16.inria.fr:8000/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.0=1.428.3&submit=Update

Cheers,
Tobias

I've set up a public LNT server to show the result of perf stat. There
is a huge improvement compared with timeit tool.
http://parkas16.inria.fr:8000/

Hi Yi Kong,

thanks for testing these changes.

Patch is updated to pin the process to a single core, the readings are
even more accurate. It's hard coded to run everything on core 0, so
don't run parallel testing with it for now. The tool now depends on
Linux perf and schedtool.

I think this sounds like a very good direction.

How did you evaluate the improvements exactly? The following run shows e.g
two execution time changes:

I sent a screenshot of original results in the previous mail. We used
to have lots of noise readings, both from small machine background
noise and large noise from the timing tool. Now noise from timing tool
is eliminated and only few machine background noise is left. This
makes manual investigation possible.

http://parkas16.inria.fr:8000/db_default/v4/nts/9

Are they expected? If I change e.g. the aggregation function to median
they disappear. Similarly the graph for one of them does not suggest an
actual performance change:

Yes, some false positives due to machine noise is expected. Median is
more tolerant to machine noise, therefore they disappear.
As suggested by Chandler, we should also lock the CPU frequency to
further reduce machine noise.

I've set up a public LNT server to show the result of perf stat. There
is a huge improvement compared with timeit tool.
http://parkas16.inria.fr:8000/

Hi Yi Kong,

thanks for testing these changes.

Patch is updated to pin the process to a single core, the readings are
even more accurate. It's hard coded to run everything on core 0, so
don't run parallel testing with it for now. The tool now depends on
Linux perf and schedtool.

I think this sounds like a very good direction.

How did you evaluate the improvements exactly? The following run shows e.g
two execution time changes:

I sent a screenshot of original results in the previous mail. We used
to have lots of noise readings, both from small machine background
noise and large noise from the timing tool. Now noise from timing tool
is eliminated and only few machine background noise is left. This
makes manual investigation possible.

I think we need to get this down to zero even at the cost of missing regressions. We have many commits and runs per day, having one or two noisy results per run means people will still not look at performance changes.

http://parkas16.inria.fr:8000/db_default/v4/nts/9

Are they expected? If I change e.g. the aggregation function to median
they disappear. Similarly the graph for one of them does not suggest an
actual performance change:

Yes, some false positives due to machine noise is expected. Median is
more tolerant to machine noise, therefore they disappear.

Right.

What I find interesting is that this change filters several results that seem to not be filtered out by our statistical test. Is this right?

In the optimal case, we should be able to set the confidence level we require high enough to filter out these results as well. Is this right?

Is there currently anything that blocks us from increasing the confidence level further reducing the noise level at the cost of some missed regressions?

As suggested by Chandler, we should also lock the CPU frequency to
further reduce machine noise.

I set it to 2667.000 Mhz on parkas16. You can try if this improves something.

Cheers,
Tobias

I've set up a public LNT server to show the result of perf stat. There
is a huge improvement compared with timeit tool.
http://parkas16.inria.fr:8000/

Hi Yi Kong,

thanks for testing these changes.

Patch is updated to pin the process to a single core, the readings are
even more accurate. It's hard coded to run everything on core 0, so
don't run parallel testing with it for now. The tool now depends on
Linux perf and schedtool.

I think this sounds like a very good direction.

How did you evaluate the improvements exactly? The following run shows
e.g
two execution time changes:

I sent a screenshot of original results in the previous mail. We used
to have lots of noise readings, both from small machine background
noise and large noise from the timing tool. Now noise from timing tool
is eliminated and only few machine background noise is left. This
makes manual investigation possible.

I think we need to get this down to zero even at the cost of missing
regressions. We have many commits and runs per day, having one or two noisy
results per run means people will still not look at performance changes.

http://parkas16.inria.fr:8000/db_default/v4/nts/9

Are they expected? If I change e.g. the aggregation function to median
they disappear. Similarly the graph for one of them does not suggest an
actual performance change:

Yes, some false positives due to machine noise is expected. Median is
more tolerant to machine noise, therefore they disappear.

Right.

What I find interesting is that this change filters several results that
seem to not be filtered out by our statistical test. Is this right?

Yes. MWU test is nonparametric, it examines the order rather than the
actual value of the samples. However eliminating with median uses
actual value(if medians of two samples are close enough, we treat them
as equal).

In the optimal case, we should be able to set the confidence level we
require high enough to filter out these results as well. Is this right?

Yes. The lowest confidence we can set is still quite high(90%). We can
certainly add a lower confidence option, but I can't find any MWU
table lower than that on the Internet.

Also, we should modify value analysis(based how close the
medians/minimums are) to vary according to the confidence level as
well. However this analysis is parametric, we needs to know how data
is actually distributed for every test. I don't think there is a
non-parametric test which does the same thing.

Is there currently anything that blocks us from increasing the confidence
level further reducing the noise level at the cost of some missed
regressions?

As suggested by Chandler, we should also lock the CPU frequency to
further reduce machine noise.

I set it to 2667.000 Mhz on parkas16. You can try if this improves
something.

Sure. I'm shutting down the server to run the tests.

I've set up a public LNT server to show the result of perf stat. There
is a huge improvement compared with timeit tool.
http://parkas16.inria.fr:8000/

Hi Yi Kong,

thanks for testing these changes.

Patch is updated to pin the process to a single core, the readings are
even more accurate. It's hard coded to run everything on core 0, so
don't run parallel testing with it for now. The tool now depends on
Linux perf and schedtool.

I think this sounds like a very good direction.

How did you evaluate the improvements exactly? The following run shows
e.g
two execution time changes:

I sent a screenshot of original results in the previous mail. We used
to have lots of noise readings, both from small machine background
noise and large noise from the timing tool. Now noise from timing tool
is eliminated and only few machine background noise is left. This
makes manual investigation possible.

I think we need to get this down to zero even at the cost of missing
regressions. We have many commits and runs per day, having one or two noisy
results per run means people will still not look at performance changes.

http://parkas16.inria.fr:8000/db_default/v4/nts/9

Are they expected? If I change e.g. the aggregation function to median
they disappear. Similarly the graph for one of them does not suggest an
actual performance change:

Yes, some false positives due to machine noise is expected. Median is
more tolerant to machine noise, therefore they disappear.

Right.

What I find interesting is that this change filters several results that
seem to not be filtered out by our statistical test. Is this right?

Yes. MWU test is nonparametric, it examines the order rather than the
actual value of the samples. However eliminating with median uses
actual value(if medians of two samples are close enough, we treat them
as equal).

I see. So some of the useful eliminations come from the fact that we actually run a parametric test? So we _do_ in this case take some assumptions about the distribution of the values, right?

In the optimal case, we should be able to set the confidence level we
require high enough to filter out these results as well. Is this right?

Yes. The lowest confidence we can set is still quite high(90%). We can
certainly add a lower confidence option, but I can't find any MWU
table lower than that on the Internet.

Why the lowest confidence? I would be interested in maximal confidence to reduce noise.

I found this table:

http://www.stat.purdue.edu/~bogdanm/wwwSTAT503_fall/Tables/Wilcoxon.pdf

I am not sure if those are the right values. Inside it says Wilcocan-Mann-Whitney U, but the filename suggests that the tables may be for the Wilcoxon signed-rank test.

Also, we should modify value analysis(based how close the
medians/minimums are) to vary according to the confidence level as
well. However this analysis is parametric, we needs to know how data
is actually distributed for every test. I don't think there is a
non-parametric test which does the same thing.

What kind of problem could we get in case we assume normal distribution and the values are in fact not normal distributed?

Would we just fail to find a significant change? Or would we possibly let non-significant changes through?

Under the assumption that there is a non-zero percentage of test cases
where the performance results are normal distributed, it may be OK for a special low-noise-configuration to only get results from these test cases, but possibly ignore performance changes from the non-normal-distributed cases.

Cheers,
Tobias

I've set up a public LNT server to show the result of perf stat. There
is a huge improvement compared with timeit tool.
http://parkas16.inria.fr:8000/

Hi Yi Kong,

thanks for testing these changes.

Patch is updated to pin the process to a single core, the readings are
even more accurate. It's hard coded to run everything on core 0, so
don't run parallel testing with it for now. The tool now depends on
Linux perf and schedtool.

I think this sounds like a very good direction.

How did you evaluate the improvements exactly? The following run shows
e.g
two execution time changes:

I sent a screenshot of original results in the previous mail. We used
to have lots of noise readings, both from small machine background
noise and large noise from the timing tool. Now noise from timing tool
is eliminated and only few machine background noise is left. This
makes manual investigation possible.

I think we need to get this down to zero even at the cost of missing
regressions. We have many commits and runs per day, having one or two
noisy
results per run means people will still not look at performance changes.

http://parkas16.inria.fr:8000/db_default/v4/nts/9

Are they expected? If I change e.g. the aggregation function to median
they disappear. Similarly the graph for one of them does not suggest an
actual performance change:

Yes, some false positives due to machine noise is expected. Median is
more tolerant to machine noise, therefore they disappear.

Right.

What I find interesting is that this change filters several results that
seem to not be filtered out by our statistical test. Is this right?

Yes. MWU test is nonparametric, it examines the order rather than the
actual value of the samples. However eliminating with median uses
actual value(if medians of two samples are close enough, we treat them
as equal).

I see. So some of the useful eliminations come from the fact that we
actually run a parametric test? So we _do_ in this case take some
assumptions about the distribution of the values, right?

Yes. You can check get_value_status() in
lnt/server/reporting/analysis.py to see how we determine significance.
I don't think making such assumption is good idea, as some tests have
very different distributions to others.

In the optimal case, we should be able to set the confidence level we
require high enough to filter out these results as well. Is this right?

Yes. The lowest confidence we can set is still quite high(90%). We can
certainly add a lower confidence option, but I can't find any MWU
table lower than that on the Internet.

Why the lowest confidence? I would be interested in maximal confidence to
reduce noise.

Ah... I got it wrong way around. I agree with you.

I found this table:

http://www.stat.purdue.edu/~bogdanm/wwwSTAT503_fall/Tables/Wilcoxon.pdf

I am not sure if those are the right values. Inside it says
Wilcocan-Mann-Whitney U, but the filename suggests that the tables may be
for the Wilcoxon signed-rank test.

That's indeed for the Wilcoxon signed-rank test.

Also, we should modify value analysis(based how close the
medians/minimums are) to vary according to the confidence level as
well. However this analysis is parametric, we needs to know how data
is actually distributed for every test. I don't think there is a
non-parametric test which does the same thing.

What kind of problem could we get in case we assume normal distribution and
the values are in fact not normal distributed?

If the distribution is in fact skewed, we will get lots of false negatives.

Would we just fail to find a significant change? Or would we possibly let
non-significant changes through?

Under the assumption that there is a non-zero percentage of test cases
where the performance results are normal distributed, it may be OK for a
special low-noise-configuration to only get results from these test cases,
but possibly ignore performance changes from the non-normal-distributed
cases.

It's hard to test if execution time is normal distributed. The samples
are definately not normally distributed, because the measurements are
guaranteed upper-bound.

I haven't looked at this particular data, but I've done a lot of work in
general trying to detect small changes in performance.

My feeling is that there is usually a "true" execution time PLUS the sum of
some random amount of things that happened during the run. Nothing random
ever makes the code run faster than it should! (which by itself makes the
normal distribution completely inappropriate, as it always has a finite
chance of negative values)

Each individual random thing that might happen in a run probably actually
has a binomial or hypergeometric distribution, but p is so small and n so
large (and p*n constant) that you might as well call it a Poisson
distribution.

Note that while the sum of a number of arbitrary independent random
variables is normal (Central Limit Theorem), the sum of independent Poisson
variables is Poisson! And you only need one number to characterise a
Poisson distribution: the expected value (which also happens to be the same
as the variance).

Also, we should modify value analysis(based how close the
medians/minimums are) to vary according to the confidence level as
well. However this analysis is parametric, we needs to know how data
is actually distributed for every test. I don't think there is a
non-parametric test which does the same thing.

What kind of problem could we get in case we assume normal distribution
and the values are in fact not normal distributed?

I haven't looked at this particular data, but I've done a lot of work in
general trying to detect small changes in performance.

I've set up another server.
http://cloud-vm-44-158.doc.ic.ac.uk:55080/

My feeling is that there is usually a "true" execution time PLUS the sum of
some random amount of things that happened during the run. Nothing random
ever makes the code run faster than it should! (which by itself makes the
normal distribution completely inappropriate, as it always has a finite
chance of negative values)

I agree.

Each individual random thing that might happen in a run probably actually
has a binomial or hypergeometric distribution, but p is so small and n so
large (and p*n constant) that you might as well call it a Poisson
distribution.

Note that while the sum of a number of arbitrary independent random
variables is normal (Central Limit Theorem), the sum of independent Poisson
variables is Poisson! And you only need one number to characterise a Poisson
distribution: the expected value (which also happens to be the same as the
variance).

I'll implement it and see how it works out.

I don't see any sign of improvement.

Since we can't get raw data any better, I think we should get this
merged and tested on build bots for a while and see how it goes.

Cheers,
Yi Kong

My builders are prepared for this (and I can fix upcoming issues in case I forgot something). Chris, Hal, Renato? What do you think?

Tobias