Using Google Benchmark Library

Hi,
I am adding some benchmarks to the test-suite as a part of my GSoC project. I am planning to use the google benchmark library on some benchmarks. I would like to know your opinion/suggestion on how I should proceed with this library and how the design should be(like limiting the number of times a kernel is executed so that overall runtime of test-suite can be controlled, multiple inputs sizes, adding it to small kernels etc.).
I would love to hear any other suggestions that you may have.

Thanks,
Pankaj

The benchmark library can dynamically pick the number of iterations, and we definitely need to keep a handle on the overall runtime of the test suite. I don’t think that we yet have guidelines for microbenchmark timing, but for full applications, we try to keep the runtime down to ~1s. -Hal

I would like to add some details of the intent.

We would like to benchmark common algorithms that appear in many
sources (Think of linear algebra, image processing, etc.). Usually,
they only consist of a single function. Only the kernel itself is of
interest, but not e.g. the data initialization. Depending on the
optimization (e.g. by Polly, parallelization, offloading), the
execution time can vary widely.

Pankaj already added a review at
https://reviews.llvm.org/D46735
where the idea to use Google Benchmark came up. However, as such it
does not fulfill all the requirements, in particular, it does not
check for correctness.

Here is a list of things I would like to see:
- Check correct output
- Execution time
- Compilation time
- LLVM pass statistics
- Code size
- Hardware performance counters
- Multiple problem sizes
- Measuring the above for the kernel only.
- Optional very large problem sizes (that test every cache level),
disabled by default
- Repeated execution to average-out noise
- Taking cold/warm cache into account

Here is an idea on how this could be implemented:
Every benchmark consists of two files: The source for the kernel and a
driver that initializes the input data, knows how to call the kernel
in the other file and can check the correct output.
The framework recognizes all benchmarks and their drivers and does the
following:
- Compile the driver
- Time the compilation of the kernel using -save-stats=obj
- Get the kernel's code size using llvm-size
- Link driver, kernel and Google Benchmark together.
- Instruct the driver to run the kernel with a small problem size and
check the correctness.
- Instructs Google Benchmark to run the kernel to get a reliable
average execution time of the kernel (without the input data
initialization)
- LNT's --exec-multisample does not need to run the benchmarks
multiple times, as Google Benchmark already did so.

Michael

2018-05-25 13:49 GMT-05:00 Pankaj Kukreja via llvm-dev
<llvm-dev@lists.llvm.org>:

Hi,
I am adding some benchmarks to the test-suite as a part of my GSoC project.
I am planning to use the google benchmark library on some benchmarks. I
would like to know your opinion/suggestion on how I should proceed with this
library and how the design should be(like limiting the number of times a
kernel is executed so that overall runtime of test-suite can be controlled,
multiple inputs sizes, adding it to small kernels etc.).
I would love to hear any other suggestions that you may have.

I would like to add some details of the intent.

We would like to benchmark common algorithms that appear in many
sources (Think of linear algebra, image processing, etc.). Usually,
they only consist of a single function. Only the kernel itself is of
interest, but not e.g. the data initialization. Depending on the
optimization (e.g. by Polly, parallelization, offloading), the
execution time can vary widely.

Pankaj already added a review at
https://reviews.llvm.org/D46735
where the idea to use Google Benchmark came up. However, as such it
does not fulfill all the requirements, in particular, it does not
check for correctness.

Here is a list of things I would like to see:
- Check correct output
- Execution time
- Compilation time
- LLVM pass statistics
- Code size
- Hardware performance counters
- Multiple problem sizes
- Measuring the above for the kernel only.
- Optional very large problem sizes (that test every cache level),
disabled by default
- Repeated execution to average-out noise
- Taking cold/warm cache into account

Here is an idea on how this could be implemented:
Every benchmark consists of two files: The source for the kernel and a
driver that initializes the input data, knows how to call the kernel
in the other file and can check the correct output.
The framework recognizes all benchmarks and their drivers and does the
following:
- Compile the driver
- Time the compilation of the kernel using -save-stats=obj
- Get the kernel's code size using llvm-size
- Link driver, kernel and Google Benchmark together.

I think you might run into artificial overhead here if you’re not careful. In particular you might run into:

- Missed in-lining opportunity in the benchmark. If you expect the kernels to be potentially inlined, this might be a problem.

- The link order might cause interference depending on the linker being used.

- If you’re doing LTO then that would add an additional wrinkle.

They’re not show-stoppers, but these are some of the things to look out for and consider.

- Instruct the driver to run the kernel with a small problem size and
check the correctness.

In practice, what I’ve seen is mixing unit tests which perform correctness checks (using Google Test/Mock) and then co-locating the benchmarks in the same file. This way you can choose to run just the tests or the benchmarks in the same compilation mode. I’m not sure whether there’s already a copy of the Google Test/Mock libraries in the test-suite, but I’d think those shouldn’t be too hard (nor controversial) to add.

- Instructs Google Benchmark to run the kernel to get a reliable
average execution time of the kernel (without the input data
initialization)

There’s ways to write the benchmarks so that you only measure a small part of the actual benchmark. The manuals will be really helpful in pointing out how to do that.

https://github.com/google/benchmark#passing-arguments

In particular, you can pause the timing when you’re doing the data initialisation and then resume just before you run the kernel.

- LNT's --exec-multisample does not need to run the benchmarks
multiple times, as Google Benchmark already did so.

I thought recent patches already does some of this. Hal would know.

Cheers

PS. I’d be happy to do some reviews of uses of the Google Benchmark library, if you need additional reviewers.

-- Dean

Thanks for your remarks.

I think you might run into artificial overhead here if you’re not careful. In particular you might run into:

- Missed in-lining opportunity in the benchmark. If you expect the kernels to be potentially inlined, this might be a problem.

For the kind of benchmarks we have in mind, one function call overhead
is not significant, nor would we expect compilers to inline them in
the applications that use them.
Inlining may even be counterproductive: After inlining might see from
the array initialization code that elements are initialized to 0 (or
some other deterministic value) and use that knowledge while
optimizing the kernel.
We might prevent such things by annotating the kernel with
__attribute__((noinline)).

- The link order might cause interference depending on the linker being used.

- If you’re doing LTO then that would add an additional wrinkle.

They’re not show-stoppers, but these are some of the things to look out for and consider.

I'd consider the benchmark to be specific to a compiler+linker
combination. As long as we can measure the kernel in isolation (and
consider cold/warm caches), it should be fine.
I'd switch off LTO here since any code that could be inlined into the
kernel should already be in its translation unit.

- Instruct the driver to run the kernel with a small problem size and
check the correctness.

In practice, what I’ve seen is mixing unit tests which perform correctness checks (using Google Test/Mock) and then co-locating the benchmarks in the same file. This way you can choose to run just the tests or the benchmarks in the same compilation mode. I’m not sure whether there’s already a copy of the Google Test/Mock libraries in the test-suite, but I’d think those shouldn’t be too hard (nor controversial) to add.

Google Test is already part of LLVM. Since the test-suite already has
a dependency on LLVM (e.g. for llvm-lit, itself already supporting
Google Test), we could just use that one.
I don't know yet one would combine them to run the same code.

- Instructs Google Benchmark to run the kernel to get a reliable
average execution time of the kernel (without the input data
initialization)

There’s ways to write the benchmarks so that you only measure a small part of the actual benchmark. The manuals will be really helpful in pointing out how to do that.

https://github.com/google/benchmark#passing-arguments

In particular, you can pause the timing when you’re doing the data initialisation and then resume just before you run the kernel.

Sounds great.

- LNT's --exec-multisample does not need to run the benchmarks
multiple times, as Google Benchmark already did so.

I thought recent patches already does some of this. Hal would know.

I haven't found any special handling for --exec-multisample in MicroBenchmarks.

PS. I’d be happy to do some reviews of uses of the Google Benchmark library, if you need additional reviewers.

Thank you. We will notify you when we have something.

Michael

Not going into all the detail, but from my side the big question is whether the benchmarks inner loop is small/fine grained enough that stabilization with google benchmark doesn't lead to dozens of seconds benchmark runtimes. Given that you typically see thousandsd or millions of invocations for small functions...

Not going into all the detail, but from my side the big question is
whether the benchmarks inner loop is small/fine grained enough that
stabilization with google benchmark doesn't lead to dozens of seconds
benchmark runtimes. Given that you typically see thousandsd or millions of
invocations for small functions...

Google benchmarks executes the kernel at max 1e9 times or until CPU time is
greater than the minimum time or the wallclock time is 5x minimum time, by
default min time is 0.5s but we can change it using "MinTime(X)" or
"--benchmark_min_time=X".
So the stabilization of small/fine grained kernel with google benchmark
should not be any problem.

>
> Thanks for your remarks.
>
>> I think you might run into artificial overhead here if you’re not
careful. In particular you might run into:
>>
>> - Missed in-lining opportunity in the benchmark. If you expect the
kernels to be potentially inlined, this might be a problem.
>
> For the kind of benchmarks we have in mind, one function call overhead
> is not significant, nor would we expect compilers to inline them in
> the applications that use them.
> Inlining may even be counterproductive: After inlining might see from
> the array initialization code that elements are initialized to 0 (or
> some other deterministic value) and use that knowledge while
> optimizing the kernel.
> We might prevent such things by annotating the kernel with
> __attribute__((noinline)).
>
>
>> - The link order might cause interference depending on the linker being
used.
>>
>> - If you’re doing LTO then that would add an additional wrinkle.
>>
>> They’re not show-stoppers, but these are some of the things to look out
for and consider.
>
> I'd consider the benchmark to be specific to a compiler+linker
> combination. As long as we can measure the kernel in isolation (and
> consider cold/warm caches), it should be fine.
> I'd switch off LTO here since any code that could be inlined into the
> kernel should already be in its translation unit.
>
>
>>
>>> - Instruct the driver to run the kernel with a small problem size and
>>> check the correctness.
>>
>> In practice, what I’ve seen is mixing unit tests which perform
correctness checks (using Google Test/Mock) and then co-locating the
benchmarks in the same file. This way you can choose to run just the tests
or the benchmarks in the same compilation mode. I’m not sure whether
there’s already a copy of the Google Test/Mock libraries in the test-suite,
but I’d think those shouldn’t be too hard (nor controversial) to add.
>
> Google Test is already part of LLVM. Since the test-suite already has
> a dependency on LLVM (e.g. for llvm-lit, itself already supporting
> Google Test), we could just use that one.
> I don't know yet one would combine them to run the same code.
>
>
>>> - Instructs Google Benchmark to run the kernel to get a reliable
>>> average execution time of the kernel (without the input data
>>> initialization)
>>
>> There’s ways to write the benchmarks so that you only measure a small
part of the actual benchmark. The manuals will be really helpful in
pointing out how to do that.
>>
>> https://github.com/google/benchmark#passing-arguments
>>
>> In particular, you can pause the timing when you’re doing the data
initialisation and then resume just before you run the kernel.
>
> Sounds great.
>
>
>>> - LNT's --exec-multisample does not need to run the benchmarks
>>> multiple times, as Google Benchmark already did so.
>>
>> I thought recent patches already does some of this. Hal would know.
>
> I haven't found any special handling for --exec-multisample in
MicroBenchmarks.

This LNT option means run the whole benchmarking multiple times. It's more
or less a loop the whole of the benchmark run: This even means we compile
the source code multiple times.

FWIW I'd really like to see an alternative option to compile once and then
run each benchmark executable multiple times before moving to the next
executable; but given that we don't have that today I'm just not using the
multisample option personally as it's not better than submitting multiple
benchmarking jobs anyway...

I think its better to use "benchmark_repetitions=n" option of google
benchmark than "--exec-multisample=n" as in multisample data initialization
will also be executed again along with the main kernel but
benchmark_repetitions option runs only the main kernel n times. The
repeated execution of data inititalization will only add unwanted execution
time.

Also, google benchmark library gives the result for each run along with the
mean, stddev and median of all runs. This can be usefull if there is some
tool which can parse the stdout and write some summary in a sheet.

One thing that I would like to have is some tool/API that can automatically
verifiy the output like lit does using reference output.

Regards,
Pankaj Kukreja
Computer Science Department
IIT Hyderabad

Not going into all the detail, but from my side the big question is
whether the benchmarks inner loop is small/fine grained enough that
stabilization with google benchmark doesn't lead to dozens of seconds
benchmark runtimes. Given that you typically see thousandsd or millions of
invocations for small functions...

Google benchmarks executes the kernel at max 1e9 times or until CPU time is
greater than the minimum time or the wallclock time is 5x minimum time, by
default min time is 0.5s but we can change it using "MinTime(X)" or
"--benchmark_min_time=X".
So the stabilization of small/fine grained kernel with google benchmark
should not be any problem.

>
> Thanks for your remarks.
>
>> I think you might run into artificial overhead here if you’re not
>> careful. In particular you might run into:
>>
>> - Missed in-lining opportunity in the benchmark. If you expect the
>> kernels to be potentially inlined, this might be a problem.
>
> For the kind of benchmarks we have in mind, one function call overhead
> is not significant, nor would we expect compilers to inline them in
> the applications that use them.
> Inlining may even be counterproductive: After inlining might see from
> the array initialization code that elements are initialized to 0 (or
> some other deterministic value) and use that knowledge while
> optimizing the kernel.
> We might prevent such things by annotating the kernel with
> __attribute__((noinline)).
>
>
>> - The link order might cause interference depending on the linker being
>> used.
>>
>> - If you’re doing LTO then that would add an additional wrinkle.
>>
>> They’re not show-stoppers, but these are some of the things to look out
>> for and consider.
>
> I'd consider the benchmark to be specific to a compiler+linker
> combination. As long as we can measure the kernel in isolation (and
> consider cold/warm caches), it should be fine.
> I'd switch off LTO here since any code that could be inlined into the
> kernel should already be in its translation unit.
>
>
>>
>>> - Instruct the driver to run the kernel with a small problem size and
>>> check the correctness.
>>
>> In practice, what I’ve seen is mixing unit tests which perform
>> correctness checks (using Google Test/Mock) and then co-locating the
>> benchmarks in the same file. This way you can choose to run just the tests
>> or the benchmarks in the same compilation mode. I’m not sure whether there’s
>> already a copy of the Google Test/Mock libraries in the test-suite, but I’d
>> think those shouldn’t be too hard (nor controversial) to add.
>
> Google Test is already part of LLVM. Since the test-suite already has
> a dependency on LLVM (e.g. for llvm-lit, itself already supporting
> Google Test), we could just use that one.
> I don't know yet one would combine them to run the same code.
>
>
>>> - Instructs Google Benchmark to run the kernel to get a reliable
>>> average execution time of the kernel (without the input data
>>> initialization)
>>
>> There’s ways to write the benchmarks so that you only measure a small
>> part of the actual benchmark. The manuals will be really helpful in pointing
>> out how to do that.
>>
>> https://github.com/google/benchmark#passing-arguments
>>
>> In particular, you can pause the timing when you’re doing the data
>> initialisation and then resume just before you run the kernel.
>
> Sounds great.
>
>
>>> - LNT's --exec-multisample does not need to run the benchmarks
>>> multiple times, as Google Benchmark already did so.
>>
>> I thought recent patches already does some of this. Hal would know.
>
> I haven't found any special handling for --exec-multisample in
> MicroBenchmarks.

This LNT option means run the whole benchmarking multiple times. It's more
or less a loop the whole of the benchmark run: This even means we compile
the source code multiple times.

FWIW I'd really like to see an alternative option to compile once and then
run each benchmark executable multiple times before moving to the next
executable; but given that we don't have that today I'm just not using the
multisample option personally as it's not better than submitting multiple
benchmarking jobs anyway...

I think its better to use "benchmark_repetitions=n" option of google
benchmark than "--exec-multisample=n" as in multisample data initialization
will also be executed again along with the main kernel but
benchmark_repetitions option runs only the main kernel n times. The repeated
execution of data inititalization will only add unwanted execution time.

Also, google benchmark library gives the result for each run along with the
mean, stddev and median of all runs. This can be usefull if there is some
tool which can parse the stdout and write some summary in a sheet.

I would like to point out that the normal table console output is not the only
output, it can produce JSON, and even natively store it into a file.
https://github.com/google/benchmark#output-formats

One thing that I would like to have is some tool/API that can automatically
verifiy the output like lit does using reference output.

Regards,
Pankaj Kukreja
Computer Science Department
IIT Hyderabad

Roman.