Questions About LLVM Test Suite: Time Units, Re-running benchmarks

Hi,

I’m not very familiar with the LLVM test suite and I’d like to ask some questions.
I wanted to get a feeling of the impact to runtime performance of some changes inside LLVM,
so I thought of running llvm test-suite benchmarks.

Build options: O3.cmake + DTEST_SUITE_BENCHMARKING_ONLY=True
Run:
llvm-lit -v -j 1 -o out.json .

What I think I should be looking for is the “metrics” → “exec_time” in the JSON file.

Now, to the questions. First, there doesn’t seem to be a common time unit for
“exec_time” among the different tests. For instance, SingleSource/ seem to use
seconds while MicroBenchmarks seem to use μs. So, we can’t reliably judge
changes. Although I get the fact that micro-benchmarks are different in nature
than Single/MultiSource benchmarks, so maybe one should focus only on
the one or the other depending on what they’re interested in.

In any case, it would at least be great if the JSON data contained the time unit per test,
but that is not happening either.

Do you think that the lack of time unit info is a problem ? If yes, do you like the
solution of adding the time unit in the JSON or do you want to propose an alternative?

The second question has to do with re-running the benchmarks: I do
cmake + make + llvm-lit -v -j 1 -o out.json .
but if I try to do the latter another time, it just does/shows nothing. Is there any reason
that the benchmarks can’t be run a second time? Could I somehow run it a second time ?

Lastly, slightly off-topic but while we’re on the subject of benchmarking,
do you think it’s reliable to run with -j ? I’m a little bit afraid of
the shared caches (because misses should be counted in the CPU time, which
is what is measured in “exec_time” AFAIU)
and any potential multi-threading that the tests may use.

Best,
Stefanos

Now, to the questions. First, there doesn't seem to be a common time unit for
"exec_time" among the different tests. For instance, SingleSource/ seem to use
seconds while MicroBenchmarks seem to use μs. So, we can't reliably judge
changes. Although I get the fact that micro-benchmarks are different in nature
than Single/MultiSource benchmarks, so maybe one should focus only on
the one or the other depending on what they're interested in.

Usually one does not compare executions of the entire test-suite, but
look for which programs have regressed. In this scenario only relative
changes between programs matter, so μs are only compared to μs and
seconds only compared to seconds.

In any case, it would at least be great if the JSON data contained the time unit per test,
but that is not happening either.

What do you mean? Don't you get the exec_time per program?

Do you think that the lack of time unit info is a problem ? If yes, do you like the
solution of adding the time unit in the JSON or do you want to propose an alternative?

You could also normalize the time unit that is emitted to JSON to s or ms.

The second question has to do with re-running the benchmarks: I do
cmake + make + llvm-lit -v -j 1 -o out.json .
but if I try to do the latter another time, it just does/shows nothing. Is there any reason
that the benchmarks can't be run a second time? Could I somehow run it a second time ?

Running the programs a second time did work for me in the past.
Remember to change the output to another file or the previous .json
will be overwritten.

Lastly, slightly off-topic but while we're on the subject of benchmarking,
do you think it's reliable to run with -j <number of cores> ? I'm a little bit afraid of
the shared caches (because misses should be counted in the CPU time, which
is what is measured in "exec_time" AFAIU)
and any potential multi-threading that the tests may use.

It depends. You can run in parallel, but then you should increase the
number of samples (executions) appropriately to counter the increased
noise. Depending on how many cores your system has, it might not be
worth it, but instead try to make the system as deterministic as
possible (single thread, thread affinity, avoid background processes,
use perf instead of timeit, avoid context switches etc. ). To avoid
systematic bias because always the same cache-sensitive programs run
in parallel, use the --shuffle option.

Michael

llvm-dev <llvm-dev@lists.llvm.org>:

Now, to the questions. First, there doesn’t seem to be a common time unit for
“exec_time” among the different tests. For instance, SingleSource/ seem to use
seconds while MicroBenchmarks seem to use μs. So, we can’t reliably judge
changes. Although I get the fact that micro-benchmarks are different in nature
than Single/MultiSource benchmarks, so maybe one should focus only on
the one or the other depending on what they’re interested in.

Usually one does not compare executions of the entire test-suite, but
look for which programs have regressed. In this scenario only relative
changes between programs matter, so μs are only compared to μs and
seconds only compared to seconds.

In any case, it would at least be great if the JSON data contained the time unit per test,
but that is not happening either.

What do you mean? Don’t you get the exec_time per program?

Do you think that the lack of time unit info is a problem ? If yes, do you like the
solution of adding the time unit in the JSON or do you want to propose an alternative?

You could also normalize the time unit that is emitted to JSON to s or ms.

The second question has to do with re-running the benchmarks: I do
cmake + make + llvm-lit -v -j 1 -o out.json .
but if I try to do the latter another time, it just does/shows nothing. Is there any reason
that the benchmarks can’t be run a second time? Could I somehow run it a second time ?

Running the programs a second time did work for me in the past.
Remember to change the output to another file or the previous .json
will be overwritten.

Lastly, slightly off-topic but while we’re on the subject of benchmarking,
do you think it’s reliable to run with -j ? I’m a little bit afraid of
the shared caches (because misses should be counted in the CPU time, which
is what is measured in “exec_time” AFAIU)
and any potential multi-threading that the tests may use.

It depends. You can run in parallel, but then you should increase the
number of samples (executions) appropriately to counter the increased
noise. Depending on how many cores your system has, it might not be
worth it, but instead try to make the system as deterministic as
possible (single thread, thread affinity, avoid background processes,
use perf instead of timeit, avoid context switches etc. ). To avoid
systematic bias because always the same cache-sensitive programs run
in parallel, use the --shuffle option.

Also, depending on what you are trying to achieve (and what your platform target is), you could enable perfcounter collection; if instruction counts are sufficient (for example), the value will probably not vary much with multi-threading.

…but it’s probably best to avoid system noise altogether. On Intel, afaik that includes disabling turbo boost and hyperthreading, along with Michael’s recommendations.

Hi,

Usually one does not compare executions of the entire test-suite, but
look for which programs have regressed. In this scenario only relative
changes between programs matter, so μs are only compared to μs and
seconds only compared to seconds.

That’s true, but there are different insights one can get from, say, a 30%
increase in a program that initially took 100μs and one which initially
took 10s.

What do you mean? Don’t you get the exec_time per program?

Yes, but JSON file does not include the time unit. Actually, I think the correct phrasing
is “unit of time”, not “time unit”, my bad. In any case, I mean that you get
e.g., “exec_time”: 4, but you don’t know if this 4 is 4 seconds or
4 μs or whatever other unit of time.

For example, the only reason that it seems that MultiSource/ use
seconds is just because I ran a bunch of them manually (and because
some outputs saved by llvm-lit, which measure in seconds, match
the numbers on JSON).

If we know the unit of time per test case (or per X grouping of tests
for that matter), we could then, e.g., normalize the times, as you
suggest, or anyway, know the unit of time and act accordingly.

Running the programs a second time did work for me in the past.

Ok, it seems it works for me if I wait, but it seems it behaves differently

the second time. Anyway, not important.

It depends. You can run in parallel, but then you should increase the
number of samples (executions) appropriately to counter the increased
noise. Depending on how many cores your system has, it might not be
worth it, but instead try to make the system as deterministic as
possible (single thread, thread affinity, avoid background processes,
use perf instead of timeit, avoid context switches etc. ). To avoid
systematic bias because always the same cache-sensitive programs run
in parallel, use the --shuffle option.

I see, thanks. I didn’t know about the --shuffle option, interesting.

Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems that perf runs both during the
build (i.e., make) and the run (i.e., llvm-lit) of the tests. It’s not important but do you happen to know
why does this happen?

Also, depending on what you are trying to achieve (and what your platform target is), you could enable perfcounter collection;

Thanks, that can be useful in a bunch of cases. I should not that perf stats are not included in the
JSON file. Is the “canonical” way to access them to follow the CMakeFiles/.dir/.time.perfstats ?

For example, let’s say that I want the perf stats for test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp
To find them, I should go to the same path but in the build directory, i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/
and then follow the pattern above, so, the .perfstats file will be in: test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats

Sorry for the long path strings, but I couldn’t make it clear otherwise.

Thanks to both,
Stefanos

Στις Δευ, 19 Ιουλ 2021 στις 5:36 μ.μ., ο/η Mircea Trofin <mtrofin@google.com> έγραψε:

Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems that perf runs both during the
build (i.e., make) and the run (i.e., llvm-lit) of the tests. It’s not important but do you happen to know
why does this happen?

It seems the one gathers measurements for the compilation command and the other for the run. My bad, I hadn’t noticed.

  • Stefanos

Στις Δευ, 19 Ιουλ 2021 στις 10:46 μ.μ., ο/η Stefanos Baziotis <stefanos.baziotis@gmail.com> έγραψε:

You know the unit of time from the top-level folder. MicroBenchmarks
is microseconds (because Google Benchmark reports microseconds),
everything is seconds.

That might be confusing when you don't know about it, but if you do
you there is no ambiguity.

Michael

Yes, I agree. And as I mentioned, one can figure it out by manually reproducing some measurements. It’s just that it leaves you wondering “is there some other dir that uses something different?” if nobody tells you
about it. Ok, good, I’ll try to add some documentation on that.

By the way, does lit have any flags to set core affinity? Currently, I have modified timeit.sh to use taskset as in: taskset --cpu-list 2,4,6 perf stat ...
It seems reliable, although I’d like to find a way to actually test the reliability. But, if lit has an option already, I could use that.

Best,
Stefanos

Στις Τρί, 20 Ιουλ 2021 στις 1:25 π.μ., ο/η Michael Kruse <llvmdev@meinersbur.de> έγραψε:

Hi,

Usually one does not compare executions of the entire test-suite, but
look for which programs have regressed. In this scenario only relative
changes between programs matter, so μs are only compared to μs and
seconds only compared to seconds.

That’s true, but there are different insights one can get from, say, a 30%
increase in a program that initially took 100μs and one which initially
took 10s.

What do you mean? Don’t you get the exec_time per program?

Yes, but JSON file does not include the time unit. Actually, I think the correct phrasing
is “unit of time”, not “time unit”, my bad. In any case, I mean that you get
e.g., “exec_time”: 4, but you don’t know if this 4 is 4 seconds or
4 μs or whatever other unit of time.

For example, the only reason that it seems that MultiSource/ use
seconds is just because I ran a bunch of them manually (and because
some outputs saved by llvm-lit, which measure in seconds, match
the numbers on JSON).

If we know the unit of time per test case (or per X grouping of tests
for that matter), we could then, e.g., normalize the times, as you
suggest, or anyway, know the unit of time and act accordingly.

Running the programs a second time did work for me in the past.

Ok, it seems it works for me if I wait, but it seems it behaves differently

the second time. Anyway, not important.

It depends. You can run in parallel, but then you should increase the
number of samples (executions) appropriately to counter the increased
noise. Depending on how many cores your system has, it might not be
worth it, but instead try to make the system as deterministic as
possible (single thread, thread affinity, avoid background processes,
use perf instead of timeit, avoid context switches etc. ). To avoid
systematic bias because always the same cache-sensitive programs run
in parallel, use the --shuffle option.

I see, thanks. I didn’t know about the --shuffle option, interesting.

Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems that perf runs both during the
build (i.e., make) and the run (i.e., llvm-lit) of the tests. It’s not important but do you happen to know
why does this happen?

Also, depending on what you are trying to achieve (and what your platform target is), you could enable perfcounter collection;

Thanks, that can be useful in a bunch of cases. I should not that perf stats are not included in the
JSON file. Is the “canonical” way to access them to follow the CMakeFiles/.dir/.time.perfstats ?

You need to specify which counters you want collected, up to 3 - see the link above (also, you need to opt in to linking libpfm)