test-suite: a new proposal for how to move forward to make "test-suite" more automatic, more flexible, and more maintainable, especially WRT reference outputs

Dear all,

Today I had an idea that might satisfy all the needs for improvement we currently have "on the plate" WRT the repo.-wise sizes of reference outputs and the issues surrounding FP optimizations and how to allow them while still allowing test programs in "test-suite" the output[s] of which depend upon FP computations [and for which relatively-small changes in FP accuracy, whether up/more-accurate or down/less-accurate, change the actual observed output].

For non-FP-dependent, fully-deterministic programs, we can choose the shortest [in # of bytes as reported by "ls"] of the following:

   * hash
   * compressed output
   * raw output

[in increasing order of "likely" size]

... or we can establish some minimum differentiating factors, e.g. "compressed output must be at least 2x smaller than raw output, otherwise stick to raw output" and "hash must be at least 10x smaller than compressed output, otherwise stick to compressed output". If needed/{strongly desired}, the rules can even be a little more complicated than that, e.g. "compressed output must be at least 2x smaller than raw output OR at least 4096 bytes smaller than raw output, otherwise stick to raw output".

For programs that _are_ either FP-dependent, not-fully-deterministic, or both, I propose that we shall only choose from the set {compressed output, raw output} because:

   1) small-enough variation in the result is expected, normal, and tolerated

and

   2) since this way the raw reference output will be available at the "lit"-running host [after decompression, if needed],
      the "fpcmp" program will be able to be told how much tolerance to allow for each run.

If we only choose from the set {compressed ref. output, raw ref. output} for these tests, then it should be relatively easy to run some tests with output-changing FP optimizations enabled, since those runs won`t depend on the {no-output-changing-FP-optimizations} build having run first. Although Hal`s suggestion to have the {no-output-changing-FP-optimizations} build produce the output that will be analyzed by the {output-changing FP optimizations enabled} builds is an excellent suggestion, it seems that implementing it in the context of "lit" is a large amount more difficult than we had hoped for. If anybody reading this knows how to make "lit" only start one test after another one has finished, please chime in.

If compressed ref. outputs will be accepted by the community, then please let me know which of the following would be acceptable to depend on the ability to decompress:

   bz2
   gzip
   xz

I`m perfectly willing to write [a] wrapper[s] that will probe the system for programs that can decompress whatever it is and will choose the best one.

Regards,

Abe

Hi Abe,

My 2 cents:
I have been using the test-suite mainly in benchmarking mode as a convenient way to track performance changes in top-of-trunk.
I’ve observed that some of the programs (IIRC, especially the ones in SingleSource/Benchmarks/Polybench/) produce a lot of output (megabytes).
This caused a lot of noise in performance measurements, as the execution time was dominated by printing out the data, rather than the actual useful computations. Renato removed the worst noise in http://reviews.llvm.org/D10991.

That experience made me think that for the programs in the test-suite, ideally they should print out only a small amount of output to be checked.
For example, by adapting individual programs that output a lot of data to only print a summary/aggregate of the data, that somehow is likely to change
when a miscomputation happened.

If we could go in that direction, I don’t see much need for storing hashes or even compressed output as reference data.
I think that needing compressed reference data may make the test-suite ever so slightly harder to set up: another dependency on an external tool. Not that I can imagine that having a dependency on e.g. gzip would be problematic on any platform.

Anyway, I thought I’d just share my opinion of it being ideal that the programs in the test-suite would only produce small outputs, to avoid noisy benchmark results. If that would be a direction we could go into, there may not be much needed for storing hashes or compressed reference output.

Thanks,

Kristof

Kristof, I agree with your point of view.

There is a very easy way to output only one double from the polybench:
- compile the kernel with fp-contract=off and -fno-fast-math
- add a "+" reduction loop of all the elements in the output array
(also compiled with strict FP computations such that the output is
deterministic)
- print the result of the reduction instead of printing the full array.

Thanks,
Sebastian

Adding Tobi in CC to get his review about the proposed change to Polybench.

Thanks,
Sebastian

There is a very easy way to output only one double from the polybench:
- compile the kernel with fp-contract=off and -fno-fast-math

Sebastian, please stop crossing the wires. This is a separate discussion.

- add a "+" reduction loop of all the elements in the output array
(also compiled with strict FP computations such that the output is
deterministic)

addition can saturate/overflow and lose precision, especially if we
have hundreds of thousands of results or if the type is float, not
double. Whatever the aggregation function we use has to be meaningful.

One way I did in the past was to aggregate in blocks when the results
weren't likely to saturate/overflow/lose precision, ie. the end result
had a similar magnitude as the individual results.

This gave us huge benefits in I/O and comparison times, and can work
with polybench, but someone will have to go through it and make sure
the aggregated numbers are not orders of magnitude greater than the
individual results.

cheers,
--renato

There is a very easy way to output only one double from the polybench:
- compile the kernel with fp-contract=off and -fno-fast-math

Sebastian, please stop crossing the wires. This is a separate discussion.

We need to get deterministic output for all possible combinations of
CFLAGS the users will compile the test-suite with.

- add a "+" reduction loop of all the elements in the output array
(also compiled with strict FP computations such that the output is
deterministic)

addition can saturate/overflow and lose precision, especially if we
have hundreds of thousands of results or if the type is float, not
double. Whatever the aggregation function we use has to be meaningful.

Agreed.
I'm also fine using any stable hashing function and link polybench
tests against that.

The problem with hashes is that it only works when the results have to
be exact, which is not suitable for FP comparisons, thus would fail
under fp-contract=on.

To speed things up, I guess we could reduce the output via meaningful
aggregation, get a small set of FP numbers to compare against and set
an FP_TOLERANCE that makes sense around FP contraction and leave it
on.

Then we commit the change to fp-contract=on.

After that, we change the test harness to have two runs, one with
tolerance=0 and fp-contract=off, and the other with tolerance=delta
and fp-contract=on.

cheers,
--renato