a proposed script to help with test-suite programs that output _lots_ of FP numbers

Dear all,

As part of working on making test-suite less demanding of exact FP results so my FP-contraction patch can go back into trunk and stay there, today I analyzed "MultiSource/Benchmarks/VersaBench/beamformer". I found that the raw output from that program is 2789780 bytes [i.e. ~2.7 _megabytes_] of floating-point text, which IMO is too much to put into a patch -- or at least a _civilized_ patch. :wink:

As a result, I wrote the below Python program, which I think should deal with the problem fairly well, or at least is a good first attempt at doing so and can be improved later. For the "MultiSource/Benchmarks/VersaBench/beamformer" test, using the same CLI arg.s as in the test-suite script, passing the output of the test through the below script gives me the following results.

Vanilla compiler, i.e. without FP-contraction patch

Cumulating errors is a bad idea.
As others have suggested, please prepare a patch that disables
fp-contract on those testcases.

Thanks,
Sebastian

No, please, let's not disable things just because they fail.

If the test is not meaningful or if the results are not good, let's
just change the test in a meaningful way that can work with any FP
optimisation without changing meaning.

If it does change meaning, it's a bug and we *want* to catch.

cheers,
--renato

As part of working on making test-suite less demanding of exact FP results
so my FP-contraction patch can go back into trunk and stay there, today I
analyzed "MultiSource/Benchmarks/VersaBench/beamformer". I found that the
raw output from that program is 2789780 bytes [i.e. ~2.7 _megabytes_] of
floating-point text, which IMO is too much to put into a patch -- or at
least a _civilized_ patch. :wink:

Not to mention having to debug the whole thing every time it breaks. :S

How I "fixed" it in the past was to do some heuristics like you did,
while still trying to keep the meaning.

I think the idea of having a "number of results" is good, but I also
think you can separate the 300k values in logical groups, maybe adding
them up.

Of course, the more you do to the results, the higher will be the
rounding errors, and the less meaning the results will have.

I don't know much about this specific benchmark, but if it has some
kind of internal aggregation values, you can dump those instead?

As a result, I wrote the below Python program, which I think should deal
with the problem fairly well, or at least is a good first attempt at doing
so and can be improved later.

The python script is a good prototype for what you want, but I'd
rather change the source code of the benchmark to print less, more
valuable, stuff.

The more you print, the more your run will be tied to stdio and less
to what you wanted to benchmark in the fist place.

Vanilla compiler, i.e. without FP-contraction patch
---------------------------------------------------
286720
9178782.5878

Compiler WITH FP-contraction patch
----------------------------------
286720
9178782.58444

This looks like a small enough change to me, given the amount of
precision you're losing. But it'd be better to make sure the result
has at least some meaning related to the benchmark.

If "fpcmp" currently cannot ignore tolerances for integers and cannot easily
be extended to be able to do so, then I propose that the two calculated
results go to separate outputs [2 files?] to be tested separately.

That'd be fine, too.

cheers,
--renato

These tests are checking the results against a "golden file" output
computed with fp-contract=off.
IMHO the most sensible solution is to continue checking those tests
with the same flag
as at the time when the reference output has been recorded.

Thanks,
Sebastian

From: "Sebastian Pop via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Renato Golin" <renato.golin@linaro.org>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Sebastian Paul Pop" <s.pop@samsung.com>, "Abe Skolnik"
<a.skolnik@samsung.com>, "cfe-dev" <cfe-dev@lists.llvm.org>
Sent: Thursday, September 29, 2016 4:26:23 PM
Subject: Re: [cfe-dev] [llvm-dev] a proposed script to help with test-suite programs that output _lots_ of FP numbers

>> Cumulating errors is a bad idea.
>> As others have suggested, please prepare a patch that disables
>> fp-contract on those testcases.
>
> No, please, let's not disable things just because they fail.
>
> If the test is not meaningful or if the results are not good, let's
> just change the test in a meaningful way that can work with any FP
> optimisation without changing meaning.
>
> If it does change meaning, it's a bug and we *want* to catch.

These tests are checking the results against a "golden file" output
computed with fp-contract=off.
IMHO the most sensible solution is to continue checking those tests
with the same flag
as at the time when the reference output has been recorded.

We don't want to lose the more-stringent test coverage just because that no longer might be the default mode. We'll also want, where practical, some looser test mode that will work regardless of the FP contraction setting. Both are important.

-Hal

Summing up errors cancels them up: it makes no sense to do this.
You may be missing errors that are an order of magnitude off both directions
and still end up with an overall small delta after reduction.
And by the way, what does it mean "small enough change"?
Based on what metrics?
Is there a way to quantify what is an acceptable error margin?
There is a lot of information that is thrown away with the sum reduction.

Sebastian

What about recording two golden files with fp-cotract=on and off, then
diff the output against both, and if none is similar enough error.
I see two problems with recording the output of fp-contract=on:
- based on the target there may be more or less contractions done,
- and also the results may vary over time as the compiler gets better
at folding.

From: "Sebastian Pop" <sebpop.llvm@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Sebastian Paul Pop" <s.pop@samsung.com>, "Abe Skolnik"
<a.skolnik@samsung.com>, "cfe-dev" <cfe-dev@lists.llvm.org>, "Renato Golin" <renato.golin@linaro.org>
Sent: Thursday, September 29, 2016 4:41:23 PM
Subject: Re: [cfe-dev] [llvm-dev] a proposed script to help with test-suite programs that output _lots_ of FP numbers

> We don't want to lose the more-stringent test coverage just because
> that no longer might be the default mode. We'll also want, where
> practical, some looser test mode that will work regardless of the
> FP contraction setting. Both are important.

What about recording two golden files with fp-cotract=on and off,
then
diff the output against both, and if none is similar enough error.
I see two problems with recording the output of fp-contract=on:
- based on the target there may be more or less contractions done,
- and also the results may vary over time as the compiler gets better
at folding.

Recording the output with -ffp-contract=on to use as a reference output does not seem useful, exactly for the reasons you specify. We should generate the reference outputs with -ffp-contract=off as we do now. That's the only good target-independent configuration. Maybe we could do this:

- Have a reference output with -ffp-contract=off
- Run program compiled with -ffp-contract=off and the default
- Compare the former against the reference output for validation
- Compare the latter (the default build) against the former with fpcmp with some tolerance

This also has the advantage that we still only need the hashed reference outputs in the repository.

-Hal

[On Thu, Sep 29, 2016 at 4:32 PM, Hal Finkel <hfinkel@anl.gov> wrote:]

We don't want to lose the more-stringent test coverage just because that no longer might be the default mode.
We'll also want, where practical, some looser test mode that will work regardless of the FP

contraction setting.

Both are important.

[On 09/29/2016 04:41 PM, Sebastian Pop wrote:]

What about recording two golden files with fp-contract=on and off,
then diff the output against both, and if none is similar enough error.
I see two problems with recording the output of fp-contract=on:
- based on the target there may be more or less contractions done,
- and also the results may vary over time as the compiler gets better
at folding.

Yes, separate "golden files" with FP contraction on [or "fast"] is problematic.

However, if I/we can get compression working, then in some sense we can "have our cake and eat it too": have small reference files in the repository and in downloads, and have any precision/tolerance we like [including requiring exact matching] at test time.

If all the big outputs compress very nicely, as does the one from "MultiSource/Benchmarks/VersaBench/beamformer", then we can probably do this across all tests the output of which depends on FP computations.

-- Abe

["Hal Finkel" <hfinkel@anl.gov> wrote:]

Recording the output with -ffp-contract=on to use as a reference output does not seem useful,
exactly for the reasons you specify. We should generate the reference outputs with -ffp-contract=off as we do now.
That's the only good target-independent configuration. Maybe we could do this:

- Have a reference output with -ffp-contract=off
- Run program compiled with -ffp-contract=off and the default
- Compare the former against the reference output for validation
- Compare the latter (the default build) against the former with fpcmp with some tolerance

This also has the advantage that we still only need the hashed reference outputs in the repository.

Ah. Interesting idea. So please tell me if I have understood you correctly. I think you [Hal] are suggesting something like this:

   1) compile the program with FP fusion off, run program, capture output and save it, hash it and compare against reference hash.

   2) if comparison against reference hash says "not equal", fail the test and stop [stop testing this particular subtest, that is]

   3) compile the program with FP fusion on/"fast", capture the output,
      compare using "fpcmp" and some positive tolerance against the output of the non-fusion build of the same source code;
      fail only if outside the tolerance limit[s]

Is that right?

Regards,

Abe

From: "Abe Skolnik" <a.skolnik@samsung.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Sebastian Pop" <sebpop.llvm@gmail.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Sebastian Paul Pop" <s.pop@samsung.com>, "cfe-dev"
<cfe-dev@lists.llvm.org>, "Renato Golin" <renato.golin@linaro.org>
Sent: Thursday, September 29, 2016 5:01:10 PM
Subject: Re: [cfe-dev] [llvm-dev] a proposed script to help with test-suite programs that output _lots_ of FP numbers

["Hal Finkel" <hfinkel@anl.gov> wrote:]

> Recording the output with -ffp-contract=on to use as a reference
> output does not seem useful,
> exactly for the reasons you specify. We should generate the
> reference outputs with -ffp-contract=off as we do now.
> That's the only good target-independent configuration. Maybe we
> could do this:

> - Have a reference output with -ffp-contract=off
> - Run program compiled with -ffp-contract=off and the default
> - Compare the former against the reference output for validation
> - Compare the latter (the default build) against the former with
> fpcmp with some tolerance

> This also has the advantage that we still only need the hashed
> reference outputs in the repository.

Ah. Interesting idea. So please tell me if I have understood you
correctly. I think you
[Hal] are suggesting something like this:

   1) compile the program with FP fusion off, run program, capture
   output and save it, hash it
and compare against reference hash.

   2) if comparison against reference hash says "not equal", fail the
   test and stop [stop
testing this particular subtest, that is]

   3) compile the program with FP fusion on/"fast", capture the
   output,
      compare using "fpcmp" and some positive tolerance against the
      output of the non-fusion
build of the same source code;
      fail only if outside the tolerance limit[s]

Is that right?

Correct.

Thanks again,
Hal

That's why Abe proposed running both modes. I think this is the only
long term solution.

My (limited) understanding so far was that -fp-contract=on does not give you deterministic results because depending on optimization levels (and architecture support) you could get more or less multiply-add instructions, so recording a fp-contract=on reference file seems pointless to me...

- Matthias

So we do agree that currently the testsuite runs only one of the
modes, that is fp-contract=off.
The test-suite does not run the other mode: fp-contract=on, that is a
new feature, i.e., more testing.

If we agree on these two points, then let's move forward, commit the
patch that Abe has just posted to add CFLAGS += -ffp-contract=off
and then in a second stage add an extra step for the new mode fp-contract=on.

Thanks,
Sebastian