[test-suite] making polybench/symm succeed with "-Ofast" and "-ffp-contract=on"

Hi,

I would need some help to fix polybench/symm:

void kernel_symm(int ni, int nj,
DATA_TYPE alpha,
DATA_TYPE beta,
DATA_TYPE POLYBENCH_2D(C,NI,NJ,ni,nj),
DATA_TYPE POLYBENCH_2D(A,NJ,NJ,nj,nj),
DATA_TYPE POLYBENCH_2D(B,NI,NJ,ni,nj))
{
  int i, j, k;
  DATA_TYPE acc;

  /* C := alpha*A*B + beta*C, A is symetric */
  for (i = 0; i < _PB_NI; i++)
    for (j = 0; j < _PB_NJ; j++)
      {
        acc = 0;
        for (k = 0; k < j - 1; k++)
          {
             C[k][j] += alpha * A[k][i] * B[i][j];
             acc += B[k][j] * A[k][i];
          }
        C[i][j] = beta * C[i][j] + alpha * A[i][i] * B[i][j] + alpha * acc;
      }
}

Compiling this kernel with __attribute__((optnone)) and outputing the
contents of the C array does not match the reference output.
Furthermore, compiling this kernel at -Ofast and comparing against -O0
only passes for FP_ABSTOLERANCE=10.
All the 10 other polybench tests that I have transformed to check FP
are passing at FP_ABSTOLERANCE=1e-5 (and most likely they could pass
at an even more reduced tolerance.)

The symm benchmark seems to accumulate all the errors as it is a big
reduction from the first elements of the C array into the last
elements.
I'm not sure we can rely on this benchmark to check FP correctness.

One option is to completely specify which optimization flags have been
used to compute the reference output and only use that to compile this
benchmark.

Please share your ideas on how to deal with this particular test.

Thanks,
Sebastian

From: "Sebastian Pop" <sebpop.llvm@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Sebastian Paul Pop" <s.pop@samsung.com>, "llvm-dev" <llvm-dev@lists.llvm.org>, "Matthias Braun"
<matze@braunis.de>, "Clang Dev" <cfe-dev@lists.llvm.org>, "nd" <nd@arm.com>, "Abe Skolnik" <a.skolnik@samsung.com>,
"Renato Golin" <renato.golin@linaro.org>
Sent: Monday, October 10, 2016 9:10:01 AM
Subject: [test-suite] making polybench/symm succeed with "-Ofast" and "-ffp-contract=on"

Hi,

I would need some help to fix polybench/symm:

void kernel_symm(int ni, int nj,
DATA_TYPE alpha,
DATA_TYPE beta,
DATA_TYPE POLYBENCH_2D(C,NI,NJ,ni,nj),
DATA_TYPE POLYBENCH_2D(A,NJ,NJ,nj,nj),
DATA_TYPE POLYBENCH_2D(B,NI,NJ,ni,nj))
{
  int i, j, k;
  DATA_TYPE acc;

  /* C := alpha*A*B + beta*C, A is symetric */
  for (i = 0; i < _PB_NI; i++)
    for (j = 0; j < _PB_NJ; j++)
      {
        acc = 0;
        for (k = 0; k < j - 1; k++)
          {
             C[k][j] += alpha * A[k][i] * B[i][j];
             acc += B[k][j] * A[k][i];
          }
        C[i][j] = beta * C[i][j] + alpha * A[i][i] * B[i][j] + alpha
        * acc;
      }
}

Compiling this kernel with __attribute__((optnone)) and outputing the
contents of the C array does not match the reference output.

Why is this? What compiler are you using? Are we not using IEEE FP @ -O0 (e.g. using x87 floating point)? IEEE FP, without FMA, should be completely deterministic. Sounds like a bug.

Furthermore, compiling this kernel at -Ofast and comparing against
-O0
only passes for FP_ABSTOLERANCE=10.
All the 10 other polybench tests that I have transformed to check FP
are passing at FP_ABSTOLERANCE=1e-5 (and most likely they could pass
at an even more reduced tolerance.)

The symm benchmark seems to accumulate all the errors as it is a big
reduction from the first elements of the C array into the last
elements.
I'm not sure we can rely on this benchmark to check FP correctness.

One option is to completely specify which optimization flags have
been
used to compute the reference output and only use that to compile
this
benchmark.

Please share your ideas on how to deal with this particular test.

If the test is not numerically stable, we can:

1. Only test the non-FP-contracted output
2. Run the FP-contracted test only for a very small size (so that we'll stay within some reasonable tolerance of the reference output)
3. Change the matrix to something that will make the test numerically stable (it does not look like the matrix itself matters to the performance; where do the values come from?).

-Hal

From: "Sebastian Pop" <sebpop.llvm@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Sebastian Paul Pop" <s.pop@samsung.com>, "llvm-dev" <llvm-dev@lists.llvm.org>, "Matthias Braun"
<matze@braunis.de>, "Clang Dev" <cfe-dev@lists.llvm.org>, "nd" <nd@arm.com>, "Abe Skolnik" <a.skolnik@samsung.com>,
"Renato Golin" <renato.golin@linaro.org>
Sent: Monday, October 10, 2016 9:10:01 AM
Subject: [test-suite] making polybench/symm succeed with "-Ofast" and "-ffp-contract=on"

Hi,

I would need some help to fix polybench/symm:

void kernel_symm(int ni, int nj,
DATA_TYPE alpha,
DATA_TYPE beta,
DATA_TYPE POLYBENCH_2D(C,NI,NJ,ni,nj),
DATA_TYPE POLYBENCH_2D(A,NJ,NJ,nj,nj),
DATA_TYPE POLYBENCH_2D(B,NI,NJ,ni,nj))
{
  int i, j, k;
  DATA_TYPE acc;

  /* C := alpha*A*B + beta*C, A is symetric */
  for (i = 0; i < _PB_NI; i++)
    for (j = 0; j < _PB_NJ; j++)
      {
        acc = 0;
        for (k = 0; k < j - 1; k++)
          {
             C[k][j] += alpha * A[k][i] * B[i][j];
             acc += B[k][j] * A[k][i];
          }
        C[i][j] = beta * C[i][j] + alpha * A[i][i] * B[i][j] + alpha
        * acc;
      }
}

Compiling this kernel with __attribute__((optnone)) and outputing the
contents of the C array does not match the reference output.

Why is this? What compiler are you using? Are we not using IEEE FP @ -O0 (e.g. using x87 floating point)? IEEE FP, without FMA, should be completely deterministic. Sounds like a bug.

This is with clang top of tree, on a x86_64-linux.
I created https://reviews.llvm.org/D25465 with the changes that I have
to the symm benchmark.

Furthermore, compiling this kernel at -Ofast and comparing against
-O0
only passes for FP_ABSTOLERANCE=10.
All the 10 other polybench tests that I have transformed to check FP
are passing at FP_ABSTOLERANCE=1e-5 (and most likely they could pass
at an even more reduced tolerance.)

The symm benchmark seems to accumulate all the errors as it is a big
reduction from the first elements of the C array into the last
elements.
I'm not sure we can rely on this benchmark to check FP correctness.

One option is to completely specify which optimization flags have
been
used to compute the reference output and only use that to compile
this
benchmark.

Please share your ideas on how to deal with this particular test.

If the test is not numerically stable, we can:

1. Only test the non-FP-contracted output

Yes, this is what I'm doing.

2. Run the FP-contracted test only for a very small size (so that we'll stay within some reasonable tolerance of the reference output)
3. Change the matrix to something that will make the test numerically stable (it does not look like the matrix itself matters to the performance; where do the values come from?).

The values may be very large towards the end of the C array.
The test now passes with FP_ABSTOLERANCE=1e-5 when lowering the values
in the input arrays with this patch:

diff --git a/SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.c
b/SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.c
index 0a1bdf3..7fc3cb1 100644
--- a/SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.c
+++ b/SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.c
@@ -35,12 +35,12 @@ void init_array(int ni, int nj,
   *beta = 2123;
   for (i = 0; i < ni; i++)
     for (j = 0; j < nj; j++) {
- C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i*j) / ni;
- B[i][j] = ((DATA_TYPE) i*j) / ni;
+ C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i-j) / ni;
+ B[i][j] = ((DATA_TYPE) i-j) / ni;
     }
   for (i = 0; i < nj; i++)
     for (j = 0; j < nj; j++)
- A[i][j] = ((DATA_TYPE) i*j) / ni;
+ A[i][j] = ((DATA_TYPE) i-j) / ni;
}

Of course we need to update the reference output hash if we decide to
use this patch.

Sebastian

1. Only test the non-FP-contracted output

Yes, this is what I'm doing.

If the whole test is about testing multiplications, what's the point of this?

2. Run the FP-contracted test only for a very small size (so that we'll stay within some reasonable tolerance of the reference output)
3. Change the matrix to something that will make the test numerically stable (it does not look like the matrix itself matters to the performance; where do the values come from?).

3 is more sound, 2 may be more practical.

- C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i*j) / ni;
- B[i][j] = ((DATA_TYPE) i*j) / ni;
+ C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i-j) / ni;
+ B[i][j] = ((DATA_TYPE) i-j) / ni;
     }
   for (i = 0; i < nj; i++)
     for (j = 0; j < nj; j++)
- A[i][j] = ((DATA_TYPE) i*j) / ni;
+ A[i][j] = ((DATA_TYPE) i-j) / ni;

Changing from multiplication to subtraction changes completely the
nature of the test and goes towards "return 0;", ie, fiddling with the
code so that the compiler "behaves" better. This is *not* a solution.

Hal,

For large scale numerical programs, if fp-contract can result in large
scale differences, we need to think about this approach by default.

If the loop above cannot be contained in an 1e-8 range for double
values over a large dataset, than I guess the transformation is going
a bit too far.

If not, we should be able to come up with a reasonable tolerance that
makes the test still be relevant.

cheers,
--renato

1. Only test the non-FP-contracted output

Yes, this is what I'm doing.

If the whole test is about testing multiplications, what's the point of this?

2. Run the FP-contracted test only for a very small size (so that we'll stay within some reasonable tolerance of the reference output)
3. Change the matrix to something that will make the test numerically stable (it does not look like the matrix itself matters to the performance; where do the values come from?).

3 is more sound, 2 may be more practical.

2 sounds like you are asking to only run checkFP on the first elements
of the array.
In that case what would be the last element to check?

- C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i*j) / ni;
- B[i][j] = ((DATA_TYPE) i*j) / ni;
+ C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i-j) / ni;
+ B[i][j] = ((DATA_TYPE) i-j) / ni;
     }
   for (i = 0; i < nj; i++)
     for (j = 0; j < nj; j++)
- A[i][j] = ((DATA_TYPE) i*j) / ni;
+ A[i][j] = ((DATA_TYPE) i-j) / ni;

Changing from multiplication to subtraction changes completely the
nature of the test and goes towards "return 0;", ie, fiddling with the
code so that the compiler "behaves" better. This is *not* a solution.

Another observation: when changing * with + the test only passes at
-Ofast with FP_ABSTOLERANCE=1e-4.

Sebastian

From: "Renato Golin" <renato.golin@linaro.org>
To: "Sebastian Pop" <sebpop.llvm@gmail.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Sebastian Paul Pop" <s.pop@samsung.com>, "llvm-dev" <llvm-dev@lists.llvm.org>,
"Matthias Braun" <matze@braunis.de>, "Clang Dev" <cfe-dev@lists.llvm.org>, "nd" <nd@arm.com>, "Abe Skolnik"
<a.skolnik@samsung.com>
Sent: Tuesday, October 11, 2016 6:33:43 AM
Subject: Re: [test-suite] making polybench/symm succeed with "-Ofast" and "-ffp-contract=on"

>> 1. Only test the non-FP-contracted output
>
> Yes, this is what I'm doing.

If the whole test is about testing multiplications, what's the point
of this?

>> 2. Run the FP-contracted test only for a very small size (so that
>> we'll stay within some reasonable tolerance of the reference
>> output)
>> 3. Change the matrix to something that will make the test
>> numerically stable (it does not look like the matrix itself
>> matters to the performance; where do the values come from?).

3 is more sound, 2 may be more practical.

> - C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i*j) / ni;
> - B[i][j] = ((DATA_TYPE) i*j) / ni;
> + C_StrictFP[i][j] = C[i][j] = ((DATA_TYPE) i-j) / ni;
> + B[i][j] = ((DATA_TYPE) i-j) / ni;
> }
> for (i = 0; i < nj; i++)
> for (j = 0; j < nj; j++)
> - A[i][j] = ((DATA_TYPE) i*j) / ni;
> + A[i][j] = ((DATA_TYPE) i-j) / ni;

Changing from multiplication to subtraction changes completely the
nature of the test and goes towards "return 0;", ie, fiddling with
the
code so that the compiler "behaves" better. This is *not* a solution.

Hal,

For large scale numerical programs, if fp-contract can result in
large
scale differences, we need to think about this approach by default.

Obviously a lot of people have done an awful lot of thinking about this over many years, and contractions-by-default is the reality on many systems. If you have a program that is numerically unstable, simulating a chaotic system, etc. then any difference, often no matter how small, will lead to large-scale differences in the output. As a result, there will be some tests that don't have a useful tolerance; sometimes these are badly-implemented tests, but sometimes the sensitivity represents an underling physical reality of a simulated system (there's a lot of very-interesting mathematical theory behind this, e.g. Chaos theory - Wikipedia).

From a user-experience perspective, this can be very unfortunate. It can be hard to understand why compiler optimizations, or different compilers, produce executables that produce different outputs for identical input configurations. It contributes to feelings that floating point is hard and confusing. However, not using the contractions also leads to equally-confusing performance discrepancies between our compiler and others (and between the observed and expected performance). We have a classic "Damned if you do, damned if you don't" situation. However, I lean toward enabling the contractions by default because other compilers do it (so users need to learn about what's going on anyway - we can't shield them from this regardless of what we do) and it gives users the performance they expect (which increases our user base and makes many users happier).

-Hal

Thanks Hal for the explanations and summary of why we need to fix this
in the compiler and in the test-suite.

For a "non FP expert" like myself, could one of you "FP experts"
choose from the proposed solutions on how to fix symm, and let me know
what I should implement?
To get polybench/symm out of my todo list, the sooner "FP experts"
make up their mind on what they would like the test-suite to look
like, the better.
:wink:

Thanks,
Sebastian

It is not uncommon to see in several polybench tests adjustments to
the initial values:

  /*
  LLVM: This change ensures we do not calculate nan values, which are
        formatted differently on different platforms and which may also
        be optimized unexpectedly.
  Original code:
  for (i = 0; i < ni; i++)
    for (j = 0; j < nj; j++) {
      A[i][j] = ((DATA_TYPE) i*j) / ni;
      Q[i][j] = ((DATA_TYPE) i*(j+1)) / nj;
    }
  for (i = 0; i < nj; i++)
    for (j = 0; j < nj; j++)
      R[i][j] = ((DATA_TYPE) i*(j+2)) / nj;
  */
  for (i = 0; i < ni; i++)
    for (j = 0; j < nj; j++) {
      A[i][j] = ((DATA_TYPE) i*j+ni) / ni;
      Q[i][j] = ((DATA_TYPE) i*(j+1)+nj) / nj;
    }
  for (i = 0; i < nj; i++)
    for (j = 0; j < nj; j++)
      R[i][j] = ((DATA_TYPE) i*(j+2)+nj) / nj;

git grepping gives us:

linear-algebra/kernels/cholesky/cholesky.c: LLVM: This change ensures
we do not calculate nan values, which are
linear-algebra/kernels/cholesky/cholesky.c: LLVM: This change
ensures we do not calculate nan values, which are
linear-algebra/kernels/cholesky/cholesky.c: LLVM: This change
ensures we do not calculate nan values, which are
linear-algebra/kernels/trisolv/trisolv.c: LLVM: This change ensures
we do not calculate nan values, which are
linear-algebra/solvers/gramschmidt/gramschmidt.c: LLVM: This change
ensures we do not calculate nan values, which are
linear-algebra/solvers/lu/lu.c: LLVM: This change ensures we do not
calculate nan values, which are

polybench/linear-algebra/solvers/gramschmidt/ exposes the same problems as symm.
It does not match the reference output at -O0 -ffp-contract=off,
and it only passes all elements comparisons for FP_ABSTOLERANCE=1 for
"-Ofast" vs. "-O0 -ffp-contract=off".

Obviously a lot of people have done an awful lot of thinking about this over many years, and contractions-by-default is the reality on many systems. If you have a program that is numerically unstable, simulating a chaotic system, etc. then any difference, often no matter how small, will lead to large-scale differences in the output. As a result, there will be some tests that don't have a useful tolerance; sometimes these are badly-implemented tests, but sometimes the sensitivity represents an underling physical reality of a simulated system (there's a lot of very-interesting mathematical theory behind this, e.g. Chaos theory - Wikipedia).

Hi Hal,

I think we're crossing the wires, here.

There are three sources of uncertainties on chaotic systems:

1. Initial conditions, not affected by the compiler and "part of the
problem, part of the solution".
2. Evolution, affected by the compiler, not limited by FP-reordering
passes (UB can also play a role here).
3. Expectations, affected by the evolution and the nature of the
problem and too high level to be of any consequence to the compiler.

Initial conditions change in real life, but they must be the same in
tests. Same for evolution and expectation. You can't use an external
random number generator, you can't rely on different RNGs (that's why
I added hand-coded ones to some tests).

If the FP-contract pass affects (2), that's perfectly fine. But if if
affects (3), for example via changing the precision / errors / deltas,
then we have a problem.

From what I understand, FP-contraction actually makes calculations

*more* precise, by removing one rounding operation every two. This
means to me that whatever tolerance of a well designed *test* must be
kept as low as possible.

And this is the key: if the tolerance of a test needs to be
*increased* because of FP-contract, then the test is wrong. Either the
code, or the reference output, or how we get to the reference values
is wrong. Disabling the checks, or increasing the tolerance beyond
what's meaningful in this case will make an irrelevant test useless.
Right now, it may be irrelevant and non-representative, but it can
catch compiler FP errors. Adding a huge bracket or disabling
FP-contract will remove even that small benefit.

Right now, the tests have one value, which happens to be identical in
virtually all platforms. This means the compiler is pretty good at
keeping the semantics and lucky in keeping the same precision. But we
both know this is wrong.

And now we have a chance to make those tests better. Not more accurate
per se, but more accurately testing the compiler. There is a big
difference here.

If we change the semantics of the code (mul -> sub), we're killing the
original test. If we increase the tolerance without analysis or
disable default passes, we're killing any chance to spot compiler
problems in the future.

From a user-experience perspective, this can be very unfortunate. It can be hard to understand why compiler optimizations, or different compilers, produce executables that produce different outputs for identical input configurations. It contributes to feelings that floating point is hard and confusing.

On the contrary, it serves an an education that FP is a finite
representation of real numbers, and people shouldn't be expecting to
get byte-exact values anyway. I have strong reservations against
scientific code that doesn't take into account rounding issues, error
calculations, and that takes the results at face value. It's like
running one single Monte Carlo simulation and taking international
politics decisions based on that result.

However, not using the contractions also leads to equally-confusing performance discrepancies between our compiler and others (and between the observed and expected performance).

Let's not mix conformance and performance. Different compilers have
different flags and behave differently. Enabling FP-contract in LLVM
has *nothing* to do with what GCC does, but to do with "what's a
better strategy for LLVM". We have refrained from following GCC
blindly for a number of years and it would be really sad if we started
now.

If FP-contract=on is a good decision for LLVM, on merits of precision,
performance and overall quality, then let's do it. If not, then let's
put it under some flag and tell all people comparing with GCC to use
that flag.

But if we do go with it, we need to make sure our current tests don't
just break apart and get hidden under a corner. GCC compatibility
isn't *that* important.

I'm not advocating against turning it on, I'm advocating against the
easy path of hiding the tests. We might just as well remove them.

I'll reply to Sebastian on a more practical way, but I wanted to make
it clear that we're talking about the test and not the transformation
itself, which needs to be analysed on its own merits, not on what GCC
does.

cheers,
--renato

This comment is there since it was originally introduced by Tobias.
We'll have to ask him what changes were done to understand how this is
relevant to your current proposal.

cheers,
--renato

I think we're going about this in a completely wrong way.

The current reference output is specific to fp-contract=off, and
making it work for fp-contract=on makes no sense at all.

For all we know, fp-contract=on generates *more accurate* results, not
less. But it may also have less predictable results *across* different
targets, thus the need to a tolerance.

FP_TOLERANCE is *not* about making the new results match an old
reference, but about showing the *real* uncertainties of FP
transformation on *different* targets.

So, if you want to fix this test for good, here are the steps you need to take:

1. Checkout the test-suite on different platforms, x86_64, ARM,
AArch64, PPC, MIPS. The more the merrier.
2. Enable fp-contract=on, run the tests on all platforms, record the
outputs, ignore the differences.
3. Collate each platofrm's output for each test and see how different they are

To make it easier to compare, in the past, I've used this trick:

1. Run in one platform, ex. x86_64, ignored the reference
2. Copy the output of those tests back to the reference_output
3. Run on a different platform, tweaking the tolerance until it "passes"
4. Run on yet another platform, making sure you don't need to tweak
the tolerance yet again

If the tolerance is "too high" for that test, we can further discuss
how to change it to make it better. If not, you found a solution.

If you want to make it even better, do some analysis on the
distribution of the results, per test, and pick the average as the
reference output and one or two standard deviations as the tolerance.
This should pass on most architectures.

To simplify the analysis, you can reduce the output into a single
number, say, adding all the results up. This will generate more
inaccuracies than comparing each value, and if that's too large an
error, then you reduce the number of samples.

For example, on cholesky, we sampled every 16th item of the array:

  for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++)
      print_element(A[i][j], j*16, printmat);
    fputs(printmat, stderr);
  }

using "print_element" because calling printf sucks.

These modifications are ok, because they don't change the tests nor
hides them from compiler changes.

cheers,
--renato

Dear Peter:

Please ignore the previous two test cases that I had sent.
They do not capture the problem that I was after.

I wonder if this is an intended design of the framework,
but something strange seems to happen with label propagation
if type casts are utilized on the original data source;
I will have test cases on the way to illustrate shortly.

Sincerely,

JongJu Park

polybench/linear-algebra/solvers/gramschmidt/ exposes the same problems as symm.
It does not match the reference output at -O0 -ffp-contract=off,
and it only passes all elements comparisons for FP_ABSTOLERANCE=1 for
"-Ofast" vs. "-O0 -ffp-contract=off".

I think we're going about this in a completely wrong way.

The current reference output is specific to fp-contract=off, and
making it work for fp-contract=on makes no sense at all.

Yes.

I want to mention that there are two problems: one is with the FP tolerance
as you describe below.
The other problem is the reference output does not match
at "-O0 -ffp-contract=off". It might be that the reference output was recorded
at "-O3 -ffp-contract=off". I think that this hides either a compiler
bug or a test bug.

Sebastian

Ah, yes! You mentioned before and I forgot to reply, you're absolutely right.

If the tolerance is zero, then it's "ok" to "fail" at O0, because
whatever O3 produces is "some" version of the expected value +- some
delta. The error is expecting the tolerance to be zero (or smaller
than delta).

My point, since the beginning, has been to understand what the
expected value (with its inherent error bars), and make that the
reference output. Only then the test will be meaningful *and*
accurate.

But there are so many overloaded terms in this conversation that it's
really hard to get a point across without going to great lengths to
explain each one. :slight_smile:

cheers,
--renato

PS: the term "accurate" above is meant to "accurately test the
expected error ranges the compiler is allowed to produce", not that
the test will have a lower error bar. It demonstrates the term
overloading quite well. :slight_smile:

Correct me if I misunderstood: you would be ok changing the
reference output to exactly match the output of "-O0 -ffp-contract=off".

I am asking this for practical reasons:
clang currently only supports __attribute__((optnone)) to compile
a function at -O0. All other optimization levels are not yet supported.
In the updated patch for Proposal 2: ⚙ D25346 [test-suite] [Polybench] run tests twice with -ffp-contract=on/off
we do use that attribute together with
#pragma STDC FP_CONTRACT OFF
to compile the kernel_StrictFP() function at "-O0 -ffp-contract=off".
The output of kernel_StrictFP is then used in exact matching against
the reference output.

In polybench there are 5 benchmarks that need adjustment of the
reference output to match the output of optnone.

polybench/linear-algebra/kernels/symm
polybench/linear-algebra/solvers/gramschmidt
polybench/medley/reg_detect
polybench/stencils/adi
polybench/stencils/seidel-2d

Thanks,
Sebastian

No, that's not at all what I said.

Matching identical outputs to FP tests makes no sense because there's
*always* an error bar.

The output of O0, O1, O2, O3, Ofast, Os, Oz should all be within the
boundaries of an average and its associated error bar.

By understanding what's the *expected* output and its associated error
range we can accurately predict what will be the correct
reference_output and the tolerance for each individual test.

Your solution 2 "works" because you're doing the matching yourself, in
the code, and for that, you pay the penalty of running it twice. But
it's not easy to control the tolerance, nor it's stable for all
platforms where we don't yet run the test suite.

My original proposal, and what I'm still proposing here, is to
understand the tests and make them right, by giving them proper
references and tolerances. If the output is too large, reduce/sample
in a way that doesn't increase the error ranges too much, enough to
keep the tolerance low, so we can still catch bugs in the FP
transformations.

cheers,
--renato

The code before is in the comments: we know exactly what Tobi has changed.
Most of these changes are in the initialization of the arrays, though
there are also
changes to the computational kernel.

Polybench was designed to stress loop optimizations in the polyhedral model.
The intent of adding Polybench to the test-suite was to stress loop
optimizations in Polly.
Those initial changes by Tobi reflect this intent: neither the FP
computation, nor the initial values matter much.
I would appreciate if Tobi could share his point of view on Polybench:
I added him to the CC list.

We are currently trying to modify Polybench to test something
different than what it was designed for.
This goes along with my earlier comment about the SPEC benchmarks:
there are benchmarks that have been designed to test FP computations.
If we need more FP benchmarks in the test-suite, we should try to
identify and add benchmarks in which "FP expert" people put thought in
correctly designing the tests to check FP computations.

Sebastian

Correct me if I misunderstood: you would be ok changing the
reference output to exactly match the output of "-O0 -ffp-contract=off".

No, that's not at all what I said.

Thanks for clarifying your previous statement: I stand corrected.

Matching identical outputs to FP tests makes no sense because there's
*always* an error bar.

Agreed.

The output of O0, O1, O2, O3, Ofast, Os, Oz should all be within the
boundaries of an average and its associated error bar.

Agreed.

By understanding what's the *expected* output and its associated error
range we can accurately predict what will be the correct
reference_output and the tolerance for each individual test.

Agreed.

Your solution 2 "works" because you're doing the matching yourself, in
the code, and for that, you pay the penalty of running it twice. But
it's not easy to control the tolerance, nor it's stable for all
platforms where we don't yet run the test suite.

My original proposal, and what I'm still proposing here, is to
understand the tests and make them right, by giving them proper
references and tolerances. If the output is too large, reduce/sample
in a way that doesn't increase the error ranges too much, enough to
keep the tolerance low, so we can still catch bugs in the FP
transformations.

This goes in the same direction as what you said earlier in:

To simplify the analysis, you can reduce the output into a single
number, say, adding all the results up. This will generate more
inaccuracies than comparing each value, and if that's too large an
error, then you reduce the number of samples.

For example, on cholesky, we sampled every 16th item of the array:

for (i = 0; i < n; i++) {
   for (j = 0; j < n; j++)
     print_element(A[i][j], j*16, printmat);
   fputs(printmat, stderr);
}

Wrt "we sampled every 16th item of the array", not really in that test,
but I get your point:

k = 0;
for (i = 0; i < n; i++) {
   for (j = 0; j < n; j+=16) {
     print_element(A[i][j], k, printmat);
     k += 16;
   }
   fputs(printmat, stderr);
}

Ok, let's do this for the 5 benchmarks that do not exactly match.

Thanks,
Sebastian

From: "Renato Golin" <renato.golin@linaro.org>
To: "Sebastian Pop" <sebpop.llvm@gmail.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Sebastian Paul Pop" <s.pop@samsung.com>, "llvm-dev" <llvm-dev@lists.llvm.org>,
"Matthias Braun" <matze@braunis.de>, "Clang Dev" <cfe-dev@lists.llvm.org>, "nd" <nd@arm.com>, "Abe Skolnik"
<a.skolnik@samsung.com>
Sent: Wednesday, October 12, 2016 8:35:16 AM
Subject: Re: [test-suite] making polybench/symm succeed with "-Ofast" and "-ffp-contract=on"

> Correct me if I misunderstood: you would be ok changing the
> reference output to exactly match the output of "-O0
> -ffp-contract=off".

No, that's not at all what I said.

Matching identical outputs to FP tests makes no sense because there's
*always* an error bar.

This is something we need to understand. No, there's not always an error bar. With FMA formation and without non-IEEE-compliant optimizations (i.e. fast-math), the optimized answer should be identical to the non-optimized answer. If these don't match, then we should understand why. This used to be a large problem because of fp80-related issues on x86 processors, but even on x86 if we stick to SSE (etc.) FP instructions, this is not an issue any more. We still do see cross-system discrepancies sometimes because of differences in denormal handling, but on the same system that should be consistent (aside, perhaps, from compiler-level constant-folding issues).

-Hal