TSVC/Equivalencing-dbl

Hi Hal, I was looking into why this fails with dragonegg, and noticed the
following: if I compile with GCC (-O0) then I get as output:

Running each loop 3125 times...

Loop Time(Sec) Checksum
S421 0.00 32010.620068485
S1421 0.00 16000
S422 0.00 3.7377231414078
S423 0.00 32000.736895702
S424 0.00 32822.36069424

This is the same as the reference output. If I run exactly the same program
under valgrind then I get:

Running each loop 3125 times...

Loop Time(Sec) Checksum
S421 0.00 32010.620068485
S1421 0.00 17208.404325315
S422 0.00 3.7377231414078
S423 0.00 32000.736895702
S424 0.00 32822.36069424

This is the same except for the S1421 line.

When built in the testsuite with dragonegg (which means optimized) I get:

Running each loop 3125 times...

Loop Time(Sec) Checksum
S421 0.00 32010.620068485
S1421 0.00 17208.404325315
S422 0.00 3.7377231414078
S423 0.00 32000.736895702
S424 0.00 32822.36069424

Which is *exactly* the same as when using valgrind!

Interestingly, the main difference between valgrind emulated floating point and
the real behaviour of the processor is that valgrind doesn't support 80 bit
extended precision floating point: it does everything in 64 bits instead. So I
wonder if these differences are basically due to whether operations are going in
and out of memory (-> 64 bits) or using 80 bit precision, or something else that
may change rounding...

Any thoughts?

Ciao, Duncan.

Oops, I ran the testsuite wrong: read clang output for dragonegg output.

From: "Duncan Sands" <duncan.sands@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu
Sent: Friday, October 5, 2012 12:10:03 PM
Subject: Re: TSVC/Equivalencing-dbl

Oops, I ran the testsuite wrong: read clang output for dragonegg
output.

Okay, can you resummarize? Do you mean that?

gcc -O0:
S1421 0.00 16000

gcc -O0 under valgrind:
S1421 0.00 17208.404325315

clang:
S1421 0.00 17208.404325315

This is all on Darwin, right?

I would certainly tend to suspect an 80-bit-intermediate issue, but, both gcc and clang give 16000 on PowerPC (which has no 80-bit). It could be a rounding issue, but would Darwin really have a different default rounding mode?

The computation being performed here is [in s1421() in tsc.inc]:
                for (int i = 0; i < LEN/2; i++) {
                        b[i] = xx[i] + a[i];
                }
So *if* we're adding up the same numbers in the same order, the answer should be the same everywhere :wink: Can you put in some print statements and confirm?

Thanks again,
Hal

Hi Hal,

From: "Duncan Sands" <duncan.sands@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu
Sent: Friday, October 5, 2012 12:10:03 PM
Subject: Re: TSVC/Equivalencing-dbl

Oops, I ran the testsuite wrong: read clang output for dragonegg
output.

Okay, can you resummarize? Do you mean that?

gcc -O0:
S1421 0.00 16000

gcc -O0 under valgrind:
S1421 0.00 17208.404325315

clang:
S1421 0.00 17208.404325315

exactly. For "clang" this is only when building like the testsuite does
(i.e. with link-time optimization + llc): if you directly do:
   clang tsc.c dummy.c -std=gnu99 -O3
then you get 16000.

This is all on Darwin, right?

No, this is on x86-64 (ubuntu) linux.

I would certainly tend to suspect an 80-bit-intermediate issue, but, both gcc and clang give 16000 on PowerPC (which has no 80-bit).

Not sure what you are saying here. The issue is the x86 internally uses 80 bits
for the 64 bit (double) type, so as long as everything is in registers you get
lots more precision, but the moment you store to memory only 64 bits are stored.
The fact that gcc and clang give the same on powerpc confirms that it is coming
from x86 using an extra 16 bits of precision beyond what you would expect.

  It could be a rounding issue, but would Darwin really have a different default rounding mode?

As I'm seeing this on linux, I guess not :slight_smile:

The computation being performed here is [in s1421() in tsc.inc]:
                 for (int i = 0; i < LEN/2; i++) {
                         b[i] = xx[i] + a[i];
                 }

So *if* we're adding up the same numbers in the same order, the answer should be the same everywhere :wink:

No, why would it be the same everywhere? If the whole thing is done in
double registers, and x86 processor will maintain 80 bits of precision
even though these are 64 bit (double) types, while if things are loaded
and stored to memory at every step instead then only 64 bits will be used.
This can lead to very different results.

  Can you put in some print statements and confirm?

Not sure what you want me to confirm, but anyway I now have 1/2 an hour to
look into this some more :slight_smile:

Ciao, Duncan.

PS: Here's how I can reproduce with clang on linux:

clang -S -o tsc.ll -O0 -flto -std=gnu99 tsc.c ; clang -S -o dummy.ll -O0 -flto -std=gnu99 dummy.c ; opt -std-compile-opts tsc.ll -S -o tsc.1.ll ; opt -std-compile-opts dummy.ll -S -o dummy.1.ll ; llvm-link tsc.1.ll dummy.1.ll -S -o total.ll ; opt -std-link-opts total.ll -S -o total.1.ll ; llc total.1.ll ; gcc -o z total.1.s

The program z shows the problem. Note that it is essential to have clang use
-O0 (not -O3).

Ciao, Duncan.

From: "Duncan Sands" <duncan.sands@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu
Sent: Friday, October 5, 2012 2:50:06 PM
Subject: Re: TSVC/Equivalencing-dbl

Hi Hal,

>> From: "Duncan Sands" <duncan.sands@gmail.com>
>> To: "Hal Finkel" <hfinkel@anl.gov>
>> Cc: llvmdev@cs.uiuc.edu
>> Sent: Friday, October 5, 2012 12:10:03 PM
>> Subject: Re: TSVC/Equivalencing-dbl
>>
>> Oops, I ran the testsuite wrong: read clang output for dragonegg
>> output.
>
> Okay, can you resummarize? Do you mean that?
>
> gcc -O0:
> S1421 0.00 16000
>
> gcc -O0 under valgrind:
> S1421 0.00 17208.404325315
>
> clang:
> S1421 0.00 17208.404325315

exactly. For "clang" this is only when building like the testsuite
does
(i.e. with link-time optimization + llc): if you directly do:
   clang tsc.c dummy.c -std=gnu99 -O3
then you get 16000.

>
> This is all on Darwin, right?

No, this is on x86-64 (ubuntu) linux.

OIC, interesting!

>
> I would certainly tend to suspect an 80-bit-intermediate issue,
> but, both gcc and clang give 16000 on PowerPC (which has no
> 80-bit).

Not sure what you are saying here. The issue is the x86 internally
uses 80 bits
for the 64 bit (double) type, so as long as everything is in
registers you get
lots more precision, but the moment you store to memory only 64 bits
are stored.
The fact that gcc and clang give the same on powerpc confirms that it
is coming
from x86 using an extra 16 bits of precision beyond what you would
expect.

  It could be a rounding issue, but would Darwin really have a
  different default
rounding mode?

As I'm seeing this on linux, I guess not :slight_smile:

>
> The computation being performed here is [in s1421() in tsc.inc]:
> for (int i = 0; i < LEN/2; i++) {
> b[i] = xx[i] + a[i];
> }

> So *if* we're adding up the same numbers in the same order, the
> answer should be the same everywhere :wink:

No, why would it be the same everywhere? If the whole thing is done
in
double registers, and x86 processor will maintain 80 bits of
precision
even though these are 64 bit (double) types, while if things are
loaded
and stored to memory at every step instead then only 64 bits will be
used.
This can lead to very different results.

Right.

  Can you put in some print statements and confirm?

Not sure what you want me to confirm, but anyway I now have 1/2 an
hour to
look into this some more :slight_smile:

For test s1421, we have:
                for (int i = 0; i < LEN/2; i++) {
                        b[i] = xx[i] + a[i];
                }

in this case xx is set to the second half of the b array. a is initialized to 1/(i+1)^2. The b array, however, does not seem to be explicitly initialized for this test. When all of the tests are run in order, it is initialized for the last test in the previous group, s353... so maybe I screwed this up in breaking apart the tests.

Thanks again,
Hal

Hi Hal,

To get my understanding right, is this a test-case problem or there is a problem with x86 code generation?. I can spend some time to look into the problem.

Thanks,
Shivaram

Shivaram,

Thanks! I'm double-checking on the way in which the arrays are initialized; I'll follow-up in the next day or so.

-Hal

Hi,

There was a out of bound array access in the test S1421. This is fixed and uploaded at TSVC site by the TSVC maintainers. With this fix and Hal's fix of proper initialization of arrays in broken tests, the test should work fine now.

Regards,
Shivaram