I just remembered an anomalous result that I stumbled upon whilst tweaking the
command-line options to llvm-gcc. Specifically, the -msse3 flag does a great
job improving the performance of floating point intensive code on the
SciMark2 benchmark but it also degrades the performance of the int-intensive
Monte Carlo part of the test:
$ llvm-gcc -Wall -lm -O3 *.c -o scimark2
$ ./scimark2
Using 2.00 seconds min time per kenel.
Composite Score: 432.84
FFT Mflops: 358.90 (N=1024)
SOR Mflops: 473.45 (100 x 100)
MonteCarlo: Mflops: 210.54
Sparse matmult Mflops: 354.25 (N=1000, nz=5000)
LU Mflops: 767.04 (M=100, N=100)
$ llvm-gcc -Wall -lm -O3 -msse3 *.c -o scimark2
$ ./scimark2
Composite Score: 548.53
FFT Mflops: 609.87 (N=1024)
SOR Mflops: 497.92 (100 x 100)
MonteCarlo: Mflops: 126.62
Sparse matmult Mflops: 604.02 (N=1000, nz=5000)
LU Mflops: 904.19 (M=100, N=100)
The relevant code is:
double Random_nextDouble(Random R)
{
int k;
int I = R->i;
int J = R->j;
int *m = R->m;
k = m[I] - m[J];
if (k < 0) k += m1;
R->m[J] = k;
if (I == 0)
I = 16;
else I--;
R->i = I;
if (J == 0)
J = 16 ;
else J--;
R->j = J;
if (R->haveRange)
return R->left + dm1 * (double) k * R->width;
else
return dm1 * (double) k;
}
double MonteCarlo_integrate(int Num_samples)
{
Random R = new_Random_seed(SEED);
int under_curve = 0;
int count;
for (count=0; count<Num_samples; count++)
{
double x= Random_nextDouble(R);
double y= Random_nextDouble(R);
if ( x*x + y*y <= 1.0)
under_curve ++;
}
Random_delete(R);
return ((double) under_curve / Num_samples) * 4.0;
}
The -msse3 flag? Does the -msse2 flag have a similar effect?
-Eli
Yes:
$ llvm-gcc -Wall -lm -O3 -msse2 *.c -o scimark2
$ ./scimark2
Composite Score: 525.99
FFT Mflops: 538.35 (N=1024)
SOR Mflops: 472.29 (100 x 100)
MonteCarlo: Mflops: 120.92
Sparse matmult Mflops: 585.14 (N=1000, nz=5000)
LU Mflops: 913.27 (M=100, N=100)
But -msse does not:
$ llvm-gcc -Wall -lm -O3 -msse *.c -o scimark2
$ ./scimark2
Composite Score: 540.08
FFT Mflops: 535.04 (N=1024)
SOR Mflops: 469.99 (100 x 100)
MonteCarlo: Mflops: 197.38
Sparse matmult Mflops: 587.77 (N=1000, nz=5000)
LU Mflops: 910.22 (M=100, N=100)
That was x64 and I get similar results for x86.
Is there some kind of contention between the integer and SSE registers?
I just remembered an anomalous result that I stumbled upon whilst
tweaking the command-line options to llvm-gcc. Specifically, the -msse3
flag
The -msse3 flag? Does the -msse2 flag have a similar effect?
Yes:
Hi Jon,
I'm seeing exactly identical .s files with -msse2 and -msse3 on the scimark version I have. Can you please send the output of:
llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.s
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.s
llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.ll -emit-llvm
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.ll -emit-llvm
Thanks,
-Chris
I'm seeing exactly identical .s files with -msse2 and -msse3 on the
scimark version I have. Can you please send the output of:
llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.s
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.s
llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.ll -emit-llvm
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.ll -emit-llvm
Can I just check that you had noticed that my timings for those
(sse2 vs sse3)
were the same and that the difference was occurring between -msse
and -msse2
(see below)?
The x86 output is attached for those (which give the same results here too) as
well as -O3 and -O3 -msse which give different results here. Here are the
performance results I just got when redoing this on x86:
MonteCarlo: Mflops: 212.20 -O3
MonteCarlo: Mflops: 211.37 -O3 -msse
MonteCarlo: Mflops: 123.70 -O3 -msse2
MonteCarlo: Mflops: 127.22 -O3 -msse3
Ok, thanks Jon! I diff'd the files and the -msse2 and -msse3 code is identical, so we're not doing anything wrong with -msse3 :).
OTOH, the perf drop from sse -> sse2 is concerning. The difference here is that we do double math in SSE regs instead of FPStack regs. In this case, using the fp stack avoids some cross-class register copying. We could improve the code generator to notice and handle this, I added this note to the x86 backend with some details:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20090202/073254.html
This is a long-known issue, but a great example of it.
Two other points of interest:
. I just retimed in x64 and could not reproduce the difference so this only
afflicts x86 and not x64 as I had said previously.
Right, this occurs because of the x86-32 ABI. x86-64 should not be affected.
. Pulling the whole benchmark into a single compilation unit changes the
performance results completely (still x86):
$ llvm-gcc -O3 -msse3 -lm all.c -o all
$ ./all
Composite Score: 570.07
FFT Mflops: 599.40 (N=1024)
SOR Mflops: 476.97 (100 x 100)
MonteCarlo: Mflops: 278.17
Sparse matmult Mflops: 582.54 (N=1000, nz=5000)
LU Mflops: 913.27 (M=100, N=100)
$ gcc -O3 -msse3 -lm all.c -o all
$ ./all
Composite Score: 539.20
FFT Mflops: 516.05 (N=1024)
SOR Mflops: 472.29 (100 x 100)
MonteCarlo: Mflops: 167.25
Sparse matmult Mflops: 633.20 (N=1000, nz=5000)
LU Mflops: 907.20 (M=100, N=100)
Note that llvm-gcc is achieving almost 280MFLOPS on MonteCarlo here, far
higher than any competitors, and it is outperforming gcc overall.
Great! Do you see the same results with LTO? Inlining Random_nextDouble from random.c to MonteCarlo.c should be a big win.
-Chris