-msse3 can degrade performance

I just remembered an anomalous result that I stumbled upon whilst tweaking the
command-line options to llvm-gcc. Specifically, the -msse3 flag does a great
job improving the performance of floating point intensive code on the
SciMark2 benchmark but it also degrades the performance of the int-intensive
Monte Carlo part of the test:

$ llvm-gcc -Wall -lm -O3 *.c -o scimark2
$ ./scimark2
Using 2.00 seconds min time per kenel.
Composite Score: 432.84
FFT Mflops: 358.90 (N=1024)
SOR Mflops: 473.45 (100 x 100)
MonteCarlo: Mflops: 210.54
Sparse matmult Mflops: 354.25 (N=1000, nz=5000)
LU Mflops: 767.04 (M=100, N=100)

$ llvm-gcc -Wall -lm -O3 -msse3 *.c -o scimark2
$ ./scimark2
Composite Score: 548.53
FFT Mflops: 609.87 (N=1024)
SOR Mflops: 497.92 (100 x 100)
MonteCarlo: Mflops: 126.62
Sparse matmult Mflops: 604.02 (N=1000, nz=5000)
LU Mflops: 904.19 (M=100, N=100)

The relevant code is:

  double Random_nextDouble(Random R)
  {
      int k;
  
      int I = R->i;
      int J = R->j;
      int *m = R->m;
  
      k = m[I] - m[J];
      if (k < 0) k += m1;
      R->m[J] = k;
  
      if (I == 0)
          I = 16;
      else I--;
      R->i = I;
  
      if (J == 0)
          J = 16 ;
      else J--;
      R->j = J;
  
      if (R->haveRange)
          return R->left + dm1 * (double) k * R->width;
      else
          return dm1 * (double) k;
  
  }

  double MonteCarlo_integrate(int Num_samples)
  {
      Random R = new_Random_seed(SEED);

      int under_curve = 0;
      int count;

      for (count=0; count<Num_samples; count++)
      {
          double x= Random_nextDouble(R);
          double y= Random_nextDouble(R);

          if ( x*x + y*y <= 1.0)
                under_curve ++;
      }

      Random_delete(R);

      return ((double) under_curve / Num_samples) * 4.0;
  }

The -msse3 flag? Does the -msse2 flag have a similar effect?

-Eli

Yes:

$ llvm-gcc -Wall -lm -O3 -msse2 *.c -o scimark2
$ ./scimark2
Composite Score: 525.99
FFT Mflops: 538.35 (N=1024)
SOR Mflops: 472.29 (100 x 100)
MonteCarlo: Mflops: 120.92
Sparse matmult Mflops: 585.14 (N=1000, nz=5000)
LU Mflops: 913.27 (M=100, N=100)

But -msse does not:

$ llvm-gcc -Wall -lm -O3 -msse *.c -o scimark2
$ ./scimark2
Composite Score: 540.08
FFT Mflops: 535.04 (N=1024)
SOR Mflops: 469.99 (100 x 100)
MonteCarlo: Mflops: 197.38
Sparse matmult Mflops: 587.77 (N=1000, nz=5000)
LU Mflops: 910.22 (M=100, N=100)

That was x64 and I get similar results for x86.

Is there some kind of contention between the integer and SSE registers?

I just remembered an anomalous result that I stumbled upon whilst
tweaking the command-line options to llvm-gcc. Specifically, the -msse3
flag

The -msse3 flag? Does the -msse2 flag have a similar effect?

Yes:

Hi Jon,

I'm seeing exactly identical .s files with -msse2 and -msse3 on the scimark version I have. Can you please send the output of:

llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.s
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.s

llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.ll -emit-llvm
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.ll -emit-llvm

Thanks,

-Chris

I'm seeing exactly identical .s files with -msse2 and -msse3 on the
scimark version I have. Can you please send the output of:

llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.s
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.s

llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.ll -emit-llvm
llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.ll -emit-llvm

Can I just check that you had noticed that my timings for those
(sse2 vs sse3)
were the same and that the difference was occurring between -msse
and -msse2
(see below)?

The x86 output is attached for those (which give the same results here too) as
well as -O3 and -O3 -msse which give different results here. Here are the
performance results I just got when redoing this on x86:

MonteCarlo: Mflops: 212.20 -O3
MonteCarlo: Mflops: 211.37 -O3 -msse
MonteCarlo: Mflops: 123.70 -O3 -msse2
MonteCarlo: Mflops: 127.22 -O3 -msse3

Ok, thanks Jon! I diff'd the files and the -msse2 and -msse3 code is identical, so we're not doing anything wrong with -msse3 :).

OTOH, the perf drop from sse -> sse2 is concerning. The difference here is that we do double math in SSE regs instead of FPStack regs. In this case, using the fp stack avoids some cross-class register copying. We could improve the code generator to notice and handle this, I added this note to the x86 backend with some details:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20090202/073254.html

This is a long-known issue, but a great example of it.

Two other points of interest:

. I just retimed in x64 and could not reproduce the difference so this only
afflicts x86 and not x64 as I had said previously.

Right, this occurs because of the x86-32 ABI. x86-64 should not be affected.

. Pulling the whole benchmark into a single compilation unit changes the
performance results completely (still x86):

$ llvm-gcc -O3 -msse3 -lm all.c -o all
$ ./all
Composite Score: 570.07
FFT Mflops: 599.40 (N=1024)
SOR Mflops: 476.97 (100 x 100)
MonteCarlo: Mflops: 278.17
Sparse matmult Mflops: 582.54 (N=1000, nz=5000)
LU Mflops: 913.27 (M=100, N=100)
$ gcc -O3 -msse3 -lm all.c -o all
$ ./all
Composite Score: 539.20
FFT Mflops: 516.05 (N=1024)
SOR Mflops: 472.29 (100 x 100)
MonteCarlo: Mflops: 167.25
Sparse matmult Mflops: 633.20 (N=1000, nz=5000)
LU Mflops: 907.20 (M=100, N=100)

Note that llvm-gcc is achieving almost 280MFLOPS on MonteCarlo here, far
higher than any competitors, and it is outperforming gcc overall.

Great! Do you see the same results with LTO? Inlining Random_nextDouble from random.c to MonteCarlo.c should be a big win.

-Chris