Performance vs other VMs

The release of a new code generator in Mono 2.2 prompted me to benchmark the
performance of various VMs using the SciMark2 benchmark on an 8x 2.1GHz
64-bit Opteron and I have published the results here:

  http://flyingfrogblog.blogspot.com/2009/01/mono-22.html

The LLVM results were generated using llvm-gcc 4.2.1 on the C version of
SciMark2 with the following command-line options:

  llvm-gcc -Wall -lm -O2 -funroll-loops *.c -o scimark2

Mono was up to 12x slower than LLVM before and is now only 2.2x slower on
average. Interestingly, the JVM scores slightly higher than LLVM on this
benchmark on average and beats LLVM on two of the five individual tests.

The individual scores are particularly enlightening. Specifically:

. LLVM outperforms all other VMs by a significant margin on FFT, Monte Carlo
and sparse matrix multiply.

. LLVM is beaten by the JVM on successive over-relaxation (SOR) and LU
decomposition.

In the context of the SOR test, I suspect the JVM is using alias information
to perform optimizations that LLVM and llvm-gcc probably do not do.

I am not sure what causes the performance discrepancy on LU. Perhaps the JVM
is generating SSE instructions. Does llvm-gcc generate SSE instructions under
any circumstances?

The release of a new code generator in Mono 2.2 prompted me to benchmark the
performance of various VMs using the SciMark2 benchmark on an 8x 2.1GHz
64-bit Opteron and I have published the results here:

http://flyingfrogblog.blogspot.com/2009/01/mono-22.html

The LLVM results were generated using llvm-gcc 4.2.1 on the C version of
SciMark2 with the following command-line options:

llvm-gcc -Wall -lm -O2 -funroll-loops *.c -o scimark2

Mono was up to 12x slower than LLVM before and is now only 2.2x slower on
average. Interestingly, the JVM scores slightly higher than LLVM on this
benchmark on average and beats LLVM on two of the five individual tests.

The individual scores are particularly enlightening. Specifically:

. LLVM outperforms all other VMs by a significant margin on FFT, Monte Carlo
and sparse matrix multiply.

. LLVM is beaten by the JVM on successive over-relaxation (SOR) and LU
decomposition.

In the context of the SOR test, I suspect the JVM is using alias information
to perform optimizations that LLVM and llvm-gcc probably do not do.

I am not sure what causes the performance discrepancy on LU. Perhaps the JVM
is generating SSE instructions. Does llvm-gcc generate SSE instructions under
any circumstances?

interesting, but can you add plain C compiled with the good old-fashined GCC or similar to serve as a point of reference as well?...

This is the highest composite score I have been able to get with gcc 4.3.2:

$ gcc -Wall -lm -O3 -march=barcelona -funroll-all-loops *.c -o scimark2
$ ./scimark2
Composite Score: 708.63
FFT Mflops: 573.76 (N=1024)
SOR Mflops: 481.74 (100 x 100)
MonteCarlo: Mflops: 129.06
Sparse matmult Mflops: 775.57 (N=1000, nz=5000)
LU Mflops: 1583.00 (M=100, N=100)

One reason is, perhaps, that the version of llvm-gcc that I am using does not
recognise -march=barcelona for this CPU but gcc does.

This is not a quite fair comparison. Other virtual machines must be
doing garbage collection, while LLVM, as it is using C code, it is
taking advantage of memory allocation by hand.

Here is a run of scimark2 with verbose GC enabled. You'll see that there are two garbage collection cycles for a total of around .003 seconds of time.
It should also be noted that these GCs happened before the timer starts running. There is almost no dynamic memory allocation in this code. Modern garbage collectors
are also very efficient (sometimes better than hand deallocation).

java -verbose:gc jnt/scimark2/commandline
[GC 511K->202K(1984K), 0.0018845 secs]
[GC 714K->415K(1984K), 0.0015513 secs]

SciMark 2.0a

Composite Score: 327.3062235870194
FFT (1024): 127.42845375506063
SOR (100x100): 677.3128255261597
Monte Carlo : 29.4337095721763
Sparse matmult (N=1000, nz=5000): 300.2107071278524
LU (100x100): 502.14542195384803

java.vendor: Apple Inc.
java.version: 1.5.0_16
os.arch: i386
os.name: Mac OS X
os.version: 10.5.6

That is an insignificant advantage in this particular case (SciMark2) because
the memory for each test is preallocated and not part of the measurement and
the heap and stack are both tiny during the computations so there is little
to traverse.

I am interested in the comparative results for LLVM because I consider it to
represent how fast my LLVM-based VM might be compared to other garbage
collected VMs.

However, LLVM has a serious disadvantage compared to the other VMs here
because it does not have aliasing assurances. For example, it does not know
about array aliasing, e.g. that the subarrays in the successive
over-relaxation test cannot overlap.

The LLVM 2.1 release notes say that llvm-gcc got alias analysis and understood
the "restrict" keyword but when I add it to the C code for SciMark2 it makes
no difference. Can anyone else get this to work?

It works for me. LLVM doesn't yet perform many of the
optimizations that typically benefit from this type of
information being available though.

Dan