benchmarking LLVM

Hi all

  i took a look into LLVM benchmarks from nightly tester and
  ran Shootout tests on my own. Below go just few outlines.

1. results on my AMD AthlonXP and Xeon used by LLVM
team are different sometime. In particular, both Shootout
and Shootout-C++ show great speed up with LLVM (in
comparison to GCC) on ackerman test on my AthlonXP.
But here:
  http://llvm.cs.uiuc.edu/testresults/X86/
GCC is reported to be quicker on
Benchmarks/Shootout-C++/ackermann test.

2. the tests based (e.g. ackerman,fibonacci) on function
calls are tend to be quicker with LLVM then with GCC
:slight_smile:

3. some generic tests (like nested loops) are slower with LLVM
then GCC. Could anyone check why
Benchmarks/Shootout*/nestedloop is so slow with LLVM?
Nested loops is a quite generic construction, so if this
test will be boosted then it might influence very much
other test too.

  i took a look into LLVM benchmarks from nightly tester and
  ran Shootout tests on my own. Below go just few outlines.

1. results on my AMD AthlonXP and Xeon used by LLVM
team are different sometime. In particular, both Shootout
and Shootout-C++ show great speed up with LLVM (in
comparison to GCC) on ackerman test on my AthlonXP.
But here:
  http://llvm.cs.uiuc.edu/testresults/X86/
GCC is reported to be quicker on
Benchmarks/Shootout-C++/ackermann test.

This test seems PARTICULARLY noisy for some reason. Take a look at this:

[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.130u 0.010s 0:03.14 100.0% 0+0k 0+0io 196pf+0w
[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.280u 0.000s 0:03.28 100.0% 0+0k 0+0io 196pf+0w
[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.210u 0.000s 0:03.21 100.0% 0+0k 0+0io 196pf+0w
[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.340u 0.010s 0:03.34 100.2% 0+0k 0+0io 196pf+0w

[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
3.890u 0.000s 0:03.89 100.0% 0+0k 0+0io 120pf+0w
[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
3.790u 0.000s 0:03.78 100.2% 0+0k 0+0io 120pf+0w
[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
4.000u 0.010s 0:04.00 100.2% 0+0k 0+0io 120pf+0w
[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
3.830u 0.000s 0:03.83 100.0% 0+0k 0+0io 120pf+0w

... the native version swings from 3.13-3.34s and the llc-ls version
swings from 3.83-4.0s. The C backend is also noisy, swinging from
3.98-4.58s. FWIW, these tests are on a Intel P4 Xeon 3.06Ghz machine with
HT enabled.

In any case, it appears that we're slower than GCC on this one. If you
wanted to dig into the test to see what is going wrong (is the LLVM code
missing an optimization, or is it bad code generation?), that would be
quite helpful.

2. the tests based (e.g. ackerman,fibonacci) on function
calls are tend to be quicker with LLVM then with GCC
:slight_smile:

Yeah I guess so! A 31x speedup on ary (66s -> 2.11 for llc-ls), 10x
speedup on ary2, and 4.8x speedup on except is not bad at all. :slight_smile: We
tend to do pretty well on C++ programs though.

3. some generic tests (like nested loops) are slower with LLVM then GCC.
Could anyone check why Benchmarks/Shootout*/nestedloop is so slow with
LLVM? Nested loops is a quite generic construction, so if this test
will be boosted then it might influence very much other test too.

You're right, this one is pretty bad. Looking at the (most) inner loop in
the "benchmark":

   for (f=0; f<n; f++)
     x++;

It looks like we're generating this machine code:

.LBBmain_10: # no_exit.5
        mov %EDX, %EBX
        mov %EBP, %EDX
        inc %EBP
        mov %EBX, %EDX
        inc %EBX
        cmp %EBP, %ESI
        jl .LBBmain_10 # no_exit.5

The code looks somewhat reasonable except for the three extra copies in
it. GCC destroys us on this loop though, producing:

.L30:
        incl %eax
        decl %edx
        jne .L30

There are two things we need to get this nice code:

1. The register allocator needs a better coallescing heuristic or a
   better PHI elimination pass to help it. Alkis is the one to talk to
   about this, but he's out of town (should be back soon?). This is a
   serious problem though, and needs to be addressed before 1.3.

   This problem is far reaching (not just limited to this testcase), and
   is a good example of how a (toy) microbenchmark like this can expose a
   real problem.

2. We need to implement a loop indvar reversal pass that counts down with
   the loop induction variable instead of up. That will allow the
   instruction selector to branch on the condition codes produced by the
   'dec' instruction (eliminating the cmp instruction), and allow usage
   of hardware support on PPC and other targets. This optimization is
   related to loop strength reduction which we also do not have yet. Note
   that we do have all of the infrastructure needed to implement this, if
   anyone is interested in doing so.

In this particular loop, we could do much better by just compiling the
loop into:

  x += n;

Doing so would not be hard at all. All we have to do is teach the trip
count analysis code that a loop like 'for (f=0; f<n; f++)' executes 'n'
times and existing optimizations will do the hard work (this will actually
end up turning this benchmark into a constant time operation). Currently
the trip count computation code only handles loops that execute a constant
number of iterations.

This has been on my todo list for a long time, but I've been busy with
many other things. :slight_smile: I'd much rather get at least problem #1 addressed
first in any case.

Thanks for looking into this. We have (obviously) spent very limited time
looking into performance issues, it's great that you're able to do so!

-Chris

This test seems PARTICULARLY noisy for some reason. Take a look at this:
[...]
... the native version swings from 3.13-3.34s and the llc-ls version
swings from 3.83-4.0s. The C backend is also noisy, swinging from
3.98-4.58s. FWIW, these tests are on a Intel P4 Xeon 3.06Ghz machine with
HT enabled. In any case, it appears that we're slower than GCC on this one.

what is very funny, that the situation on my AMD is not similar
at all to what you state here. On my side tests show GREAT spead up
with ackerman, i.e. 2 to 3 times. Sorry, I can't access my
linux box now to get concrete values.

If you
wanted to dig into the test to see what is going wrong (is the LLVM code
missing an optimization, or is it bad code generation?), that would be
quite helpful.

i will access my linux box next Friday only :frowning:

> 2. the tests based (e.g. ackerman,fibonacci) on function
> calls are tend to be quicker with LLVM then with GCC
> :slight_smile:

Yeah I guess so! A 31x speedup on ary (66s -> 2.11 for llc-ls), 10x
speedup on ary2, and 4.8x speedup on except is not bad at all. :slight_smile: We
tend to do pretty well on C++ programs though.

hm, i see that nightly tester shows new values now and
again the things at your side are very different
then i got here...

ary* tests were quite similar by gcc and llvm at my side.

1. The register allocator needs a better coallescing heuristic or a
   better PHI elimination pass to help it. Alkis is the one to talk to
   about this, but he's out of town (should be back soon?). This is a
   serious problem though, and needs to be addressed before 1.3.

i'd agree -- if it does matter :slight_smile:

Thanks for looking into this. We have (obviously) spent very limited time
looking into performance issues, it's great that you're able to do so!

I am sorry for helping too few. I'd wish more helpers join this nice
project.

Still, i am wondering how could it happens that result are
so different by my side and your side. I'd not expect a very
big difference becasue of AMD <-> Xeon difference...

I'd propose to check error codes with benchmarking suite.
Maybe to quick result ist just aborted execution?..
(anyway i still do not see smth very cool for ary*
tests at my side)

Are there anyone how could test Shootout test as well?

thank you.

> This test seems PARTICULARLY noisy for some reason. Take a look at this:
> [...]
> ... the native version swings from 3.13-3.34s and the llc-ls version
> swings from 3.83-4.0s. The C backend is also noisy, swinging from
> 3.98-4.58s. FWIW, these tests are on a Intel P4 Xeon 3.06Ghz machine with
> HT enabled. In any case, it appears that we're slower than GCC on this one.

what is very funny, that the situation on my AMD is not similar
at all to what you state here. On my side tests show GREAT spead up
with ackerman, i.e. 2 to 3 times. Sorry, I can't access my
linux box now to get concrete values.

It's quite possible that it is the difference between the two chips. The
microarchitectures are very different between the two. If there is
something else going on here we do want to know about it though...

> If you
> wanted to dig into the test to see what is going wrong (is the LLVM code
> missing an optimization, or is it bad code generation?), that would be
> quite helpful.

i will access my linux box next Friday only :frowning:

LLVM works under cygwin and Mac OS/X. If you have access to either of
those, it should work for you :slight_smile:

> Thanks for looking into this. We have (obviously) spent very limited time
> looking into performance issues, it's great that you're able to do so!

I am sorry for helping too few. I'd wish more helpers join this nice
project.

Everything helps. :slight_smile:

Still, i am wondering how could it happens that result are
so different by my side and your side. I'd not expect a very
big difference becasue of AMD <-> Xeon difference...

I'm not sure. I would expect such a big difference either, but if you are
using the same llvmgcc and llc as we are (CVS HEAD) you should be running
exactly the same machine code.

I'd propose to check error codes with benchmarking suite.
Maybe to quick result ist just aborted execution?..
(anyway i still do not see smth very cool for ary*
tests at my side)

Are there anyone how could test Shootout test as well?

I don't understand what you're saying here, can you please rephrase it?

-Chris