dragonegg svn benchmarks

The Polyhedron 2005 benchmark results for dragonegg svn at r141492
using FSF gcc 4.6.2svn measured on x86_64-apple-darwin11 are listed below.
The benchmarks used the optimizaton flags...

-msse4 -ffast-math -funroll-loops -O3

in all cases. The use of -fplugin-arg-dragonegg-enable-gcc-optzns to allow
for autovectorization from the FSF gcc front-end only produces a single run-time
regression, fatigue, which is PR10892.

Run time

Benchmark gfortran dragonegg dragonegg+optnz

Hi Jack,

      The Polyhedron 2005 benchmark results for dragonegg svn at r141492
using FSF gcc 4.6.2svn measured on x86_64-apple-darwin11 are listed below.
The benchmarks used the optimizaton flags...

  -msse4 -ffast-math -funroll-loops -O3

in all cases. The use of -fplugin-arg-dragonegg-enable-gcc-optzns to allow
for autovectorization from the FSF gcc front-end only produces a single run-time
regression, fatigue, which is PR10892.

thanks for these numbers. I suggest you also try -O4. This does heavier LLVM
optimization when used with -fplugin-arg-dragonegg-enable-gcc-optzns, and seems
to typically result in faster code. You can also use -O6, which does even more
LLVM optimizing, but seems to slow things down (I didn't analyse why yet).

Ciao, Duncan.

PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers are run at
the following levels:

Command line option LLVM optimizers run at
------------------- ----------------------
         -O1 tiny amount of optimization
     -O2 or -O3 -O1
     -O4 or -O5 -O2
     -O6 or better -O3

Hi Duncan,

Out of curiosity, why do you follow this approach? People generally use -O2 or -O3. I'd recommend switching dragonegg to line those up with whatever you want people to use.

-Chris

Hi Chris,

PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers are run at
the following levels:

Command line option LLVM optimizers run at
------------------- ----------------------
         -O1 tiny amount of optimization
     -O2 or -O3 -O1
     -O4 or -O5 -O2
     -O6 or better -O3

Hi Duncan,

Out of curiosity, why do you follow this approach? People generally use -O2 or -O3. I'd recommend switching dragonegg to line those up with whatever you want people to use.

note that this is done only when the GCC optimizers are run. The basic
observation is that running the LLVM optimizers at -O3 after running the
GCC optimizers (at -O3) results in slower code! I mean slower than what
you get by running the LLVM optimizers at -O1 or -O2. I didn't find time
to analyse this curiosity yet. It might simply be that the LLVM inlining
level is too high given that inlining has already been done by GCC. Anyway,
I didn't want to run LLVM at -O3 because of this. The next question was:
what is better: LLVM at -O1 or at -O2? My first experiments showed that
code quality was essentially the same. Since at -O1 you get a nice compile
time speedup, I settled on using -O1. Also -O1 makes some sense if the GCC
optimizers did a good job and all that is needed is to clean up the mess that
converting to LLVM IR can produce. However later experiments showed that -O2
does seem to consistently result in slightly better code, so I've been thinking
of using -O2 instead. This is one reason I encouraged Jack to use -O4 in his
benchmarks (i.e. GCC at -O3, LLVM at -O2) - to see if they show the same thing.

Ciao, Duncan.

PS: Dragonegg is a nice platform for understanding what the GCC optimizers
do better than LLVM. It's a pity no-one seems to have used it for this.

On Wed, Oct 12, 2011 at 12:40 AM, Duncan Sands > The basic

observation is that running the LLVM optimizers at -O3 after running the
GCC optimizers (at -O3) results in slower code! I mean slower than what
you get by running the LLVM optimizers at -O1 or -O2. I didn't find time
to analyse this curiosity yet. It might simply be that the LLVM inlining
level is too high given that inlining has already been done by GCC. Anyway,
I didn't want to run LLVM at -O3 because of this.

If you inline too much you will get slower code because you make
poorer use of the instruction cache in most modern processors.

C99 and C++ allow one to declare functions inline at the point that
they are declared. For early C standards I believe GCC has an
attribute that allows one to inline a function at the point of
declaration as a language extension.

Lots of other languages do inlining, for example I understand Java
JITs will inline JIT-compiled native code even though the Java
language itself doesn't support inlining.

For modern processors with code caches it would be better to inline
functions at the point they are used rather than when they are
declared. That way one has the choice of better cache usage or
avoiding function call overhead. For example:

int foo( float bar );

int baz( void )
{
   return foo( 3 ) inline; // This call will be fast
}

int boo( void )
{
   return foo( 5 ); // This will make a hot spot at foo's definition
}

Profiler-guided optimizations could take care of this without needing
any language extensions. I understand that that is what the Java
HotSpot JIT does.

Don Quixote

Hi Chris,

PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers are run at
the following levels:

Command line option LLVM optimizers run at
------------------- ----------------------
         -O1 tiny amount of optimization
     -O2 or -O3 -O1
     -O4 or -O5 -O2
     -O6 or better -O3

Hi Duncan,

Out of curiosity, why do you follow this approach? People generally use -O2 or -O3. I'd recommend switching dragonegg to line those up with whatever you want people to use.

note that this is done only when the GCC optimizers are run. The basic
observation is that running the LLVM optimizers at -O3 after running the
GCC optimizers (at -O3) results in slower code! I mean slower than what
you get by running the LLVM optimizers at -O1 or -O2. I didn't find time
to analyse this curiosity yet. It might simply be that the LLVM inlining
level is too high given that inlining has already been done by GCC. Anyway,
I didn't want to run LLVM at -O3 because of this. The next question was:
what is better: LLVM at -O1 or at -O2? My first experiments showed that
code quality was essentially the same. Since at -O1 you get a nice compile
time speedup, I settled on using -O1. Also -O1 makes some sense if the GCC
optimizers did a good job and all that is needed is to clean up the mess that
converting to LLVM IR can produce. However later experiments showed that -O2
does seem to consistently result in slightly better code, so I've been thinking
of using -O2 instead. This is one reason I encouraged Jack to use -O4 in his
benchmarks (i.e. GCC at -O3, LLVM at -O2) - to see if they show the same thing.

Duncan,
   My preliminary runs of the pb05 benchmarks at -O4, -O5 and -O6 using
-fplugin-arg-dragonegg-enable-gcc-optzns didn't show any significant run time
performance changes compared to -fplugin-arg-dragonegg-enable-gcc-optzns -O3.
I'll rerun those and post the tabulated results this weekend. I am using
-ffast-math -funroll-loops as well in the optimization flags. Perhaps I should
repeat the benchmarks without those flags.
   IMHO, the more important thing is to fish out the remaining regressions
in the llvm vectorization code by defaulting -fplugin-arg-dragonegg-enable-gcc-optzns
on in dragonegg svn once llvm 3.0 has branched. Hopefully this will get us wider
testing of the llvm vectorization support and some additional smaller test cases
that expose the remaining bugs in that code.
              Jack

The Polyhedron 2005 benchmark results for dragonegg svn at r141775
using FSF gcc 4.6.2svn measured on x86_64-apple-darwin11 are listed below.
The benchmarks used the optimizaton flags...

a) gfortran-fsf-4.6 -msse4 -ffast-math -funroll-loops -O3 %n.f90 -o %n
b) de-gfortran46 -msse4 -ffast-math -funroll-loops -O3 %n.f90 -o %n
c) de-gfortran46 -msse4 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n
d) de-gfortran46 -msse4 -ffast-math -funroll-loops -O4 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n
e) de-gfortran46 -msse4 -ffast-math -funroll-loops -O5 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n
f) de-gfortran46 -msse4 -ffast-math -funroll-loops -O6 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n

and no runtime regressions are observed in any of the cases.

Run time (seconds)

Benchmark gfortran dragonegg de+optnz de+optnz+O4 de+optnz+O5 de+optnz+O6

Hi Jack,

    IMHO, the more important thing is to fish out the remaining regressions
in the llvm vectorization code by defaulting -fplugin-arg-dragonegg-enable-gcc-optzns
on in dragonegg svn once llvm 3.0 has branched. Hopefully this will get us wider
testing of the llvm vectorization support and some additional smaller test cases
that expose the remaining bugs in that code.

turning on the GCC optimizers by default essentially means giving up on the LLVM
IR optimizers: one way of reading your benchmark results is that the LLVM IR
optimizers don't do anything useful that the GCC optimizers haven't done
already. The fact that LLVM -O3 and -O2 don't produce better code than -O1
suggests that all that is needed is a little bit of optimization to clean up
the inevitable messy bits produced by the gimple -> LLVM IR conversion, but
that otherwise GCC already did all the interesting transforms. Should this be
considered an LLVM bug or a dragonegg feature?

An LLVM bug: if the GCC optimizers work better than LLVM's then LLVM should be
improved until LLVM's are better. Turning on the GCC optimizers by default just
hides the weaknesses of LLVM's optimizers, and reduces the pressure to improve
things.

A dragonegg feature: users want their code to run fast. Turning on the GCC
optimizers results in faster code, ergo the GCC optimizers should be turned
on by default. That way you get faster compile times and fast code.

I have some sympathy for both viewpoints...

Ciao, Duncan.

Hi Jack,

    IMHO, the more important thing is to fish out the remaining regressions
in the llvm vectorization code by defaulting -fplugin-arg-dragonegg-enable-gcc-optzns
on in dragonegg svn once llvm 3.0 has branched. Hopefully this will get us wider
testing of the llvm vectorization support and some additional smaller test cases
that expose the remaining bugs in that code.

turning on the GCC optimizers by default essentially means giving up on the LLVM
IR optimizers: one way of reading your benchmark results is that the LLVM IR
optimizers don't do anything useful that the GCC optimizers haven't done
already. The fact that LLVM -O3 and -O2 don't produce better code than -O1
suggests that all that is needed is a little bit of optimization to clean up
the inevitable messy bits produced by the gimple -> LLVM IR conversion, but
that otherwise GCC already did all the interesting transforms. Should this be
considered an LLVM bug or a dragonegg feature?

An LLVM bug: if the GCC optimizers work better than LLVM's then LLVM should be
improved until LLVM's are better. Turning on the GCC optimizers by default just
hides the weaknesses of LLVM's optimizers, and reduces the pressure to improve
things.

A dragonegg feature: users want their code to run fast. Turning on the GCC
optimizers results in faster code, ergo the GCC optimizers should be turned
on by default. That way you get faster compile times and fast code.

Duncan,
    My main concern is that we test the vectorization support in llvm as hard as
possible post llvm 3.0. Considering that llvm is unlikely to get autovectorization
support in the near term, it seems that FSF gcc/dragonegg is the best approach
to hunt for vectorization issues in llvm. Might we be able to split the difference
here and create a variant of -fplugin-arg-dragonegg-enable-gcc-optzns which only
enables a limited set of FSF gcc optimizations (like -ftree-vectorize) required
to enable FSF gcc's autovectorization under dragonegg? For instance couldn't
dragonegg just honor -ftree-vectorize when it or -O3 are passed as compiler flags?
                      Jack

> Hi Jack,
>
>> IMHO, the more important thing is to fish out the remaining regressions
>> in the llvm vectorization code by defaulting -fplugin-arg-dragonegg-enable-gcc-optzns
>> on in dragonegg svn once llvm 3.0 has branched. Hopefully this will get us wider
>> testing of the llvm vectorization support and some additional smaller test cases
>> that expose the remaining bugs in that code.
>
> turning on the GCC optimizers by default essentially means giving up on the LLVM
> IR optimizers: one way of reading your benchmark results is that the LLVM IR
> optimizers don't do anything useful that the GCC optimizers haven't done
> already. The fact that LLVM -O3 and -O2 don't produce better code than -O1
> suggests that all that is needed is a little bit of optimization to clean up
> the inevitable messy bits produced by the gimple -> LLVM IR conversion, but
> that otherwise GCC already did all the interesting transforms. Should this be
> considered an LLVM bug or a dragonegg feature?
>
> An LLVM bug: if the GCC optimizers work better than LLVM's then LLVM should be
> improved until LLVM's are better. Turning on the GCC optimizers by default just
> hides the weaknesses of LLVM's optimizers, and reduces the pressure to improve
> things.
>
> A dragonegg feature: users want their code to run fast. Turning on the GCC
> optimizers results in faster code, ergo the GCC optimizers should be turned
> on by default. That way you get faster compile times and fast code.

Duncan,
    My main concern is that we test the vectorization support in llvm as hard as
possible post llvm 3.0. Considering that llvm is unlikely to get autovectorization
support in the near term,

I just came across this: http://www.cdl.uni-saarland.de/projects/wfv/ --
It says that it will be released in the "near future".

Also, Intel's ISPC (http://ispc.github.com/) generates vector
instructions.

[I don't have an opinion on the default dragonegg options].

-Hal

Duncan, et al.,

I am interested in getting dragonegg to work on PowerPC. Obviously the
stuff in src/x86 needs to be replaced/replicated for PowerPC, but if you
have a few minutes, can you provide your thoughts on what has to be
changed between x86 and PPC.

Thanks in advance,
Hal

Hi Hal,

I am interested in getting dragonegg to work on PowerPC. Obviously the
stuff in src/x86 needs to be replaced/replicated for PowerPC, but if you
have a few minutes, can you provide your thoughts on what has to be
changed between x86 and PPC.

you should probably start by doing this: copy gcc/config/rs6000/llvm-rs6000.cpp
to (in the dragonegg source) src/ppc/Target.cpp. Extract the LLVM bits of
rs6000.h into include/ppc/dragonegg/Target.h. Be inspired by the corresponding
x86 Target.cpp and Target.h. Try to compile.

Ciao, Duncan.

I am interested in getting dragonegg to work on PowerPC. Obviously the
stuff in src/x86 needs to be replaced/replicated for PowerPC, but if you
have a few minutes, can you provide your thoughts on what has to be
changed between x86 and PPC.

you should probably start by doing this: copy gcc/config/rs6000/llvm-rs6000.cpp
to (in the dragonegg source) src/ppc/Target.cpp. Extract the LLVM bits of
rs6000.h into include/ppc/dragonegg/Target.h. Be inspired by the corresponding
x86 Target.cpp and Target.h. Try to compile.

I meant: copy from llvm-gcc.

Ciao, Duncan.