-fplugin-arg-dragonegg-enable-gcc-optzns status

Current dragonegg svn has all of the -fplugin-arg-dragonegg-enable-gcc-optzns bugs for
usage with -ffast-math -O3 addressed except for those related to PR2314. Using the -fno-tree-vectorize
option, we can evaluate the current state of -fplugin-arg-dragonegg-enable-gcc-optzns with
the Polyhedron 2005 benchmarks compared to stock dragonegg and stock gcc 4.5.4. The runtime
benchmarks below show that we average slightly faster than stock gcc 4.5.4 and significantly
faster than stock dragonegg through the use of -fplugin-arg-dragonegg-enable-gcc-optzns.

x86_64 darwin

A) gcc 4.5.4svn using -msse3 -ffast-math -O3 -fno-tree-vectorize
B) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns
C) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize

Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/
              gcc 4.5.4 dragonegg/optzns dragonegg

ac 9.58 9.13 12.30
aermod 20.88 16.10 17.62
air 6.16 6.59 7.70
capacita 35.68 39.94 46.22
channel 2.03 2.04 1.96
doduc 28.28 28.43 30.41
fatigue 8.13 7.19 10.40
gas_dyn 10.10 9.83 11.73
induct 20.17 20.76 48.76
linpk 15.42 15.65 15.69
mdbx 11.42 11.73 12.07
nf 27.99 28.60 29.39
protein 38.36 39.08 39.98
rnflow 27.28 28.19 31.90
test_fpu 11.43 11.17 11.50
tfft 1.91 1.95 2.16

Mean 12.72 12.62 14.71

Once vector_select() is implemented we can retest without -fno-tree-vectorize.

Hi,

Here's a quick update regarding the vector-select. I started committing my vector-select patch[1] little by little. The general approach is to implement Integer-Promotions legalization on vectors (rather than vector-widening). This enables the widening of <4 x i1> masks into <4 x i32> masks, which are used by the SIMD instruction set.
I started with some type-legalization refactoring. Next, I added a new flag to enable the new kind of type-legalization and a few tests. After that, I added the LegalizeTypes implementation of PromoteInteger for the new vector SDNodes (buildvector, extract, etc) and the changes to copyFromParts/copyToParts (needed for argument passing and inter basicblock variables). I added some tests for arithmetic vector code.

My next patch is going to be augmenting the load/store code for saving and storing of the modified vectors. A <4 x i8> vector is promoted to <4 x i32> in registers, but still needs to be saved as <4 x i8> in memory. After this patch goes it, we can do two things. First, we can consider removing the special flag and enabling the new legalization strategy for all code. Second, we can implement the vector select. The vector select part would be easy. I am not sure how long it would take me to finish this patch, because I am only working on this in the late evenings.

Cheers,
Nadav

[1] - http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20110502/120445.html

Hi Jack, thanks for these numbers. Can you also please measure compile times?
I'm thinking of enabling gcc optimizations by default, but I don't want to
increase compile times, which means choosing a value for the
-fplugin-arg-dragonegg-llvm-ir-optimize option that is low enough to get good
compile times, yet high enough to get fast code. It would be great if you could
play around with this to find a good choice.

Best wishes, Duncan.

Duncan,
    Below are the tabulated compile times and executable sizes.

A) gcc 4.5.4svn using -msse3 -ffast-math -O3 -fno-tree-vectorize
B) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns
C) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize

Compile time (seconds)

Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/
               gcc 4.5.4 dragonegg/optzns dragonegg

ac 0.61 1.65 0.32
aermod 31.24 25.83 21.02
air 1.74 1.49 0.81
capacita 0.83 0.80 0.44
channel 0.34 0.33 0.25
doduc 3.09 2.63 1.63
fatigue 1.04 1.08 0.84
gas_dyn 0.91 0.95 0.75
induct 3.18 2.57 1.73
linpk 0.34 0.30 0.21
mdbx 1.08 1.01 0.59
nf 0.39 0.41 0.28
protein 1.55 1.29 0.97
rnflow 1.76 1.73 1.26
test_fpu 1.38 1.40 1.05
tfft 0.31 0.28 0.19

mean 3.11 2.73 2.02

Executable size (bytes)

Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/
               gcc 4.5.4 dragonegg/optzns dragonegg

ac 26344 30896 26704
aermod 1145924 1043816 1052056
air 57404 57700 53532
capacita 40864 41008 37064
channel 22448 22664 22664
doduc 127340 124108 120124
fatigue 61152 65352 65664
gas_dyn 647864 58768 !!! 59024
induct 162360 180440 175312
linpk 18112 18848 18864
mdbx 53464 57652 49516
nf 22560 23784 24080
protein 74320 74440 74816
rnflow 66040 71488 71648
test_fpu 52624 58224 58320
tfft 18416 18456 18600

The compile times with optzns are 26% slower than stock dragonegg
but 12% faster than stock gcc 4.5.4. The most interesting executable
size difference is gas_dyn which fastest with optzns but 11x larger
in size with stock gcc 4.5.4 compared to either stock dragonegg or
dragonegg with optzns. This is likely much improved in gcc 4.6 with
the new -fwhole-file default.

Hi Jack, thanks for doing this.

     Below are the tabulated compile times and executable sizes.

A) gcc 4.5.4svn using -msse3 -ffast-math -O3 -fno-tree-vectorize
B) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns
C) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize

These numbers really surprised me: the GCC code generators must be really slow
if the entire set of LLVM IR and codegen optimizations takes less time to run
than GCC codegen (since with -fplugin-arg-dragonegg-enable-gcc-optzns the only
part of GCC being disabled is codegen, i.e. RTL). I was assuming that I would
need to reduce the LLVM optimization level to get decent speed. Are you sure
that you built GCC with checking disabled (or --enable-checking=release)?
Can you please also redo this (along with execution times), adding the option
-fplugin-arg-dragonegg-llvm-ir-optimize=2. I expect that to always result in
a decent compile time win for dragonegg wrt stock gcc-4.5. If it doesn't have
a significant impact on execution speed, then I'd be tempted to use the formula
   LLVM optimization level = (1 + GCC optimization level) / 2
as the default, i.e. GCC -O3 -> LLVM -O2, GCC -O2 -> LLVM -O1, GCC -O1 -> LLVM
-O1, GCC -O0 -> LLVM -O0, GCC -O5 -> LLVM -O3.

Best wishes, Duncan.

Hi Jack, thanks for doing this.

     Below are the tabulated compile times and executable sizes.

A) gcc 4.5.4svn using -msse3 -ffast-math -O3 -fno-tree-vectorize
B) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns
C) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize

These numbers really surprised me: the GCC code generators must be really slow
if the entire set of LLVM IR and codegen optimizations takes less time to run
than GCC codegen (since with -fplugin-arg-dragonegg-enable-gcc-optzns the only
part of GCC being disabled is codegen, i.e. RTL). I was assuming that I would
need to reduce the LLVM optimization level to get decent speed. Are you sure
that you built GCC with checking disabled (or --enable-checking=release)?

I built gcc-4.5.4 from svn with --enable-check=yes. I'll rebuild gcc-4.5.4 with
--enable-checking=release and repeat the benchmarks.

Can you please also redo this (along with execution times), adding the option
-fplugin-arg-dragonegg-llvm-ir-optimize=2. I expect that to always result in
a decent compile time win for dragonegg wrt stock gcc-4.5. If it doesn't have
a significant impact on execution speed, then I'd be tempted to use the formula
  LLVM optimization level = (1 + GCC optimization level) / 2
as the default, i.e. GCC -O3 -> LLVM -O2, GCC -O2 -> LLVM -O1, GCC -O1 -> LLVM
-O1, GCC -O0 -> LLVM -O0, GCC -O5 -> LLVM -O3.

I'll try this after I repeat the initial benchmarks with --enable-checking=release.
        Jack

I'll try this after I repeat the initial benchmarks with --enable-checking=release.

Thanks! Don't forget to do a release build of LLVM too, i.e. configure with
--enable-optimized --disable-assertions Building dragonegg will the use the
same options.

Ciao, Duncan.

Hi Jack, thanks for doing this.

     Below are the tabulated compile times and executable sizes.

A) gcc 4.5.4svn using -msse3 -ffast-math -O3 -fno-tree-vectorize
B) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns
C) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize

These numbers really surprised me: the GCC code generators must be really slow
if the entire set of LLVM IR and codegen optimizations takes less time to run
than GCC codegen (since with -fplugin-arg-dragonegg-enable-gcc-optzns the only
part of GCC being disabled is codegen, i.e. RTL). I was assuming that I would
need to reduce the LLVM optimization level to get decent speed. Are you sure
that you built GCC with checking disabled (or --enable-checking=release)?
Can you please also redo this (along with execution times), adding the option
-fplugin-arg-dragonegg-llvm-ir-optimize=2. I expect that to always result in
a decent compile time win for dragonegg wrt stock gcc-4.5. If it doesn't have
a significant impact on execution speed, then I'd be tempted to use the formula
  LLVM optimization level = (1 + GCC optimization level) / 2
as the default, i.e. GCC -O3 -> LLVM -O2, GCC -O2 -> LLVM -O1, GCC -O1 -> LLVM
-O1, GCC -O0 -> LLVM -O0, GCC -O5 -> LLVM -O3.

Best wishes, Duncan.

I get about the same thing with --enable-checking=release applied to gcc-4.5.4...

Compile time (seconds)

Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/
               gcc 4.5.4 dragonegg/optzns dragonegg
ac 0.86 0.44 0.31
aermod 31.13 25.81 20.94
air 1.74 1.48 0.81
capacita 0.86 0.74 0.44
channel 0.35 0.32 0.23
doduc 3.08 2.63 1.63
fatigue 1.04 1.05 0.89
gas_dyn 0.94 0.94 0.75
induct 3.30 2.52 1.84
linpk 0.33 0.28 0.20
mdbx 1.09 1.02 0.60
nf 0.41 0.40 0.28
protein 1.56 1.28 0.98
rnflow 1.75 1.70 1.24
test_fpu 1.38 1.41 1.05
tfft 0.31 0.28 0.19

mean 3.13 2.64 2.02

I wouldn't put a lot of faith in the compile time measurements
because unlike the actual benchmark runs, pb05 doesn't attempt to
repeat the compilations until it has converged on a low error
measurement for the compilation time.
            Jack

Duncan,
    Here are the complete benchmarks rerun against gcc 4.5.4 built with...

Using built-in specs.
COLLECT_GCC=gfortran-fsf-4.5
COLLECT_LTO_WRAPPER=/sw/lib/gcc4.5/libexec/gcc/x86_64-apple-darwin11.0.0/4.5.4/lto-wrapper
Target: x86_64-apple-darwin11.0.0
Configured with: ../gcc-4.5.4/configure --prefix=/sw --prefix=/sw/lib/gcc4.5 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.5/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.5 --enable-lto --enable-checking=release
Thread model: posix
gcc version 4.5.4 20110608 (prerelease) (GCC)

x86_64 darwin

A) gcc 4.5.4svn using -msse3 -ffast-math -O3 -fno-tree-vectorize
B) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns
C) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize
D) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns -fplugin-arg-dragonegg-llvm-ir-optimize=2
E) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-llvm-ir-optimize=2

Run Time (seconds)
Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/ D) gcc 4.5.4/ E) gcc 4.5.4/
              gcc 4.5.4 dragonegg/optzns dragonegg dragonegg/optzns/ dragonegg/optimize=2
                                                           optimize=2

ac 9.58 9.11 12.28 9.12 12.73
aermod 20.99 16.18 17.86 16.30 17.89
air 6.06 6.58 7.69 6.51 7.64
capacita 35.76 39.86 46.10 39.58 45.89
channel 2.03 2.04 1.96 2.04 1.96
doduc 28.16 28.50 30.34 28.53 30.42
fatigue 8.12 7.09 10.34 7.06 10.25
gas_dyn 10.16 9.92 11.67 9.96 11.81
induct 20.14 20.76 48.75 20.78 48.75
linpk 15.43 15.41 15.64 15.41 15.64
mdbx 11.41 11.72 12.11 11.72 12.07
nf 27.90 28.52 29.26 28.42 29.13
protein 38.65 38.72 41.31 38.75 39.49
rnflow 27.22 28.18 31.81 28.15 31.98
test_fpu 11.49 11.23 11.57 11.17 11.52
tfft 1.91 1.95 2.15 1.95 2.16

Mean 12.72 12.60 14.73 12.59 14.72

Compile Time (seconds)
Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/ D) gcc 4.5.4/ E) gcc 4.5.4/
              gcc 4.5.4 dragonegg/optzns dragonegg dragonegg/optzns/ dragonegg/optimize=2
                                                           optimize=2

ac 0.86 0.44 0.31 0.41 0.28
aermod 31.13 25.81 20.94 25.44 20.87
air 1.74 1.48 0.81 1.46 0.78
capacita 0.86 0.74 0.44 0.71 0.42
channel 0.35 0.32 0.23 0.30 0.23
doduc 3.08 2.63 1.63 2.60 1.58
fatigue 1.04 1.05 0.89 0.90 0.70
gas_dyn 0.94 0.94 0.75 0.84 0.62
induct 3.30 2.52 1.84 2.36 1.66
linpk 0.33 0.28 0.20 0.28 0.20
mdbx 1.09 1.02 0.60 0.99 0.59
nf 0.41 0.40 0.28 0.40 0.28
protein 1.56 1.28 0.98 1.21 0.82
rnflow 1.75 1.70 1.24 1.61 1.13
test_fpu 1.38 1.41 1.05 1.31 0.95
tfft 0.31 0.28 0.19 0.28 0.19

Executable Size (bytes)
Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/ D) gcc 4.5.4/ E) gcc 4.5.4/
              gcc 4.5.4 dragonegg/optzns dragonegg dragonegg/optzns/ dragonegg/optimize=2
                                                           optimize=2

ac 26344 30896 26704 30896 26824
aermod 1145924 1043816 1052056 1027680 1031880
air 57404 57700 53532 53556 53532
capacita 40864 41008 37064 41008 37064
channel 22448 22664 22664 22664 22664
doduc 127340 124108 120124 124372 120484
fatigue 61152 65352 65664 61256 61568
gas_dyn 647864 58768 59024 54672 54960
induct 162360 180440 175312 168304 163176
linpk 18112 18848 18864 18848 18896
mdbx 53464 57652 49516 57652 49516
nf 22560 23784 24080 23784 24080
protein 74320 74440 74816 70344 66624
rnflow 66040 71488 71648 67416 67616
test_fpu 52624 58224 58320 54128 54256
tfft 18416 18456 18600 18456 18600

Duncan,
   FYI, I always build llvm/clang with cmake using...

cmake -DLLVM_BUILD_32_BITS:BOOL=OFF -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_INSTALL_PREFIX=%p/opt/llvm-%v -DLLVM_ENABLE_ASSERTIONS=OFF -DCMAKE_BUILD_TYPE=Release ..

      Jack

Duncan,
    Here are the complete benchmarks rerun against gcc 4.5.4 built with...

Using built-in specs.
COLLECT_GCC=gfortran-fsf-4.5
COLLECT_LTO_WRAPPER=/sw/lib/gcc4.5/libexec/gcc/x86_64-apple-darwin11.0.0/4.5.4/lto-wrapper
Target: x86_64-apple-darwin11.0.0
Configured with: ../gcc-4.5.4/configure --prefix=/sw --prefix=/sw/lib/gcc4.5 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.5/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.5 --enable-lto --enable-checking=release
Thread model: posix
gcc version 4.5.4 20110608 (prerelease) (GCC)

x86_64 darwin

A) gcc 4.5.4svn using -msse3 -ffast-math -O3 -fno-tree-vectorize
B) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns
C) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize
D) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns -fplugin-arg-dragonegg-llvm-ir-optimize=2
E) gcc 4.5.4svn/dragonegg using -msse3 -ffast-math -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-llvm-ir-optimize=2

Run Time (seconds)
Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/ D) gcc 4.5.4/ E) gcc 4.5.4/
              gcc 4.5.4 dragonegg/optzns dragonegg dragonegg/optzns/ dragonegg/optimize=2
                                                           optimize=2

ac 9.58 9.11 12.28 9.12 12.73
aermod 20.99 16.18 17.86 16.30 17.89
air 6.06 6.58 7.69 6.51 7.64
capacita 35.76 39.86 46.10 39.58 45.89
channel 2.03 2.04 1.96 2.04 1.96
doduc 28.16 28.50 30.34 28.53 30.42
fatigue 8.12 7.09 10.34 7.06 10.25
gas_dyn 10.16 9.92 11.67 9.96 11.81
induct 20.14 20.76 48.75 20.78 48.75
linpk 15.43 15.41 15.64 15.41 15.64
mdbx 11.41 11.72 12.11 11.72 12.07
nf 27.90 28.52 29.26 28.42 29.13
protein 38.65 38.72 41.31 38.75 39.49
rnflow 27.22 28.18 31.81 28.15 31.98
test_fpu 11.49 11.23 11.57 11.17 11.52
tfft 1.91 1.95 2.15 1.95 2.16

Mean 12.72 12.60 14.73 12.59 14.72

Compile Time (seconds)
Benchmark A) stock B) gcc 4.5.4/ C) gcc 4.5.4/ D) gcc 4.5.4/ E) gcc 4.5.4/
              gcc 4.5.4 dragonegg/optzns dragonegg dragonegg/optzns/ dragonegg/optimize=2
                                                           optimize=2

ac 0.86 0.44 0.31 0.41 0.28
aermod 31.13 25.81 20.94 25.44 20.87
air 1.74 1.48 0.81 1.46 0.78
capacita 0.86 0.74 0.44 0.71 0.42
channel 0.35 0.32 0.23 0.30 0.23
doduc 3.08 2.63 1.63 2.60 1.58
fatigue 1.04 1.05 0.89 0.90 0.70
gas_dyn 0.94 0.94 0.75 0.84 0.62
induct 3.30 2.52 1.84 2.36 1.66
linpk 0.33 0.28 0.20 0.28 0.20
mdbx 1.09 1.02 0.60 0.99 0.59
nf 0.41 0.40 0.28 0.40 0.28
protein 1.56 1.28 0.98 1.21 0.82
rnflow 1.75 1.70 1.24 1.61 1.13
test_fpu 1.38 1.41 1.05 1.31 0.95
tfft 0.31 0.28 0.19 0.28 0.19

mean 3.13 2.64 2.02 2.57 1.96

Duncan,
    hese numbers were from release builds for both FSF gcc 4.5.4 and llvm. It seems that
-fplugin-arg-dragonegg-llvm-ir-optimize=2 provides a small offsetting reduction in compile
time to compensate for the increased compile time from -fplugin-arg-dragonegg-enable-gcc-optzns
at -O3 -ffast-math. It also appears that with -fplugin-arg-dragonegg-llvm-ir-optimize=2,
the addition of -fplugin-arg-dragonegg-enable-gcc-optzns slows compilation by 24%
with -O3 -ffast-math (which is very close to the 23% increase in compile time seen
without -fplugin-arg-dragonegg-llvm-ir-optimize=2). We should rebenchmark pb05 with -O2 -ffast-math
to see if -fplugin-arg-dragonegg-enable-gcc-optzns has the same impact on compile times.
IMHO, if -fplugin-arg-dragonegg-enable-gcc-optzns has less effect at -O2, it would might make sense
to default -fplugin-arg-dragonegg-enable-gcc-optzns on in dragonegg. That is, if the compile time
regressions are mainly at -O3 that would be tolerable because run-time of the resulting binaries
should be more important there.
            Jack

Hi Jack,

     Here are the complete benchmarks rerun against gcc 4.5.4 built with...

thanks for these great numbers. It is interesting to see that dropping the LLVM
IR optimization level to 2 makes no difference to the run-times. As a radical
experiment I just committed a patch to dragonegg (commit 132846) that disables
all heavy LLVM optimizations when the GCC optimizers are enabled. A few small
cleanups are run on each function, but otherwise only LLVM codegen (and codegen
optimizations) are done. I did some measurements and this results in very fast
compile times. But how does it impact run-time? Can you please benchmark
run times with -fplugin-arg-dragonegg-enable-gcc-optzns and this patch applied
(plus don't use the -fplugin-arg-dragonegg-llvm-ir-optimize option since that
turns on heavy LLVM IR optimizations again). If it has no impact on run-times
then that would suggest that LLVM's IR level optimizers are not doing any useful
optimization: GCC already got everything. If it does have an impact then that
suggests that LLVM is picking up stuff that GCC missed. I can't way to see!

Thanks a lot, Duncan.

Hi Jack,

     These numbers were from release builds for both FSF gcc 4.5.4 and llvm. It seems that
-fplugin-arg-dragonegg-llvm-ir-optimize=2 provides a small offsetting reduction in compile
time to compensate for the increased compile time from -fplugin-arg-dragonegg-enable-gcc-optzns
at -O3 -ffast-math. It also appears that with -fplugin-arg-dragonegg-llvm-ir-optimize=2,
the addition of -fplugin-arg-dragonegg-enable-gcc-optzns slows compilation by 24%
with -O3 -ffast-math (which is very close to the 23% increase in compile time seen
without -fplugin-arg-dragonegg-llvm-ir-optimize=2). We should rebenchmark pb05 with -O2 -ffast-math
to see if -fplugin-arg-dragonegg-enable-gcc-optzns has the same impact on compile times.
IMHO, if -fplugin-arg-dragonegg-enable-gcc-optzns has less effect at -O2, it would might make sense
to default -fplugin-arg-dragonegg-enable-gcc-optzns on in dragonegg. That is, if the compile time
regressions are mainly at -O3 that would be tolerable because run-time of the resulting binaries
should be more important there.

I did some compile time benchmarking using gcc-as-one-big-file (750000 SLOC) and
bzip2-as-one-big-file (7000 SLOC). At -O3, adding -fplugin-arg-dragonegg-
enable-gcc-optzns didn't change the compile time (gcc) or decreased it slightly
(bzip2). At -O2 it increased the compile time of gcc by 8%, and decreased the
compile time of bzip2 by 11%. At -O1 it increased the compile time of gcc by
9% and did not change the compile time of bzip2.

With my latest patch that turns off module level LLVM optimizations if GCC
optimizations are enabled: at -O3, adding -fplugin-arg-dragonegg-
enable-gcc-optzns decreased the compile time of gcc by 27% and decreased the
compile time of bzip2 by 36%. At -O2 it decreased the compile time of gcc by
20%, and decreased the compile time of bzip2 by 36%. At -O1 it decreased the
compile time of gcc by 12% and decreased the compile time of bzip2 by 26%.

The above two paragraphs represent two extreme situations, the trick is now to
find a level between the two where run-time performance is excellent while still
having decent compile times.

Ciao, Duncan.

PS: All compiling was done with -fno-tree-vectorize.
PPS: "Adding -fplugin-arg-dragonegg-enable-gcc-optzns" is short-hand for:
using the dragonegg plugin (-fplugin=dragonegg.so) along with the
-fplugin-arg-dragonegg-enable-gcc-optzns option. I.e. I am comparing stock
gcc-4.5 with dragonegg+gcc-optimizations.