Polyhedron 2005 results for dragonegg 3.3svn

Below are the results for the Polyhedron 2005 benchmarks compiled with llvm/compiler-rt/dragonegg 3.3svn at r182439 against current
FSF gcc 4.7.3svn and 4.8.1svn. The only major bug remaining in the dragonegg 3.3svn support for gcc 4.8.x is http://llvm.org/bugs/show_bug.cgi?id=15980
which results in unresolved symbols for _iround and _iroundf in the aermod and rnflow testcases. Note that this skews the geometric mean
of the run time to much higher values.
              Jack

Tested on x86_apple-darwin12

Compile Flags: -ffast-math -funroll-loops -O3

de-gfortran47: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs
de-gfortran48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
de-gfortran47+optzns: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns
de-gfortran48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns
gfortran47: /sw/bin/gfortran-fsf-4.7
gfortran48: /sw/bin/gfortran-fsf-4.8

Run time (secs)

Benchmark de-gfortran47 de-gfortran48 de-gfortran47+optzns de-gfortran48+optzns gfortran47 gfortran48
ac 11.39 11.39 8.09 8.14 8.18 8.05
aermod 16.35 -1.00 14.50 -1.00 16.45 16.23
air 6.88 6.79 5.42 5.25 5.83 5.73
capacita 39.85 39.85 34.71 33.39 32.51 33.02
channel 2.05 2.03 2.15 1.98 1.83 1.83
doduc 27.10 27.24 26.75 26.36 25.91 25.76
fatigue 8.85 8.88 7.72 5.56 8.26 5.60
gas_dyn 11.76 11.45 4.51 4.20 3.88 3.59
induct 24.01 24.00 11.86 11.85 12.08 12.21
linpk 15.43 15.44 15.40 15.77 15.37 15.64
mdbx 11.92 11.92 11.30 11.28 11.18 11.42
nf 29.57 29.82 29.50 29.46 27.21 27.25
protein 36.15 35.10 35.93 34.13 31.88 31.81
rnflow 27.02 -1.00 26.77 -1.00 24.67 21.21
test_fpu 11.49 11.34 9.11 9.30 7.90 8.01
tfft 1.92 1.92 1.92 1.90 1.86 1.90

Geom. Mean 13.19 21.26 10.99 17.31 10.60 10.22

Compile time (secs)

Benchmark de-gfortran47 de-gfortran48 de-gfortran47+optzns de-gfortran48+optzns gfortran47 gfortran48
ac 0.62 0.31 2.20 1.38 2.88 2.08
aermod 35.19 35.52 43.50 42.89 42.75 55.97
air 1.16 1.17 2.72 2.36 4.48 4.28
capacita 0.52 0.55 1.02 0.99 1.90 1.89
channel 0.26 0.26 0.47 0.47 0.65 0.75
doduc 1.74 1.76 3.78 3.54 6.03 5.68
fatigue 0.91 0.91 1.33 1.49 1.97 2.04
gas_dyn 0.70 0.69 1.40 1.38 3.39 2.44
induct 1.95 1.73 2.87 2.98 4.08 4.42
linpk 0.25 0.24 0.53 0.71 0.92 1.25
mdbx 0.66 0.67 1.30 1.14 2.16 1.90
nf 0.39 0.39 0.80 0.74 2.12 1.67
protein 1.12 1.11 2.01 1.77 4.39 3.62
rnflow 1.26 1.26 2.93 2.74 6.43 5.47
test_fpu 0.91 0.91 2.27 2.22 5.28 4.26
tfft 0.22 0.21 0.39 0.44 0.59 0.78

Executable (bytes)

Benchmark de-gfortran47 de-gfortran48 de-gfortran47+optzns de-gfortran48+optzns gfortran47 gfortran48
ac 26776 26792 47160 34928 59120 42784
aermod 1023024 0 1052728 0 1392840 1286136
air 61940 61948 65964 61876 110768 106680
capacita 41344 41144 45440 45040 77920 73248
channel 22736 22744 26696 22552 34704 34656
doduc 128376 128384 140580 136296 205320 189040
fatigue 65648 65640 69808 73848 90240 82040
gas_dyn 54840 54936 63144 71304 123680 99184
induct 163064 158792 163192 166920 179080 170872
linpk 18680 18688 22896 34920 42640 50936
mdbx 49492 49508 57692 53604 90232 78032
nf 23880 23888 32088 32104 84072 67744
protein 74960 75048 87144 83128 131976 115688
rnflow 67704 0 88248 0 205584 176912
test_fpu 50000 50008 70440 78456 179464 142608
tfft 18568 18576 18416 22544 30680 34832

Duncan,
    With r182593, the dragonegg 3.3 branch now completely passes the Polyhedron 2005 benchmarks
using the FSF gcc 4.8.1svn compiler. Thanks.
         Jack

Tested on x86_apple-darwin12

Compile Flags: -ffast-math -funroll-loops -O3

de-gfortran47: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs
de-gfortran48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
de-gfortran47+optzns: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns
de-gfortran48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns
gfortran47: /sw/bin/gfortran-fsf-4.7
gfortran48: /sw/bin/gfortran-fsf-4.8

Run time (secs)

Benchmark de-gfortran47 de-gfortran48 de-gfortran47+optzns de-gfortran48+optzns gfortran47 gfortran48
ac 11.39 11.39 8.09 8.14 8.18 8.05
aermod 16.35 16.00 14.50 15.28 16.45 16.23
air 6.88 6.77 5.42 5.28 5.83 5.73
capacita 39.85 39.83 34.71 33.47 32.51 33.02
channel 2.05 2.05 2.15 1.99 1.83 1.83
doduc 27.10 27.37 26.75 26.31 25.91 25.76
fatigue 8.85 8.81 7.72 5.60 8.26 5.60
gas_dyn 11.76 11.50 4.51 4.21 3.88 3.59
induct 24.01 24.04 11.86 11.85 12.08 12.21
linpk 15.43 15.48 15.40 15.83 15.37 15.64
mdbx 11.92 11.91 11.30 11.27 11.18 11.42
nf 29.57 30.04 29.50 29.59 27.21 27.25
protein 36.15 35.21 35.93 34.16 31.88 31.81
rnflow 27.02 25.92 26.77 22.20 24.67 21.21
test_fpu 11.49 11.47 9.11 9.30 7.90 8.01
tfft 1.92 1.92 1.92 1.89 1.86 1.90

Geom. Mean 13.19 13.10 10.99 10.52 10.60 10.22

Compile time (secs)

Benchmark de-gfortran47 de-gfortran48 de-gfortran47+optzns de-gfortran48+optzns gfortran47 gfortran48
ac 0.62 0.29 2.20 0.71 2.88 2.08
aermod 35.19 20.44 43.50 42.90 42.75 55.97
air 1.16 1.11 2.72 2.40 4.48 4.28
capacita 0.52 0.52 1.02 1.04 1.90 1.89
channel 0.26 0.23 0.47 0.50 0.65 0.75
doduc 1.74 1.74 3.78 3.53 6.03 5.68
fatigue 0.91 0.87 1.33 1.49 1.97 2.04
gas_dyn 0.70 0.63 1.40 1.39 3.39 2.44
induct 1.95 1.77 2.87 2.99 4.08 4.42
linpk 0.25 0.21 0.53 0.72 0.92 1.25
mdbx 0.66 0.61 1.30 1.24 2.16 1.90
nf 0.39 0.35 0.80 0.74 2.12 1.67
protein 1.12 1.03 2.01 1.79 4.39 3.62
rnflow 1.26 1.19 2.93 2.72 6.43 5.47
test_fpu 0.91 0.85 2.27 2.22 5.28 4.26
tfft 0.22 0.18 0.39 0.46 0.59 0.78

Executable (bytes)

Benchmark de-gfortran47 de-gfortran48 de-gfortran47+optzns de-gfortran48+optzns gfortran47 gfortran48
ac 26776 26792 47160 34928 59120 42784
aermod 1023024 1023064 1052728 1031576 1392840 1286136
air 61940 61948 65964 61876 110768 106680
capacita 41344 41144 45440 45040 77920 73248
channel 22736 22744 26696 22552 34704 34656
doduc 128376 128384 140580 136296 205320 189040
fatigue 65648 65640 69808 73848 90240 82040
gas_dyn 54840 54936 63144 71304 123680 99184
induct 163064 158792 163192 166920 179080 170872
linpk 18680 18688 22896 34920 42640 50936
mdbx 49492 49508 57692 53604 90232 78032
nf 23880 23888 32088 32104 84072 67744
protein 74960 75048 87144 83128 131976 115688
rnflow 67704 67712 88248 96152 205584 176912
test_fpu 50000 50008 70440 78456 179464 142608
tfft 18568 18576 18416 22544 30680 34832

Hi Jack, do the results improve significantly with the attached patch applied?
If enables IR level fast math optimizations and the loop vectorizer. Note that
some loop vectorizations only kick in if fast-math is enabled too.

Best wishes, Duncan.

fm.diff (2.75 KB)

Duncan,
    As requested, appended are the updated Polyhedron 2005 benchmark results with both RC1 and RC3 llvm 3.3 testing.
There is a small improvement in the dragonegg results (without -fplugin-arg-dragonegg-enable-gcc-optzns) in RC3. I assume
we still only have partial coverage of all of the -ffast-math optimizations performed by FSF gcc in llvm's fast-math
support, correct?
                      Jack

Tested on x86_apple-darwin12

Compile Flags: -ffast-math -funroll-loops -O3

de-gfc47: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs
de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
de-gfc47+optzns: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs
+-fplugin-arg-dragonegg-enable-gcc-optzns
de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
+-fplugin-arg-dragonegg-enable-gcc-optzns
gfortran47: /sw/bin/gfortran-fsf-4.7
gfortran48: /sw/bin/gfortran-fsf-4.8

Run time (secs)

Benchmark de-gfc47 de-gfc47 de-gfc48 de-gfc48 de-gfc47 de-gfc47 de-gfc48 de-gfc48 gfortran47 gfortran48
                                                          +optzns +optzns +optzns +optzns
                     RC1 RC3 RC1 RC3 RC1 RC3 RC1 RC3
ac 11.39 11.66 11.39 11.58 8.09 8.07 8.14 8.14 8.18 8.05
aermod 16.35 16.47 16.00 16.44 14.50 14.61 15.28 14.43 16.45 16.23
air 6.88 6.87 6.77 6.77 5.42 5.42 5.28 5.27 5.83 5.73
capacita 39.85 37.80 39.83 37.86 34.71 34.81 33.47 33.53 32.51 33.02
channel 2.05 2.06 2.05 2.06 2.15 2.15 1.99 1.99 1.83 1.83
doduc 27.10 27.43 27.37 27.39 26.75 27.03 26.31 26.24 25.91 25.76
fatigue 8.85 8.84 8.81 8.88 7.72 7.75 5.60 5.42 8.26 5.60
gas_dyn 11.76 8.25 11.50 7.94 4.51 4.52 4.21 4.20 3.88 3.59
induct 24.01 24.45 24.04 24.04 11.86 11.90 11.85 11.85 12.08 12.21
linpk 15.43 15.48 15.48 15.49 15.40 15.47 15.83 15.81 15.37 15.64
mdbx 11.92 12.14 11.91 12.15 11.30 11.29 11.27 11.27 11.18 11.42
nf 29.57 30.08 30.04 30.11 29.50 29.82 29.59 29.86 27.21 27.25
protein 36.15 36.15 35.21 35.17 35.93 36.02 34.16 34.06 31.88 31.81
rnflow 27.02 27.08 25.92 26.12 26.77 26.83 22.20 22.21 24.67 21.21
test_fpu 11.49 11.55 11.47 11.52 9.11 9.11 9.30 9.30 7.90 8.01
tfft 1.92 1.94 1.92 1.92 1.92 1.92 1.89 1.90 1.86 1.90

Geom. Mean 13.19 12.95 13.10 12.83 10.99 11.02 10.52 10.47 10.60 10.22

Compile time (secs)

Benchmark de-gfc47 de-gfc47 de-gfc48 de-gfc48 de-gfc47 de-gfc47 de-gfc48 de-gfc48 gfortran47 gfortran48
                                                          +optzns +optzns +optzns +optzns
                     RC1 RC3 RC1 RC3 RC1 RC3 RC1 RC3
ac 0.62 1.63 0.29 0.93 2.20 1.02 0.71 0.73 2.88 2.08
aermod 35.19 35.57 20.44 35.86 43.50 43.39 42.90 43.08 42.75 55.97
air 1.16 1.23 1.11 1.26 2.72 2.68 2.40 2.35 4.48 4.28
capacita 0.52 0.60 0.52 0.62 1.02 0.94 1.04 0.96 1.90 1.89
channel 0.26 0.28 0.23 0.30 0.47 0.45 0.50 0.47 0.65 0.75
doduc 1.74 1.89 1.74 1.91 3.78 3.71 3.53 3.55 6.03 5.68
fatigue 0.91 0.91 0.87 0.91 1.33 1.30 1.49 1.49 1.97 2.04
gas_dyn 0.70 0.87 0.63 0.88 1.40 1.37 1.39 1.39 3.39 2.44
induct 1.95 1.83 1.77 1.83 2.87 2.81 2.99 3.02 4.08 4.42
linpk 0.25 0.32 0.21 0.32 0.53 0.52 0.72 0.73 0.92 1.25
mdbx 0.66 0.73 0.61 0.75 1.30 1.26 1.24 1.15 2.16 1.90
nf 0.39 0.55 0.35 0.55 0.80 0.80 0.74 0.74 2.12 1.67
protein 1.12 1.18 1.03 1.20 2.01 1.99 1.79 1.77 4.39 3.62
rnflow 1.26 1.55 1.19 1.55 2.93 2.84 2.72 2.73 6.43 5.47
test_fpu 0.91 1.12 0.85 1.13 2.27 5.06 2.22 2.23 5.28 4.26
tfft 0.22 0.24 0.18 0.22 0.39 0.40 0.46 0.46 0.59 0.78

Executable (bytes)

Benchmark de-gfc47 de-gfc47 de-gfc48 de-gfc48 de-gfc47 de-gfc47 de-gfc48 de-gfc48 gfortran47 gfortran48
                                                          +optzns +optzns +optzns +optzns
                     RC1 RC3 RC1 RC3 RC1 RC3 RC1 RC3
ac 26776 30896 26792 30912 47160 47160 34928 34928 59120 42784
aermod 1023024 1035312 1023064 1031248 1052728 1052728 1031576 1031568 1392840 1286136
air 61940 61940 61948 61948 65964 65964 61876 61876 110768 106680
capaci 41344 45440 41144 41144 45440 45440 45040 45040 77920 73248
channe 22736 22600 22744 22608 26696 22600 22552 22552 34704 34656
doduc 128376 120188 128384 120196 140580 140580 136296 136296 205320 189040
fatigu 65648 69744 65640 69736 69808 69808 73848 73848 90240 82040
gas_dy 54840 58936 54936 59032 63144 63144 71304 71304 123680 99184
induct 163064 163064 158792 162888 163192 167288 166920 171024 179080 170872
linpk 18680 22896 18688 22904 22896 22896 34920 34920 42640 50936
mdbx 49492 57684 49508 57700 57692 57692 53604 53604 90232 78032
nf 23880 32080 23888 27984 32088 32088 32104 32104 84072 67744
protei 74960 79056 75048 79144 87144 87144 83128 83128 131976 115688
rnflow 67704 79992 67712 80000 88248 88248 96152 96152 205584 176912
test_f 50000 62296 50008 62304 70440 70440 78456 78456 179464 142608
tfft 18568 18568 18576 18576 18416 18416 22544 22544 30680 34832

Hi Jack,

Hi Jack, I pulled the loop vectorizer and fast math changes into the 3.3 branch,
so hopefully they will be part of 3.3 rc3 (and 3.3 final!). It would be great
if you could redo the benchmarks rc3.

Duncan,
     As requested, appended are the updated Polyhedron 2005 benchmark results with both RC1 and RC3 llvm 3.3 testing.

thanks for doing this. As rc3 hasn't been tagged yet, I assume you used latest
3.3svn?

There is a small improvement in the dragonegg results (without -fplugin-arg-dragonegg-enable-gcc-optzns) in RC3. I assume
we still only have partial coverage of all of the -ffast-math optimizations performed by FSF gcc in llvm's fast-math
support, correct?

These results are very disappointing, I was hoping to see a big improvement
somewhere instead of no real improvement anywhere (except for gas_dyn) or a
regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math
optimizations. I will try to find time to poke at gas_dyn and induct: since
turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers
are clearly missing something important.

Ciao, Duncan.

Duncan,
   Appended are another set of benchmark runs where I attempted to decouple the
fast math optimizations from the vectorization by passing -fno-tree-vectorize.
I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm
vectorization.

Tested on x86_apple-darwin12

Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize

de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
s -fplugin-arg-dragonegg-enable-gcc-optzns
gfortran48: /sw/bin/gfortran-fsf-4.8

Run time (secs)

Benchmark de-gfc48 de-gfc48 gfortran48
                        +optzns

ac 11.33 8.10 8.02
aermod 16.03 14.45 16.13
air 6.80 5.28 5.73
capacita 39.89 35.21 34.96
channel 2.06 2.29 2.69
doduc 27.35 26.13 25.74
fatigue 8.83 4.82 4.67
gas_dyn 11.41 9.79 9.60
induct 23.95 21.75 21.14
linpk 15.49 15.48 15.69
mdbx 11.91 11.28 11.39
nf 29.92 29.57 27.99
protein 36.34 33.94 31.91
rnflow 25.97 25.27 22.78
test_fpu 11.48 10.91 9.64
tfft 1.92 1.91 1.91

Geom. Mean 13.12 11.70 11.64

Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization,
I am hoping that additional benchmarking of de-gfc48+optzns with individual
-ffast-math optimizations disabled (such as passing -fno-unsafe-math-optimizations)
may give us a clue as the the origin of the performance delta between the stock
dragonegg results with -ffast-math and those with -fplugin-arg-dragonegg-enable-gcc-optzns.
      Jack

Duncan,
   In case it helps, I benchmarked disabling individual -ffast-math optimizations (with partial results
appended). The most important optimization to the benchmark runtimes seems to be -funsafe-math-optimizations
(as can be seen from the runtime regression caused by -fno-unsafe-math-optimizations). Does llvm currently
support all of the features of FSF gcc's -funsafe-math-optimizations?
            Jack

Tested on x86_apple-darwin12

Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize

de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
s -fplugin-arg-dragonegg-enable-gcc-optzns
gfortran48: /sw/bin/gfortran-fsf-4.8
de-gfc48+nounsafe+optzns:/sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated
-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns -fno-unsafe-math-optimzations
de-gfc48+math-errno+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integra
ted-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns -fmath-errno
de-gfc48+math-signans+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integ
rated-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns -fsignaling-nans

Run time (secs)

Benchmark de-gfc48 de-gfc48 gfortran48 de-gfc48+nounsafe de-gfc48+math-errno de-gfc48+math-signans
                        +optzns +optzns +optzns +optzns

ac 11.33 8.10 8.02 9.20 8.10 8.10
aermod 16.03 14.45 16.13 14.83 14.20 14.51
air 6.80 5.28 5.73 6.84 5.26 5.31
capacita 39.89 35.21 34.96 36.72 35.21 35.51
channel 2.06 2.29 2.69 2.30 2.29 2.30
doduc 27.35 26.13 25.74 29.90 26.42 26.99
fatigue 8.83 4.82 4.67 5.60 4.87 4.82
gas_dyn 11.41 9.79 9.60 12.97 10.56 12.13
induct 23.95 21.75 21.14 22.34 21.39 21.91
linpk 15.49 15.48 15.69 15.49 15.49 15.52
mdbx 11.91 11.28 11.39 11.85 11.27 11.83
nf 29.92 29.57 27.99 29.67 29.67 29.47
protein 36.34 33.94 31.91 34.23 33.62 33.97
rnflow 25.97 25.27 22.78 27.99 28.00 28.00
test_fpu 11.48 10.91 9.64 10.95 10.94 10.93
tfft 1.92 1.91 1.91 1.91 1.90 1.91

Geom. Mean 13.12 11.70 11.64 12.62 11.82 12.01

Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers
is.

These results are very disappointing, I was hoping to see a big improvement
somewhere instead of no real improvement anywhere (except for gas_dyn) or a
regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math
optimizations. I will try to find time to poke at gas_dyn and induct: since
turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers
are clearly missing something important.

Ciao, Duncan.

Duncan,
    Appended are another set of benchmark runs where I attempted to decouple the
fast math optimizations from the vectorization by passing -fno-tree-vectorize.
I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm
vectorization.

Yes, it does disable LLVM vectorization.

Tested on x86_apple-darwin12

Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize

Maybe -march=native would be a good addition.

de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
s -fplugin-arg-dragonegg-enable-gcc-optzns
gfortran48: /sw/bin/gfortran-fsf-4.8

Run time (secs)

What is the standard deviation for each benchmark? If each run varies by +-5%
then that means that the changes in runtime of around 3% measured below don't
mean anything.

Comparing with your previous benchmarks, I see:

Benchmark de-gfc48 de-gfc48 gfortran48
                         +optzns

ac 11.33 8.10 8.02

Turning on LLVM's vectorizer gives a 2% slowdown.

aermod 16.03 14.45 16.13

Turning on LLVM's vectorizer gives a 2.5% slowdown.

air 6.80 5.28 5.73
capacita 39.89 35.21 34.96

Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from
its vectorizer.

channel 2.06 2.29 2.69

GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the
other hand, without vectorization LLVM's version runs 23% faster than GCC's, so
while GCC's vectorizer leaps GCC into the lead, the final speed difference is
more in the order of GCC 10% faster.

doduc 27.35 26.13 25.74
fatigue 8.83 4.82 4.67

GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
This is a good one to look at, because all the difference between GCC
and LLVM is coming from the mid-level optimizers: turning on GCC optzns
in dragonegg speeds up the program to GCC levels, so it is possible to
get LLVM IR with and without the effect of GCC optimizations, which should
make it fairly easy to understand what GCC is doing right here.

gas_dyn 11.41 9.79 9.60

Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable
speedup from its vectorizer.

induct 23.95 21.75 21.14

GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like
fatigue, this is a case where we can get IR showing all the improvements that
the GCC optimizers made.

linpk 15.49 15.48 15.69
mdbx 11.91 11.28 11.39

Turning on LLVM's vectorizer gives a 2% slowdown

nf 29.92 29.57 27.99
protein 36.34 33.94 31.91

Turning on LLVM's vectorizer gives a 3% speedup.

rnflow 25.97 25.27 22.78

GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.

test_fpu 11.48 10.91 9.64

GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.

tfft 1.92 1.91 1.91

Geom. Mean 13.12 11.70 11.64

Ciao, Duncan.

Jack,

Can you please file a bug report and attach the BC files for the major loops that we miss ?

Thanks,
Nadav

Actually this kind of opportunities, as outlined bellow, was one of my contrived motivating
example for fast-math. But last year we don't see such opportunities in real applications we care about.

     t1 = x1/y
     ...
     t2 = x2/y.

  I think it is better to be taken care by GVN/PRE -- blindly convert x/y => x *1/y is not necessarily
beneficial. Or maybe we can blindly perform such transformation in early stage, and later on
convert it back if they are not CSEed away.

[Resending without the bitcode attached, which was too big for the mailing
list].

Hi Nadav,

Jack,

Can you please file a bug report and attach the BC files for the major loops
that we miss ?

I took a look and it's not clear what vectorization has to do with it, it seems
to be a missed fast-math optimization. I've attached bitcode where only LLVM
optimizations are run (fatigue0.ll) and where GCC optimizations are run before
LLVM optimizations (fatigue1.ll). The hottest instruction is the same in both:

fatigue0.ll:
    %329 = fsub fast double %327, %328, !dbg !1077

fatique1.ll:
    %1504 = fsub fast double %1501, %1503, !dbg !1148

However in the GCC version it is twice as hot as in the LLVM only version,
i.e. in the LLVM only version instructions elsewhere are consuming a lot of
time. In the LLVM only version there are 9 fdiv instructions in that basic
block while GCC has only one. From the profile it looks like each of them is
consuming quite some time, and all together they chew up a lot of time. I
think this explains the speed difference.

All of the fdiv's have the same denominator:
    %260 = fdiv fast double %253, %259
...
    %262 = fdiv fast double %219, %259
...
    %264 = fdiv fast double %224, %259
...
    %266 = fdiv fast double %230, %259
and so on. It looks like GCC takes the reciprocal
    %1445 = fdiv fast double 1.000000e+00, %1439
and then turns the fdiv's into fmul's.

I'm not sure what the best way to implement this optimization in LLVM is. Maybe
Shuxin has some ideas.

So it looks like a missed fast-math optimization rather than anything to do with
vectorization, which is strange as GCC only gets the big speedup when
vectorization is turned on.

Ciao, Duncan.

Hi Shuxin,

Actually this kind of opportunities, as outlined bellow, was one of my contrived
motivating
example for fast-math. But last year we don't see such opportunities in real
applications we care about.

     t1 = x1/y
     ...
     t2 = x2/y.

  I think it is better to be taken care by GVN/PRE -- blindly convert x/y => x
*1/y is not necessarily
beneficial. Or maybe we can blindly perform such transformation in early stage,
and later on
convert it back if they are not CSEed away.

I've opened PR16218 to track this.

Ciao, Duncan.