pb05 results for current llvm/dragonegg

Attached are the Polyhedron 2005 benchmark results for current llvm/dragonegg svn
on x86_64-apple-darwin11 built against Xcode 4.3.2 and FSF gcc 4.6.3. The benchmarks
for -msse3 and -msse4 appear identical (at least for degg+optnz). This is fortunate
since there seems to be a bug in -msse4 on 2.33 GHz (T7600) Intel Core 2 Duo Merom
(http://llvm.org/bugs/show_bug.cgi?id=12434).
                   Jack

llvm/dragonegg r153877

dragonegg:
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 %n.f90 -o %n

degg+vectorize:
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-llvm-option=-vectorize %n.f90 -o %n

degg+optnz:
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n

gfortran:
gfortran-fsf-4.6 -msse3 -ffast-math -funroll-loops -O3 %n.f90 -o %n

Ave Run (secs)
               dragonegg degg+vectorize degg+optnz gfortran
ac 12.45 12.45 8.85 8.80
aermod 16.15 16.05 14.80 17.48
air 7.10 7.11 6.46 5.50
capacita 40.00 39.96 37.72 32.62
channel 2.16 2.15 1.99 1.84
doduc 29.13 28.41 27.48 26.74
fatigue 8.75 9.03 8.11 8.44
gas_dyn 11.72 11.80 4.47 4.26
induct 24.02 24.91 12.08 13.65
linpk 15.40 15.78 15.74 15.45
mdbx 11.80 12.22 11.86 11.20
nf 28.45 28.50 29.25 27.91
protein 38.15 39.26 37.87 32.49
rnflow 32.25 32.35 26.47 24.06
test_fpu 11.34 11.35 9.31 8.04
tftt 1.91 1.92 1.93 1.87

Geometric Mean 13.50 13.62 11.34 10.87

Compile (secs)
               dragonegg degg+vectorize degg+optnz gfortran
ac 0.33 0.38 0.72 1.27
aermod 25.91 27.58 32.34 43.91
air 1.07 1.25 1.52 2.25
capacita 0.49 0.52 0.89 1.71
channel 0.29 0.36 0.50 0.62
doduc 1.71 4.50 3.25 5.34
fatigue 0.84 0.97 1.19 1.76
gas_dyn 0.67 0.68 1.20 3.02
induct 1.60 2.14 2.82 3.99
linpk 0.22 0.24 0.47 0.78
mdbx 0.63 0.77 1.16 1.85
nf 0.37 0.40 0.70 1.66
protein 0.93 1.02 1.75 4.01
rnflow 1.20 1.25 2.63 5.44
test_fpu 0.88 0.92 2.13 4.39
tftt 0.21 0.24 0.34 0.56

Executable (bytes)
               dragonegg degg+vectorize degg+optnz gfortran
ac 26856 26856 39120 50968
aermod 1043700 1055988 1046288 1265640
air 62004 62004 53740 73988
capacita 41416 41416 45552 73896
channel 22808 22808 26768 34784
doduc 128448 128448 136996 197240
fatigue 69824 69824 69840 86080
gas_dyn 59112 59112 67416 119744
induct 163152 167248 167344 174976
linpk 18752 18752 27056 38648
mdbx 53692 53692 57884 82112
nf 23960 23960 32104 71800
protein 75032 75032 87208 132040
rnflow 71896 71896 96632 181120
test_fpu 54272 54272 78776 155072
tftt 18640 18640 18488 30768

Hi Jack

          dragonegg degg\+vectorize degg\+optnz  gfortran

ac 12.45 12.45 8.85 8.80
gas_dyn 11.72 11.80 4.47 4.26
induct 24.02 24.91 12.08 13.65
rnflow 32.25 32.35 26.47 24.06

Any idea what might cause such differences here?

Hi Jack,

   Attached are the Polyhedron 2005 benchmark results for current llvm/dragonegg svn
on x86_64-apple-darwin11 built against Xcode 4.3.2 and FSF gcc 4.6.3.

thanks for the numbers. How does this compare to LLVM 3.0 - were there any
regressions?

Ciao, Duncan.

  The benchmarks

Hi Anton,

               dragonegg degg+vectorize degg+optnz gfortran
ac 12.45 12.45 8.85 8.80
gas_dyn 11.72 11.80 4.47 4.26
induct 24.02 24.91 12.08 13.65
rnflow 32.25 32.35 26.47 24.06

Any idea what might cause such differences here?

I haven't analysed these, but as a general remark: if "degg+optnz" does much
better than "dragonegg" then that indicates a weakness in LLVM's IR level
optimizers, while if "gfortran" does much better than "degg+optnz" then that
indicates a weakness in LLVM's codegen. Applying this to the above suggests
that most of the differences are coming from LLVM's IR level optimizers not
doing a good job somewhere.

Ciao, Duncan.

Hi Jack,

   Attached are the Polyhedron 2005 benchmark results for current llvm/dragonegg svn
on x86_64-apple-darwin11 built against Xcode 4.3.2 and FSF gcc 4.6.3.

thanks for the numbers. How does this compare to LLVM 3.0 - were there any
regressions?

The results from just before llvm/dragonegg 3.0 was released are at...

http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/044091.html

It does look as if the ac benchmark has been regressed from 10.80 sec
in llvm/dragonegg 3.0 to 12.45 sec in llvm/dragonegg 3.1. These are
slightly different FSF gcc 4.6 releases (4.6.2svn vs 4.6.3 but I would
be shocked if that was the origin of the performance regression).
   The results for -fplugin-arg-dragonegg-enable-gcc-optzns doesn't seem
much improved in llvm 3.1 so I assume this means little progress was made
in eliminating the scalarization of vectorizations in this release. Did
we even get any code added to llvm that would allow us to identify instances
of these scalarizations through a compiler warning? Also, the current
-fplugin-arg-dragonegg-llvm-option=-vectorize option seems to do almost
nothing in terms of vectorization. Do we need to pass any additional flags
to actually achieve autovectorization via llvm (in absence of -ftree-vectorize
and -fplugin-arg-dragonegg-enable-gcc-optzns)?
                 Jack

Hi Anton,

>> dragonegg degg+vectorize degg+optnz gfortran
>> ac 12.45 12.45 8.85 8.80
>> gas_dyn 11.72 11.80 4.47 4.26
>> induct 24.02 24.91 12.08 13.65
>> rnflow 32.25 32.35 26.47 24.06
> Any idea what might cause such differences here?

I haven't analysed these, but as a general remark: if "degg+optnz" does much
better than "dragonegg" then that indicates a weakness in LLVM's IR level
optimizers, while if "gfortran" does much better than "degg+optnz" then that
indicates a weakness in LLVM's codegen. Applying this to the above suggests
that most of the differences are coming from LLVM's IR level optimizers not
doing a good job somewhere.

Duncan,
   I can add a table column benchmarking...

de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n

which would separate out the vectorization component. This might be more
informative in identifying weak points in the LLVM's IR level.
            Jack

> Hi Jack,
>
>> Attached are the Polyhedron 2005 benchmark results for current
>> llvm/dragonegg svn on x86_64-apple-darwin11 built against Xcode
>> 4.3.2 and FSF gcc 4.6.3.
>
> thanks for the numbers. How does this compare to LLVM 3.0 - were
> there any regressions?

The results from just before llvm/dragonegg 3.0 was released are at...

http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/044091.html

It does look as if the ac benchmark has been regressed from 10.80 sec
in llvm/dragonegg 3.0 to 12.45 sec in llvm/dragonegg 3.1. These are
slightly different FSF gcc 4.6 releases (4.6.2svn vs 4.6.3 but I would
be shocked if that was the origin of the performance regression).
   The results for -fplugin-arg-dragonegg-enable-gcc-optzns doesn't
seem much improved in llvm 3.1 so I assume this means little progress
was made in eliminating the scalarization of vectorizations in this
release. Did we even get any code added to llvm that would allow us
to identify instances of these scalarizations through a compiler
warning? Also, the current
-fplugin-arg-dragonegg-llvm-option=-vectorize option seems to do
almost nothing in terms of vectorization. Do we need to pass any
additional flags to actually achieve autovectorization via llvm

Currently, we only have basic-block vectorization, so to get
autovectorization of loops (which is probably what we want here), the
loops need to be unrolled. I see that all categories include
-funroll-loops, does that do anything if we're not using gcc's
optimizations?

I generally run with both -unroll-allow-partial and -unroll-runtime so
that llvm's unroller will do as much as it can. Also, in many of these
cases, it looks like the vectorization is doing *something*, just not
anything overly helpful :wink: -vectorize is new, so it is helpful to
get feedback on what is actually useful.

You might try including -bb-vectorize-aligned-only (sse3 does not
actually have unaligned load/stores, right?). Other things to try
include -bb-vectorize-no-ints (determining when to vectorize integer
ops may be trickier than floating-point ops) and setting the required
chain depth to something less than the current default of 6 (for
example, -bb-vectorize-req-chain-depth=3) will cause a lot more
vectorization.

-Hal

(in

> > Hi Jack,
> >
> >> Attached are the Polyhedron 2005 benchmark results for current
> >> llvm/dragonegg svn on x86_64-apple-darwin11 built against Xcode
> >> 4.3.2 and FSF gcc 4.6.3.
> >
> > thanks for the numbers. How does this compare to LLVM 3.0 - were
> > there any regressions?
>
> The results from just before llvm/dragonegg 3.0 was released are at...
>
> http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/044091.html
>
> It does look as if the ac benchmark has been regressed from 10.80 sec
> in llvm/dragonegg 3.0 to 12.45 sec in llvm/dragonegg 3.1. These are
> slightly different FSF gcc 4.6 releases (4.6.2svn vs 4.6.3 but I would
> be shocked if that was the origin of the performance regression).
> The results for -fplugin-arg-dragonegg-enable-gcc-optzns doesn't
> seem much improved in llvm 3.1 so I assume this means little progress
> was made in eliminating the scalarization of vectorizations in this
> release. Did we even get any code added to llvm that would allow us
> to identify instances of these scalarizations through a compiler
> warning? Also, the current
> -fplugin-arg-dragonegg-llvm-option=-vectorize option seems to do
> almost nothing in terms of vectorization. Do we need to pass any
> additional flags to actually achieve autovectorization via llvm

Currently, we only have basic-block vectorization, so to get
autovectorization of loops (which is probably what we want here), the
loops need to be unrolled. I see that all categories include
-funroll-loops, does that do anything if we're not using gcc's
optimizations?

I generally run with both -unroll-allow-partial and -unroll-runtime so
that llvm's unroller will do as much as it can. Also, in many of these
cases, it looks like the vectorization is doing *something*, just not
anything overly helpful :wink: -vectorize is new, so it is helpful to
get feedback on what is actually useful.

You might try including -bb-vectorize-aligned-only (sse3 does not
actually have unaligned load/stores, right?). Other things to try
include -bb-vectorize-no-ints (determining when to vectorize integer
ops may be trickier than floating-point ops) and setting the required
chain depth to something less than the current default of 6 (for
example, -bb-vectorize-req-chain-depth=3) will cause a lot more
vectorization.

So these need to be passed on their own instances of -fplugin-arg-dragonegg-llvm-option=
I guess. I'll try...

de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-llvm-option=-vectorize -fplugin-arg-dragonegg-llvm-option=-unroll-allow-partial -fplugin-arg-dragonegg-llvm-option=-unroll-runtime -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-aligned-only -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-no-ints %n.f90 -o %n

Unfortunately it doesn't seem that dragonegg can currently parse something like...

-fplugin-arg-dragonegg-llvm-option=-bb-vectorize-req-chain-depth=3

% de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-llvm-option=-vectorize -fplugin-arg-dragonegg-llvm-option=-unroll-allow-partial -fplugin-arg-dragonegg-llvm-option=-unroll-runtime -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-aligned-only -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-no-ints -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-req-chain-depth=3 ac.f90 -o ac
f951: error: malformed option -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-req-chain-depth=3 (multiple '=' signs)

Duncan, any idea how to work around that for passing -bb-vectorize-req-chain-depth=3?
          Jack

Attached are the Polyhedron 2005 benchmark results for current llvm/dragonegg svn
on x86_64-apple-darwin11 built against Xcode 4.3.2 and FSF gcc 4.6.3. The benchmarks
for -msse3 and -msse4 appear identical (at least for degg+optnz). This is fortunate
since there seems to be a bug in -msse4 on 2.33 GHz (T7600) Intel Core 2 Duo Merom
(http://llvm.org/bugs/show_bug.cgi?id=12434). I've added two additional entries to
the table. The first, degg+novect+optnz, should show the optimizations achieved by
-fplugin-arg-dragonegg-enable-gcc-optzns in the absence of autovectorization by
FSF gcc. This shows the missing optimization opportunities for LLVM IR-level outside
of autovectorization. The second entry is for the new LLVM autovectorization option
with all of its related options set. This shows mixed results with some benchmarks
being improved over the simple -fplugin-arg-dragonegg-llvm-option=-vectorize
and some being worsened in performance.
                   Jack

llvm/dragonegg r153877

dragonegg:
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 %n.f90 -o %n

degg+vectorize:
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-llvm-option=-vectorize %n.f90 -o %n

degg+optnz:
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n

gfortran:
gfortran-fsf-4.6 -msse3 -ffast-math -funroll-loops -O3 %n.f90 -o %n

degg+novect+optnz
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n

degg+fullvect+optnz
de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-llvm-option=-vectorize -fplugin-arg-dragonegg-llvm-option=-unroll-allow-partia
l -fplugin-arg-dragonegg-llvm-option=-unroll-runtime -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-aligned-only -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-no-ints %
n.f90 -o %n

Ave Run (secs)
               dragonegg degg+vectorize degg+optnz gfortran degg+novect+optnz degg+fullvect+optnz
ac 12.45 12.45 8.85 8.80 8.90 10.89
aermod 16.15 16.05 14.80 17.48 14.12 15.84
air 7.10 7.11 6.46 5.50 6.46 8.15
capacita 40.00 39.96 37.72 32.62 39.38 39.94
channel 2.16 2.15 1.99 1.84 2.15 2.56
doduc 29.13 28.41 27.48 26.74 28.27 29.05
fatigue 8.75 9.03 8.11 8.44 7.28 10.49
gas_dyn 11.72 11.80 4.47 4.26 10.02 11.63
induct 24.02 24.91 12.08 13.65 20.54 24.68
linpk 15.40 15.78 15.74 15.45 15.39 15.46
mdbx 11.80 12.22 11.86 11.20 11.82 11.50
nf 28.45 28.50 29.25 27.91 29.17 28.16
protein 38.15 39.26 37.87 32.49 39.08 38.62
rnflow 32.25 32.35 26.47 24.06 28.75 31.05
test_fpu 11.34 11.35 9.31 8.04 10.88 10.19
tftt 1.91 1.92 1.93 1.87 1.94 1.90

Geometric Mean 13.50 13.62 11.34 10.87 12.53 13.65

Compile (secs)
               dragonegg degg+vectorize degg+optnz gfortran degg+novect+optnz degg+fullvect+optnz
ac 0.33 0.38 0.72 1.27 0.71 0.39
aermod 25.91 27.58 32.34 43.91 25.13 23.62
air 1.07 1.25 1.52 2.25 1.36 1.34
capacita 0.49 0.52 0.89 1.71 0.71 0.98
channel 0.29 0.36 0.50 0.62 0.42 0.49
doduc 1.71 4.50 3.25 5.34 2.75 5.42
fatigue 0.84 0.97 1.19 1.76 1.00 1.24
gas_dyn 0.67 0.68 1.20 3.02 0.90 1.81
induct 1.60 2.14 2.82 3.99 2.53 2.15
linpk 0.22 0.24 0.47 0.78 0.30 0.46
mdbx 0.63 0.77 1.16 1.85 0.99 1.12
nf 0.37 0.40 0.70 1.66 0.42 1.22
protein 0.93 1.02 1.75 4.01 1.40 2.73
rnflow 1.20 1.25 2.63 5.44 1.72 2.85
test_fpu 0.88 0.92 2.13 4.39 1.26 2.38
tftt 0.21 0.24 0.34 0.56 0.30 0.27

Executable (bytes)
               dragonegg degg+vectorize degg+optnz gfortran degg+novect+optnz degg+fullvect+optnz
ac 26856 26856 39120 50968 39120 35144
aermod 1043700 1055988 1046288 1265640 1013488 1146196
air 62004 62004 53740 73988 53740 78392
capacita 41416 41416 45552 73896 41416 70096
channel 22808 22808 26768 34784 22672 34984
doduc 128448 128448 136996 197240 128868 173512
fatigue 69824 69824 69840 86080 65712 78016
gas_dyn 59112 59112 67416 119744 59160 91952
induct 163152 167248 167344 174976 176696 179552
linpk 18752 18752 27056 38648 18904 31200
mdbx 53692 53692 57884 82112 53788 70080
nf 23960 23960 32104 71800 23912 48568
protein 75032 75032 87208 132040 78912 132376
rnflow 71896 71896 96632 181120 67928 137528
test_fpu 54272 54272 78776 155072 50144 111640
tftt 18640 18640 18488 30768 18488 22744

Hi Jack,

Duncan, any idea how to work around that for passing -bb-vectorize-req-chain-depth=3?

it is being rejected by GCC's plugin options parser. I just implemented a hack
in dragonegg in which colons will be morphed into equals signs. So you should
now be able to pass -bb-vectorize-req-chain-depth:3 and have it work.

Ciao, Duncan.

Duncan,
   It would also be nice if -fplugin-arg-dragonegg-llvm-option= could allow multiple
entries surrounded by quotes. Yesterday when I tested...

de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-llvm-option=-vectorize -fplugin-arg-dragonegg-llvm-option=-
unroll-allow-partial -fplugin-arg-dragonegg-llvm-option=-unroll-runtime -fplugin-arg-dragonegg-llvm-option=-bb-vectorize-aligned-only -fplugin-arg-dragone
gg-llvm-option=-bb-vectorize-no-ints %n.f90 -o %n

this was longer than the pbharness would allow so I had to hard code those options into the
de-gfortran46 compiler wrapper I use. I would be nice if we could group these together in
order to be more concise...

de-gfortran46 -msse3 -ffast-math -funroll-loops -O3 -fno-tree-vectorize -fplugin-arg-dragonegg-llvm-option="-vectorize -unroll-allow-partial -unroll-runtime -bb-vectorize-aligned-only -bb-vectorize-no-ints" %n.f90 -o %n

with the current -fplugin-arg-dragonegg-llvm-option or a new -fplugin-arg-dragonegg-llvm-options.
                Jack

Hi Jack,

    It would also be nice if -fplugin-arg-dragonegg-llvm-option= could allow multiple
entries surrounded by quotes. Yesterday when I tested...

I implemented this: options passed this way are now split on spaces.

Ciao, Duncan.

Hi Anton,

               dragonegg degg+vectorize degg+optnz gfortran
ac 12.45 12.45 8.85 8.80
gas_dyn 11.72 11.80 4.47 4.26
induct 24.02 24.91 12.08 13.65
rnflow 32.25 32.35 26.47 24.06

Any idea what might cause such differences here?

if I'm reading Jack's latest numbers right, for gas_dyn and induct the
difference is mainly due to GCC's vectorizer:

with GCC's vectorizer and other optimizations:

gas_dyn 4.47
induct 12.08

without GCC's vectorizer but with GCC's other optimizations:

gas_dyn 10.02
induct 20.54

without any GCC optimizations, only LLVM's optimizers:

gas_dyn 11.72
induct 24.02

So even without vectorization GCC is doing a better job, but not hugely
better.

Ciao, Duncan.

Hi Jack,

                dragonegg degg+vectorize degg+optnz gfortran degg+novect+optnz degg+fullvect+optnz
ac 12.45 12.45 8.85 8.80 8.90 10.89

for this one it looks like the major reason for the difference is that GCC turns
this floating point division:

       DATA d2/2147483647.D0/
...
       GGL = Ds/d2

into multiplication by the inverse (which is OK because of -ffast-math), however
LLVM does not. This results in the following difference in the LLVM IR:

GCC optimizers:

   %3 = fmul double %2, 0x3E00000000200000

LLVM optimizers:

   %3 = fdiv double %2, 0x41DFFFFFFFC00000

The code generators don't help out even though they are passed
-enable-unsafe-fp-math (aka -ffast-math):

GCC optimizers + LLVM codegen:

         mulsd .LCPI4_2(%rip), %xmm0

LLVM optimizers + LLVM codegen:

         divsd .LCPI2_1(%rip), %xmm0

I'm surprised that the code generators didn't get this.

Ciao, Duncan.

With the attached patch to turn x/c into x*(1.0/c) in the code generators
if -ffast-math is enabled, "ac" with LLVM optimizers goes from 40% slower
to 5% slower when compared to "ac" compiled with the GCC optimizers.

Currently LLVM does very little in the way of -ffast-math optimizations.
There's clearly a lot of room for improvement here.

Ciao, Duncan.

recip.diff (1.21 KB)

Duncan, Jack, et al.,

I realized yesterday that the basic-block vectorizer had not been
vectorizing selects, so I've now corrected that, and I also added the
capability for vectorizing pointers and generating the single-index
vectorized-GEPs that Nadav added a few months ago.

In my autovectorization benchmark suite, the select vectorization
triggers only once, and the GEP vectorization does not trigger at all,
so it is possible that these changes will currently have little
practical effect. Regardless, I am curious whether either of these
things might have an impact on the quoted results (especially the
select vectorization, as I imagine that the fortran frontend might be
smart enough to generate those).

Jack, if you have a chance to re-run these benchmarks, I'd be interested
to know if the result for the case using the basic-block vectorizer has
changed (>= r154735).

FWIW, I also added some additional options to help troubleshoot
regressions: bb-vectorize-no-pointers, bb-vectorize-no-select,
bb-vectorize-no-gep.

Thanks again,
Hal