Compare test-suite benchmarks performance complied without TBAA, with default TBAA and with new TBAA struct path

Hello,

I was interested in how much Type-Based Alias Analysis helps to optimize code. For that purpose, I've compared
three sets of benchmarks: compiled without TBAA, compiled with a default TBAA metadata format, and compiled
with new TBAA metadata format.

As a set of benchmarks, I've used the LLVM test suite (http://llvm.org/docs/TestingGuide.html#test-suite-overview)
which has a lot of tests already identified as benchmarks. For statistical reliability, I've executed each test
at least 40 times. Many tests have a very short execution time so 40 times wasn't enough. For these, I increased
the number of repetitions. I've used two types of measurements: execution time and a number of CPU instructions
executed in user-mode. The first one is a common approach but its accuracy is not high and results may vary for
the same testing program. The second is exactly what a tested program did and it is very accurate: if a program
doesn't use random values, it will be almost the same amount for each run. But different instructions have
different complexity (for example, moving data from one register to another is much faster than moving from memory
to register) so we still need overall execution time for a complete picture. For details about testing, see
Appendix A.

### Some interesting results

5 tests compiled with new TBAA didn't passed verification (3 of them had segmentation faults). I think this is
because the new TBAA struct path isn't fully supported. In any case, all these tests should be investigated:

-----------------------------------------------------------------|--------------------|
Test name | Error |
-----------------------------------------------------------------|--------------------|
MultiSource/Applications/ClamAV/clamscan.test | different output |
MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test | segmentation fault |
MultiSource/Benchmarks/FreeBench/distray/distray.test | different output |
MultiSource/Benchmarks/Olden/bh/bh.test | segmentation fault |
SingleSource/Benchmarks/Misc-C++-EH/spirit.test | segmentation fault |
-----------------------------------------------------------------|--------------------|

4 tests compiled without TBAA shows better results than compiled with TBAA. Probably some TBAA optimization passes
blocks more efficiently (as minimum in that cases) optimizations as the process continues. Anyway think these tests
should be investigated also, probably not so much like previous tests:

------------------------------------------------------------|--------|--------|--------|--------|
Test name | Execution time |CPU instructions |
                                                           >Diff with No TBAA|Diff with No TBAA|
                                                           >Default | New |Default | New |
                                                           > TBAA,% | TBAA,% | TBAA,% | TBAA,% |
------------------------------------------------------------|--------|--------|--------|--------|
MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test | -1.11 | -1.15 | -4.48 | -4.48 |
MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test| -13.64 | -13.61 | -20.68 | -20.68 |
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test | -2.21 | -2.45 | -0.51 | -0.51 |
SingleSource/Benchmarks/Misc-C++/Large/sphereflake.test | -2.45 | -3.45 | -2.41 | -3.45 |
------------------------------------------------------------|--------|--------|--------|--------|

Typically, the execution time correlated to the number of executed CPU instructions. For the following tests,
however, that was not the case, as some of them have fewer instructions but longer execution time (or vice versa):

-----------------------------------------------------------------|--------|--------|--------|--------|
Test name | Execution time |CPU instructions |
                                                                >Diff with No TBAA|Diff with No TBAA|
                                                                >Default | New |Default | New |
                                                                > TBAA,% | TBAA,% | TBAA,% | TBAA,% |
-----------------------------------------------------------------|--------|--------|--------|--------|
MultiSource/Benchmarks/7zip/7zip-benchmark.test | 0.51 | 0.20 | -0.74 | -0.74 |
MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test | 0.85 | 0.86 | -0.62 | -0.62 |
MultiSource/Benchmarks/DOE-ProxyApps-C/Pathfinder/PathFinder.test| 0.73 | 0.82 | -1.09 | -1.09 |
MultiSource/Benchmarks/FreeBench/pifft/pifft.test | -1.36 | -1.38 | 1.93 | 1.93 |
MultiSource/Benchmarks/Ptrdist/anagram/anagram.test | 17.81 | 17.89 | -16.35 | -16.35 |
SingleSource/Benchmarks/Shootout/Shootout-hash.test | -9.30 | -10.31 | 0.10 | 0.55 |
SingleSource/Benchmarks/Shootout/Shootout-lists.test | -2.97 | -2.97 | 8.82 | 8.82 |
-----------------------------------------------------------------|--------|--------|--------|--------|

There are many tests where TBAA helps with better optimizations. Several tests from that list have better
optimizations with new TBAA:

-----------------------------------------------------------------|--------|--------|--------|--------|
Test name | Execution time |CPU instructions |
                                                                >Diff with No TBAA|Diff with No TBAA|
                                                                >Default | New |Default | New |
                                                                > TBAA,% | TBAA,% | TBAA,% | TBAA,% |
-----------------------------------------------------------------|--------|--------|--------|--------|
Bitcode/Benchmarks/Halide/blur/halide_blur.test | 239.61 | 239.62 | 413.65 | 413.65 |
SingleSource/Benchmarks/Misc/himenobmtxpa.test | 64.58 | 64.97 | 219.74 | 219.74 |
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt.t| 46.74 | 47.04 | 48.01 | 48.01 |
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test | 41.32 | 41.57 | 54.97 | 54.97 |
SingleSource/Benchmarks/Dhrystone/dry.test | 20.02 | 20.02 | 11.54 | 11.54 |
SingleSource/Benchmarks/Dhrystone/fldry.test | 19.96 | 19.95 | 18.52 | 18.52 |
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test | 17.43 | 17.47 | 14.73 | 14.73 |
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.| 16.41 | 16.39 | 0 | 0 |
MultiSource/Benchmarks/TSVC/Equivalencing-dbl/Equivalencing-dbl.t| 14.71 | 15.09 | 38.99 | 38.99 |
MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test | 11.73 | 11.75 | 18.20 | 18.20 |
MultiSource/Benchmarks/FreeBench/neural/neural.test | 7.20 | 7.29 | 8.15 | 8.15 |
MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test | 5.78 | 5.91 | 6.35 | 6.35 |
MultiSource/Benchmarks/McCat/18-imp/imp.test | 5.30 | 5.24 | 2.73 | 2.73 |
MultiSource/Benchmarks/MiBench/network-dijkstra/network-dijkstra.| 4.41 | 3.85 | 0 | 0 |
SingleSource/Benchmarks/Misc/oourafft.test | 3.50 | 3.46 | 3.58 | 3.58 |
MultiSource/Applications/JM/lencod/lencod.test | 2.67 | 2.30 | 3.03 | 3.03 |
MultiSource/Benchmarks/MallocBench/espresso/espresso.test | 2.14 | 2.89 | 2.30 | 2.30 |
MultiSource/Benchmarks/Olden/bisort/bisort.test | 2.04 | 2.12 | 3.59 | 3.59 |
MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test | 1.98 | 2.03 | 2.66 | 2.66 |
MultiSource/Benchmarks/sim/sim.test | 1.89 | 1.85 | 2.06 | 2.06 |
MultiSource/Applications/JM/ldecod/ldecod.test | 1.86 | 1.76 | 3.59 | 3.59 |
MultiSource/Benchmarks/McCat/08-main/main.test | 1.73 | 1.74 | 3.46 | 3.46 |
MultiSource/Benchmarks/mafft/pairlocalalign.test | 1.74 | 1.75 | 2.80 | 2.80 |
SingleSource/Benchmarks/McGill/chomp.test | 1.29 | 3.32 | 8.75 | 8.75 |
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.| 1.18 | 1.06 | 0.87 | 0.87 |
MultiSource/Applications/sgefa/sgefa.test | 1.11 | 1.22 | 2.61 | 2.61 |
MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test | 1.06 | 1.22 | 0.72 | 0.72 |
MultiSource/Benchmarks/VersaBench/dbms/dbms.test | 1.05 | 1.28 | 2.00 | 2.00 |
MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test | 0.59 | 0.84 | 1.66 | 1.66 |
MultiSource/Benchmarks/McCat/05-eks/eks.test | 0.53 | 1.04 | 2.28 | 2.28 |
MultiSource/Applications/minisat/minisat.test | 0.90 | 0.76 | 1.49 | 1.81 |
SingleSource/Benchmarks/CoyoteBench/fftbench.test | 0.44 | 0.52 | 1.10 | 1.35 |
MultiSource/Benchmarks/Bullet/bullet.test | 0.36 | 0.47 | 0.33 | 2.34 |
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.| 0.26 | 0.43 | 0.79 | 0.79 |
MultiSource/Benchmarks/Olden/mst/mst.test | 0.21 | 0.40 | 0.69 | 0.69 |
MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test | 0.21 | 0.23 | 0 | 1.16 |
MultiSource/Benchmarks/McCat/09-vor/vor.test | 0 | 0.59 | 2.47 | 2.47 |
-----------------------------------------------------------------|--------|--------|--------|--------|

Full testing results are available in Appendix B.

## Appendix A: Testing details

### Environment and preparation

All testing was made in on a designated desktop computer with an Intel i7-4770 CPU and 16GB memory.
- Operation system: 64bit Ubuntu 16.04.1
- Linux core: 4.13.0-37
- test-suite SVN rev: 328330 2018-03-23 08:58:41 -0700 (Fri, 23 Mar 2018)
- LLVM SVN rev: 329926 2018-04-12 10:01:46 -0700 (Thu, 12 Apr 2018)
- clang SVN rev: 329924 2018-04-12 09:41:55 -0700 (Thu, 12 Apr 2018)

I followed these useful tips: https://www.llvm.org/docs/Benchmarking.html.
I used only a shell, no GUI, no other services, with 2 CPUs isolated from other processes and used only for
testing.

Test-suite source files were moved to a ramdisk (because some tests use some input files from the source
folder), and 3 builds were made in separate folders on the ramdisk with the following parameters:
1. No TBAA, CFLAGS and CXXFLAGS = " -O3 -relaxed-aliasing -static"
2. Default TBAA, CFLAGS and CXXFLAGS = " -O3 -static"
3. New TBAA, CFLAGS and CXXFLAGS = " -O3 -Xclang -new-struct-path-tbaa -static"

The following flag was specified for all the builds: -DTEST_SUITE_BENCHMARKING_ONLY=ON

### Execution

For each *.test file that was found in the first No TBAA folder I did the following:
1. Execute command line described in that file in section "RUN:" with redirect to a file stdout and stderr.
2. Execute verification programs described in sections "VERIFY: " with %o replaced to that file.
3. If verification was successful, the test from RUN section was executed repeatedly at least 40 times (if
   the first execution time was less than one second, the number of repeats was 40/execution_time but not
   more than 1000) with the following:
   - the test was executed in shielded CPUs
   - stdout and stderr was redirected to /dev/null
   - execution time and the number of CPU instructions executed in user mode were collected by perf utility
4. Steps #2 and #3 was executed for the same test in "default" folder and "new-tbaa" folder.

After testing, the data was reviewed. Some tests with strange results like high differences were re-executed
several times with an extended number of repeats.

## Appendix B: Full testing results