New test-suite result viewer/analyzer

I just put a little script into the llvm test-suite under util/compare.py.

It is usefull for situations in which you want to analyze the results of a few test-suite runs on the commandline; If you have hundreds of results or want graphs you should rather use LNT.
The tool currently parses the json files produced by running "lit -o file.json" (so you should use the cmake/lit test-suite mode).

=== Basic usage ===

compare.py base0.json

Warning: 'test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test' has No metrics!
Tests: 508
Metric: exec_time

Program base0

INT2006/456.hmmer/456.hmmer 1222.90
INT2006/464.h264ref/464.h264ref 928.70
INT2006/458.sjeng/458.sjeng 873.93
INT2006/401.bzip2/401.bzip2 829.99
INT2006/445.gobmk/445.gobmk 782.92
INT2006/471.omnetpp/471.omnetpp 723.68
INT2006/473.astar/473.astar 701.71
INT2006/400.perlbench/400.perlbench 677.13
INT2006/483.xalancbmk/483.xalancbmk 502.35
INT2006/462.libquantum/462.libquantum 409.06
INT2000/164.gzip/164.gzip 150.25
FP2000/188.ammp/188.ammp 149.88
INT2000/197.parser/197.parser 135.19
INT2000/300.twolf/300.twolf 119.94
INT2000/256.bzip2/256.bzip2 105.71
             base0
count 506.000000
mean 20.563098
std 111.423325
min 0.003400
25% 0.011200
50% 0.339450
75% 4.067200
max 1222.896800

- All numbers are arranged below each other on the dot (mail clients with variable width fonts mess up the effect).
- Results are sorted by magnitude, and limited to the 15 biggest ones by default, common prefixes and suffixes in benchmark names are removed ('test-suite :: External/SPEC/' and '.test' in this case). The names are shortened with some '...' in the middle if they are still too long. All of this can be disabled with the --full flag.
- The pandas library prints some neat statistics below the results
- Shows the 'exec_time' metric by default, use --metric XXX to select a different one

=== Compare multiple runs ===

compare.py --filter-short base0.json base1.json base2.json

Tests: 508
Short Running: 281 (filtered out)
Remaining: 227
Metric: exec_time

Program base0 base1 base2 diff

SingleSour...e/Benchmarks/Misc/himenobmtxpa 3.27 3.26 4.52 38.5%
MultiSource/Benchmarks/nbench/nbench 14.39 18.10 15.03 25.8%
SingleSour...Benchmarks/Shootout-C++/lists1 0.87 1.02 1.07 22.5%
MultiSourc...hmarks/MallocBench/cfrac/cfrac 2.95 2.44 2.41 22.3%
MultiSourc...chmarks/BitBench/five11/five11 8.69 10.21 8.67 17.9%
MultiSource/Benchmarks/Ptrdist/bc/bc 1.25 1.25 1.07 16.8%
SingleSour...out-C++/Shootout-C++-ackermann 1.22 1.17 1.35 16.2%
MultiSourc...chmarks/Prolangs-C++/life/life 4.23 3.76 3.75 12.8%
External/SPEC/CINT95/134.perl/134.perl 16.76 17.79 17.73 6.1%
MultiSourc...e/Applications/ClamAV/clamscan 0.80 0.82 0.77 5.9%
SingleSour...hootout-C++/Shootout-C++-sieve 3.04 3.21 3.21 5.8%
MultiSource/Applications/lemon/lemon 2.84 2.72 2.79 4.2%
SingleSour...Shootout-C++/Shootout-C++-hash 1.27 1.31 1.32 3.5%
SingleSour...h/stencils/fdtd-apml/fdtd-apml 16.15 15.61 15.66 3.5%
MultiSourc...e/Applications/sqlite3/sqlite3 5.62 5.81 5.62 3.3%
             base0 base1 base2 diff
count 226.000000 226.000000 225.000000 227.000000
mean 45.939256 45.985196 46.096998 0.013667
std 163.389800 163.494907 163.503512 0.042327
min 0.608000 0.600500 0.665200 0.000000
25% 2.432750 2.428300 2.432600 0.001370
50% 4.708250 4.697600 4.799300 0.002822
75% 9.674850 10.083075 9.698000 0.007492
max 1222.896800 1223.112600 1221.131300 0.385443

- Displays results of different result files next to each, computes difference between smallest and biggest number, sort by difference

=== A/B Comparisons and multiple runs ===

compare.py --filter-short base0.json base1.json base2.json vs try0.json try1.json try2.json

Tests: 508
Short Running: 283 (filtered out)
Remaining: 225
Metric: exec_time

Program lhs rhs diff

SingleSour.../Benchmarks/Linpack/linpack-pc 5.16 4.30 -16.5%
SingleSour...Benchmarks/Misc/matmul_f64_4x4 1.25 1.09 -12.8%
SingleSour...enchmarks/BenchmarkGame/n-body 1.86 1.63 -12.4%
MultiSourc...erolling-dbl/LoopRerolling-dbl 7.01 7.86 12.2%
MultiSource/Benchmarks/sim/sim 4.37 4.88 11.7%
SingleSour...UnitTests/Vectorizer/gcc-loops 3.89 3.54 -9.0%
SingleSource/Benchmarks/Misc/salsa20 9.30 8.54 -8.3%
MultiSourc...marks/Trimaran/enc-pc1/enc-pc1 1.00 0.92 -8.2%
SingleSource/UnitTests/Vector/build2 2.90 3.13 8.1%
External/SPEC/CINT2000/181.mcf/181.mcf 100.20 92.82 -7.4%
MultiSourc...VC/Symbolics-dbl/Symbolics-dbl 5.02 4.65 -7.4%
SingleSour...enchmarks/CoyoteBench/fftbench 2.73 2.53 -7.1%
External/SPEC/CFP2000/177.mesa/177.mesa 49.12 46.05 -6.2%
MultiSourc...VC/Symbolics-flt/Symbolics-flt 2.94 2.76 -6.2%
SingleSour...hmarks/BenchmarkGame/recursive 1.32 1.39 5.4%
               lhs rhs diff
count 225.000000 225.000000 225.000000
mean 46.018045 46.272968 -0.003343
std 163.377050 164.958080 0.028145
min 0.665200 0.658800 -0.164888
25% 2.424500 2.428200 -0.006311
50% 4.799300 4.650700 -0.000495
75% 9.684500 9.646200 0.004407
max 1221.131300 1219.680000 0.122050

- A/B comparison mode: Merges the metric values before the "vs" argument and the ones after into the "lhs" and "rhs" sets. Takes the smallest value on each side (--merge-max, --merge-average is available as well) and compares the resulting two sets.

=== Filtering Flags ===
--filter-hash: Exclude Programs whose 'hash' metric has the same value everywhere.
--filter-short: Exclude Programs when the metric is < 0.6 (typically used with 'exec_time')
--filter-blacklist blacklist.txt: Exclude programs listed in the blacklist.txt file (one name per line)

You need python 2.7 and the pandas library installed to get this running. Hope it helps your benchmarking/testing workflow.

- Matthias