trunk's optimizer generates slower code than 3.5

I submitted the problem report to clang's bugzilla but no one seems to
care so I have to send it to the mailing list.

clang 3.7 svn (trunk 229055 as the time I was to report this problem)
generates slower code than 3.5 (Apple LLVM version 6.0
(clang-600.0.56) (based on LLVM 3.5svn)) for the following code.

It is a "8 queens puzzle" solver written as an educational example. As
compiled by both clang 3.5 and 3.7, it gave the correct answer, but
clang 3.5 generates code which runs 20% faster than 3.6/3.7.

Also confirmed with the llvm 3.5.1 release and the llvm 3.6 release
branch on x86_64-apple-darwin14...

% clang-3.5 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector
-fno-exceptions -o 8 8.c
% time ./8 9
352 solutions
3.603u 0.002s 0:03.60 100.0% 0+0k 0+0io 2pf+0w
% time ./8 10
724 solutions
104.217u 0.059s 1:44.30 99.9% 0+0k 0+0io 2pf+0w

% clang-3.6 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector
-fno-exceptions -o 8 8.c
% time ./8 9
352 solutions
4.050u 0.001s 0:04.05 100.0% 0+0k 0+0io 2pf+0w
% time ./8 10
724 solutions
114.808u 0.041s 1:54.86 99.9% 0+0k 0+0io 2pf+0w

The regressions in the performance of generated code, introduced
by the llvm 3.6 release, don't seem to be limited to this 8 queens
puzzle" solver test case. See...

where a bit hit in the performance of the Sparse Matrix Multiply test
of the SciMark v2.0 benchmark was observed as well as others.
    Do you really want to release 3.6 with this level of performance regression?

Do any of the build-bots routinely run the SciMark v2.0 benchmark?
If so, might not an examination of those logs reveal the commit range
at which the optimizations in that benchmark degraded?

Using the SciMark 2.0 code from compiled with the

make CFLAGS="-O3 -march=native"

I am able to reproduce the 22% performance regression in the run time
of the Sparse matmult benchmark.
For 10 runs of the scimark2 benechmark, I get 998.439+/-0.4828 with
the release llvm clang 3.5.1 compiler
and 1217.363+/-1.1004 for the current clang 3.6svn from 3.6 branch. Not good.

The same 22% performance regression also exists in current llvm/clang
trunk for the SciMark2 Sparse matmult benchmark.

Oops. I misspoke. The 22% performance regression is in fact eliminated
in current llvm/clang trunk. Hopefully this is due to a single fix
that can be back ported rather than some large change in the code.

Filed as