Test case name :
llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c -
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .This is one very specific case. How does that behave on all other cases?
Normally, every big improvement comes with a cost, and if you only look at
the benchmark you're tuning to, you'll never see it. It may be that the
cost is small and that we decide to pay the price, but not until we know
that the cost is.
I agree that we should approach in whole than in bits and pieces. I was
basically comparing performance of clang and gcc code for benchmarks listed
in llvm trunk. I found that wherever there was floating point ops
(specifically floating point multiplication), performance with clang was
bad. On analyzing further those issues, i came across vmla instruction by
gcc. The test cases hit by bad performance of clang are :
Test
Case
No of vmla instructions emitted by gcc (clang does not emit vmla for
cortex-a8)