Sorry folks, i didn't specify the actual test case and results in detail
previously. The details are as follows :
Test case name :
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .
clang version : trunk version (latest as of today 19 Dec 2013)
GCC version : 4.5 (i checked with 4.8 as well)
flags passed to both gcc and clang : -march=armv7-a -mfloat-abi=softfp
Optimization level used : O3
No vmla instruction emitted by clang but GCC happily emits it.
This was tested on real hardware. Time taken for a 4x4 matrix
clang : ~14 secs
gcc : ~9 secs
Time taken for a 3x3 matrix multiplication:
clang : ~6.5 secs
gcc : ~5 secs
when flag -mcpu=cortex-a8 is changed to -mcpu=cortex-a15, clang emits vmla
instructions (gcc emits by default)
Time for 4x4 matrix multiplication :
clang : ~8.5 secs
GCC : ~9secs
Time for matrix multiplication :
clang : ~3.8 secs
GCC : ~5 secs
Please let me know if i am missing something. (-ffast-math option doesn't
help in this case.) On examining assembly code for various scenarios above,
i concluded what i have stated above regarding more load/store ops.
Also, as stated by Renato - "there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
-mcpu=cortex-a15 as option, clang emits vmla instructions back to
back(sequential) . Is there something different with cortex-a15 regarding
pipeline stalls, that we are ignoring back to back vmla hazards?