food for optimizer developers

I wrote a Fortran to C++ conversion program that I used to convert selected
LAPACK sources. Comparing runtimes with different compilers I get:

                         absolute relative
ifort 11.1.072 1.790s 1.00
gfortran 4.4.4 2.470s 1.38
g++ 4.4.4 2.922s 1.63
clang++ 2.8 (trunk 108205) 6.487s 3.62

- Why is the code generated by clang++ so much slower than the g++ code?

A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()

  FEM_DO(i, 1, m) {
    temp = a(i, j + 1);
    a(i, j + 1) = ctemp * temp - stemp * a(i, j);
    a(i, j) = stemp * temp + ctemp * a(i, j);
  }

For the loop body, g++ (4.2) emits unsurprising code.
loop:
movsd (%rcx), %xmm2
movapd %xmm3, %xmm0
mulsd %xmm2, %xmm0
movapd %xmm4, %xmm1
mulsd (%rax), %xmm1
subsd %xmm1, %xmm0
movsd %xmm0, (%rcx)
movapd %xmm3, %xmm0
mulsd (%rax), %xmm0
mulsd %xmm4, %xmm2
addsd %xmm2, %xmm0
movsd %xmm0, (%rax)
incl %esi
addq $8, %rcx
addq $8, %rax
cmpl %esi, +0(%r13)
jge loop

clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
loop:
movq %rax, %rdi
subq %rdx, %rdi
imulq %r14, %rdi
subq %rcx, %rdi
addq %rsi, %rdi
movq +0(%r13), %r8
movsd (%r8, %rdi, 8), %xmm3
mulsd %xmm1, %xmm3
movq %rbx, %rdi
subq %rdx, %rdi
imulq %r14, %rdi
subq %rcx, %rdi
addq %rsi, %rdi
movsd (%r8, %rdi, 8), %xmm4
movapd %xmm2, %xmm5
mulsd %xmm4, %xmm5
subsd %xmm3, %xmm5
movsd %xmm5, (%r8, %rdi, 8)
movq +32(%r13), %rdx
movq %rax, %rdi
subq %rdx, %rdi
movq +0(%r13), %r8
movq +8(%r13), %r14
imulq %r14, %rdi
movq +24(%r13), %rcx
subq %rcx, %rdi
addq %rsi, %rdi
movsd (%r8, %rdi, 8), %xmm3
mulsd %xmm2, %xmm3
mulsd %xmm1, %xmm4
addsd %xmm3, %xmm4
movsd %xmm4, (%r8, %rdi, 8)
incq %rsi
cmpl (%r15), %esi
jle loop

Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
arr_ref<double, 2> a

I think clang has no trouble optimizing properly for arrays like this:
double a[800][800];

Robert P.

This would make a *wonderful* bug report against the LLVM optimizer... http://llvm.org/bugs/ :slight_smile:

  - Doug

I wrote a Fortran to C++ conversion program that I used to convert selected
LAPACK sources. Comparing runtimes with different compilers I get:

                        absolute relative
ifort 11.1.072 1.790s 1.00
gfortran 4.4.4 2.470s 1.38
g++ 4.4.4 2.922s 1.63
clang++ 2.8 (trunk 108205) 6.487s 3.62

- Why is the code generated by clang++ so much slower than the g++ code?

A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()

FEM_DO(i, 1, m) {
   temp = a(i, j + 1);
   a(i, j + 1) = ctemp * temp - stemp * a(i, j);
   a(i, j) = stemp * temp + ctemp * a(i, j);
}

Please file a bug with the reduced .cpp testcase. My wild guess is that this is a failure because we don't have TBAA yet, which isn't being worked on. What flags are you passing to the compiler? Anything like -ffast-math? Note that ifort defaults to "fast and loose" numerics iirc.

-Chris

Rather, "which *is* being worked on". You can quickly verify this assumption by seeing if gcc generates similar code to llvm when you pass -fno-strict-aliasing to gcc.

-Chris

Chris Lattner wrote:

Please file a bug with the reduced .cpp testcase.

http://llvm.org/bugs/show_bug.cgi?id=7868

What flags are you passing to the compiler?

-O3 -ffast-math

Note that ifort defaults to "fast and loose" numerics iirc.

Which is exactly what I'm hoping to get from C++, too, one day,
if I ask for it via certain options.

I think speed will be the major argument against using the C++ code
generated by the fable converter. If the generated C++ code could somehow
be made to run nearly as fast as the original Fortran (compiled with ifort)
there wouldn't be any good reason anymore to still develop in Fortran,
or to bother with the complexities of mixing languages.

Ralf

Douglas Gregor wrote:

I wrote a Fortran to C++ conversion program that I used to convert selected
LAPACK sources. Comparing runtimes with different compilers I get:

                       absolute relative
ifort 11.1.072 1.790s 1.00
gfortran 4.4.4 2.470s 1.38
g++ 4.4.4 2.922s 1.63
clang++ 2.8 (trunk 108205) 6.487s 3.62

- Why is the code generated by clang++ so much slower than the g++ code?

A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()

FEM_DO(i, 1, m) {
  temp = a(i, j + 1);
  a(i, j + 1) = ctemp * temp - stemp * a(i, j);
  a(i, j) = stemp * temp + ctemp * a(i, j);
}

For the loop body, g++ (4.2) emits unsurprising code.

clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.

Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
arr_ref<double, 2> a

This would make a *wonderful* bug report against the LLVM optimizer... http://llvm.org/bugs/ :slight_smile:

I believe that would require the cooperation of the OP, because it is his Fortran -> C++ converter. Are you interested, Ralf?
I've started the ball rolling with a much reduced test case.

cat test.cpp
/*
Background:
<http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-August/010258.html&gt;

Relevant files, including benchmark dsyev_test.cpp:
<http://cci.lbl.gov/lapack_fem/&gt;

This file (test.cpp) is a reduced case of dsyev_test.cpp.
It sheds light on the performance issue with clang++.

$ clang++ -c -I. -O3 test.cpp -save-temps

Examine test.s, in which the two inner loops of interest
are easily identified by their 'subsd' instruction.
Contrary to expectation, assembly code for loops A and B
is different. Loop B contains laborious and redundant
address calculations.

clang --version
clang version 2.8 (trunk 110653)

By contrast, g++ (4.2) emits identical assembler for loops A and B.
*/

#include <fem/major_types.hpp>

namespace lapack_dsyev_fem {
  
  using namespace fem::major_types;
  
  void
  test(
     int x,
     int const& m,
     int const& n,
     arr_cref<double> c,
     arr_cref<double> s,
     arr_ref<double, 2> a,
     int const& lda)
  {
    c(dimension(star));
    s(dimension(star));
    a(dimension(lda, star));
    
    int i, j;
    double ctemp, stemp, temp;
    
    if ( x ) {
      for ( j = m - 1; j >= 1; j-- ) {
        ctemp = c(j);
        stemp = s(j);
      // loop A, identical with loop B below
        for ( i = 1; i <= n; i++ ) {
          temp = a(j + 1, i);
          a(j + 1, i) = ctemp * temp - stemp * a(j, i);
          a(j, i) = stemp * temp + ctemp * a(j, i);
        }
      }
    }
    else {
      for ( j = m - 1; j >= 1; j-- ) {
        ctemp = c(j);
        stemp = s(j);
        // loop B, identical with loop A above
        for ( i = 1; i <= n; i++ ) {
          temp = a(j + 1, i);
          a(j + 1, i) = ctemp * temp - stemp * a(j, i);
          a(j, i) = stemp * temp + ctemp * a(j, i);
        }
      }
    }
  }
  
} // namespace lapack_dsyev_fem

Robert P.

Chris Lattner wrote:

My wild guess is that this is a failure because we don't have TBAA yet, which isn't being worked on. What flags are you passing to the compiler? Anything like -ffast-math? Note that ifort defaults to "fast and loose" numerics iirc.

Rather, "which *is* being worked on". You can quickly verify this assumption by seeing if gcc generates similar code to llvm when you pass -fno-strict-aliasing to gcc.

Passing -fno-strict-aliasing makes no difference to the code generated by g++. It is still twice the speed of code from clang.

Robert P.

Hi Robert,

I believe that would require the cooperation of the OP, because it is his
Fortran -> C++ converter. Are you interested, Ralf?

Definitely. Let me know how I could help by changing the C++ code generator.

Ralf