Performance problem with SIMD support

I am comparing the performance of our code generated with Clang++ 3.3 and with G++ 4.5.1 for Linux x86_64 (Fedora 14).

Clang++ code performs generally a bit better than G++ code, but one of the differences in a few test cases is attributable to some SIMD code. It uses SSE2 and SSE4.1 instructions to accelerate some small functions that are used often.

In January I reported lack of support for __builtin_ia32_blendvpd as a bug. I learned that we have to use _mm_blendv_pd from xmmintrin.h instead of the __builtin_ form. This was the final comment from Eli Friedman : The _mm_ forms are preferred because they are standardized; we consider the __builtin_ versions an implementation detail.

I converted all of our code to use the _mm_ forms for Clang++ builds. Now I have discovered that our code actually runs more slowly with the SIMD instructions than without.

With G++ 4.5.1, the test case runs in 69 sec. with SIMD and 84 sec. without SIMD.

With C++ 3.3, the same test case runs in 73 sec. with SIMD and 64 sec. without SIMD.

We discovered that the function gcopy2 was at the top of the profiler's list, and fcopy2 and dcopy2 were also in the top 5. A stack trace pointed to our SIMD code as the caller, and this indicated we should try compiling without the SIMD code.

Before I spend too much more time with various possibilities, can anyone comment on this issue?

Perhaps we should be using __builtin_ functions, when they are available, and _mm_ functions only when the __builtin_ forms are not available.

Is there something that could be improved in Clang's SIMD support?

With G++ 4.5.1, the test case runs in 69 sec. with SIMD and 84 sec. without
SIMD.

With C++ 3.3, the same test case runs in 73 sec. with SIMD and 64 sec.
without SIMD.

We discovered that the function gcopy2 was at the top of the profiler's
list, and fcopy2 and dcopy2 were also in the top 5. A stack trace pointed
to our SIMD code as the caller, and this indicated we should try compiling
without the SIMD code.

Before I spend too much more time with various possibilities, can anyone
comment on this issue?

It'd be good to see a testcase that shows the problem. We're
definitely interested in optimizing this path.

Perhaps we should be using __builtin_ functions, when they are available,
and _mm_ functions only when the __builtin_ forms are not available.

We'd prefer not. Basically the __builtin forms are basically
equivalent to inline asm. The idea behind using only the _mm_*
versions is that the code is also capable of being optimized.

Is there something that could be improved in Clang's SIMD support?

Probably if you're having this problem.

-eric

I'll see what I can do about a simpler test case.

I just talked with a colleague more familiar with our SIMD code, and he pointed out that our function called in this test case is using asm code for the SIMD instructions, not the intrinsic functions. My guess that the difference was due to _mm_ functions was wrong. I apologize for jumping to the wrong conclusion.

This colleague thinks the problem might be related to data packing, which compilers could handle differently. I will try to send our code in a case that demonstrates the performance issue.

Another colleague pointed out that we are compiling the functions using asm code with '-fno-dse' for G++ builds. Perhaps this option will help for Clang++, too.

I will try it out as soon as I can re-enable our SIMD code and rebuild everything.

This pinpoints the problem. We can compile the asm-based SIMD code with G++ using '-fno-dse', but Clang++ ignores the option and produces code that runs more slowly.

Is there any plan to support this option in Clang?

BTW, we are unable to compile the asm code with Clang in a debug build (-O0). We get a bunch of errors like this:

xxx.cc:504:2: error: ran out of registers during register allocation
         compute_factors (f0, f1, f2, df0, df1, df2,
         ^
xxx.cc:302:8: note: expanded from macro 'compute_factors'
         asm ( "movapd %3, %%xmm0 \n\t" /* xmm0 = f0 */ \
               ^

This pinpoints the problem. We can compile the asm-based SIMD code with G++
using '-fno-dse', but Clang++ ignores the option and produces code that runs
more slowly.

That's for the dead store elimination pass. We do eliminate some dead
stores, and a testcase would be ideal.

Is there any plan to support this option in Clang?

BTW, we are unable to compile the asm code with Clang in a debug build
(-O0). We get a bunch of errors like this:

xxx.cc:504:2: error: ran out of registers during register allocation
        compute_factors (f0, f1, f2, df0, df1, df2,
        ^
xxx.cc:302:8: note: expanded from macro 'compute_factors'
        asm ( "movapd %3, %%xmm0 \n\t" /* xmm0 = f0 */ \

Means that you've got a lot of inline assembly that basically depends
upon the compiler picking decent memory operands for your inline asm.
A couple of comments here:

a) that's a lot of inline assembly then,
b) you'll be better off using the intrinsics, especially with clang

-eric

This pinpoints the problem. We can compile the asm-based SIMD code with G++ using '-fno-dse', but Clang++ ignores the option and produces code that runs more slowly.

That's for the dead store elimination pass. We do eliminate some dead
stores, and a testcase would be ideal.

I am ready to send you a test case. The asm SIMD code runs slower than the equivalent C++ non-SIMD code. How do I send it to you?

BTW, we are unable to compile the asm code with Clang in a debug build
(-O0). We get a bunch of errors like this:

xxx.cc:504:2: error: ran out of registers during register allocation
         compute_factors (f0, f1, f2, df0, df1, df2,
         ^
xxx.cc:302:8: note: expanded from macro 'compute_factors'
         asm ( "movapd %3, %%xmm0 \n\t" /* xmm0 = f0 */ \

Means that you've got a lot of inline assembly that basically depends
upon the compiler picking decent memory operands for your inline asm.
A couple of comments here:

a) that's a lot of inline assembly then,
b) you'll be better off using the intrinsics, especially with clang

Te test case also demonstrates the errors reported, when compiling with -g instead of -O2.

I submitted a bug report: http://llvm.org/bugs/show_bug.cgi?id=17195

I've responded to one part of it (the -O0 part), but the performance
problem can be looked at separately. You might want to look at the
generated code for each function as well.

-eric