instcombine does silly things with vector x+x

Consider the following function which doubles a <16 x i8> vector:

define <16 x i8> @test(<16 x i8> %a) {
       %b = add <16 x i8> %a, %a
       ret <16 x i8> %b
}

If I compile it for x86 with llc like so:

llc paddb.ll -filetype=asm -o=/dev/stdout

I get a two-op function that just does paddb %xmm0 %xmm0 and then
returns. llc does this regardless of the optimization level. Great!

If I let the instcombine pass touch it like so:

opt -instcombine paddb.ll | llc -filetype=asm -o=/dev/stdout

or like so:

opt -O3 paddb.ll | llc -filetype=asm -o=/dev/stdout

then the add gets converted to a vector left shift by 1, which then
lowers to a much slower function with about a hundred ops. No amount
of optimization after the fact will simplify it back to paddb.

I'm actually generating these ops in a JIT context, and I want to use
instcombine, as it seems like a useful pass. Any idea how I can
reliably generate the 128-bit sse version of paddb? I thought I might
be able to force the issue with an intrinsic, but there only seems to
be an intrinsic for the 64 bit version (llvm.x86.mmx.padd.b), and the
saturating 128 bit version (llvm.x86.sse2.padds.b). I would just give
up and use inline assembly, but it seems I can't JIT that.

I'm using the latest llvm 3.1 from svn. I get similar behavior at
llvm.org/demo using the following equivalent C code:

#include <emmintrin.h>
__m128i f(__m128i a) {
  return _mm_add_epi8(a, a);
}

The no-optimization compilation of this is better than the optimized version.

Any ideas? Should I just not use this pass?

- Andrew

Consider the following function which doubles a <16 x i8> vector:

define <16 x i8> @test(<16 x i8> %a) {
      %b = add <16 x i8> %a, %a
      ret <16 x i8> %b
}

If I compile it for x86 with llc like so:

llc paddb.ll -filetype=asm -o=/dev/stdout

I get a two-op function that just does paddb %xmm0 %xmm0 and then
returns. llc does this regardless of the optimization level. Great!

If I let the instcombine pass touch it like so:

opt -instcombine paddb.ll | llc -filetype=asm -o=/dev/stdout

or like so:

opt -O3 paddb.ll | llc -filetype=asm -o=/dev/stdout

then the add gets converted to a vector left shift by 1, which then
lowers to a much slower function with about a hundred ops. No amount
of optimization after the fact will simplify it back to paddy.

This sounds like a really serious X86 backend performance bug. Canonicalizing "x+x" to a shift is the "right thing to do", the backend should match it.

-Chris

Opened pr11266. I will try to make time to work on it.

Fixed in r143311.