Poor bitcode generated for variable vector extract

Clang appears to be generating quite poor bitcode for cases in which a variable element is extracted from a vector.

The checkin for svn 207801 (http://llvm.org/viewvc/llvm-project?view=revision&revision=207801) appears to address this issue in llvm, but that checkin refers to a subsequent checkin "to teach clang" how to generate the proper bitcode. I'm wondering if that subsequent checkin was never done? (I'm using rev 214406.)

Anyway, the specific problem I'm seeing is the following:

Given this C code with a *constant* variable extract (the "2" in the "return" statement):

typedef int v4si __attribute__((__vector_size__ (16)));
typedef union { v4si v; int e[4]; } v4si_u;
int myfunc3(int x, v4si v0) {
v4si_u vu;
vu.v = v0;
return vu.e[2];

And this clang command line:

clang -S -emit-llvm -O3 a.c

I see the following bitcode, which looks great:

; Function Attrs: nounwind readnone
define i32 @myfunc3(i32 %x, <4 x i32> %v0) #0 {
%vu.sroa.0.8.vec.extract = extractelement <4 x i32> %v0, i32 2
ret i32 %vu.sroa.0.8.vec.extract

*However*, when I change to the "return" statement to be the following, so that the extracted element is a variable:

return vu.e;

I see very poor bitcode that does a memory store and then a load rather than making use of the extractelement with a variable final parameter ('x'):

; Function Attrs: nounwind readnone
define i32 @myfunc3(i32 %x, <4 x i32> %v0) #0 {
%vu = alloca %union.v4si_u, align 16
%v = getelementptr inbounds %union.v4si_u* %vu, i32 0, i32 0
store <4 x i32> %v0, <4 x i32>* %v, align 16, !tbaa !1
%e = bitcast %union.v4si_u* %vu to [4 x i32]*
%arrayidx = getelementptr inbounds [4 x i32]* %e, i32 0, i32 %x
%0 = load i32* %arrayidx, align 4, !tbaa !4
ret i32 %0

Any thoughts on this issue?