Hi Chandler,
Thanks for fixing the problem with the insertps mask.
Generally the new shuffle lowering looks promising, however there are
some cases where the codegen is now worse causing runtime performance
regressions in some of our internal codebase.
You have already mentioned how the new shuffle lowering is missing
some features; for example, you explicitly said that we currently lack
of SSE4.1 blend support. Unfortunately, this seems to be one of the
main reasons for the slowdown we are seeing.
Here is a list of what we found so far that we think is causing most
of the slowdown:
1) shufps is always emitted in cases where we could emit a single
blendps; in these cases, blendps is preferable because it has better
reciprocal throughput (this is true on all modern Intel and AMD cpus).
Things get worse when it comes to lowering shuffles where the shuffle
mask indices refer to elements from both input vectors in each lane.
For example, a shuffle mask of <0,5,2,7> could be easily lowered into
a single blendps; instead it gets lowered into two shufps
instructions.
Example:
;;;
define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
%1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0,
i32 5, i32 2, i32 7>
ret <4 x float> %1
}
;;;
llc (-mcpu=corei7-avx):
vblendps $10, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[5],xmm0[2],xmm1[7]
llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
2) On SSE4.1, we should try not to emit an insertps if the shuffle
mask identifies a blend. At the moment the new lowering logic is very
aggressively emitting insertps instead of cheaper blendps.
Example:
;;;
define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
%1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
i32 5, i32 2, i32 7>
ret <4 x float> %1
}
;;;
llc (-mcpu=corei7-avx):
vblendps $11, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
3) When a shuffle performs an insert at index 0 we always generate an
insertps, while a movss would do a better job.
;;;
define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
%1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
i32 1, i32 2, i32 3>
ret <4 x float> %1
}
;;;
llc (-mcpu=corei7-avx):
vmovss %xmm1, %xmm0, %xmm0
llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
I hope this is useful. We would be happy to contribute patches to
improve some of the above cases, but we obviously know that this is
still a work in progress, so we don't want to introduce conflicts with
your work. Please let us know what you think.
We will keep looking at this and follow up with any further findings.
Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Inc.