Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Greetings all,

As you may have noticed, there is a new vector shuffle lowering path in the X86 backend. You can try it out with the ‘-x86-experimental-vector-shuffle-lowering’ flag to llc, or ‘-mllvm -x86-experimental-vector-shuffle-lowering’ to clang. Please test it out!

There may be some correctness bugs, I’m still fuzz testing it to shake them out. But I expect fairly few of those.

I don’t have any test cases which regress in performance with the new shuffle lowering. I have several which improve by 1-3%, and a couple which improve by 5-10%. YMMV.

There are still some missing features: AVX2 shuffles, SSE4.1 blends, handling all possible uses of the “mov*” style shuffles. However, as indicated, I don’t have any test cases on any micro architectures that are really showing regressions here. It’s entirely possible I just don’t have access to them, so please help me benchmark!

Provided there aren’t really terrible regressions in performance, I’d like to switch the default in a couple of days and start getting bug reports about what doesn’t work yet. I’ve already talked to a couple of the regular contributors to the x86 backend and they seem pretty happy, so I just wanted to send a wider reaching email in case some folks had a chance to benchmark more.

Inevitably, there will be some regressions, but they can be handled and fixed like anything else provided they don’t cause lots of trouble for folks.

Thanks,
-Chandler

Hi Chandler,

I've done some informal benchmarking on an AMD Jaguar core (amd16h)
with and without the experimental flag. The tests were a mixture of
FP and Integer tests. I didn't see any significant performance
regression, with most of the differances being in the noise (less than
1%). One test, however, did show a performance improvement of ~4%.

Unfortunately, another team, while doing internal testing has seen the
new path generating illegal insertps masks. A sample here:

    vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3]
    vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3]
    vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3]
    vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 = xmm4[0,1],xmm1[2],xmm4[3]
    vinsertps $416, %xmm13, %xmm6, %xmm13 # xmm13 =
xmm6[0,1],xmm13[2],xmm6[3]
    vinsertps $416, %xmm0, %xmm7, %xmm0 # xmm0 = xmm7[0,1],xmm0[2],xmm7[3]

We'll continue to look into this and do additional testing.

Thanks,
Rob.

Interesting. Let me know if you get a test case. The insertps code path was
added recently though and has been much less well tested. I'll start fuzz
testing it and should hopefully uncover the bug.

Hi Chandler,

Hi Chandler,

While doing the performance measurement on a Ivy Bridge, I ran into compile time errors.

I saw a bunch of “cannot select" in the LLVM test suite with -march=core-avx-i.
E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at O3 -march=core-avx-i with:

fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32 = bitcast 0x7f91b99b0e10 [ORD=3] [ID=27]
0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210, 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
0x7f91b99a7210: v4i64 = undef [ID=15]
0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2] [ID=23]
0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60, 0x7f91b99ac738 [ORD=2] [ID=20]
0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820, 0x7f91b99a3a10 [ORD=2] [ID=16]
0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
0x7f91b99ace70: i64 = Constant<0> [ID=3]
In function: isamax0
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 3.6.0 (215249)
Target: x86_64-apple-darwin14.0.0

For some reason, I cannot reproduce the problem with the test case that clang gives me using -emit-llvm. Since the source is public, I guess you can try to reproduce on your side.
Indeed, if you run the test-suite with -march=core-avx-i you’ll likely see all those failures.

Let me know if you cannot and I’ll try harder to produce a test case.

Note: This is the same failure all over the place, i.e., cannot select a bit cast from various types to v4i32 or v4i64.

Thanks,
-Quentin

FYI, this is all fixed. =] Sorry for the trouble, was a silly goof that should have been caught sooner.

I’m having trouble reproducing this. I’m trying to get LNT to actually run, but manually compiling the given source file didn’t reproduce it for me.

It might have been fixed recently (although I’d be surprised if so), but it would help to get the actual command line for which compiling this file in the test suite failed.

-Chandler

I’ve run the SingleSource test suite for core-avx-i and have no failures here so a preprocessed file + commandline would be very useful if this reproduces for you still.

Sure,

Here is the command line:

clang -cc1 -triple x86_64-apple-macosx -S -disable-free -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu core-avx-i -O3 -ferror-limit 19 -fmessage-length 114 -stack-protector 1 -mstackrealign -fblocks -fencode-extended-block-signature -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i

This was with trunk 215249.

Thanks,
-Quentin

tmp.i (119 KB)

Sure,

Here is the command line:

clang -cc1 -triple x86_64-apple-macosx -S -disable-free -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu core-avx-i -O3 -ferror-limit 19 -fmessage-length 114 -stack-protector 1 -mstackrealign -fblocks -fencode-extended-block-signature -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i

This was with trunk 215249.

I meant, r217281.

Hi Chandler,

Forget about that I said.
It seems I have some weird dependencies in my built system.
My binaries are out-of-sync.

Let me sort that out, this is likely the problem is already fixed, and I can resume the measurements.

Sorry for the noise.

Q.

Hi Chandler,

Thanks for fixing the problem with the insertps mask.

Generally the new shuffle lowering looks promising, however there are
some cases where the codegen is now worse causing runtime performance
regressions in some of our internal codebase.

You have already mentioned how the new shuffle lowering is missing
some features; for example, you explicitly said that we currently lack
of SSE4.1 blend support. Unfortunately, this seems to be one of the
main reasons for the slowdown we are seeing.

Here is a list of what we found so far that we think is causing most
of the slowdown:
1) shufps is always emitted in cases where we could emit a single
blendps; in these cases, blendps is preferable because it has better
reciprocal throughput (this is true on all modern Intel and AMD cpus).

Things get worse when it comes to lowering shuffles where the shuffle
mask indices refer to elements from both input vectors in each lane.
For example, a shuffle mask of <0,5,2,7> could be easily lowered into
a single blendps; instead it gets lowered into two shufps
instructions.

Example:
;;;
define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0,
i32 5, i32 2, i32 7>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vblendps $10, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[5],xmm0[2],xmm1[7]

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]

2) On SSE4.1, we should try not to emit an insertps if the shuffle
mask identifies a blend. At the moment the new lowering logic is very
aggressively emitting insertps instead of cheaper blendps.

Example:
;;;
define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
i32 5, i32 2, i32 7>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vblendps $11, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]

3) When a shuffle performs an insert at index 0 we always generate an
insertps, while a movss would do a better job.
;;;
define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
i32 1, i32 2, i32 3>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vmovss %xmm1, %xmm0, %xmm0

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]

I hope this is useful. We would be happy to contribute patches to
improve some of the above cases, but we obviously know that this is
still a work in progress, so we don't want to introduce conflicts with
your work. Please let us know what you think.

We will keep looking at this and follow up with any further findings.

Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Inc.

Hi Chandler,

I had observed some improvements and regressions with the new lowering.

Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.

I’ll look into the regressions to provide test cases.

** Numbers **

Smaller is better. Only reported tests that run for at least one second.
Reference is the default lowering, Test is the new lowering.
The Os numbers are overall neutral, but the O3 numbers mainly expose regressions.

Note: I can attach the raw numbers if you want.

  • Os *

Benchmark_ID Reference Test Expansion Percent

Hi Chandler,

I had observed some improvements and regressions with the new lowering.

Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.

I’ll look into the regressions to provide test cases.

** Numbers **

Smaller is better. Only reported tests that run for at least one second.
Reference is the default lowering, Test is the new lowering.
The Os numbers are overall neutral, but the O3 numbers mainly expose
regressions.

Note: I can attach the raw numbers if you want.

That would be great. Please do.

-- Sean Silva

Alright, here they are :).

base-perf-Ox.txt: runtime for the default lowering.
new-perf-Ox.txt: runtime for the new lowering.

Each line in those files has the following format:

The units are:

  • min: Minimum of the 7 runs.
  • max: Maximum of the 7 runs.
  • avg: Average of the 7 runs.
  • total: Total of the 7 runs.
  • med: Median of the 7 runs.
  • SD: Standard deviation of the 7 runs.
  • SD%: Standard deviation of the7 runs in percentage.

-Quentin

base-perf-O3.txt (60.2 KB)

base-perf-Os.txt (45 KB)

new-perf-O3.txt (60.2 KB)

new-perf-Os.txt (45 KB)

Hi Chandler,

Here is a test case for the biggest offender (oourafft.c).
To reproduce:
llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=true repro.ll
llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=false repro.ll

The main problem is that we miss:

vmovsd (%rdi,%rcx,8), %xmm2
vmovlhps %xmm2, %xmm2, %xmm2 ## xmm2 = xmm2[0,0]

=>

vmovddup (%rdi,%rcx,8), %xmm2

I do not know how problematic is that (it seems we catch up on the performance with just the previous transformation), but we also miss:

vsubpd %xmm1, %xmm0, %xmm2
vaddpd %xmm1, %xmm0, %xmm0
vshufpd $2, %xmm0, %xmm2, %xmm0 ## xmm0 = xmm2[0],xmm0[1]

=>

vaddsubpd %xmm1, %xmm0, %xmm0

I’ll look into the other regressions.

Thanks,
-Quentin

repro.ll (2.21 KB)

First off, thanks for the *fantastic* testing and investigation. =]

Hi Chandler,

Here is a test case for the biggest offender (oourafft.c).
To reproduce:
llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=true
repro.ll
llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=false
repro.ll

The main problem is that we miss:
vmovsd (%rdi,%rcx,8), %xmm2
vmovlhps %xmm2, %xmm2, %xmm2 ## xmm2 = xmm2[0,0]
=>
vmovddup (%rdi,%rcx,8), %xmm2

I do not know how problematic is that (it seems we catch up on the
performance with just the previous transformation),

Actually, this is awesome, because this was also the main problem I saw. I
already wrote the fix, and just need to fix up test case fixes and submit
it. =]

I think blendps is the other big missing piece as mentioned.

but we also miss:
vsubpd %xmm1, %xmm0, %xmm2
vaddpd %xmm1, %xmm0, %xmm0
vshufpd $2, %xmm0, %xmm2, %xmm0 ## xmm0 = xmm2[0],xmm0[1]
=>
vaddsubpd %xmm1, %xmm0, %xmm0

I’ll look into the other regressions.

Maybe wait until i can land the duplicate move support and the blendps
support? I'd rather see what the results are after that.

There is also some AVX specific stuff that I've left FIXMEs fore that I
could probably address to pull it up a bit.

FWIW, I've got the main test-suite reproducing your results for x86, but I
don't currently have a nice reproduction for SPEC, so digging into those
would help somewhat more.

Great!

Understood.

I’ll wait for the patches to support the duplicate move before doing that :).

Shot me an email when they land, just to be sure ;).

Thanks,
-Quentin

Awesome, thanks for all the information!

See below:

You have already mentioned how the new shuffle lowering is missing
some features; for example, you explicitly said that we currently lack
of SSE4.1 blend support. Unfortunately, this seems to be one of the
main reasons for the slowdown we are seeing.

Here is a list of what we found so far that we think is causing most
of the slowdown:
1) shufps is always emitted in cases where we could emit a single
blendps; in these cases, blendps is preferable because it has better
reciprocal throughput (this is true on all modern Intel and AMD cpus).

Yep. I think this is actually super easy. I'll add support for blendps
shortly.

Things get worse when it comes to lowering shuffles where the shuffle
mask indices refer to elements from both input vectors in each lane.
For example, a shuffle mask of <0,5,2,7> could be easily lowered into
a single blendps; instead it gets lowered into two shufps
instructions.

Example:
;;;
define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0,
i32 5, i32 2, i32 7>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vblendps $10, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0],xmm1[5],xmm0[2],xmm1[7]

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]

2) On SSE4.1, we should try not to emit an insertps if the shuffle
mask identifies a blend. At the moment the new lowering logic is very
aggressively emitting insertps instead of cheaper blendps.

Example:
;;;
define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
i32 5, i32 2, i32 7>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vblendps $11, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]

3) When a shuffle performs an insert at index 0 we always generate an
insertps, while a movss would do a better job.
;;;
define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
i32 1, i32 2, i32 3>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vmovss %xmm1, %xmm0, %xmm0

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]

So, this is hard. I think we should do this in MC after register allocation
because movss is the worst instruction ever: it switches from blending with
the destination to zeroing the destination when the source switches from a
register to a memory operand. =[ I would like to not emit movss in the DAG
*ever*, and teach the MC combine pass to run after register allocation (and
thus spills) have been emitted. This way we can match both patterns: when
insertps is zeroing the other lanes and the operand is from memory, and
when insertps is blending into the other lanes and the operand is in a
register.

Does that make sense? If so, would you be up for looking at this side of
things? It seems nicely separable.

Awesome, thanks for all the information!

See below:

You have already mentioned how the new shuffle lowering is missing
some features; for example, you explicitly said that we currently lack
of SSE4.1 blend support. Unfortunately, this seems to be one of the
main reasons for the slowdown we are seeing.

Here is a list of what we found so far that we think is causing most
of the slowdown:
1) shufps is always emitted in cases where we could emit a single
blendps; in these cases, blendps is preferable because it has better
reciprocal throughput (this is true on all modern Intel and AMD cpus).

Yep. I think this is actually super easy. I'll add support for blendps
shortly.

Thanks Chandler!

3) When a shuffle performs an insert at index 0 we always generate an
insertps, while a movss would do a better job.
;;;
define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
i32 1, i32 2, i32 3>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vmovss %xmm1, %xmm0, %xmm0

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]

So, this is hard. I think we should do this in MC after register allocation
because movss is the worst instruction ever: it switches from blending with
the destination to zeroing the destination when the source switches from a
register to a memory operand. =[ I would like to not emit movss in the DAG
*ever*, and teach the MC combine pass to run after register allocation (and
thus spills) have been emitted. This way we can match both patterns: when
insertps is zeroing the other lanes and the operand is from memory, and when
insertps is blending into the other lanes and the operand is in a register.

Does that make sense? If so, would you be up for looking at this side of
things? It seems nicely separable.

I think it is a good idea and it makes sense to me.
I will start investigating on this and see what can be done.

Cheers,
Andrea