RFB: Would like to flip the vector shuffle legality flag

Greetings LLVM hackers and x86 vector shufflers!

I would like to flip on another chunk of the new vector shuffling, specifically the logic to mark ~all shuffles as “legal”.

This can be tested today with the flag “-x86-experimental-vector-shuffle-legality”. I would essentially like to make this the default (by removing the “false” path). Doing this will allow me to completely delete the old vector shuffle lowering.

I’ve got the patches prepped and ready to go, but it will likely have a significant impact on performance. Notably, a bunch of the remaining domain crossing bugs I’m seeing are due to this. The key thing to realize is that vector shuffle combining is much more powerful when we think all of these are legal, and so we combine away bad shuffles that would trigger domain crosses.

All of my benchmarks have come back performance neutral overall with a few benchmarks improving. However, there may be some regressions that folks want to track down first. I’d really like to get those reported and prioritize among the vector shuffle work so we can nuke several thousand lines of code from X86ISelLowering.cpp. =D

Thanks!
-Chandler

PS: If you’re feeling adventurous, the next big mode flip flag I want to see changed is -x86-experimental-vector-widening-legalization, but this is a much more deep change to the entire vector legalization strategy, so I want to do it second and separately.

Hi Chandler,

Greetings LLVM hackers and x86 vector shufflers!

I would like to flip on another chunk of the new vector shuffling,
specifically the logic to mark ~all shuffles as "legal".

This can be tested today with the flag
"-x86-experimental-vector-shuffle-legality". I would essentially like to
make this the default (by removing the "false" path). Doing this will allow
me to completely delete the old vector shuffle lowering.

I've got the patches prepped and ready to go, but it will likely have a
significant impact on performance. Notably, a bunch of the remaining domain
crossing bugs I'm seeing are due to this. The key thing to realize is that
vector shuffle combining is *much* more powerful when we think all of these
are legal, and so we combine away bad shuffles that would trigger domain
crosses.

That's good news!
Also, I really like your idea of making all shuffles legal by default.
I remember I did some experiments disabling the checks for legal
shuffles in the DAGCombiner to see how well the new shuffle lowering
coped with 'overly' aggressive shuffle combining. I was surprised to
see that from eyeballing the generated code it looked much cleaner
(although I didn't test it extensively). Our target is btver2, so I
also didn't look at what could have been codegen for targets with no
AVX/SSE4.1 where there might be fewer opportunities to match a shuffle
with a single target instruction during legalization.

All of my benchmarks have come back performance neutral overall with a few
benchmarks improving. However, there may be some regressions that folks want
to track down first. I'd really like to get those reported and prioritize
among the vector shuffle work so we can nuke several *thousand* lines of
code from X86ISelLowering.cpp. =D

I'll see if I can get some numbers from our internal codebase and help
with reporting potential regressions.

Thanks,
Andrea

Thanks for the heads up, Chandler!
I'll also try to get some numbers locally.

Michael

I ran the benchmarking subset of test-suite on a btver2 machine and optimizing for btver2 (so enabling AVX codegen).

I don’t see anything outside of the noise with x86-experimental-vector-shuffle-legality=1.

Hi Chandler,

I ran the new flag through some of our internal benchmarks and overall this is neutral, so slightly better.
I observed a couple of improvements, but also a couple of regressions.
I am working on tracking those down to produce reduced test cases. No sure I would have time to do that by the end of this week though.

Cheers,
-Quentin

Hi Chandler,

I don’t see any scary changes in performance outside of noise (1%) when I set "all shuffles as legal” to true. I haven’t tested the vector widening yet.

A minor concern: in several situations I’m seeing vpermilps has been replaced with shufps with repeated inputs (see below). I’m also seeing cases where 2 x shufps are being used where a single shufps/permilps was occurring. I’ll try to reduce some examples for bugzilla.

old:
vshufps $19, %xmm3, %xmm2, %xmm2 # xmm2 = xmm2[3,0],xmm3[1,0]
vpermilps $45, %xmm2, %xmm2 # xmm2 = xmm2[1,3,2,0]

new:
vshufps $76, %xmm3, %xmm2, %xmm2 # xmm2 = xmm2[0,3],xmm3[0,1]
vshufps $120, %xmm2, %xmm2, %xmm2 # xmm2 = xmm2[0,2,3,1]

This may be fixable with improvements to DAGCombiner::visitVECTOR_SHUFFLE - it uses shuffle mask legality to check whether to commute candidate shuffle masks, which we never do with always legal shuffles. Better optimisation may be possible if we can optimise more shuffle(shuffle(A, B, M0), shuffle(C, D, M1), M2) patterns as well.

Cheers, Simon.

Thanks all!

Simon, Quentin, do you feel comfortable removing the old legality testing now, or would you like to leave it in until you have more time to reduce test cases?

(I’m not actually trying to hurry getting test cases, I’m fine to leave it in place as long as necessary, just checking so I know how to proceed.)

Thanks all!

Simon, Quentin, do you feel comfortable removing the old legality testing now, or would you like to leave it in until you have more time to reduce test cases?

I would prefer to have it to reduce the test cases.

Q.

Same here – it looks mostly neutral.

I don’t think we see anything that’s significantly above or below noise.

Hi Chandler,

I’ve been looking at the regressions Quentin mentioned, and filed a PR
for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377

As for the others, I’m working on reducing them, but for now, here are
some raw observations, in case any of it rings a bell:

Another problem I’m seeing is that in some cases we can’t fold memory anymore:
vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
vblendps $0x1, %xmm2, %xmm0, %xmm0
becomes:
vmovaps -0xXX(%rdx), %xmm2
vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]

Also, I see differences when some loads are shuffled, that I’m a bit
conflicted about:
vmovaps -0xXX(%rbp), %xmm3

vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
becomes:
vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]

vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]

Note that the second version does the shuffle in-place, in xmm2.

Some are blends (har har) of those two:
vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
becomes:
vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 = xmm0[3,0],xmm_mem_1[0,0]
vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm_mem_1[1,2]

I also see a lot of somewhat neutral (focusing on Haswell for now)
domain changes such as (xmm5 and 0 are initially integers, and are
dead after the store):
vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2]
vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0 = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
vmovdqu %xmm0, 0x20(%rax)
turning into:
vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2]
vmovups %xmm0, 0x20(%rax)

-Ahmed

Hi Chandler,

I've been looking at the regressions Quentin mentioned, and filed a PR
for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377

As for the others, I'm working on reducing them, but for now, here are
some raw observations, in case any of it rings a bell:

Very cool, and thanks for the analysis!

Another problem I'm seeing is that in some cases we can't fold memory
anymore:
    vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
    vblendps $0x1, %xmm2, %xmm0, %xmm0
becomes:
    vmovaps -0xXX(%rdx), %xmm2
    vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
    vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]

Also, I see differences when some loads are shuffled, that I'm a bit
conflicted about:
    vmovaps -0xXX(%rbp), %xmm3
    ...
    vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
becomes:
    vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
    ...
    vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]

Note that the second version does the shuffle in-place, in xmm2.

Some are blends (har har) of those two:
    vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
    vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
    vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
becomes:
    vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
    vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
    vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
= xmm0[3,0],xmm_mem_1[0,0]
    vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
= xmm0[0,2],xmm_mem_1[1,2]

I also see a lot of somewhat neutral (focusing on Haswell for now)
domain changes such as (xmm5 and 0 are initially integers, and are
dead after the store):
    vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2]
    vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
= xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
    vmovdqu %xmm0, 0x20(%rax)
turning into:
    vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
    vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2]
    vmovups %xmm0, 0x20(%rax)

All of these stem from what I think is the same core weakness of the
current algorithm: we prefer the fully general shufps+shufps 4-way
shuffle/blend far too often. Here is how I would more precisely classify
the two things missing here:

- Check if either inputs are "in place" and we can do a fast single-input
shuffle with a fixed blend.
- Check if we can form a rotation and use palignr to finish a shuffle/blend

There may be other patterns we're missing, but these two seem to jump out
based on your analysis, and may be fairly easy to tackle.

Hi Chandler,

I've been looking at the regressions Quentin mentioned, and filed a PR
for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377

As for the others, I'm working on reducing them, but for now, here are
some raw observations, in case any of it rings a bell:

Very cool, and thanks for the analysis!

Another problem I'm seeing is that in some cases we can't fold memory
anymore:
    vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
    vblendps $0x1, %xmm2, %xmm0, %xmm0
becomes:
    vmovaps -0xXX(%rdx), %xmm2
    vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
    vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 =
xmm3[0,2],xmm0[1,2]

Also, I see differences when some loads are shuffled, that I'm a bit
conflicted about:
    vmovaps -0xXX(%rbp), %xmm3
    ...
    vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
becomes:
    vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
    ...
    vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]

Note that the second version does the shuffle in-place, in xmm2.

Some are blends (har har) of those two:
    vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
    vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
    vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
becomes:
    vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
    vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
    vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
= xmm0[3,0],xmm_mem_1[0,0]
    vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
= xmm0[0,2],xmm_mem_1[1,2]

I also see a lot of somewhat neutral (focusing on Haswell for now)
domain changes such as (xmm5 and 0 are initially integers, and are
dead after the store):
    vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2]
    vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
= xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
    vmovdqu %xmm0, 0x20(%rax)
turning into:
    vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
    vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 =
xmm0[0,2],xmm5[1,2]
    vmovups %xmm0, 0x20(%rax)

All of these stem from what I think is the same core weakness of the
current algorithm: we prefer the fully general shufps+shufps 4-way
shuffle/blend far too often. Here is how I would more precisely classify
the two things missing here:

- Check if either inputs are "in place" and we can do a fast single-input
shuffle with a fixed blend.

I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390

- Check if we can form a rotation and use palignr to finish a shuffle/blend

.. and this would be http://llvm.org/bugs/show_bug.cgi?id=22391

I think this about covers the Haswell regressions I'm seeing. Now for some
pre-AVX fun!

-Ahmed

I filed a couple more, in case they’re actually different issues:

And that’s pretty much it for internal changes. I’m fine with flipping the switch; Quentin, are you?
Also, just to have an idea, do you (or someone else!) plan to tackle these in the near future?

I may get one or two in the next month, but not more than that. Focused on the pass manager for now. If none get there first, I’ll eventually circle back though, so they won’t rot forever.

I may get one or two in the next month, but not more than that. Focused on
the pass manager for now. If none get there first, I'll eventually circle
back though, so they won't rot forever.

Alright, I'll give it a try in the next few weeks as well.

-Ahmed

Thanks to everyone doing the benchmarking!!! =D

FYI, I've fixed all the regressions filed except for PR22391 along with a
*giant* pile of other improvements to the vector shuffle lowering. I even
have a fix up my sleeve for PR22391, but it needs refactoring in the code
that I don't really want to do while supporting both. It is time.

I'm going to start submitting the patches now to rip out the flag and all
the code supporting it.

-Chandler