AVX Shuffles & PatLeaf Help Needed

I'm working on debugging AVX shuffles and I ran into an interesting
problem.

The current isSHUFPMask predicate in X86ISelLowering needs to be
generalized to operate on 128-bit or 256-bit masks. There are
probably lots of other things to change too (LowerVECTOR_SHUFFLE_4wide,
etc.) but I'll worry about that later.

The generalized rule is:

1. For the low 64 bits of the result vector, the source can be from
   the low 128 bits of vector 1.

2. For the next 64 bits, the source can be from the low 128 bits of
   vector 2.

3. For the 3rd 64 bits, the source is the high 128 bits of vector 1.

4. For the high 64 bits, the source is the high 128 bits of vector 2.

For 128 bit vectors, steps 3 and 4 are ignored since there are no high
128 bits.

Determining the answer boils down to knowing how big a vector element
is. Then we can map operand values to ranges within 64-bit and 128-bit
chunks and determine the proper index ranges to look for. For example,
for 64-bit elements, result element zero must come from index 0 or 1.
For 32-bit elements, result element zero must come from index 0-3.

In isSHUFPMask all we have is the SDNode of the shufflevector index vector.
Unfortunately, this tells us nothing about the type of the result vector.
If we have two operands, we obviously have a v2i/f64 vector. For eight
operands, we have a v8i/f32 vector. But for four operands we could have
a v4f/i32 or a v4f/i64. So we can't know the vector element size and thus
we can't map shufflevector indices valid ranges for output vector elements.

If I change isSHUFPMask to take an extra argument which is the result
vector type (or element type, or something similar), how would I express
that extra argument in the .td file? Right now we do:

SHUFP_shuffle_mask:$src3

to add the predicate to check the 3rd (mask) operand for conformance to
something SHUFPS/D can handle.

Is there some way currently to add another "argument" to the PatLeaf
invocation?

                           -Dave

David, this is probably the wrong approach, based on the accreted awfulness of the X86 shuffle lowering code, which Eli and I have hacked on to improve somewhat. The correct approach is probably a rewrite based around what AltiVec does: Canonicalize to byte ops, and write all the patterns once rather than having to look for 6 different variants of the same pattern.

Nate

David, this is probably the wrong approach, based on the accreted awfulness
of the X86 shuffle lowering code,

Ha! I have no issue believing this statement. :slight_smile:

The correct approach is probably a rewrite based around what
AltiVec does: Canonicalize to byte ops, and write all the patterns once
rather than having to look for 6 different variants of the same pattern.

Can you expand on this with an example? There seems to be an awful lot of
shuffle patterns and predicates in PPCInstrAltivec.td. What do you mean by,
"Canonicalize to byte ops?" Can you walk me through how that works with
Altivec?

Since I'm rewriting all of the SSE patterns to clean them up and incorporate
AVX functionality anyway, a complete rewrite of shuffles is not additional
work. :slight_smile:

Thanks.

                            -Dave

Ah wait, I think I know what you mean. For x86, you mean rewrite the shuffle
operations in X86ISelLowering to operate on 32-bit elements always (the
shuffle instructions with immediate masks don't have 16-bit or 8-bit
variants), bitcasting the vector operands as appropriate and reformulating the
index vector to account for the bitcasts. Is that right?

I think I can see how to do this. I think before I start this I will try to
check in as muich AVX work as I can so that we all see the changes as they
happen. I have quite an extensive set backlogged but I think it's stable
enough to start releasing now.

                               -Dave

Hello, David

Can you expand on this with an example? There seems to be an awful lot of
shuffle patterns and predicates in PPCInstrAltivec.td. What do you mean by,
"Canonicalize to byte ops?" Can you walk me through how that works with
Altivec?

The basic idea is quite simple - lower everything to vNi8 and write
all the patterns using only these types.

Yeah, I figured that out after thinking a bit more. However, I think in this
case we only want to lower to vNi32 since there are no immediate-mask shuffles
in X86 that operate on smaller element types. Doing it at the byte level
would just be more confusing, I think.

PSHUFB is really a completely different instruction than PSHUFD, for example.

                              -Dave

Aside from consuming one of its inputs, which is a regalloc problem, it isn't really different. It's just a one-input immediate shuffle, where the immediate is not encoded in the instruction. From the perspective of the shuffle instruction, all the x86 shuffles are just various byte shuffles. Writing them all in one canonical form would substantially simplify the code, especially given layering AVX on top of the existing barely-understandable code would probably result in something almost unmaintainable.

Nate

But I don't think it can share a pattern with the immmediate-mask shuffles
because the operands are of different class (register vs. immediate).
I guess for PSHUFB the instruction works for all masks so we don't even
need a predicate to check for validity.

I'll think some more about this, but casting a vector to bytes is really
going to make debugging harder. It's easier to think about 4 or 8
index values than 16 or 32.

                                     -Dave