RFC: AVX Pattern Specification [LONG]

David_A_Greene · April 30, 2009, 10:59pm

Here's the big RFC.

A I've gone through and designed patterns for AVX, I quickly realized that the
existing SSE pattern specification, while functional, is less than ideal in
terms of maintenance. In particular, a number of nearly-identical patterns
are specified all over for nearly-identical instructions. For example:

let Constraints = "$src1 = $dst" in {
multiclass basic_sse1_fp_binop_rm<bits<8> opc, string OpcodeStr,
                                  SDNode OpNode, Intrinsic F32Int,
                                  bit Commutable = 0> {
  // Scalar operation, reg+reg.
  def SSrr : SSI<opc, MRMSrcReg, (outs FR32:$dst),
                                 (ins FR32:$src1, FR32:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]> {
    let isCommutable = Commutable;
  }

  // Scalar operation, reg+mem.
  def SSrm : SSI<opc, MRMSrcMem, (outs FR32:$dst),
                                 (ins FR32:$src1, f32mem:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, (load addr:$src2)))]>;

  // Vector operation, reg+reg.
  def PSrr : PSI<opc, MRMSrcReg, (outs VR128:$dst),
                                 (ins VR128:$src1, VR128:$src2),
               !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
               [(set VR128:$dst, (v4f32 (OpNode VR128:$src1,
                                                VR128:$src2)))]> {
    let isCommutable = Commutable;
  }

  // Vector operation, reg+mem.
  def PSrm : PSI<opc, MRMSrcMem, (outs VR128:$dst),
                                 (ins VR128:$src1, f128mem:$src2),
                 !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
             [(set VR128:$dst, (OpNode VR128:$src1,
                                       (memopv4f32 addr:$src2)))]>;
}

These are all essentially the same except that ModRM formats, types and
register classes change. For patterns that access memory there are special
"memory access operators" like memopv4f32. But the base pattern of dest =
src1 op src2 is the same.

Worse yet:

let Constraints = "$src1 = $dst" in {
multiclass basic_sse2_fp_binop_rm<bits<8> opc, string OpcodeStr,
                                  SDNode OpNode, Intrinsic F64Int,
                                  bit Commutable = 0> {
  // Scalar operation, reg+reg.
  def SDrr : SDI<opc, MRMSrcReg, (outs FR64:$dst),
                                 (ins FR64:$src1, FR64:$src2),
                 !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                 [(set FR64:$dst, (OpNode FR64:$src1, FR64:$src2))]> {
    let isCommutable = Commutable;
  }

  // Scalar operation, reg+mem.
  def SDrm : SDI<opc, MRMSrcMem, (outs FR64:$dst),
                                 (ins FR64:$src1, f64mem:$src2),
                 !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                 [(set FR64:$dst, (OpNode FR64:$src1, (load addr:$src2)))]>;

  // Vector operation, reg+reg.
  def PDrr : PDI<opc, MRMSrcReg, (outs VR128:$dst),
                                 (ins VR128:$src1, VR128:$src2),
               !strconcat(OpcodeStr, "pd\t{$src2, $dst|$dst, $src2}"),
               [(set VR128:$dst, (v2f64 (OpNode VR128:$src1,
                                                VR128:$src2)))]> {
    let isCommutable = Commutable;
  }

  // Vector operation, reg+mem.
  def PDrm : PDI<opc, MRMSrcMem, (outs VR128:$dst),
                                 (ins VR128:$src1, f128mem:$src2),
                 !strconcat(OpcodeStr, "pd\t{$src2, $dst|$dst, $src2}"),
                 [(set VR128:$dst, (OpNode VR128:$src1,
                                           (memopv2f64 addr:$src2)))]>;
}

This looks identical to basic_sse1_fp_binop_rm except it's SD/PD instead of
SS/PS and types and register classes differ in a predictable way.

So we already have two "levels" or redundancy: one within a single multiclass
and then redudancy across multiclasses.

This gets even worse with more complicated patterns like converts.
Essentially the same complex pattern gets duplicated for the variously-sized
converts. Bug fixes in once place need to be replicated everywhere and it's
easy to miss one or two. This is the very definition of "maintenance
problem."

Moreover, the various SSE levels were implemented at different times and do
things subtly differently. For example:

SSE1 :

  def ANDNPSrr : PSI<0x55, MRMSrcReg,
                     (outs VR128:$dst), (ins VR128:$src1, VR128:$src2),
                     "andnps\t{$src2, $dst|$dst, $src2}",
                     [(set VR128:$dst,
                       (v2i64 (and (xor VR128:$src1,
                                    (bc_v2i64 (v4i32 immAllOnesV))),
                               VR128:$src2)))]>;

SSE2 :

  def ANDNPDrr : PDI<0x55, MRMSrcReg,
                     (outs VR128:$dst), (ins VR128:$src1, VR128:$src2),
                     "andnpd\t{$src2, $dst|$dst, $src2}",
                     [(set VR128:$dst,
                       (and (vnot (bc_v2i64 (v2f64 VR128:$src1))),
                        (bc_v2i64 (v2f64 VR128:$src2))))]>;

Note the use of xor vs. vnot and the different placement of the bc* fragments
and use of type specifiers. I wonder if we even match both of these.

And naming is not consistent:

def Int_CVTSS2SIrr : SSI<0x2D, MRMSrcReg, (outs GR32:$dst), (ins VR128:$src),
def MOVUPSrm_Int : PSI<0x10, MRMSrcMem, (outs VR128:$dst), (ins f128mem:$src),

Furthermore, the current scheme ties patterns to prefix encodings and Requires
predicates:

  // Scalar operation, reg+reg.
  def SSrr : SSI<opc, MRMSrcReg, (outs FR32:$dst),
                                 (ins FR32:$src1, FR32:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]> {
    let isCommutable = Commutable;
  }

From X86InstrFormats.td:

class SSI<bits<8> o, Format F, dag outs, dag ins, string asm,
list<dag> pattern>
: I<o, F, outs, ins, asm, pattern>, XS, Requires<[HasSSE1]>;

For AVX we would need a different set of format classes because while AVX
could reuse the existing XS class (it's recoded as part of the VEX prefix so
we still need the information XS provides), "Requires<[HasSSE1]>" is
certainly inappropriate. Initially I started factoring things out to separate
XS and other prefix classes from Requires<> but that didn't solve the pattern
problems I mentioned above.

All of this complication gets multipled with AVX because AVX recodes all of
the legacy SSE instructions using VEX to provide three-address forms. So if
we were to follow the existing sceheme, we would duplicate *all* of
X86InstrSSE.td and edit patterns to match three-address modes and then add the
256-bit patterns on top of that, effectively duplicating X86InstrSSE.td a
second time.

This is not scalable.

So what I've done is a little experiment to see if I can unify all SSE and AVX
SIMD instructions under one framework. I'll leave MMX and 3dNow alone since
they're oddballs and hardly anyone uses them.

Essentially I've created a set of base pattern classes that are very generic.
These contain the basic asm string templates and dag patterns we want to
match. These classes are parameterized by things like register class,
operand type, ModRM format and "memory access operation." I've also created
patterns that take a fully specified asm string and/or dag pattern to provide
flexibility for "oddball" instructions.

Multiclasses sit on top of the patterns and aggregate various legal
combinations (e.g. SS, SD, PS, PD for basic arithmetic). There's a set
of base multiclasses and a set of derived multiclasses that aggregate
things into legal sets. For example, some SSE instructions are vector-only
while others have scalar and vector versions. Some instructions use the
XS, XD, TB and OpSize/TB prefixes while others use the TA, T8 and OpSize
prefixes.

The point of all of this is to write patterns and asm strings *once* for each
kind of instruction (binary arithmetic, convert, shuffle, etc.) and then use
multiclasses to generate all of the concrete patterns for SSE and AVX.

So for example, an ADD would be specified like this:

// Arithmetic ops with intrinsics and scalar equivalents
defm ADD :
sse1_sse2_avx_binary_scalar_xs_xd_vector_tb_ostb_node_intrinsic_rm_rrm<
   0x58, // Opcode
   "add", // asm base opcode name
   fadd, // SDNode name
   "add", // Intrinsic base name (we pre-concat int_x86_sse*/avx and
           // post-contact ps/pd, etc.)
   1 // Commutative

;

Now the multiclass name is rather unwieldy, I know. That can be changed so
don't worry too much about it. I'm more concerned about the overall scheme
and that it make sense to you all.

I have a Perl script that auto-generates the necessary mutliclass combinations
as well as the needed base classes depending on what's in the top-level .td
file. For now, I've named that top-level file X86InstrSIMD.td.

The Perl script would only be need to run as X86InstrSIMD.td changes. Thus
its use would be similar to how we use autoconf today. We only run autoconf /
automake when we update the .ac files, not as part of the build process.

Initially, X86InstrSIMD.td would define only AVX instructions so it would not
impact existing SSE clients. My intent is that X86InstrSIMD.td essentially
become the canonical description of all SSE and AVX instructions and
X86InstrSSE.td would go away completely.

Of course we would not transition away from X86InstrSSE.td until
X86InstrSIMD.td is proven to cover all current uses of SSE correctly.

The pros of the scheme:

* Unify all "important" x86 SIMD instructions into one framework and provide
consistency

* Specify patterns and asm strings *once* per instruction type / family
rather than the current scheme of multiple patterns for essentially the
same instruction

* Bugfixes / optimizations / new patterns instantly apply to all SSE levels
and AVX

The cons:

* Transition from X86InstrSSE.td

* A more complex class hierarchy

* A class-generating tool / indirection

Personally, I think the pros far outweigh the cons but I realize that this
proposes a major change and there are probably cons I haven't considered (and
pros as well!).

So right now I'm looking for comments. This is the way I intend to go because
it's far easier in the long run considering maintenance and future extension.

I'll post an example as soon as I have time to package it up and get approval
on this end to release it. As of now I have simple arithmetic operations
implemented and the proposed scheme seems to work. Right now we're generating
simple arithmetic and having it correctly assembled by gas.

Thanks for your ideas and input.

-Dave

Nate_Begeman1 · May 1, 2009, 5:25pm

Here's the big RFC.

Of course we would not transition away from X86InstrSSE.td until
X86InstrSIMD.td is proven to cover all current uses of SSE correctly.

The pros of the scheme:

* Unify all "important" x86 SIMD instructions into one framework and provide
consistency

While almost all of this sounds pretty great to me, since I'm also in the process of cleaning up how x86 vector shuffles and shifts are implemented, I'm a little worried by what "important" means. Aren't *all* the instructions important? Was the goal to convert 50% of the instructions over to the new class hierarchy and leave 50% as they currently are?

* Specify patterns and asm strings *once* per instruction type / family
rather than the current scheme of multiple patterns for essentially the
same instruction

Sounds great; you'll probably need some kind of custom lowering to go with this to make sure instructions end up in whatever canonicalized form you've chosen.

* Bugfixes / optimizations / new patterns instantly apply to all SSE levels
and AVX

The cons:

* Transition from X86InstrSSE.td

What's the timeline look like for this?

Nate

David_A_Greene · May 1, 2009, 5:39pm

> * Unify all "important" x86 SIMD instructions into one framework and
> provide
> consistency

While almost all of this sounds pretty great to me, since I'm also in
the process of cleaning up how x86 vector shuffles and shifts are
implemented, I'm a little worried by what "important" means. Aren't
*all* the instructions important? Was the goal to convert 50% of the
instructions over to the new class hierarchy and leave 50% as they
currently are?

"Important" in this context means all SSE instructions. I'm excluding MMX and
3dNow. I suppose we could add them but I just don't see much point.

One could also imagine moving the scalar patterns into this but frankly, the
scalar instruction set isn't changing much. The vector instruction set is
more likely to be extended which is why I put the focus there. I'm trying to
set us up better for the future.

> * Specify patterns and asm strings *once* per instruction type /
> family
> rather than the current scheme of multiple patterns for essentially
> the
> same instruction

Sounds great; you'll probably need some kind of custom lowering to go
with this to make sure instructions end up in whatever canonicalized
form you've chosen.

Maybe. More likely we will need some more pattern fragments. At least that's
what I've run into so far.

> * Bugfixes / optimizations / new patterns instantly apply to all SSE
> levels
> and AVX
>
> The cons:
>
> * Transition from X86InstrSSE.td

What's the timeline look like for this?

Good question. What I hope to do is add AVX stuff incrementally. Basic
arithmetic works now. But I want to vet this proposal fully before I start
adding things to the repository. So the timeline really depends on that
process. Once people are comfortable with moving forward, I'll start checking
things in as they become available.

Moving SSE over is a longer timeline. I would expect that wouldn't be
done until 2.7 at the very earliest and probably more like 2.8.

-Dave

Chris_Lattner · May 1, 2009, 6:46pm

Here's the big RFC.

A I've gone through and designed patterns for AVX, I quickly realized that the
existing SSE pattern specification, while functional, is less than ideal in
terms of maintenance. In particular, a number of nearly-identical patterns
are specified all over for nearly-identical instructions. For example:

Right. A lot of the X86 backend was written before the current set of tblgen features was available. In particular, multiclasses only got retrofitted in later to avoid some of the duplicated code. Where they are used, they aren't used as well as they could be.

Moreover, the various SSE levels were implemented at different times and do
things subtly differently. For example:

Note the use of xor vs. vnot and the different placement of the bc* fragments
and use of type specifiers. I wonder if we even match both of these.

Right, we have the same problem within the GPR operations. Ideally we'd have a multiclass for most arithmetic operations that expands out into 8/16/32/64-bit operations, perhaps even handling reg/reg, reg/imm, and reg/mem versions all at the same time. Similar things should probably be done for SSE1/2 since it adds double versions of all the float operations that SSE1 has.

For AVX we would need a different set of format classes because while AVX
could reuse the existing XS class (it's recoded as part of the VEX prefix so
we still need the information XS provides), "Requires<[HasSSE1]>" is
certainly inappropriate. Initially I started factoring things out to separate
XS and other prefix classes from Requires<> but that didn't solve the pattern
problems I mentioned above.

Right, a lot of these problems can be solved by some nice refactoring stuff. I'm also hoping that some of the complexity in defining shuffle matching code can be helped by making the definition of the shuffle patterns more declarative within the td file. It would be really nice to say that "this shuffle does a "1,0,3,2 shuffle and has cost 42" and have tblgen generate all the required matching code.

All of this complication gets multipled with AVX because AVX recodes all of
the legacy SSE instructions using VEX to provide three-address forms. So if
we were to follow the existing sceheme, we would duplicate *all* of
X86InstrSSE.td and edit patterns to match three-address modes and then add the
256-bit patterns on top of that, effectively duplicating X86InstrSSE.td a
second time.

This is not scalable.

I agree, I think it is unfortunate that AVX decided to do this at an architectural level :).

So what I've done is a little experiment to see if I can unify all SSE and AVX
SIMD instructions under one framework. I'll leave MMX and 3dNow alone since
they're oddballs and hardly anyone uses them.

Ok. I agree that the similarity being factored here is across SSE1/2/AVX.

Essentially I've created a set of base pattern classes that are very generic.
These contain the basic asm string templates and dag patterns we want to
match. These classes are parameterized by things like register class,
operand type, ModRM format and "memory access operation." I've also created
patterns that take a fully specified asm string and/or dag pattern to provide
flexibility for "oddball" instructions.

Ok.

The point of all of this is to write patterns and asm strings *once* for each
kind of instruction (binary arithmetic, convert, shuffle, etc.) and then use
multiclasses to generate all of the concrete patterns for SSE and AVX.

Very nice.

So for example, an ADD would be specified like this:

// Arithmetic ops with intrinsics and scalar equivalents
defm ADD :
sse1_sse2_avx_binary_scalar_xs_xd_vector_tb_ostb_node_intrinsic_rm_rrm<
  0x58, // Opcode
  "add", // asm base opcode name
  fadd, // SDNode name
  "add", // Intrinsic base name (we pre-concat int_x86_sse*/avx and
          // post-contact ps/pd, etc.)
  1 // Commutative

;

Now the multiclass name is rather unwieldy, I know. That can be changed so
don't worry too much about it. I'm more concerned about the overall scheme
and that it make sense to you all.

This does look very nice.

I have a Perl script that auto-generates the necessary mutliclass combinations
as well as the needed base classes depending on what's in the top-level .td
file. For now, I've named that top-level file X86InstrSIMD.td.

The Perl script would only be need to run as X86InstrSIMD.td changes. Thus
its use would be similar to how we use autoconf today. We only run autoconf /
automake when we update the .ac files, not as part of the build process.

While I agree that we want to refactor this, I really don't think that we should autogenerate .td files from perl. This has a number of significant logistical problems. What is it that perl gives you that we can't enhance tblgen to do directly?

Initially, X86InstrSIMD.td would define only AVX instructions so it would not
impact existing SSE clients. My intent is that X86InstrSIMD.td essentially
become the canonical description of all SSE and AVX instructions and
X86InstrSSE.td would go away completely.

Instead of slowly building it up and then cutting over, I'd prefer to incrementally move patterns into it, removing them from the other .td files at the same time. This should be a nice clean and continuous refactoring that makes the code monotonically better (smaller).

The pros of the scheme:

* Unify all "important" x86 SIMD instructions into one framework and provide
consistency

Yay!

* Specify patterns and asm strings *once* per instruction type / family
rather than the current scheme of multiple patterns for essentially the
same instruction

Yay!

* Bugfixes / optimizations / new patterns instantly apply to all SSE levels
and AVX

Yay!

The cons:
* Transition from X86InstrSSE.td
* A more complex class hierarchy

I'm not worried about these.

* A class-generating tool / indirection

I really don't like this :). If there is something higher level that you need, I think it would be very interesting to carefully consider what the root problem is and whether there is a good solution that we can directly implement in tblgen. It is pretty clear that we can *improve* the current situation with no tblgen enhancements, but I agree that AVX is a nice forcing function that will greatly benefit from a *much improved* target description.

-Chris

Stefanus_Du_Toit3 · May 1, 2009, 8:05pm

Hi David,

This is not scalable.

So what I've done is a little experiment to see if I can unify all SSE and AVX
SIMD instructions under one framework. I'll leave MMX and 3dNow alone since
they're oddballs and hardly anyone uses them.

I don't want to unnecessarily expand your scope, but while you're doing this, it might make sense to keep in mind the new Larrabee instructions as well. They operate on 512-bit registers, and there's a (slightly indirect) reference available here:

http://software.intel.com/en-us/articles/prototype-primitives-guide/

I'm not suggesting adding these now, just that they might be interesting to keep in mind while you're doing this work.

Stefanus

David_A_Greene · May 1, 2009, 9:47pm

Right, a lot of these problems can be solved by some nice refactoring
stuff. I'm also hoping that some of the complexity in defining
shuffle matching code can be helped by making the definition of the
shuffle patterns more declarative within the td file. It would be
really nice to say that "this shuffle does a "1,0,3,2 shuffle and has
cost 42" and have tblgen generate all the required matching code.

That would be nice. Any ideas how this would work?

I agree, I think it is unfortunate that AVX decided to do this at an
architectural level :).

I'm not sure what else would be done.

> So what I've done is a little experiment to see if I can unify all
> SSE and AVX
> SIMD instructions under one framework. I'll leave MMX and 3dNow
> alone since
> they're oddballs and hardly anyone uses them.

Ok. I agree that the similarity being factored here is across SSE1/2/
AVX.

And SSE3/4.

> I have a Perl script that auto-generates the necessary mutliclass
> combinations
> as well as the needed base classes depending on what's in the top-
> level .td
> file. For now, I've named that top-level file X86InstrSIMD.td.
>
> The Perl script would only be need to run as X86InstrSIMD.td
> changes. Thus
> its use would be similar to how we use autoconf today. We only run
> autoconf /
> automake when we update the .ac files, not as part of the build
> process.

While I agree that we want to refactor this, I really don't think that
we should autogenerate .td files from perl. This has a number of
significant logistical problems. What is it that perl gives you that
we can't enhance tblgen to do directly?

Well, mainly it's because we don't have whatever tblgen enhancements we need.
I'll have to think on this some and see if I can come up with some tblgen
features that could help.

I was writing a lot of these base classes by hand at first, but there are a
lot of them (they tend to be very small) and writing them is very mechanical.
So we probably can enhance tblgen somehow. I'm just not sure what that looks
like right now.

> Initially, X86InstrSIMD.td would define only AVX instructions so it
> would not
> impact existing SSE clients. My intent is that X86InstrSIMD.td
> essentially
> become the canonical description of all SSE and AVX instructions and
> X86InstrSSE.td would go away completely.

Instead of slowly building it up and then cutting over, I'd prefer to
incrementally move patterns into it, removing them from the other .td
files at the same time. This should be a nice clean and continuous
refactoring that makes the code monotonically better (smaller).

Ok, that sounds like a pretty good idea.

> The cons:
> * Transition from X86InstrSSE.td
> * A more complex class hierarchy

I'm not worried about these.

Ok.

> * A class-generating tool / indirection

I really don't like this :). If there is something higher level that
you need, I think it would be very interesting to carefully consider
what the root problem is and whether there is a good solution that we
can directly implement in tblgen. It is pretty clear that we can
*improve* the current situation with no tblgen enhancements, but I
agree that AVX is a nice forcing function that will greatly benefit
from a *much improved* target description.

Your point is well taken. Let me think on this a bit.

-Dave

David_A_Greene · May 1, 2009, 9:59pm

Oh, I'm definitely keeping Larrabee in mind. I've looked at the primitives
library and also the Dr. Dobb's article on LRBni:

I'm designing all of this so it should be relatively easy to plug-in the
non-mask/swizzle variants of the Larrabee instructions. Masks and swizzles
will require new patterns.

I haven't seen a Larrabee ISA manual yet. My hope is that the opcodes and
formats for SSE-like instructions will be the same (e.g. a 512-bit ADD will
still be 0x58). I think that's a safe assumption. But who knows? Maybe
their graphics guys don't talk to their CPU guys. If things are radically
different, we won't be able to re-use as much. But it should still be better
than what we have now.

Also, Larrabee is a great motivator for direct mask/predicate support in LLVM.
Anyone want to re-engage in that discussion?

-Dave

David_A_Greene · May 1, 2009, 10:01pm

What I really meant was SSE3/AVX, SSSE3/AVX and SSE4/AVX. SSS?E3 and SSE4
don't have too much in common with each other.

-Dave

Chris_Lattner · May 1, 2009, 10:50pm

Right, a lot of these problems can be solved by some nice refactoring
stuff. I'm also hoping that some of the complexity in defining
shuffle matching code can be helped by making the definition of the
shuffle patterns more declarative within the td file. It would be
really nice to say that "this shuffle does a "1,0,3,2 shuffle and has
cost 42" and have tblgen generate all the required matching code.

That would be nice. Any ideas how this would work?

Nate is currently working on refactoring a bunch of shuffle related logic, which includes changing the X86 backend to canonicalize shuffles more like the ppc/altivec backend does. Once that is done, I think it would make sense for tblgen to generate some C++ code that looks like this:

// MatchVectorShuffle - Matches a shuffle node against the available instructions,
// returning the lowest cost one as well as the actual cost of it.
unsigned MatchVectorShuffle(VectorShuffleSDNode *N) {
unsigned LowestCost = ~0;

   if (N can be matched by movddup) {
     unsigned movddupcost = ... // can be either constant, or callback into subtarget info
     if (LowestCost > movddupcost)
       LowestCost = movddupcost;
       operands = [whatever]
       opcode = X86::MOVDDUP;
     }
   }

   if (N can be matched by movhlps) {
     unsigned movhlpscost = ...
     if (LowestCost > movhlpscost)
       LowestCost = movhlpscost;
       operands = [whatever]
       opcode = X86::MOVHLPS;
     }
   }
   ...
}

The advantage of doing this is that it moves the current heuristics for match ordering (which is a poor way to model costs) into a declarative form in the .td file. This is particularly important because different chips have different costs!

This generated function could then be called by the actual isel pass itself as well as from DAGCombine. We'd like dagcombine to be able to merge two shuffles into one, but it should only do this when the cost of the resultant shuffle is less than the two original ones (a simple greedy algorithm).

This is vague and hand wavy, but hopefully the idea comes across. We have this in the .td files right now:

;; we already have this
def MOVDDUPrr : S3DI<0x12, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src),
"movddup\t{$src, $dst|$dst, $src}",
[(set VR128:$dst,(v2f64 (movddup VR128:$src, (undef))))]>;

def movddup : PatFrag<(ops node:$lhs, node:$rhs),
(vector_shuffle node:$lhs, node:$rhs), [{
return X86::isMOVDDUPMask(cast<ShuffleVectorSDNode>(N));
}]>;

The goal is to replace the pattern fragment and the C++ code for X86::isMOVDDUPMask with something like:

def movddup : PatFrag<(ops node:$lhs, node:$rhs),
(vector_shuffle node:$lhs, node:$rhs,
0, 1, 0, 1, Cost<42>)

Alternatively, the cost could be put on the instructions etc, whatever makes the most sense. incidentally, I'm not sure why movddup is currently defined to take a LHS/RHS: the RHS should always be undef so it should be coded into the movddup def.

Another possible syntax would be to add a special kind of shuffle node to give more natural and clean syntax. This is probably the better solution:

def movddup : Shuffle4<VR128, undef, 0, 1, 0, 1>, Cost<42>;

While I agree that we want to refactor this, I really don't think that
we should autogenerate .td files from perl. This has a number of
significant logistical problems. What is it that perl gives you that
we can't enhance tblgen to do directly?

Well, mainly it's because we don't have whatever tblgen enhancements we need.
I'll have to think on this some and see if I can come up with some tblgen
features that could help.

I was writing a lot of these base classes by hand at first, but there are a
lot of them (they tend to be very small) and writing them is very mechanical.
So we probably can enhance tblgen somehow. I'm just not sure what that looks
like right now.

Ok.

Your point is well taken. Let me think on this a bit.

Thanks Dave! I really appreciate you working in this area,

-Chris

Evan_Cheng1 · May 5, 2009, 6:02am

What does "cost" mean here? Currently isel cost means complexity of the matched pattern. It's hard to compute this by hand so the current hack is to allow manual cost adjustments.

I think it makes sense for isel to use HW cost (instruction latency, code size) as a late tie breaker. In that case, shouldn't cost be part of instruction itinerary?

Evan

David_A_Greene · May 5, 2009, 4:31pm

What latency? Each implementation has its own quirks and LLVM must be
flexible enough to handle them. So cost needs to be a function of
the CPU type as well as the instruction.

We do need a better cost/priority mechanism than AddedComplexity (the naming
alone of that is very confusing). Perhaps we can have some base cost values
per instruction and allow each CPU type to override them.

-Dave

Chris_Lattner · May 5, 2009, 8:10pm

For shuffles, I don't have a strong opinion. I just want dag combiner to be able to say "if these two shuffles have greater or equal cost to the equivalent combined shuffle, then merge the shuffles into one". It doesn't matter what units these are in. The other use is to break ties between multiple instructions that can match the same shuffle pattern. For these, the precise units also don't matter.

Looking further ahead to a world where we have vectorization, we will need very precise cost models for various vector operands, scalar operations etc. I don't think it necessarily makes sense to overconstraint a solution for shuffles in the short term though.

-Chris

David_A_Greene · May 7, 2009, 4:38pm

> While I agree that we want to refactor this, I really don't think that
> we should autogenerate .td files from perl. This has a number of
> significant logistical problems. What is it that perl gives you that
> we can't enhance tblgen to do directly?

Well, mainly it's because we don't have whatever tblgen enhancements we
need. I'll have to think on this some and see if I can come up with some
tblgen features that could help.

I was writing a lot of these base classes by hand at first, but there are a
lot of them (they tend to be very small) and writing them is very
mechanical. So we probably can enhance tblgen somehow. I'm just not sure
what that looks like right now.

So I've been thinking about this some more and the major obtacle here is that
the Perl generator has a lot of X86-specific knowledge coded into it.

For example, it knows:

* "SS" instructions need to use ths XS encoding class, but only for SSE1
and AVX
* "SD" instructions need to use the XD encoding class, but only for SSE2
and AVX
* A vector instruction never uses XS or XD encoding
* A scalar instruction never uses OpSize or TB
* AVX uses rrm and mrr encoding, SSE uses rm and mr
* rrm expands to rrr and rrm encoding, rm expands to rr and rm
* mr only expands to mr, mrr only expands to mrr
* and on and on...

I'm not sure how to conveniently encapsulate all of that detailed knowledge
in a set of TableGen classes and/or feature extensions. Basically, TableGen
would need to look at

defm ADD :
sse1_sse2_avx_avx3_binary_scalar_xs_xd_vector_tb_ostb_node_intrinsic_rm_rrm<
0x58, "add", fadd, "add", 1

;

and understand all of the SS/PS/SD/PD/VEX/etc. combinations that implies. Or
at least have support to conveniently express inheritance from multiclasses
that provide the valid SS/PS, etc. combinations and (this is critical) a
convenient way to transform class/multiclass arguments to specialize them for
the various instruction sets/encodings. If we only want to specify patterns
and arguments once, we need a way to specify transformations on them and pass
the results down to base (multi)classes.

That's what the intermediate classes generated by the Perl script do. They
serve two functions:

* Provide valid format combinations (e.g. SS/PS/SD/PD, SS/SD, PS/PD, etc.)
* Provide argument transformation to specialize for various formats and
encoding

I've got a few ideas rolling in my head but I need to do some more thinking.

Oh, and I guess I'll go ahead and add XOP support too:

http://forums.amd.com/devblog/blogpost.cfm?threadid=112934&catid=208

-Dave

Topic		Replies	Views
Multi-Instruction Patterns LLVM Dev List Archives	7	181	September 25, 2008
Regular Expressions LLVM Dev List Archives	19	147	June 18, 2009
RFC: More AVX Experience LLVM Dev List Archives	1	74	May 15, 2009
Instruction descriptions question LLVM Dev List Archives	11	121	October 5, 2006
RFC: X86InstrFormats.td Refactoring LLVM Dev List Archives	4	96	March 31, 2009

RFC: AVX Pattern Specification [LONG]

Related topics