tablegen exponential behavior

I implemented a pattern matching of the dot product for arm64
and it seemed to work well for the basic case, i.e.,

class mulB<SDPatternOperator ldop> :
  PatFrag<(ops node:$Rn, node:$Rm, node:$offset),
          (mul (ldop (add node:$Rn, node:$offset)),
               (ldop (add node:$Rm, node:$offset)))>;
class mulBz<SDPatternOperator ldop> :
  PatFrag<(ops node:$Rn, node:$Rm),
          (mul (ldop node:$Rn), (ldop node:$Rm))>;

class DotProductI32<Instruction DOT, SDPatternOperator ldop> :
  Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)),
                (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))),
       (i64 (DOT (DUPv2i32gpr WZR),
                 (v8i8 (LD1Onev8b GPR64sp:$Rn)),
                 (v8i8 (LD1Onev8b GPR64sp:$Rm)))),
       sub_32)>, Requires<[HasDotProd]>;

  def : DotProductI32<SDOTv8i8, sextloadi8>;
  def : DotProductI32<UDOTv8i8, zextloadi8>;

Then when I extended it to 8 element vectors, the time spent by tblgen exploded:
from under 7 seconds (on A-72) on the AArch64 td files and the above patch
to more than half an hour when I decided to terminate the processes.

Here are the additional def'pats that produce the exponential behavior:

def VADDV_32 : OutPatFrag<(ops node:$R), (ADDPv2i32 node:$R, node:$R)>;

class DotProduct2I32<Instruction DOT, SDPatternOperator ldop> :
  Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 7)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 6)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 5)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 4)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)),
           (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)),
                (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))))))),
        (i64 (DOT (DUPv2i32gpr WZR),
                  (v8i8 (LD1Onev8b GPR64sp:$Rn)),
                  (v8i8 (LD1Onev8b GPR64sp:$Rm))))),
       sub_32)>, Requires<[HasDotProd]>;

  def : DotProduct2I32<SDOTv8i8, sextloadi8>;
  def : DotProduct2I32<UDOTv8i8, zextloadi8>;

linux-perf profile for the first minute executing llvm-tblgen shows
that most of the time is spent in isIsomorphicTo:

  28.25% llvm-tblgen llvm-tblgen [.]
  21.62% llvm-tblgen llvm-tblgen [.]
  15.25% llvm-tblgen [.] memcmp
  14.61% llvm-tblgen llvm-tblgen [.]

In call-graph mode `perf -g` points to GenerateVariants that generates
most of the calls to isIsomorphicTo:

+ 100.00% 0.00% llvm-tblgen llvm-tblgen [.] main
+ 100.00% 0.00% llvm-tblgen llvm-tblgen [.] llvm::TableGenMain
+ 99.85% 0.00% llvm-tblgen llvm-tblgen [.] (anonymous
+ 99.85% 0.00% llvm-tblgen llvm-tblgen [.] llvm::EmitDAGISel
+ 99.85% 0.00% llvm-tblgen llvm-tblgen [.]
+ 99.46% 98.01% llvm-tblgen llvm-tblgen [.]
     0.38% 0.00% llvm-tblgen llvm-tblgen [.] GenerateVariantsOf


add and mul are both commutable and tablegen tries to create patterns with every possible permutation of commutable operations.

Thanks Craig!
I looked at the implementation of tablegen and it is not easy to come up
with some limiting factor. The best I have found is to compute the
height of the
TreePatternNode and do not generate permutations if the height is
larger than 4 or 5.
Would there be interest in cleaning up that patch and submitting for review?

I have also found out that (on the codes that I care for) I don't need a long
def-pat matching dot products of 8 by 8 bytes, as the SLP vectorizer
transforms that code into vector add/mul that only needs a simple pattern.