[RFC][SelectionDAG] Code-size optimizations of MatcherTable

wangpc-pp · November 29, 2023, 9:34am

Hi all!
I posted a series of patches to optimize code size of MatcherTable in SelectionDAG ISel, let me know if you are interested in these and may help to review these patches.

Backgroud

The implementation of SelectionDAG ISel is a bytecode interpreter in LLVM and the process of SelectionDAG ISel is actually the execution of this bytecode interpreter. The opcodes of this interpreter are defined in llvm/include/llvm/CodeGen/SelectionDAGISel.h:

  // Opcodes used by the DAG state machine:
  enum BuiltinOpcodes {
    OPC_Scope,
    OPC_RecordNode,
    // ...
    OPC_CompleteMatch,
    OPC_Coverage
  };

All ISel patterns defined by backend target will be compiled into a byte array called MatcherTable, in which opcodes with their operands are contained. For example:

  static const unsigned char MatcherTable[] = {
/*     0*/ OPC_SwitchOpcode /*5 cases */, 15, TARGET_VAL(ISD::INTRINSIC_WO_CHAIN),// ->19
/*     4*/  OPC_CheckChild0Integer, 12, 
/*     6*/  OPC_RecordChild1, // #0 = $x
/*     7*/  OPC_RecordChild2, // #1 = $y
/*     8*/  OPC_EmitInteger32, 2, 
/*    10*/  OPC_MorphNodeTo1, TARGET_VAL(::AddRRI), 0,
                MVT::i32, 3/*#Ops*/, 0, 1, 2, 
            // Src: (intrinsic_wo_chain:{ *:[i32] } 6:{ *:[iPTR] }, i32:{ *:[i32] }:$x, i32:{ *:[i32] }:$y) - Complexity = 8
            // Dst: (AddRRI:{ *:[i32] } Reg:{ *:[i32] }:$x, Reg:{ *:[i32] }:$y, 1:{ *:[i32] })
/*    19*/ /*SwitchOpcode*/ 17, TARGET_VAL(ISD::INTRINSIC_W_CHAIN),// ->39
/*    22*/  OPC_RecordNode, // #0 = 'intrinsic_w_chain' chained node
/*    23*/  OPC_CheckChild1Integer, 72|128,3/*456*/, 
/*    26*/  OPC_RecordChild2, // #1 = $src1
/*    27*/  OPC_RecordChild3, // #2 = $src2
/*    28*/  OPC_RecordChild4, // #3 = $imm
/*    29*/  OPC_EmitMergeInputChains1_0,
/*    30*/  OPC_MorphNodeTo1, TARGET_VAL(::MulRRI), 0|OPFL_Chain,
                MVT::i32, 3/*#Ops*/, 1, 2, 3, 
            // Src: (intrinsic_w_chain:{ *:[i32] } 228:{ *:[iPTR] }, Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2, i32:{ *:[i32] }:$imm) - Complexity = 8
            // Dst: (MulRRI:{ *:[i32] } Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2, i32:{ *:[i32] }:$imm)
/*    39*/ /*SwitchOpcode*/ 13, TARGET_VAL(ISD::ADD),// ->55
/*    42*/  OPC_RecordChild0, // #0 = $x
/*    43*/  OPC_RecordChild1, // #1 = $y
/*    44*/  OPC_EmitInteger32, 0, 
/*    46*/  OPC_MorphNodeTo1, TARGET_VAL(::AddRRI), 0,
                MVT::i32, 3/*#Ops*/, 0, 1, 2, 
            // Src: (add:{ *:[i32] } i32:{ *:[i32] }:$x, i32:{ *:[i32] }:$y) - Complexity = 3
            // Dst: (AddRRI:{ *:[i32] } Reg:{ *:[i32] }:$x, Reg:{ *:[i32] }:$y)
/*    55*/ /*SwitchOpcode*/ 13, TARGET_VAL(ISD::SUB),// ->71
/*    58*/  OPC_RecordChild0, // #0 = $src1
/*    59*/  OPC_RecordChild1, // #1 = $src2
/*    60*/  OPC_EmitInteger32, 0, 
/*    62*/  OPC_MorphNodeTo1, TARGET_VAL(::SubRRI), 0,
                MVT::i32, 3/*#Ops*/, 0, 1, 2, 
            // Src: (sub:{ *:[i32] } Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2) - Complexity = 3
            // Dst: (SubRRI:{ *:[i32] } Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2)
/*    71*/ /*SwitchOpcode*/ 13, TARGET_VAL(ISD::MUL),// ->87
/*    74*/  OPC_RecordChild0, // #0 = $x
/*    75*/  OPC_RecordChild1, // #1 = $y
/*    76*/  OPC_EmitInteger32, 0, 
/*    78*/  OPC_MorphNodeTo1, TARGET_VAL(::MulIRR), 0,
                MVT::i32, 3/*#Ops*/, 2, 0, 1, 
            // Src: (mul:{ *:[i32] } i32:{ *:[i32] }:$x, i32:{ *:[i32] }:$y) - Complexity = 3
            // Dst: (MulIRR:{ *:[i32] } Reg:{ *:[i32] }:$x, Reg:{ *:[i32] }:$y)
/*    87*/ 0, // EndSwitchOpcode
    0
  }; // Total Array size is 89 bytes

And you can see a Total Array size info at the end of MatcherTable.

Why would I try to optimize the code size of MatcherTable?

My first intention is to optimize the MatcherTable size of RISCV target. RISCV target has the biggest MatcherTable because of RVV(RISCV Vector) implementation (TL;DR: there are a lot of pseudo instructions because of RVV’s register grouping feature).

A simple grep of llvm build cache:

~/llvm-project/build/lib/Target# grep -rin "Total Array size" . | sort -rn -k 8
./RISCV/RISCVGenDAGISel.inc:1118649:  }; // Total Array size is 2592357 bytes
./X86/X86GenDAGISel.inc:331901:  }; // Total Array size is 682337 bytes
./AArch64/AArch64GenDAGISel.inc:243534:  }; // Total Array size is 507122 bytes
./AMDGPU/AMDGPUGenDAGISel.inc:237193:  }; // Total Array size is 505636 bytes
./ARM/ARMGenDAGISel.inc:90710:  }; // Total Array size is 200842 bytes
./NVPTX/NVPTXGenDAGISel.inc:92337:  }; // Total Array size is 196179 bytes
./PowerPC/PPCGenDAGISel.inc:79255:  }; // Total Array size is 188930 bytes
./Hexagon/HexagonGenDAGISel.inc:91446:  }; // Total Array size is 177071 bytes
./LoongArch/LoongArchGenDAGISel.inc:44726:  }; // Total Array size is 83010 bytes
./VE/VEGenDAGISel.inc:36841:  }; // Total Array size is 71134 bytes
./Mips/MipsGenDAGISel.inc:28330:  }; // Total Array size is 53111 bytes
./SystemZ/SystemZGenDAGISel.inc:27412:  }; // Total Array size is 52915 bytes
./AMDGPU/R600GenDAGISel.inc:12579:  }; // Total Array size is 37571 bytes
./WebAssembly/WebAssemblyGenDAGISel.inc:13914:  }; // Total Array size is 25752 bytes
./MSP430/MSP430GenDAGISel.inc:4730:  }; // Total Array size is 9103 bytes
./Sparc/SparcGenDAGISel.inc:3593:  }; // Total Array size is 6531 bytes
./BPF/BPFGenDAGISel.inc:2341:  }; // Total Array size is 4105 bytes
./XCore/XCoreGenDAGISel.inc:2213:  }; // Total Array size is 3820 bytes
./AVR/AVRGenDAGISel.inc:1700:  }; // Total Array size is 3004 bytes
./Lanai/LanaiGenDAGISel.inc:1286:  }; // Total Array size is 2305 bytes

As you can see, MatcherTable of RISCV (about 2592K) is almost 4 times larger than the second largest one (X86, 682K). This has already influenced the compile-time of RISCV target.

But, of course, RISCV will not be the only target that benefit from the reduction of MatcherTable size.

What I have done

Most of the opcodes have followed operands bytes and I found that most of the operands have their own preferences. For exmaple, some opcodes need a MVT type operand, but this MVT type is usually MVT::i32 or MVT::i64. So, based on this, I have done some optimizations to reduce the size.

Note: We will show the reduction of the llc binary size with all in-tree targets.

Optimize `OPC_EmitInteger` and `OPC_EmitStringInteger`

These two opcodes need a MVT operand to indicate the integer type but it is always one of i8/i16/i32/i64, so we can instantiate OPC_EmitInteger and OPC_EmitStringInteger with
i8/i16/i32/i64 (OPC_EmitInteger8, OPC_EmitInteger16, OPC_EmitInteger32, OPC_EmitInteger64, etc.) so that we can reduce one byte (the type).

[SelectionDAG] Add instantiated OPC_EmitInteger and OPC_EmitStringInteger by wangpc-pp · Pull Request #73241 · llvm/llvm-project · GitHub reduces ~200K

Optimize `OPC_CheckType`, `OPC_EmitRegister` and `OPC_CheckChildType`

Same as above, these opcodes need a MVT type but it is usually MVT::i32 or MVT::i64. So, we add space-optimized forms of these opcodes that implictly encode the MVT type so that we can reduce one byte.

Optimize `OPC_EmitConvertToTarget` and `OPC_EmitCopyToReg`

These two opcodes need a RecNo operand, which is usually a small integer. New opcodes that implicitly indicate the RecNo (0-7) are added so that we can reduce one byte.

Optimize `OPC_CheckComplexPat`, `OPC_CheckPatternPredicate` and `OPC_CheckPredicate`

For these opcodes, they need an ID operand (an integer index actually)to indicate the ComplexPat or Predicate that they are evaluated on. For most targets, only a small set of ComplexPat or Predicate have a large number of uses. So, we record the usage of each ComplexPat or Predicate and sort them by uses. For the top 8, new opcodes that implicitly indicate the index number (0-7) are added so that we can reduce one byte.

Optimize `OPC_EmitNode` and `OPC_MorphNodeTo`

These two opcodes need a EmitNodeInfo to indicate the attributes of a node, for example, hasChain, hasGlueInput, etc. For most cases, none or one attribute bit is set. So, we can encode it
implicitly to save one byte.

[SelectionDAG] Add space-optimized forms of OPC_EmitNode/OPC_MorphNodeTo by wangpc-pp · Pull Request #73502 · llvm/llvm-project · GitHub reduces ~168K.

Add OPC_MoveSibling

There are a lot of operations to move current node to parent and then move to another child. So OPC_MoveSibling and its space-optimized forms are added to do this “move to sibling” operation.

These new operations will be generated when optimizing matcher in ContractNodes. Currently MoveParent+MoveChild will be optimized to MoveSibling and sequences MoveParent+RecordChild+MoveChild will be transformed into MoveSibling+RecordNode.

[SelectionDAG] Add OPC_MoveSibling by wangpc-pp · Pull Request #73643 · llvm/llvm-project · GitHub reduces ~30K.

Summary

Overall these changes reduce the llc binary size with all in-tree targets by about 756K.

The grep of Total Array size is:

~/llvm-project/build/lib/Target# grep -rin "Total Array size" . | sort -rn -k 8
./RISCV/RISCVGenDAGISel.inc:1114870:  }; // Total Array size is 2208263 bytes
./X86/X86GenDAGISel.inc:327015:  }; // Total Array size is 612478 bytes
./AArch64/AArch64GenDAGISel.inc:240233:  }; // Total Array size is 443212 bytes
./AMDGPU/AMDGPUGenDAGISel.inc:226002:  }; // Total Array size is 430613 bytes
./ARM/ARMGenDAGISel.inc:90174:  }; // Total Array size is 171971 bytes
./NVPTX/NVPTXGenDAGISel.inc:90076:  }; // Total Array size is 169906 bytes
./PowerPC/PPCGenDAGISel.inc:78781:  }; // Total Array size is 163270 bytes
./Hexagon/HexagonGenDAGISel.inc:90652:  }; // Total Array size is 155936 bytes
./LoongArch/LoongArchGenDAGISel.inc:44196:  }; // Total Array size is 73125 bytes
./VE/VEGenDAGISel.inc:36641:  }; // Total Array size is 65232 bytes
./Mips/MipsGenDAGISel.inc:27729:  }; // Total Array size is 46817 bytes
./SystemZ/SystemZGenDAGISel.inc:26984:  }; // Total Array size is 46740 bytes
./AMDGPU/R600GenDAGISel.inc:12381:  }; // Total Array size is 29793 bytes
./WebAssembly/WebAssemblyGenDAGISel.inc:13630:  }; // Total Array size is 22611 bytes
./MSP430/MSP430GenDAGISel.inc:4687:  }; // Total Array size is 8069 bytes
./Sparc/SparcGenDAGISel.inc:3577:  }; // Total Array size is 5769 bytes
./BPF/BPFGenDAGISel.inc:2349:  }; // Total Array size is 3709 bytes
./XCore/XCoreGenDAGISel.inc:2201:  }; // Total Array size is 3434 bytes
./AVR/AVRGenDAGISel.inc:1695:  }; // Total Array size is 2754 bytes
./Lanai/LanaiGenDAGISel.inc:1283:  }; // Total Array size is 2075 bytes

We can see ~15% reduction of RISCV’s MatcherTable size and average ~14% reduction for other targets.

Compile-time impact

I have only tested RISCV target and I saw a stable ~1% compile-time reduction. Maybe @nikic can add all of the pacthes to the compile time tracker? Thanks in advance!
Theoretically, thare are some influences:

Adding more opcodes will increase the complexity of interpreter, which may cause increasement of compile-time.
We can get better cache utilization as the MatcherTable size is decreased, which may cause decreasement of compile-time.

Anyway, I hope these works will be helpful.

Thanks!
Wang Pengcheng

rnk · November 30, 2023, 6:17pm

This is great work, thank you for looking into this!

I’ve observed many inefficiencies in our various table-generated lookup tables before, but haven’t had the time to prioritize further improvements. This is great, and I think there is more opportunity in the other generated tables. In particular, I recall that the assembly and disassembly tables could be improved, but that could be stale information.

ilovepi · December 1, 2023, 12:36am

I think this is really fantastic work! Thanks for working on this. I think this is a really promising direction to reduce the binary size of the toolchain without introducing too much complexity. Reducing the size of the toolchain is a big benefit when distributing it to customers, and the efforts are really appreciated.

As @rnk mentioned I have a memory that many of the generated tables could probably be reduced. Since you’ve investigated this recently, do you know of other places where the size contribution stood out to you?

cc: @nickdesaulniers, @petrhosek

wangpc-pp · December 1, 2023, 11:22am

Yeah, I found that the DecoderTables used in disassembly can also be reduced and optimized. As for AsmMatcher, the table is big too but I don’t know how to reduce it.

Topic		Replies	Views
Work on DAG Isel for TableGen and compiler LLVM Dev List Archives	7	115	November 19, 2020
Help with ISEL matching for an SDAG LLVM Dev List Archives	3	79	August 2, 2016
[PSA] SelectionDAG now has a pattern matching framework Code Generation	0	142	February 23, 2024
Musings on the TableGen -emit-dag-isel backend LLVM Dev List Archives	4	65	November 13, 2020
SelectionDAG legality: isel creating cycles LLVM Dev List Archives	9	70	February 22, 2010

[RFC][SelectionDAG] Code-size optimizations of MatcherTable

Backgroud

Why would I try to optimize the code size of MatcherTable?

What I have done

Optimize OPC_EmitInteger and OPC_EmitStringInteger

Optimize OPC_CheckType, OPC_EmitRegister and OPC_CheckChildType

Optimize OPC_EmitConvertToTarget and OPC_EmitCopyToReg

Optimize OPC_CheckComplexPat, OPC_CheckPatternPredicate and OPC_CheckPredicate

Optimize OPC_EmitNode and OPC_MorphNodeTo