Hi all!
I posted a series of patches to optimize code size of MatcherTable in SelectionDAG ISel, let me know if you are interested in these and may help to review these patches.
Backgroud
The implementation of SelectionDAG ISel is a bytecode interpreter in LLVM and the process of SelectionDAG ISel is actually the execution of this bytecode interpreter. The opcodes of this interpreter are defined in llvm/include/llvm/CodeGen/SelectionDAGISel.h
:
// Opcodes used by the DAG state machine:
enum BuiltinOpcodes {
OPC_Scope,
OPC_RecordNode,
// ...
OPC_CompleteMatch,
OPC_Coverage
};
All ISel patterns defined by backend target will be compiled into a byte array called MatcherTable
, in which opcodes with their operands are contained. For example:
static const unsigned char MatcherTable[] = {
/* 0*/ OPC_SwitchOpcode /*5 cases */, 15, TARGET_VAL(ISD::INTRINSIC_WO_CHAIN),// ->19
/* 4*/ OPC_CheckChild0Integer, 12,
/* 6*/ OPC_RecordChild1, // #0 = $x
/* 7*/ OPC_RecordChild2, // #1 = $y
/* 8*/ OPC_EmitInteger32, 2,
/* 10*/ OPC_MorphNodeTo1, TARGET_VAL(::AddRRI), 0,
MVT::i32, 3/*#Ops*/, 0, 1, 2,
// Src: (intrinsic_wo_chain:{ *:[i32] } 6:{ *:[iPTR] }, i32:{ *:[i32] }:$x, i32:{ *:[i32] }:$y) - Complexity = 8
// Dst: (AddRRI:{ *:[i32] } Reg:{ *:[i32] }:$x, Reg:{ *:[i32] }:$y, 1:{ *:[i32] })
/* 19*/ /*SwitchOpcode*/ 17, TARGET_VAL(ISD::INTRINSIC_W_CHAIN),// ->39
/* 22*/ OPC_RecordNode, // #0 = 'intrinsic_w_chain' chained node
/* 23*/ OPC_CheckChild1Integer, 72|128,3/*456*/,
/* 26*/ OPC_RecordChild2, // #1 = $src1
/* 27*/ OPC_RecordChild3, // #2 = $src2
/* 28*/ OPC_RecordChild4, // #3 = $imm
/* 29*/ OPC_EmitMergeInputChains1_0,
/* 30*/ OPC_MorphNodeTo1, TARGET_VAL(::MulRRI), 0|OPFL_Chain,
MVT::i32, 3/*#Ops*/, 1, 2, 3,
// Src: (intrinsic_w_chain:{ *:[i32] } 228:{ *:[iPTR] }, Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2, i32:{ *:[i32] }:$imm) - Complexity = 8
// Dst: (MulRRI:{ *:[i32] } Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2, i32:{ *:[i32] }:$imm)
/* 39*/ /*SwitchOpcode*/ 13, TARGET_VAL(ISD::ADD),// ->55
/* 42*/ OPC_RecordChild0, // #0 = $x
/* 43*/ OPC_RecordChild1, // #1 = $y
/* 44*/ OPC_EmitInteger32, 0,
/* 46*/ OPC_MorphNodeTo1, TARGET_VAL(::AddRRI), 0,
MVT::i32, 3/*#Ops*/, 0, 1, 2,
// Src: (add:{ *:[i32] } i32:{ *:[i32] }:$x, i32:{ *:[i32] }:$y) - Complexity = 3
// Dst: (AddRRI:{ *:[i32] } Reg:{ *:[i32] }:$x, Reg:{ *:[i32] }:$y)
/* 55*/ /*SwitchOpcode*/ 13, TARGET_VAL(ISD::SUB),// ->71
/* 58*/ OPC_RecordChild0, // #0 = $src1
/* 59*/ OPC_RecordChild1, // #1 = $src2
/* 60*/ OPC_EmitInteger32, 0,
/* 62*/ OPC_MorphNodeTo1, TARGET_VAL(::SubRRI), 0,
MVT::i32, 3/*#Ops*/, 0, 1, 2,
// Src: (sub:{ *:[i32] } Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2) - Complexity = 3
// Dst: (SubRRI:{ *:[i32] } Reg:{ *:[i32] }:$src1, Reg:{ *:[i32] }:$src2)
/* 71*/ /*SwitchOpcode*/ 13, TARGET_VAL(ISD::MUL),// ->87
/* 74*/ OPC_RecordChild0, // #0 = $x
/* 75*/ OPC_RecordChild1, // #1 = $y
/* 76*/ OPC_EmitInteger32, 0,
/* 78*/ OPC_MorphNodeTo1, TARGET_VAL(::MulIRR), 0,
MVT::i32, 3/*#Ops*/, 2, 0, 1,
// Src: (mul:{ *:[i32] } i32:{ *:[i32] }:$x, i32:{ *:[i32] }:$y) - Complexity = 3
// Dst: (MulIRR:{ *:[i32] } Reg:{ *:[i32] }:$x, Reg:{ *:[i32] }:$y)
/* 87*/ 0, // EndSwitchOpcode
0
}; // Total Array size is 89 bytes
And you can see a Total Array size
info at the end of MatcherTable
.
Why would I try to optimize the code size of MatcherTable?
My first intention is to optimize the MatcherTable
size of RISCV target. RISCV target has the biggest MatcherTable
because of RVV(RISCV Vector) implementation (TL;DR: there are a lot of pseudo instructions because of RVV’s register grouping feature).
A simple grep of llvm build cache:
~/llvm-project/build/lib/Target# grep -rin "Total Array size" . | sort -rn -k 8
./RISCV/RISCVGenDAGISel.inc:1118649: }; // Total Array size is 2592357 bytes
./X86/X86GenDAGISel.inc:331901: }; // Total Array size is 682337 bytes
./AArch64/AArch64GenDAGISel.inc:243534: }; // Total Array size is 507122 bytes
./AMDGPU/AMDGPUGenDAGISel.inc:237193: }; // Total Array size is 505636 bytes
./ARM/ARMGenDAGISel.inc:90710: }; // Total Array size is 200842 bytes
./NVPTX/NVPTXGenDAGISel.inc:92337: }; // Total Array size is 196179 bytes
./PowerPC/PPCGenDAGISel.inc:79255: }; // Total Array size is 188930 bytes
./Hexagon/HexagonGenDAGISel.inc:91446: }; // Total Array size is 177071 bytes
./LoongArch/LoongArchGenDAGISel.inc:44726: }; // Total Array size is 83010 bytes
./VE/VEGenDAGISel.inc:36841: }; // Total Array size is 71134 bytes
./Mips/MipsGenDAGISel.inc:28330: }; // Total Array size is 53111 bytes
./SystemZ/SystemZGenDAGISel.inc:27412: }; // Total Array size is 52915 bytes
./AMDGPU/R600GenDAGISel.inc:12579: }; // Total Array size is 37571 bytes
./WebAssembly/WebAssemblyGenDAGISel.inc:13914: }; // Total Array size is 25752 bytes
./MSP430/MSP430GenDAGISel.inc:4730: }; // Total Array size is 9103 bytes
./Sparc/SparcGenDAGISel.inc:3593: }; // Total Array size is 6531 bytes
./BPF/BPFGenDAGISel.inc:2341: }; // Total Array size is 4105 bytes
./XCore/XCoreGenDAGISel.inc:2213: }; // Total Array size is 3820 bytes
./AVR/AVRGenDAGISel.inc:1700: }; // Total Array size is 3004 bytes
./Lanai/LanaiGenDAGISel.inc:1286: }; // Total Array size is 2305 bytes
As you can see, MatcherTable
of RISCV (about 2592K) is almost 4 times larger than the second largest one (X86, 682K). This has already influenced the compile-time of RISCV target.
But, of course, RISCV will not be the only target that benefit from the reduction of MatcherTable
size.
What I have done
Most of the opcodes have followed operands bytes and I found that most of the operands have their own preferences. For exmaple, some opcodes need a MVT
type operand, but this MVT
type is usually MVT::i32
or MVT::i64
. So, based on this, I have done some optimizations to reduce the size.
Note: We will show the reduction of the llc binary size with all in-tree targets.
Optimize OPC_EmitInteger
and OPC_EmitStringInteger
These two opcodes need a MVT operand to indicate the integer type but it is always one of i8/i16/i32/i64, so we can instantiate OPC_EmitInteger
and OPC_EmitStringInteger
with
i8/i16/i32/i64 (OPC_EmitInteger8
, OPC_EmitInteger16
, OPC_EmitInteger32
, OPC_EmitInteger64
, etc.) so that we can reduce one byte (the type).
Optimize OPC_CheckType
, OPC_EmitRegister
and OPC_CheckChildType
Same as above, these opcodes need a MVT type but it is usually MVT::i32
or MVT::i64
. So, we add space-optimized forms of these opcodes that implictly encode the MVT
type so that we can reduce one byte.
- [SelectionDAG] Add instantiated OPC_CheckType by wangpc-pp · Pull Request #73283 · llvm/llvm-project · GitHub reduces ~29K.
- [SelectionDAG] Add space-optimized forms of OPC_EmitRegister by wangpc-pp · Pull Request #73291 · llvm/llvm-project · GitHub reduces ~10K.
- [SelectionDAG] Add instantiated OPC_CheckChildType by wangpc-pp · Pull Request #73297 · llvm/llvm-project · GitHub reduces ~70K.
Optimize OPC_EmitConvertToTarget
and OPC_EmitCopyToReg
These two opcodes need a RecNo
operand, which is usually a small integer. New opcodes that implicitly indicate the RecNo
(0-7) are added so that we can reduce one byte.
- [SelectionDAG] Add space-optimized forms of OPC_EmitConvertToTarget by wangpc-pp · Pull Request #73286 · llvm/llvm-project · GitHub reduces ~13K.
- [SelectionDAG] Add space-optimized forms of OPC_EmitCopyToReg by wangpc-pp · Pull Request #73293 · llvm/llvm-project · GitHub reduces ~33K.
Optimize OPC_CheckComplexPat
, OPC_CheckPatternPredicate
and OPC_CheckPredicate
For these opcodes, they need an ID operand (an integer index actually)to indicate the ComplexPat
or Predicate
that they are evaluated on. For most targets, only a small set of ComplexPat
or Predicate
have a large number of uses. So, we record the usage of each ComplexPat
or Predicate
and sort them by uses. For the top 8, new opcodes that implicitly indicate the index number (0-7) are added so that we can reduce one byte.
- [SelectionDAG] Add space-optimized forms of OPC_CheckComplexPat by wangpc-pp · Pull Request #73310 · llvm/llvm-project · GitHub reduces ~89K.
- [SelectionDAG] Add space-optimized forms of OPC_CheckPatternPredicate by wangpc-pp · Pull Request #73319 · llvm/llvm-project · GitHub reduces ~93K.
- [SelectionDAG] Add space-optimized forms of OPC_CheckPredicate by wangpc-pp · Pull Request #73488 · llvm/llvm-project · GitHub reduces ~61K.
Optimize OPC_EmitNode
and OPC_MorphNodeTo
These two opcodes need a EmitNodeInfo
to indicate the attributes of a node, for example, hasChain
, hasGlueInput
, etc. For most cases, none or one attribute bit is set. So, we can encode it
implicitly to save one byte.
Add OPC_MoveSibling
There are a lot of operations to move current node to parent and then move to another child. So OPC_MoveSibling
and its space-optimized forms are added to do this “move to sibling” operation.
These new operations will be generated when optimizing matcher in ContractNodes
. Currently MoveParent+MoveChild
will be optimized to MoveSibling
and sequences MoveParent+RecordChild+MoveChild
will be transformed into MoveSibling+RecordNode
.
- [SelectionDAG] Add OPC_MoveSibling by wangpc-pp · Pull Request #73643 · llvm/llvm-project · GitHub reduces ~30K.
Summary
Overall these changes reduce the llc binary size with all in-tree targets by about 756K.
The grep of Total Array size
is:
~/llvm-project/build/lib/Target# grep -rin "Total Array size" . | sort -rn -k 8
./RISCV/RISCVGenDAGISel.inc:1114870: }; // Total Array size is 2208263 bytes
./X86/X86GenDAGISel.inc:327015: }; // Total Array size is 612478 bytes
./AArch64/AArch64GenDAGISel.inc:240233: }; // Total Array size is 443212 bytes
./AMDGPU/AMDGPUGenDAGISel.inc:226002: }; // Total Array size is 430613 bytes
./ARM/ARMGenDAGISel.inc:90174: }; // Total Array size is 171971 bytes
./NVPTX/NVPTXGenDAGISel.inc:90076: }; // Total Array size is 169906 bytes
./PowerPC/PPCGenDAGISel.inc:78781: }; // Total Array size is 163270 bytes
./Hexagon/HexagonGenDAGISel.inc:90652: }; // Total Array size is 155936 bytes
./LoongArch/LoongArchGenDAGISel.inc:44196: }; // Total Array size is 73125 bytes
./VE/VEGenDAGISel.inc:36641: }; // Total Array size is 65232 bytes
./Mips/MipsGenDAGISel.inc:27729: }; // Total Array size is 46817 bytes
./SystemZ/SystemZGenDAGISel.inc:26984: }; // Total Array size is 46740 bytes
./AMDGPU/R600GenDAGISel.inc:12381: }; // Total Array size is 29793 bytes
./WebAssembly/WebAssemblyGenDAGISel.inc:13630: }; // Total Array size is 22611 bytes
./MSP430/MSP430GenDAGISel.inc:4687: }; // Total Array size is 8069 bytes
./Sparc/SparcGenDAGISel.inc:3577: }; // Total Array size is 5769 bytes
./BPF/BPFGenDAGISel.inc:2349: }; // Total Array size is 3709 bytes
./XCore/XCoreGenDAGISel.inc:2201: }; // Total Array size is 3434 bytes
./AVR/AVRGenDAGISel.inc:1695: }; // Total Array size is 2754 bytes
./Lanai/LanaiGenDAGISel.inc:1283: }; // Total Array size is 2075 bytes
We can see ~15% reduction of RISCV’s MatcherTable
size and average ~14% reduction for other targets.
Compile-time impact
I have only tested RISCV target and I saw a stable ~1% compile-time reduction. Maybe @nikic can add all of the pacthes to the compile time tracker? Thanks in advance!
Theoretically, thare are some influences:
- Adding more opcodes will increase the complexity of interpreter, which may cause increasement of compile-time.
- We can get better cache utilization as the MatcherTable size is decreased, which may cause decreasement of compile-time.
Anyway, I hope these works will be helpful.
Thanks!
Wang Pengcheng