[RFC] Make GlobalISel Match Table entries 1 byte instead of 8

Hi!

Introduction

The GlobalISel match table, which powers both the InstructionSelector and Combiners, currently uses 64 bits per entry.
This is mostly a waste of space, as no entries actually need that. In fact, we never need more than 2 bytes unless we’re dealing with match table indexes or immediates.

Proposal

I made a WIP that changes entries in the match table to be 1 byte each:

Data is encoded over multiple bytes when needed:

  • MatchTable Opcodes are uint8
  • MachineInstr IDs and Operand IDs are ULEB128.
    • The vast majority of patterns only need a handful of those, so they only use 1 byte. We have some outliers like PPC that have really, really, really long patterns where we need 255+ IDs.
  • LLTs indexes are int8
    • Positive values are references to the LLT enum generated by the backend, negative is for temporary LLTs derived from a register operand.
  • Predicate enums, target opcodes/RCs/RBs/registers are 2 bytes.
    • No enum in any target needs >2 bytes I believe
  • Table indexes are 4 bytes
  • Immediates are 8 bytes (could be ULEB if we really wanted to optimize but they’re not used often so I don’t think it’s worth it)

Pros

This is mostly a code size optimization.

A quick test shows LLC (MinSizeRel) going from 116Mb to 108Mb so we gain about 7% binary size.

Cons

There is a lot of test churn (42 files), and tests are becoming harder to read to the use of macros to encode multi-byte values. GIMatchTableExecutorImpl.h also becomes a tiny bit more complicated due to the need to call functions to extract values.

All of this is pretty minor. The need to manually update that many tests is annoying, but I can get it done, but first I’d like to investigate performance a bit more.

Performance

Before I submit a review, I would like to first do performance testing. Can anyone help me measure the impact this has on build times of LLVM itself, and on the build times of programs using GlobalISel?

I’m fearing some longer LLVM build times because the match table got more complex due to the use of macros to encode values.

I’m also wondering how this affects the speed of the match table’s execution. I’m hoping it doesn’t get slower if we start having entries spread across cache lines?

1 Like

I like the idea! It’s now like the MatcherTable of SelectionDAG ISel.
Can you fire up a PR so that we can review and merge it? Also, I tried to optimize the size of SelectionDAG MatcherTable, some of the optimizations may be useful to GISel too I think.

Currently, only a few of targets have GISel support (AArch64 and RISCV?). So I think we can notice developers of these targets. And, we have compile time tracker (@nikic)

Actually, I am optimistic to the execution time. You have reduced a lot of MatcherTable bytes, we should get better cache utilization.

I’m still not done with the test updates but review can start

For building LLVM, you can probably time this yourself using something like perf stat -r 10 on the command that compiles the generated matcher.

For running LLVM, I could help to do some ad hoc testing on AMDGPU which has pretty complete globalisel coverage (with one or two caveats).

Good idea, I will try to build the .inc file and see that way. Thanks

For the AMDGPU GlobalISel testing I can do it too (even on AArch64 as well) but it’s just that I’m not sure it’s particularly reliable. I’d prefer to see numbers on a real project, I think AArch64 has some to check compile times but not sure where to find them

Some quick numbers:

Before

Build AMDGPUInstructionSelector.cpp: 6.67s

Running AMDGPU/GlobalISel and AArch64 lit tests:

 10.44s: LLVM :: CodeGen/AMDGPU/GlobalISel/fdiv.f32.ll
 9.78s: LLVM :: CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
 7.02s: LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.ll
 6.08s: LLVM :: CodeGen/AMDGPU/GlobalISel/extractelement.ll
 5.48s: LLVM :: CodeGen/AMDGPU/GlobalISel/fdiv.f64.ll
 5.34s: LLVM :: CodeGen/AMDGPU/GlobalISel/extractelement.i8.ll

After

Build AMDGPUInstructionSelector.cpp: 8.20s (+23%)

  10.20s: LLVM :: CodeGen/AMDGPU/GlobalISel/fdiv.f32.ll
  9.84s: LLVM :: CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
  6.98s: LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.ll
  5.97s: LLVM :: CodeGen/AMDGPU/GlobalISel/extractelement.ll
  5.50s: LLVM :: CodeGen/AMDGPU/GlobalISel/fdiv.f64.ll
  5.35s: LLVM :: CodeGen/AMDGPU/GlobalISel/extractelement.i8.ll

Conclusions

Worse build times (expected), equivalent or slightly better runtime.

I’ll look a pre-encoding constants and only relying on macros when the value isn’t known ahead-of-time (e.g. enum). Maybe that’ll help build times a bit?
Though it’s expected builds get slower, the array is smaller in terms of bytes, but much bigger in terms of number of entries.

This is great, thanks for looking into this. I’ll run some compile time tests on Apple M-class machines to see if there’s impact.