Auto-generate the memory folding tables

Hi LLVM devs

As time goes on, manually adding instructions to the X86 memory folding table becomes a big burden for me to support the new ISA, so I‘d like to automate the generation of this memory fold table and replace the existing tables in X86InstrFoldTables.cpp.

I noticed the following comments were added in commit 614f1924716f [X86] Add comment about the sorting of the memory folding tables added in r335501.

// We also have a tablegen emitter that tries to autogenerate these tables

// by comparing encoding information. This can be enabled by passing

// X86_GEN_FOLD_TABLES=ON to cmake which fill produce X86GenFoldTables.inc

// in the build area. There are currently some bugs in the autogenerated table

// that require a manual review to copy them from the autogenerated table into

// this table. It is unclear if we will ever be able to fully automate this

// because as new instruction are added into holes in the X86 opcode map they

// potentially pair up with old instructions and create new entries in the

// tables that would be incorrect. The manual review process allows us a chance

// to catch these before they become observable bugs.

This patch was added in 2018 and there was no Phabricator link for it. So I don’t know how to find more details about it …

It would be very useful for me to support the automation by default if I could know what these bugs are.

Could anyone provide some information about this? Any other suggestions are also welcome.

Best

Shengchen

Here’s a sampling.

These entries from X86InstrFoldTables.cpp have a different version in the generated table

{ X86::MMX_MOVD64from64rr, X86::MMX_MOVD64from64rm, TB_FOLDED_STORE | TB_NO_REVERSE },
{ X86::MMX_MOVD64grr, X86::MMX_MOVD64mr, TB_FOLDED_STORE | TB_NO_REVERSE }
{ X86::MOVPQIto64rr, X86::MOVPQI2QImr, TB_FOLDED_STORE | TB_NO_REVERSE }

{ X86::VMOVPQIto64Zrr, X86::VMOVPQI2QIZmr, TB_FOLDED_STORE | TB_NO_REVERSE },
{ X86::VMOVPQIto64rr, X86::VMOVPQI2QImr, TB_FOLDED_STORE | TB_NO_REVERSE },
{ X86::MMX_MOVD64to64rr, X86::MMX_MOVQ64rm, 0 },
{ X86::MOV64toPQIrr, X86::MOVQI2PQIrm, TB_NO_REVERSE }

These entries are missing from the generated table

{ X86::MOV64toSDrr, X86::MOV64mr, TB_FOLDED_STORE | TB_NO_REVERSE | 0 }
{ X86::MOVDI2SSrr, X86::MOV32mr, TB_FOLDED_STORE | TB_NO_REVERSE | 0 }
{ X86::MOVSDto64rr, X86::MOVSDmr, TB_FOLDED_STORE | TB_NO_REVERSE | 0 },
{ X86::MOVSS2DIrr, X86::MOVSSmr, TB_FOLDED_STORE | 0},

{ X86::TCRETURNri, X86::TCRETURNmi, TB_FOLDED_LOAD | TB_NO_FORWARD },
{ X86::TCRETURNri64, X86::TCRETURNmi64, TB_FOLDED_LOAD | TB_NO_FORWARD },
{ X86::VMOV64toSDZrr, X86::MOV64mr, TB_FOLDED_STORE | TB_NO_REVERSE },
{ X86::VMOV64toSDrr, X86::MOV64mr, TB_FOLDED_STORE | TB_NO_REVERSE }
{ X86::VMOVDI2SSZrr, X86::MOV32mr, TB_FOLDED_STORE | TB_NO_REVERSE },
{ X86::VMOVDI2SSrr, X86::MOV32mr, TB_FOLDED_STORE | TB_NO_REVERSE },
{ X86::MOV64toSDrr, X86::MOVSDrm_alt, TB_NO_REVERSE },
{ X86::MOVDI2SSrr, X86::MOVSSrm_alt, 0 },
{ X86::MOVSDrr, X86::MOVLPDrm, TB_NO_REVERSE },

This entry in the generated table is clearly wrong
{ X86::SENDUIPI, X86::VMXON, TB_FOLDED_LOAD | 0 },

I think these entries in the generated table might be wrong or at least the B case because the register spill size in 16-bits but the load is only 8-bits.

{ X86::KMOVBkk, X86::KMOVBkm, 0 },
{ X86::KMOVDkk, X86::KMOVDkm, 0 },
{ X86::KMOVQkk, X86::KMOVQkm, 0 },
{ X86::KMOVWkk, X86::KMOVWkm, 0 },

The VPEXPAND entries don’t have TB_NO_REVERSE in the generated table

{ X86::VPEXPANDBZ128rr, X86::VPEXPANDBZ128rm, TB_NO_REVERSE },
{ X86::VPEXPANDBZ256rr, X86::VPEXPANDBZ256rm, TB_NO_REVERSE },
{ X86::VPEXPANDBZrr, X86::VPEXPANDBZrm, TB_NO_REVERSE },
{ X86::VPEXPANDDZ128rr, X86::VPEXPANDDZ128rm, TB_NO_REVERSE },
{ X86::VPEXPANDDZ256rr, X86::VPEXPANDDZ256rm, TB_NO_REVERSE },
{ X86::VPEXPANDDZrr, X86::VPEXPANDDZrm, TB_NO_REVERSE },
{ X86::VPEXPANDQZ128rr, X86::VPEXPANDQZ128rm, TB_NO_REVERSE },
{ X86::VPEXPANDQZ256rr, X86::VPEXPANDQZ256rm, TB_NO_REVERSE },
{ X86::VPEXPANDQZrr, X86::VPEXPANDQZrm, TB_NO_REVERSE },
{ X86::VPEXPANDWZ128rr, X86::VPEXPANDWZ128rm, TB_NO_REVERSE },
{ X86::VPEXPANDWZ256rr, X86::VPEXPANDWZ256rm, TB_NO_REVERSE },
{ X86::VPEXPANDWZrr, X86::VPEXPANDWZrm, TB_NO_REVERSE },

INSERTPS is in the generated table, but not in X86InstrFoldTables.cpp and I think that’s intentional because the memory form doesn’t have the Count_S field.
{ X86::INSERTPSrr, X86::INSERTPSrm, TB_NO_REVERSE },

Masked instructions are generated without TB_NO_REVERSE. You can’t unfold a load from a masked instruction because the unfold will produce an unmasked load. The TB_NO_REVERSE prevents that.

I should also point out that autogenerating the table without also autogenerating the tests increases the test coverage hole we already have on theses tables.

~Craig

Thanks, craig! The information is pretty useful. I think coverage hole of this table is a minor issue, we don’t need to have a test for each entry in the table b/c it’s for optimization. Enumerating all the cases that could be optimized in tests seems a little rigid and I belive we only need to test the correctness of the design, e.g, TB_NO_REVERSE, TB_FOLDED_LOAD, TB_NO_FORWARD.

I guess my big concern is how we find out when new rules need to be added to the generator to create correct tables. There are multiple different bugs already in just the examples I gave. I didn’t go through all of the tables.

I think the memory folding/unfolding table should not be unique, namely, one->many should be allowed. We can define a priority for mappings with the same key value, e.g

{X86::MMX_MOVD64grr,X86::MMX_MOVD64mr,TB_FOLDED_STORE},
{X86::MMX_MOVQ64rr,X86::MMX_MOVQ64mr,TB_FOLDED_STORE},

Both of the entries could be inserted into folding table. And if we prefer to unfold X86::MMX_MOVD64mr to X86::MMX_MOVD64grr, we could put the X86::MMX_MOVD64grr before X86::MMX_MOVQ64rr in the unfolding table.

{X86::MMX_MOVD64mr,X86::MMX_MOVD64grr},
{X86::MMX_MOVQ64mr,X86::MMX_MOVQ64rr},

I’m confused. All 4 of the instruction names in this example are unique. What’s the conflict?

{X86::MMX_MOVD64grr,X86::MMX_MOVD64mr,TB_FOLDED_STORE},
{X86::MMX_MOVQ64rr,X86::MMX_MOVQ64mr,TB_FOLDED_STORE},

The TB_NO_REVERSE and TB_NO_FORWARD are the priority mechanisms already.

Ah, I may have fainted from hunger yesterday :smiling_face_with_tear:

Let me check the details of TB_NO_REVERSE and TB_NO_FORWARD

TB_NO_REVERSE prevents insertion into the reverse table. So that can be used to guarantee uniqueness on that table.

TB_NO_FORWARD only suppresses lookups it isn’t factored into the uniqueness checks for the forward table, but that’s fixable by replacing llvm::lower_bound with equal_range. And getting rid of the adjacent_find checks.

That’s my plan:

  1. Gradually fix the gap between auto-generated folding table and the hand-written one
  2. When the two are totally same, I will use the auto-generated table by default add a LIT test to check each entry of the table (sth like llvm/utils/update_llc_test_checks.py would be added)

Then we can prevent the possible regression.

@topperc I noticed that you added TB_NO_REVERSE to some folding table entries where the register from uses the REX prefix, but the memory form does not.
rG9f2f12700960

I think we can do better now because it is possible to know if the target has a 64bit feature, etc
Could we check the predicates for these entries in functions llvm::lookup*Table? (I noticed that we already did similar things CheckVEXInstPredicate in X86GenEVEX2VEXTables.inc)

I think that’s doable, but as noted in that patch there’s no reason to ever unfold any of those instructions. Keeping TB_NO_REVERSE saves space in the unfold table.

Could we mark all instructions that only store a register or only load into a register as TB_NO_REVERSE through the generator?

Good idea! Let me do some experiment.

Should the instruction like MOVUPDmr, MOVUPSmr be unfoldable? The bit isMoveReg is set to 1 for MOVUPDrr, MOVUPSrr.

I don’t know of any places where we unfold a store instruction. So I don’t think it matter is MOVUPDmr, MOVUPSmr are unfoldable or not.

How about the pairs like

X86::VMOVAPDrr,            X86::VMOVAPDrm,
X86::VMOVAPSYrr,           X86::VMOVAPSYrm

X86::VMOVAPDrm and X86::VMOVAPSYrm are not store instructions but are MOV-like instructions.

I don’t think there’s a good reason to unfold them. I think the unfolds are in MachineLICM to pull loads out of loops and InstrEmitter when it needs to copy an EFLAGS producing instruction.

For MachineLICM, unfolding VMOVAPDrm would leave VMOVAPDrr inside the loop which doesn’t make sense.

The InstrEmitter case only happens on instructions that write EFLAGS I think. If the EFLAGS from one node are used twice and get clobbered between the two users, the producer will get copied so it can recreate the EFLAGS result. If the EFLAGS producer has a folded load, we need to unfold it first so that it doesn’t get copied. This case doesn’t apply to VMOVAPDrm or any instruction that just loads a value into a register.