[RFC] Support long instruction fixup for X86

Issue

The instruction-size limit of 15 bytes still applies to APX instructions.
(APX SPEC: https://cdrdv2.intel.com/v1/dl/getContent/784266)

Note that it is possible for an EVEX-encoded legacy instruction to reach the 15-byte instruction length limit: 4
bytes of EVEX prefix + 1 byte of opcode + 1 byte of ModRM + 1 byte of SIB + 4 bytes of displacement + 4
bytes of immediate = 15 bytes in total, e.g.

addq    $184, -96, %rax   # encoding: [0x62,0xf4,0xfc,0x18,0x81,0x04,0x25,0xa0,0xff,0xff,0xff,0xb8,0x00,0x00,0x00]

If we added a segment prefix like fs, the length would be 16.

In such a case, no additional (ADSIZE or segment override) prefix can be used. This limit is possibly reached by following instructions

  1. OP32mi_ND, OP32mi_NF, OP32mi_NF_ND, OP32mi_EVEX
  2. OP64mi32_ND, OP64mi32_NF, OP64mi32_NF_ND, OP64mi32_EVEX
  3. CCMP32mi, CTEST32mi, CCMP64mi, CTEST64mi32
  4. IMUL32rmi_NF, IMUL32rmi_EVEX, IMUL64rmi32_NF, IMUL64rmi32_EVEX

where OP can be AND, OR, XOR, ADD, SUB, ADC and SBB.
(See llvm-project/llvm/lib/Target/X86/X86InstrArithmetic.td at main · llvm/llvm-project (github.com) for the definitions)

Let’s cut out some unlikely situations first.

  1. APX is supported only in 64-bit mode, where compiler only use 64-bit memory addressing (namely base and index register are 64-bit, no ADSIZE)
  2. _EVEX variants for above instructions are redundant encoding and compiler never emits them.

I believe we don’t need to handle the above 2 cases in LLVM. But segment override prefix is real issue for instruction size b/c it’s probably used when we access a TLS variable, etc. We encountered this issue when building Node.js with APX.

Proposal

To resolve the issue of X86 instruction-size limit, I propose to split the “long” instruction into 2 instructions. This split should be done after (not rightly after) register allocation b/c whether SIB is required depends on the base register.

The split is :first load the memory into a register and then apply the desired prefix(es) to the shorter ri version of the same instruction class. e.g.

subq    $184, %fs:257(%rbx, %rcx), %rax

->

movq %fs:257(%rbx, %rcx),%rax
subq $184, %rax

it’s prefered than

movq %184, %rax
subq %rax, %fs:257(%rbx, %rcx), %rax

b/c

  1. the former is shorter
  2. the base/index register may be same as the dest register, then the latter one requires a scratch register
  3. IMUL does not have rmr variant
  4. HW supports immediate folding

Relevant topic

I plan to implement TargetInstrInfo::getInstSizeInBytes() for X86 for the split.
Like AArch64, it returns the maximum number of bytes.

Question
Where should I do this split?
a. Rename “X86 Byte/Word Instruction Fixup” and do it there
b. X86 pseudo instruction expansion pass
c. Create a new pass

CC @RKSimon @topperc @phoebe @XinWang10 @e-kud

Why do we need the change given the forward condition is quite clear?

if (IsEVEX() && hasAddr32() && hasIMM32() && (hasSegPrefix() || hasSIB()))
  DoSplit();

I think the condition is

if (IsEVEX() && hasSIB() && hasIMM32() && hasDisp32() && (hasSegPrefix() || hasAdSize())

Why do we need the change

I would not call it a “change”. The interface TargetInstrInfo::getInstSizeInBytes() is already there but not implemented for X86. It can be used in some optimizations like IfConversion.

I’m not sure if X86FixupBWInsts would be a good place to add this.

We’ve had potential needs for a general load/store unfolding pass in the past (for cases where MachineLICM can’t help us), maybe now is the time?

Do you mean the general unfolding pass is responsible for unfolding the instruction if it exceeds 15 bytes? What’s the other potential needs?

There maybe others but these are the ones I can remember off hand:

1, 2 and 3 have different challenges from this RFC. For NDD instructions, we can use a MOV ndd_reg, [mem] for the load op. But for RMW or vector instructions, I am not sure we can find a safe register as the destination of the load after register allocation.

And for 1, we probably have a better way to resolve the issue

  1. Add tuning feature like slowRMW, and add predicate !slowRMW for the RMW patterns
  2. Check the slow opcode in X86InstrInfo::foldMemoryOperandImpl to disable the folding.

[X86][CodeGen] Support long instruction fixup for APX NDD instructions by KanRobert · Pull Request #83578 · llvm/llvm-project (github.com)