[RFC] Support long instruction fixup for X86

KanRobert · January 26, 2024, 8:01am

Issue

The instruction-size limit of 15 bytes still applies to APX instructions.
(APX SPEC: https://cdrdv2.intel.com/v1/dl/getContent/784266)

Note that it is possible for an EVEX-encoded legacy instruction to reach the 15-byte instruction length limit: 4
bytes of EVEX prefix + 1 byte of opcode + 1 byte of ModRM + 1 byte of SIB + 4 bytes of displacement + 4
bytes of immediate = 15 bytes in total, e.g.

addq    $184, -96, %rax   # encoding: [0x62,0xf4,0xfc,0x18,0x81,0x04,0x25,0xa0,0xff,0xff,0xff,0xb8,0x00,0x00,0x00]

If we added a segment prefix like fs, the length would be 16.

In such a case, no additional (ADSIZE or segment override) prefix can be used. This limit is possibly reached by following instructions

OP32mi_ND, OP32mi_NF, OP32mi_NF_ND, OP32mi_EVEX
OP64mi32_ND, OP64mi32_NF, OP64mi32_NF_ND, OP64mi32_EVEX
CCMP32mi, CTEST32mi, CCMP64mi, CTEST64mi32
IMUL32rmi_NF, IMUL32rmi_EVEX, IMUL64rmi32_NF, IMUL64rmi32_EVEX

where OP can be AND, OR, XOR, ADD, SUB, ADC and SBB.
(See llvm-project/llvm/lib/Target/X86/X86InstrArithmetic.td at main · llvm/llvm-project (github.com) for the definitions)

Let’s cut out some unlikely situations first.

APX is supported only in 64-bit mode, where compiler only use 64-bit memory addressing (namely base and index register are 64-bit, no ADSIZE)
_EVEX variants for above instructions are redundant encoding and compiler never emits them.

I believe we don’t need to handle the above 2 cases in LLVM. But segment override prefix is real issue for instruction size b/c it’s probably used when we access a TLS variable, etc. We encountered this issue when building Node.js with APX.

Proposal

To resolve the issue of X86 instruction-size limit, I propose to split the “long” instruction into 2 instructions. This split should be done after (not rightly after) register allocation b/c whether SIB is required depends on the base register.

The split is ：first load the memory into a register and then apply the desired prefix(es) to the shorter ri version of the same instruction class. e.g.

subq    $184, %fs:257(%rbx, %rcx), %rax

->

movq %fs:257(%rbx, %rcx)，%rax
subq $184, %rax

it’s prefered than

movq %184, %rax
subq %rax, %fs:257(%rbx, %rcx), %rax

b/c

the former is shorter
the base/index register may be same as the dest register, then the latter one requires a scratch register
IMUL does not have rmr variant
HW supports immediate folding

Relevant topic

I plan to implement TargetInstrInfo::getInstSizeInBytes() for X86 for the split.
Like AArch64, it returns the maximum number of bytes.

Question
Where should I do this split?
a. Rename “X86 Byte/Word Instruction Fixup” and do it there
b. X86 pseudo instruction expansion pass
c. Create a new pass

KanRobert · January 26, 2024, 8:04am

CC @RKSimon @topperc @phoebe @XinWang10 @e-kud

phoebe · January 26, 2024, 8:51am

Why do we need the change given the forward condition is quite clear?

if (IsEVEX() && hasAddr32() && hasIMM32() && (hasSegPrefix() || hasSIB()))
  DoSplit();

KanRobert · January 26, 2024, 9:39am

I think the condition is

if (IsEVEX() && hasSIB() && hasIMM32() && hasDisp32() && (hasSegPrefix() || hasAdSize())

Why do we need the change

I would not call it a “change”. The interface TargetInstrInfo::getInstSizeInBytes() is already there but not implemented for X86. It can be used in some optimizations like IfConversion.

RKSimon · January 29, 2024, 1:45pm

I’m not sure if X86FixupBWInsts would be a good place to add this.

We’ve had potential needs for a general load/store unfolding pass in the past (for cases where MachineLICM can’t help us), maybe now is the time?

KanRobert · January 29, 2024, 3:00pm

Do you mean the general unfolding pass is responsible for unfolding the instruction if it exceeds 15 bytes? What’s the other potential needs?

RKSimon · January 29, 2024, 4:19pm

There maybe others but these are the ones I can remember off hand:

Unfolding RMW cases on some targets where they are particularly slow: [X86] Avoid RMW ADC/SBB operations on some targets · Issue #40176 · llvm/llvm-project · GitHub
Unfolding vector constant loads (if we have spare registers to avoid spill) when building for optsize/minsize if it would allow X86FixupVectorConstants to “compress” the constant
Unfolding compress stores on znver4: [x86] Work around slow compress store instruction on znver4 · Issue #72530 · llvm/llvm-project · GitHub

KanRobert · February 2, 2024, 3:38am

1, 2 and 3 have different challenges from this RFC. For NDD instructions, we can use a MOV ndd_reg, [mem] for the load op. But for RMW or vector instructions, I am not sure we can find a safe register as the destination of the load after register allocation.

KanRobert · February 2, 2024, 8:39am

And for 1, we probably have a better way to resolve the issue

Add tuning feature like slowRMW, and add predicate !slowRMW for the RMW patterns
Check the slow opcode in X86InstrInfo::foldMemoryOperandImpl to disable the folding.

KanRobert · May 19, 2024, 2:42am

[X86][CodeGen] Support long instruction fixup for APX NDD instructions by KanRobert · Pull Request #83578 · llvm/llvm-project (github.com)

Topic		Replies	Views
RFC: code size reduction in X86 by replacing EVEX with VEX encoding LLVM Dev List Archives	11	98	November 29, 2016
[RFC] Design for APX feature EGPR and NDD support X86	6	1206	September 28, 2023
[RFC] Design for AVX10 feature support X86	20	1980	April 13, 2024
New AVX512{VL,BW,DQ} features enabled in LLVM LLVM Dev List Archives	0	131	July 21, 2014
Encoding an X86 format with long operands LLVM Dev List Archives	4	99	March 20, 2018

[RFC] Support long instruction fixup for X86

Related Topics